PDA

View Full Version : Best way to load and parse an HTML file ??



tuthmosis
22nd July 2008, 16:37
Greetings !

Can someone point me to a demo or sample QT C++ code to load and parse HTML files at specific URLs ? (DHTML content)

Thanks !

aamer4yu
22nd July 2008, 18:25
Why do you wanna do that ???
If you want to display HTML sites, you may have a look at QWebView from Qt 4.4 onwards :)

tuthmosis
22nd July 2008, 18:32
Why do you wanna do that ???
If you want to display HTML sites, you may have a look at QWebView from Qt 4.4 onwards :)

Because i am creating a crawler, a robot application that loads dhtml documents, extract some links, and recurse for each of those links...

So far, i've tested QHttp but it is not always working. Sometimes the pages load perfectly (i.e. http://www.google.ca)
and sometimes, it loads a "302 Found" dummy page or worst. (i.e any url that represents a google query.)

aamer4yu
22nd July 2008, 18:36
Well in that case I geuss u will have to parse the html file urself. Am not aware of such a class in Qt.
may be regular expressions might be of some help to u for parsing ...

sadjoker
22nd July 2008, 20:39
Because i am creating a crawler, a robot application that loads dhtml documents, extract some links, and recurse for each of those links...

So far, i've tested QHttp but it is not always working. Sometimes the pages load perfectly (i.e. http://www.google.ca)
and sometimes, it loads a "302 Found" dummy page or worst. (i.e any url that represents a google query.)

Sometimes Google displays captcha because of suspecting Bot search. That`s probably your 302 problem - 302 response code means redirected.
If you want to parse html... and fetch the links ... use RegExp to do it.

elcuco
22nd July 2008, 20:47
.. or user Perl which has dedicated classes for this subject. Maybe Qt is not the best solution for your problem.

mave-rick
21st August 2008, 15:57
This probably won't be read byt eh original thread author, but i'll post anyway for the record.

You can use mozilla's engine "Gecko" to parse HTML or XML. go here and read :http://developer.mozilla.org/en/Gecko

hope this helps anyone.

tuthmosis
22nd August 2008, 01:54
This probably won't be read byt eh original thread author, but i'll post anyway for the record.

You can use mozilla's engine "Gecko" to parse HTML or XML. go here and read :http://developer.mozilla.org/en/Gecko

hope this helps anyone.

WOW... Thanks mave-rick !!!
I hope this does what it claims !... In parsing stuff...

I'll try to find wrapping class to ease it's usage with C++.... Eclipse and QT.

Tahnks again

patrik08
23rd August 2008, 11:06
Greetings !

Can someone point me to a demo or sample QT C++ code to load and parse HTML files at specific URLs ? (DHTML content)

Thanks !

read Qt Quarterly
is a sample to query image src ...
http://doc.trolltech.com/qq/qq25-webrobot.html
change it to query a/href

My method to load on qtextedit remote or local image is:




/// from http://www.qt-apps.org/content/show.php/XHTML+Wysiwyg+Qeditor?content=59493

void Load_Image_Connector()
{
QRegExp expression( "src=[\"\'](.*)[\"\']", Qt::CaseInsensitive );
expression.setMinimal(true);
int iPosition = 0;
int canna = 0;
while( (iPosition = expression.indexIn( html , iPosition )) != -1 ) {
QString semi1 = expression.cap( 1 );
canna++;
dimage.append(semi1); /* image lista 1 */
AppendImage(semi1); /* image list local or remote */
iPosition += expression.matchedLength();
}
QTimer::singleShot(1, this, SLOT(GetRemoteFile()));
}



other way is class ScribeParser
it parse the full document to find internal or external link
file http://fop-miniscribus.googlecode.com/svn/trunk/QTextPanel/qtextpanelcontrol.h


but all this way not parse javascript ... google robo bot engine it not can!