Best way to load and parse an HTML file ??

**tuthmosis** · 22nd July 2008, 18:32

Originally Posted by aamer4yu

Why do you wanna do that ???
If you want to display HTML sites, you may have a look at QWebView from Qt 4.4 onwards

Because i am creating a crawler, a robot application that loads dhtml documents, extract some links, and recurse for each of those links...

So far, i've tested QHttp but it is not always working. Sometimes the pages load perfectly (i.e. http://www.google.ca)
and sometimes, it loads a "302 Found" dummy page or worst. (i.e any url that represents a google query.)

**aamer4yu** · 22nd July 2008, 18:36

Well in that case I geuss u will have to parse the html file urself. Am not aware of such a class in Qt.
may be regular expressions might be of some help to u for parsing ...

**sadjoker** · 22nd July 2008, 20:39

Originally Posted by tuthmosis

Because i am creating a crawler, a robot application that loads dhtml documents, extract some links, and recurse for each of those links...

So far, i've tested QHttp but it is not always working. Sometimes the pages load perfectly (i.e. http://www.google.ca)
and sometimes, it loads a "302 Found" dummy page or worst. (i.e any url that represents a google query.)

Sometimes Google displays captcha because of suspecting Bot search. That`s probably your 302 problem - 302 response code means redirected.
If you want to parse html... and fetch the links ... use RegExp to do it.

**elcuco** · 22nd July 2008, 20:47

.. or user Perl which has dedicated classes for this subject. Maybe Qt is not the best solution for your problem.

**mave-rick** · 21st August 2008, 15:57

This probably won't be read byt eh original thread author, but i'll post anyway for the record.

You can use mozilla's engine "Gecko" to parse HTML or XML. go here and read :http://developer.mozilla.org/en/Gecko

hope this helps anyone.

**tuthmosis** · 22nd August 2008, 01:54

Originally Posted by mave-rick

This probably won't be read byt eh original thread author, but i'll post anyway for the record.

You can use mozilla's engine "Gecko" to parse HTML or XML. go here and read :http://developer.mozilla.org/en/Gecko

hope this helps anyone.

WOW... Thanks mave-rick !!!
I hope this does what it claims !... In parsing stuff...

I'll try to find wrapping class to ease it's usage with C++.... Eclipse and QT.

Tahnks again