PDA

View Full Version : HTML parsing



enno
28th September 2009, 14:08
Can anyone recommend a class to do HTML parsing.
One of the key differences with XML appears to be HTMLs sloppy endtags. I have tried to subclass the QXmlDefaultHandler but it dies on missing endtags. Even when I continue after intercepting fatal errors normal event reporting is discontinued.

I have thought about an 'insert' function (to insert endtags on the fly elided by HTML) in a subclass of the XmlSimpleReader but that also appears a major job.

Any suggestions how to get a proper DOM document from a HTML source?
Enno

luf
29th September 2009, 07:53
Suggestings for reading:
http://www.qtcentre.org/forum/f-qt-programming-2/t-parsing-html-4698.html

Tidy can be found here:
http://tidy.sourceforge.net/
a c++ wrapper here:
http://users.rcn.com/creitzel/tidy.html#cplusplus

By using tidy you should be able to get the data in a way so that you can use it with QDomDocument.

piotr.dobrogost
29th September 2009, 11:52
Yes, parsing real world (broken) HTML is not an easy task. It's true you could try using HTML Tidy but if you're already using Qt I would advise not to do so and to use something already available in Qt. Use QtWebKit and QWebElement which is new in Qt 4.6 and you have your DOM ready in 15 minutes.