PDA

View Full Version : Parsing HTML



stevey
1st December 2006, 12:19
Hi,

I've been trying to setContent() on a QDomDocument and wondering why the content is failing.
Can anyone reccomend a way to parse html data?
I am reading from a website which has thousands of pages which follow a format and there are some common fields I need to extract from each page.

I've thought of 2 approaches:
1. The above where I could search for a child node by name in a QDomDocument.
2. Transform the HTML with XSL only writing out the elements I require.

I don't think the first will work though due to how badly formed the page is, and I'm not sure if the XSL approach will work for the same reason.
If the XSL is the way to go, then does Qt provide a way of running an XSLT on a document? I guess I'd still need to run the XSL on a QDomDocument anyway.

I had a quick glance at the HTML Tidy web page and noticed that it still leaves <p> and <li> unclosed. Does QDomDocument know how to handle these cases?



Thanks,

Steve York

jpn
1st December 2006, 12:40
I had a quick glance at the HTML Tidy web page and noticed that it still leaves <p> and <li> unclosed. Does QDomDocument know how to handle these cases?

Unfortunately no, it can't. QDomDocument can parse well-formed XML documents where most HTML documents aren't well-formed XML. XHTML would make difference..

Maybe a QRegExp would be sufficient for picking the information?

Brandybuck
1st December 2006, 20:01
Tidy will convert HTML to XML. You need to run it in a separate process though. If you need to do this in-process, then take a look at libxslt.