PDA

View Full Version : Need a QT class to handle HTML documents...



tuthmosis
25th May 2010, 00:42
I thought QDomDocument could do the job but it just can't handle HTML... at least i cannot make it work.

Any ideas?

Thanks

SixDegrees
25th May 2010, 07:54
What is the job? QDomDocument is built to work with XML. HTML, in general, is not well-formed XML, and XML parsers will generally choke on it

tuthmosis
25th May 2010, 14:01
Our goal is to load a web page containing a table displaying data we must import in a MySQL database.

The web page and the table will never change in their structure.

aamer4yu
25th May 2010, 14:34
Will
QTextEdit::setHtml or QWebView::setHtml be of some use to you ?

tuthmosis
25th May 2010, 15:00
I should have point out that this extraction process has to be automated... The application will be a deamon that will load the web page every morning at 1am.

I am currently looking at QWebPage but have some difficulties use this QWebKit module... looks like i always have to use QWebPage then QWebFrame then QWebElement....

Not sure am on the right track.

jryannel
25th May 2010, 20:38
Maybe you can use an XQuery see http://doc.qt.nokia.com/4.7-snapshot/qxmlquery.html. If you document is not valid XML, it might be even better to parse it via a reg-exp and extract the required information.

XQuery adds some complexity to your project, as you need to understand it first ;-) Here is a tutorial: http://www.w3schools.com/xquery/default.asp.

Good luck!

SixDegrees
25th May 2010, 20:51
If the web page will never change, then just slurp the HTML into a string and parse it yourself, perhaps using regular expressions. Or, if the page section containing whatever you're interested in conforms to XML specifications, extract that and hand it off to the XML parser for final processing.

tuthmosis
27th May 2010, 02:34
One thing for sure, we cannot use QRegExp as it is not compliant with standard RegExp expressions....
Again, assuming we want to extract "this is a test" from "<TH>this is a test</TD>", QRegExp would handle lookahead but not the backward equivalent so it is possible to say "return string that is immediatly followed by </TD>" but it is not possible to have "return string that immediatly follows the string <TH>".
So at best i could extract the following string:
"<TH>this is a test"

Anyone knows how to do this ?