PDA

View Full Version : QT HTML Parser (+XQuery)



monkazer
9th July 2012, 00:01
Hello

I'm looking for a QT HTML parser tool.
I have some html source code and I'd like to use XQuery on it. I already tried using QWebPage + QWebElement, but I don't like this solution cause firstly it doesn't works on non-gui thread (because of QWebPage) and because we can't apply XPath but CSS Path.
The other solution I tried is QXmlQuery, it works great, but the only problem is that it doesn't works if there is an error on the page. For example, the first page I tried was missing systemId (in the DOCTYPE tag), so the parsing was aborted.

I heard we can use gecko for parsing but I have no idea how to use it with QT.

Have you some suggestions ?

Thanks

ChrisW67
9th July 2012, 00:37
Qt does not have a Beautiful Soup-style parser outside of the WebKit components that I am aware of. Most HTML is not XML so the XML tools, including XSLT, XPath and XQuery, are essentially useless.

You could look at Tidy (http://tidy.sourceforge.net/) for a way to handle the ill-disciplined mush that is HTML and make a valid, though not necessarily meaningful, XML document from it.

monkazer
10th July 2012, 00:08
Thanks for your answer.
Any example to convert a html string to a xml document with tidy ?

ChrisW67
10th July 2012, 00:52
Browsers have no problem with test.html:


<html>
<head>
<title>Title</title>
</HEAD>
<BODY>
<P>A para
<p>Another para
<p>None of it is <b>XML</p>
</Body>


Using the tidy command line tool,


$ tidy -asxml test.html
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 8 column 22 - Warning: missing </b> before </p>
Info: Document content looks like XHTML 1.0 Strict
2 warnings, 0 errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
<title>Title</title>
</head>
<body>
<p>A para</p>
<p>Another para</p>
<p>None of it is <b>XML</b></p>
</body>
</html>

To learn more about HTML Tidy see http://tidy.sourceforge.net
Please fill bug reports and queries using the "tracker" on the Tidy web site.
Additionally, questions can be sent to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium

I am sure the equivalent is possible through the tidy library.

Another approach might be to use the Qt WebKit Bridge to execute JavaScript in the browser to extract the elements you are after.

monkazer
24th July 2012, 23:52
I finally ended up using libxml2. It does exactly what I wanted.
Here is an example.


QStringList Core::queryHTML(const QString &html, const QString &query) {
QStringList list;

htmlParserCtxtPtr ctxt = htmlNewParserCtxt();
htmlDocPtr htmlDoc = htmlCtxtReadMemory(ctxt, html.toUtf8().constData(), strlen(html.toUtf8().constData())
, "", NULL, 0);

xmlXPathContextPtr context = xmlXPathNewContext ( htmlDoc );
xmlXPathObjectPtr result = xmlXPathEvalExpression ((xmlChar*) query.toUtf8().constData(), context);
xmlXPathFreeContext (context);
if (result == NULL) {
qDebug()<<"Invalid XQuery ?";
}
else {
xmlNodeSetPtr nodeSet = result->nodesetval;
if ( !xmlXPathNodeSetIsEmpty ( nodeSet ) ) {
for (int i = 0; i < nodeSet->nodeNr; i++ ) {
xmlNodePtr nodePtr;
nodePtr = nodeSet->nodeTab[i];
QString xml = QString::fromUtf8((char*)nodePtr->children->content);
list.append(decodeXml(xml));
}
}

xmlXPathFreeObject (result);
}

return list;
}