Re: QT HTML Parser (+XQuery)
Qt does not have a Beautiful Soup-style parser outside of the WebKit components that I am aware of. Most HTML is not XML so the XML tools, including XSLT, XPath and XQuery, are essentially useless.
You could look at Tidy for a way to handle the ill-disciplined mush that is HTML and make a valid, though not necessarily meaningful, XML document from it.
Re: QT HTML Parser (+XQuery)
Thanks for your answer.
Any example to convert a html string to a xml document with tidy ?
Re: QT HTML Parser (+XQuery)
Browsers have no problem with test.html:
Code:
<html>
<head>
<title>Title</title>
</HEAD>
<BODY>
<P>A para
<p>Another para
<p>None of it is <b>XML</p>
</Body>
Using the tidy command line tool,
Code:
$ tidy -asxml test.html
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 8 column 22 - Warning: missing </b> before </p>
Info: Document content looks like XHTML 1.0 Strict
2 warnings, 0 errors were found!
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
<title>Title</title>
</head>
<body>
<p>A para</p>
<p>Another para</p>
<p>None of it is <b>XML</b></p>
</body>
</html>
To learn more about HTML Tidy see http://tidy.sourceforge.net
Please fill bug reports and queries using the "tracker" on the Tidy web site.
Additionally, questions can be sent to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium
I am sure the equivalent is possible through the tidy library.
Another approach might be to use the Qt WebKit Bridge to execute JavaScript in the browser to extract the elements you are after.
Re: QT HTML Parser (+XQuery)
I finally ended up using libxml2. It does exactly what I wanted.
Here is an example.
Code:
htmlParserCtxtPtr ctxt = htmlNewParserCtxt();
htmlDocPtr htmlDoc = htmlCtxtReadMemory(ctxt, html.toUtf8().constData(), strlen(html.toUtf8().constData())
, "", NULL, 0);
xmlXPathContextPtr context = xmlXPathNewContext ( htmlDoc );
xmlXPathObjectPtr result = xmlXPathEvalExpression ((xmlChar*) query.toUtf8().constData(), context);
xmlXPathFreeContext (context);
if (result == NULL) {
qDebug()<<"Invalid XQuery ?";
}
else {
xmlNodeSetPtr nodeSet = result->nodesetval;
if ( !xmlXPathNodeSetIsEmpty ( nodeSet ) ) {
for (int i = 0; i < nodeSet->nodeNr; i++ ) {
xmlNodePtr nodePtr;
nodePtr = nodeSet->nodeTab[i];
list.append(decodeXml(xml));
}
}
xmlXPathFreeObject (result);
}
return list;
}