QT HTML Parser (+XQuery)

Printable View

9th July 2012, 00:01
monkazer

QT HTML Parser (+XQuery)

Hello

I'm looking for a QT HTML parser tool.
I have some html source code and I'd like to use XQuery on it. I already tried using QWebPage + QWebElement, but I don't like this solution cause firstly it doesn't works on non-gui thread (because of QWebPage) and because we can't apply XPath but CSS Path.
The other solution I tried is QXmlQuery, it works great, but the only problem is that it doesn't works if there is an error on the page. For example, the first page I tried was missing systemId (in the DOCTYPE tag), so the parsing was aborted.

I heard we can use gecko for parsing but I have no idea how to use it with QT.

Have you some suggestions ?

Thanks
9th July 2012, 00:37
ChrisW67

Re: QT HTML Parser (+XQuery)

Qt does not have a Beautiful Soup-style parser outside of the WebKit components that I am aware of. Most HTML is not XML so the XML tools, including XSLT, XPath and XQuery, are essentially useless.

You could look at Tidy for a way to handle the ill-disciplined mush that is HTML and make a valid, though not necessarily meaningful, XML document from it.
10th July 2012, 00:08
monkazer

Re: QT HTML Parser (+XQuery)

Thanks for your answer.
Any example to convert a html string to a xml document with tidy ?

Re: QT HTML Parser (+XQuery)

Browsers have no problem with test.html:

Code:

<html>
<head>
<title>Title</title>
</HEAD>
<BODY>
    <P>A para
    <p>Another para
    <p>None of it is <b>XML</p>
</Body>

Using the tidy command line tool,

Code:

$ tidy -asxml test.html
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 8 column 22 - Warning: missing </b> before </p>
Info: Document content looks like XHTML 1.0 Strict
2 warnings, 0 errors were found!
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
<title>Title</title>
</head>
<body>
<p>A para</p>
<p>Another para</p>
<p>None of it is <b>XML</b></p>
</body>
</html>
 
To learn more about HTML Tidy see http://tidy.sourceforge.net
Please fill bug reports and queries using the "tracker" on the Tidy web site.
Additionally, questions can be sent to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium

I am sure the equivalent is possible through the tidy library.

Another approach might be to use the Qt WebKit Bridge to execute JavaScript in the browser to extract the elements you are after.

Re: QT HTML Parser (+XQuery)

I finally ended up using libxml2. It does exactly what I wanted.
Here is an example.

Code:

QStringList Core::queryHTML(const QString &html, const QString &query) {
    QStringList list;
 
    htmlParserCtxtPtr ctxt = htmlNewParserCtxt();
    htmlDocPtr htmlDoc = htmlCtxtReadMemory(ctxt, html.toUtf8().constData(), strlen(html.toUtf8().constData())
                                            , "", NULL, 0);
 
    xmlXPathContextPtr context = xmlXPathNewContext ( htmlDoc );
    xmlXPathObjectPtr result = xmlXPathEvalExpression ((xmlChar*) query.toUtf8().constData(), context);
    xmlXPathFreeContext (context);
    if (result == NULL) {
        qDebug()<<"Invalid XQuery ?";
    }
    else {
        xmlNodeSetPtr nodeSet = result->nodesetval;
        if ( !xmlXPathNodeSetIsEmpty ( nodeSet ) ) {
            for (int i = 0; i < nodeSet->nodeNr; i++ ) {
                xmlNodePtr  nodePtr;
                nodePtr = nodeSet->nodeTab[i];
                QString xml = QString::fromUtf8((char*)nodePtr->children->content);
                list.append(decodeXml(xml));
            }
        }
 
        xmlXPathFreeObject (result);
    }
 
    return list;
}