Results 1 to 5 of 5

Thread: QT HTML Parser (+XQuery)

  1. #1

    Default QT HTML Parser (+XQuery)

    Hello

    I'm looking for a QT HTML parser tool.
    I have some html source code and I'd like to use XQuery on it. I already tried using QWebPage + QWebElement, but I don't like this solution cause firstly it doesn't works on non-gui thread (because of QWebPage) and because we can't apply XPath but CSS Path.
    The other solution I tried is QXmlQuery, it works great, but the only problem is that it doesn't works if there is an error on the page. For example, the first page I tried was missing systemId (in the DOCTYPE tag), so the parsing was aborted.

    I heard we can use gecko for parsing but I have no idea how to use it with QT.

    Have you some suggestions ?

    Thanks

  2. #2
    Join Date
    Mar 2009
    Location
    Brisbane, Australia
    Posts
    7,729
    Thanks
    13
    Thanked 1,610 Times in 1,537 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows
    Wiki edits
    17

    Default Re: QT HTML Parser (+XQuery)

    Qt does not have a Beautiful Soup-style parser outside of the WebKit components that I am aware of. Most HTML is not XML so the XML tools, including XSLT, XPath and XQuery, are essentially useless.

    You could look at Tidy for a way to handle the ill-disciplined mush that is HTML and make a valid, though not necessarily meaningful, XML document from it.

  3. #3

    Default Re: QT HTML Parser (+XQuery)

    Thanks for your answer.
    Any example to convert a html string to a xml document with tidy ?

  4. #4
    Join Date
    Mar 2009
    Location
    Brisbane, Australia
    Posts
    7,729
    Thanks
    13
    Thanked 1,610 Times in 1,537 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows
    Wiki edits
    17

    Default Re: QT HTML Parser (+XQuery)

    Browsers have no problem with test.html:
    Qt Code:
    1. <html>
    2. <head>
    3. <title>Title</title>
    4. </HEAD>
    5. <BODY>
    6. <P>A para
    7. <p>Another para
    8. <p>None of it is <b>XML</p>
    9. </Body>
    To copy to clipboard, switch view to plain text mode 

    Using the tidy command line tool,
    Qt Code:
    1. $ tidy -asxml test.html
    2. line 1 column 1 - Warning: missing <!DOCTYPE> declaration
    3. line 8 column 22 - Warning: missing </b> before </p>
    4. Info: Document content looks like XHTML 1.0 Strict
    5. 2 warnings, 0 errors were found!
    6.  
    7. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    8. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    9. <html xmlns="http://www.w3.org/1999/xhtml">
    10. <head>
    11. <meta name="generator" content=
    12. "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
    13. <title>Title</title>
    14. </head>
    15. <body>
    16. <p>A para</p>
    17. <p>Another para</p>
    18. <p>None of it is <b>XML</b></p>
    19. </body>
    20. </html>
    21.  
    22. To learn more about HTML Tidy see http://tidy.sourceforge.net
    23. Please fill bug reports and queries using the "tracker" on the Tidy web site.
    24. Additionally, questions can be sent to html-tidy@w3.org
    25. HTML and CSS specifications are available from http://www.w3.org/
    26. Lobby your company to join W3C, see http://www.w3.org/Consortium
    To copy to clipboard, switch view to plain text mode 
    I am sure the equivalent is possible through the tidy library.

    Another approach might be to use the Qt WebKit Bridge to execute JavaScript in the browser to extract the elements you are after.

  5. #5

    Default Re: QT HTML Parser (+XQuery)

    I finally ended up using libxml2. It does exactly what I wanted.
    Here is an example.

    Qt Code:
    1. QStringList Core::queryHTML(const QString &html, const QString &query) {
    2.  
    3. htmlParserCtxtPtr ctxt = htmlNewParserCtxt();
    4. htmlDocPtr htmlDoc = htmlCtxtReadMemory(ctxt, html.toUtf8().constData(), strlen(html.toUtf8().constData())
    5. , "", NULL, 0);
    6.  
    7. xmlXPathContextPtr context = xmlXPathNewContext ( htmlDoc );
    8. xmlXPathObjectPtr result = xmlXPathEvalExpression ((xmlChar*) query.toUtf8().constData(), context);
    9. xmlXPathFreeContext (context);
    10. if (result == NULL) {
    11. qDebug()<<"Invalid XQuery ?";
    12. }
    13. else {
    14. xmlNodeSetPtr nodeSet = result->nodesetval;
    15. if ( !xmlXPathNodeSetIsEmpty ( nodeSet ) ) {
    16. for (int i = 0; i < nodeSet->nodeNr; i++ ) {
    17. xmlNodePtr nodePtr;
    18. nodePtr = nodeSet->nodeTab[i];
    19. QString xml = QString::fromUtf8((char*)nodePtr->children->content);
    20. list.append(decodeXml(xml));
    21. }
    22. }
    23.  
    24. xmlXPathFreeObject (result);
    25. }
    26.  
    27. return list;
    28. }
    To copy to clipboard, switch view to plain text mode 

Similar Threads

  1. XQuery with HTML?
    By liuyanghejerry in forum Qt Programming
    Replies: 2
    Last Post: 4th June 2011, 07:12
  2. xmlpatterns issues or lack of understanding XQuery?
    By zaphod.b in forum Qt Programming
    Replies: 1
    Last Post: 7th January 2011, 14:24
  3. Replies: 4
    Last Post: 10th November 2010, 07:53
  4. XQuery indexing
    By dv_ in forum Qt Programming
    Replies: 1
    Last Post: 7th June 2009, 09:20
  5. HTML parser using Qt3
    By lni in forum Qt Programming
    Replies: 3
    Last Post: 3rd July 2007, 21:47

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.