Results 1 to 3 of 3

Thread: Parsing HTML

  1. #1
    Join Date
    Mar 2006
    Posts
    140
    Thanks
    8
    Thanked 4 Times in 3 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Parsing HTML

    Hi,

    I've been trying to setContent() on a QDomDocument and wondering why the content is failing.
    Can anyone reccomend a way to parse html data?
    I am reading from a website which has thousands of pages which follow a format and there are some common fields I need to extract from each page.

    I've thought of 2 approaches:
    1. The above where I could search for a child node by name in a QDomDocument.
    2. Transform the HTML with XSL only writing out the elements I require.

    I don't think the first will work though due to how badly formed the page is, and I'm not sure if the XSL approach will work for the same reason.
    If the XSL is the way to go, then does Qt provide a way of running an XSLT on a document? I guess I'd still need to run the XSL on a QDomDocument anyway.

    I had a quick glance at the HTML Tidy web page and noticed that it still leaves <p> and <li> unclosed. Does QDomDocument know how to handle these cases?



    Thanks,

    Steve York

  2. #2
    Join Date
    Feb 2006
    Location
    Oslo, Norway
    Posts
    6,264
    Thanks
    36
    Thanked 1,519 Times in 1,389 Posts
    Qt products
    Qt4
    Platforms
    MacOS X Unix/X11 Windows Symbian S60 Maemo/MeeGo

    Default Re: Parsing HTML

    Quote Originally Posted by stevey View Post
    I had a quick glance at the HTML Tidy web page and noticed that it still leaves <p> and <li> unclosed. Does QDomDocument know how to handle these cases?
    Unfortunately no, it can't. QDomDocument can parse well-formed XML documents where most HTML documents aren't well-formed XML. XHTML would make difference..

    Maybe a QRegExp would be sufficient for picking the information?
    J-P Nurmi

  3. #3
    Join Date
    Mar 2006
    Location
    Mountain View, California
    Posts
    489
    Thanks
    3
    Thanked 74 Times in 54 Posts
    Qt products
    Qt3 Qt4 Qt/Embedded
    Platforms
    MacOS X Unix/X11 Windows

    Default Re: Parsing HTML

    Tidy will convert HTML to XML. You need to run it in a separate process though. If you need to do this in-process, then take a look at libxslt.

Similar Threads

  1. Replies: 1
    Last Post: 18th July 2006, 12:06
  2. Need Basic html Browser
    By awalesminfo in forum Newbie
    Replies: 6
    Last Post: 21st March 2006, 17:14
  3. HTML Parsing
    By awalesminfo in forum Qt Programming
    Replies: 3
    Last Post: 19th March 2006, 11:31
  4. Replies: 1
    Last Post: 17th March 2006, 08:01
  5. [Qt 4.1]using html in QTextEdit from designer
    By patcito in forum Qt Programming
    Replies: 5
    Last Post: 16th January 2006, 22:36

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.