Results 1 to 3 of 3

Thread: Extracting text from HTML documents

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Extracting text from HTML documents

    What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

    Cheers,
    _

  2. #2
    Join Date
    Sep 2011
    Posts
    20
    Qt products
    Qt5
    Platforms
    Windows

    Default Re: Extracting text from HTML documents

    Quote Originally Posted by anda_skoa View Post
    What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

    Cheers,
    _
    So I tried the following code:

    Qt Code:
    1. void MainWindow::traverseElements(const QWebElement& parentElement)
    2. {
    3. QRegExp ws("^\\s*$");
    4. QWebElement element = parentElement.firstChild();
    5. while (!element.isNull()) {
    6. QWebElement copy = element.clone();
    7. copy.removeAllChildren();
    8. QString text = copy.toPlainText();
    9. qDebug() << copy.tagName() << copy.attribute("class") << text.replace('\n', "\\n");
    10. QStringList sentences = text.split('\n');
    11. for(const QString& sentence : sentences) {
    12. if(ws.exactMatch(sentence)) {
    13. continue;
    14. }
    15.  
    16. ui->listWidget->addItem(sentence);
    17. }
    18.  
    19. traverseElements(element);
    20. element = element.nextSibling();
    21. }
    22. }
    To copy to clipboard, switch view to plain text mode 

    And got this output:

    Qt Code:
    1. "HEAD" "" ""
    2. "BODY" "" ""
    3. "DIV" "one" ""
    4. "DIV" "two" ""
    5. "DIV" "three" ""
    To copy to clipboard, switch view to plain text mode 

    It seems to remove the text elements.

    So far I've tried QWebView and QDomDocument, but neither gives the correct output. I've also tried QXmlStreamReader, pugixml, and rapidxml, but they all fail because the HTML documents have some tags that aren't closed properly. I'm thinking of writing a simple parser myself now.

Similar Threads

  1. Replies: 0
    Last Post: 29th July 2010, 08:15
  2. QRegExp for extracting the string between two HTML tags...
    By tuthmosis in forum Qt Programming
    Replies: 3
    Last Post: 27th May 2010, 06:55
  3. Need a QT class to handle HTML documents...
    By tuthmosis in forum Qt Programming
    Replies: 7
    Last Post: 27th May 2010, 02:34
  4. Extracting text from QTableWidgetItem
    By bizmopeen in forum Newbie
    Replies: 3
    Last Post: 1st September 2009, 17:28
  5. Extracting text from QDomNodes
    By Matt Smith in forum Qt Programming
    Replies: 3
    Last Post: 25th February 2007, 20:27

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Qt is a trademark of The Qt Company.