Extracting text from HTML documents

**anda_skoa** · 25th June 2015, 09:53

What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

Cheers,
_

**themagician** · 25th June 2015, 21:35

Originally Posted by anda_skoa

What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

Cheers,
_

So I tried the following code:

Qt Code:

Switch view

void MainWindow::traverseElements(const QWebElement& parentElement)
{
    QRegExp ws("^\\s*$");
    QWebElement element = parentElement.firstChild();
    while (!element.isNull()) {
        QWebElement copy = element.clone();
        copy.removeAllChildren();
        QString text = copy.toPlainText();
        qDebug() << copy.tagName() << copy.attribute("class") << text.replace('\n', "\\n");
        QStringList sentences = text.split('\n');
        for(const QString& sentence : sentences) {
            if(ws.exactMatch(sentence)) {
                continue;
            }
 
            ui->listWidget->addItem(sentence);
        }
 
        traverseElements(element);
        element = element.nextSibling();
    }
}

void MainWindow::traverseElements(const QWebElement& parentElement)
{
    QRegExp ws("^\\s*$");
    QWebElement element = parentElement.firstChild();
    while (!element.isNull()) {
        QWebElement copy = element.clone();
        copy.removeAllChildren();
        QString text = copy.toPlainText();
        qDebug() << copy.tagName() << copy.attribute("class") << text.replace('\n', "\\n");
        QStringList sentences = text.split('\n');
        for(const QString& sentence : sentences) {
            if(ws.exactMatch(sentence)) {
                continue;
            }

            ui->listWidget->addItem(sentence);
        }

        traverseElements(element);
        element = element.nextSibling();
    }
}

To copy to clipboard, switch view to plain text mode

And got this output:

Qt Code:

Switch view

"HEAD" "" ""
"BODY" "" ""
"DIV" "one" ""
"DIV" "two" ""
"DIV" "three" ""

"HEAD" "" ""
"BODY" "" ""
"DIV" "one" ""
"DIV" "two" ""
"DIV" "three" ""

To copy to clipboard, switch view to plain text mode

It seems to remove the text elements.

So far I've tried QWebView and QDomDocument, but neither gives the correct output. I've also tried QXmlStreamReader, pugixml, and rapidxml, but they all fail because the HTML documents have some tags that aren't closed properly. I'm thinking of writing a simple parser myself now.