Extracting text from HTML documents [Archive]

View Full Version : Extracting text from HTML documents

themagician

24th June 2015, 23:30

EDIT: Problem has been solved thanks to QXmlStreamReader.
EDIT2: QXmlStreamReader doesn't work for all the documents so the problem is still open.
EDIT3: QXmlStreamReader/pugixml don't work because the HTML documents aren't perfect HTML, they're missing some closing tags.

I have this function that outputs the text elements from an HTML document with QWebView:

void MainWindow::traverseElements(const QWebElement& parentElement)
{
QRegExp ws("^\\s+$");
QWebElement element = parentElement.firstChild();
while (!element.isNull()) {
QString text = element.toPlainText();
qDebug() << element.tagName() << element.attribute("class") << text.replace('\n', "\\n");
QStringList sentences = text.split('\n');
for(const QString& sentence : sentences) {
if(ws.exactMatch(sentence) || sentence.isEmpty()) {
continue;
}

ui->listWidget->addItem(sentence);
}

traverseElements(element);
element = element.nextSibling();
}
}

I call it like this (view is a QWebView):

QWebFrame *frame = view->page()->mainFrame();
QWebElement element = frame->documentElement();
traverseElements(element);

The problem is, given this HTML document:

<div class="one"><div class="two">foo<div class="three">hello</div>bar</div></div>

It outputs:

"HEAD" "" ""
"BODY" "" "foo\nhello\nbar"
"DIV" "one" "foo\nhello\nbar"
"DIV" "two" "foo\nhello\nbar"
"DIV" "three" "hello"

So in the code when I do ui->listWidget->addItem(sentence); it adds "foo\nhello\nbar" 3 times, and "hello" once.

What I want the code to do is addItem so that the ui->listWidget contains three elements: "foo", "hello", and "bar", in that order like they appear on a web page.

Note that I can't just take the text output from the BODY element, because I need to store EVERY element, their original text value, and a translation so that I can save the exact same HTML document but with the translations.

Thanks.

anda_skoa

25th June 2015, 09:53

What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

Cheers,
_

themagician

25th June 2015, 21:35

What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

Cheers,
_

So I tried the following code:

void MainWindow::traverseElements(const QWebElement& parentElement)
{
QRegExp ws("^\\s*$");
QWebElement element = parentElement.firstChild();
while (!element.isNull()) {
QWebElement copy = element.clone();
copy.removeAllChildren();
QString text = copy.toPlainText();
qDebug() << copy.tagName() << copy.attribute("class") << text.replace('\n', "\\n");
QStringList sentences = text.split('\n');
for(const QString& sentence : sentences) {
if(ws.exactMatch(sentence)) {
continue;
}

ui->listWidget->addItem(sentence);
}

traverseElements(element);
element = element.nextSibling();
}
}

And got this output:

"HEAD" "" ""
"BODY" "" ""
"DIV" "one" ""
"DIV" "two" ""
"DIV" "three" ""

It seems to remove the text elements.

So far I've tried QWebView and QDomDocument, but neither gives the correct output. I've also tried QXmlStreamReader, pugixml, and rapidxml, but they all fail because the HTML documents have some tags that aren't closed properly. I'm thinking of writing a simple parser myself now.