
Originally Posted by
anda_skoa
What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?
Cheers,
_
So I tried the following code:
void MainWindow::traverseElements(const QWebElement& parentElement)
{
QWebElement element = parentElement.firstChild();
while (!element.isNull()) {
QWebElement copy = element.clone();
copy.removeAllChildren();
qDebug() << copy.tagName() << copy.attribute("class") << text.replace('\n', "\\n");
for(const QString& sentence : sentences) {
if(ws.exactMatch(sentence)) {
continue;
}
ui->listWidget->addItem(sentence);
}
traverseElements(element);
element = element.nextSibling();
}
}
void MainWindow::traverseElements(const QWebElement& parentElement)
{
QRegExp ws("^\\s*$");
QWebElement element = parentElement.firstChild();
while (!element.isNull()) {
QWebElement copy = element.clone();
copy.removeAllChildren();
QString text = copy.toPlainText();
qDebug() << copy.tagName() << copy.attribute("class") << text.replace('\n', "\\n");
QStringList sentences = text.split('\n');
for(const QString& sentence : sentences) {
if(ws.exactMatch(sentence)) {
continue;
}
ui->listWidget->addItem(sentence);
}
traverseElements(element);
element = element.nextSibling();
}
}
To copy to clipboard, switch view to plain text mode
And got this output:
"HEAD" "" ""
"BODY" "" ""
"DIV" "one" ""
"DIV" "two" ""
"DIV" "three" ""
"HEAD" "" ""
"BODY" "" ""
"DIV" "one" ""
"DIV" "two" ""
"DIV" "three" ""
To copy to clipboard, switch view to plain text mode
It seems to remove the text elements.
So far I've tried QWebView and QDomDocument, but neither gives the correct output. I've also tried QXmlStreamReader, pugixml, and rapidxml, but they all fail because the HTML documents have some tags that aren't closed properly. I'm thinking of writing a simple parser myself now.
Bookmarks