PDA

View Full Version : Is QWebEnginePage::setHtml() synchronous or Asynchronous?



ayanda83
10th January 2017, 08:48
Hi there guys, compliments of the season. I'm creating this web crawler that mines data off websites. Here is my problem, I am using QNAM to pull the HTML from a website and then I create a QWebEnginePage object where I set the HTML to the page for the purpose of using the QWebEnginePage::runJavaScript() to extract the data I need from the HTML. The problem is that the QWebEnginePage::runJavaScript() function doesn't work if I run it immediately after the call to QWebEnginePage::setHtml(). So I resorted to calling a lambda within QObject::connect() where the signal is QWebEnginePage::loadFinished() and the outcome was that QWebEnginePage::runJavaScript() will retrieve data that is typically at the top of the page (e.g. "document.title;" will run) but the function return an empty string for the data that is in the mid-section to the end of the HTML document. This gives me an impression that I am trying to access elements that are not loaded by the page yet, which is confusing because I am running a lambda on a loadFinished() signal. Please see code below.
void CPT_Page::getTenderSite()
{
page_1 = new QWebEnginePage(this);
page_1->setHtml(tenderPage_reply->readAll());

QWebEngineView *view_1 = new QWebEngineView();
view_1->setPage(page_1);
view_1->show();

QObject::connect(page_1, &QWebEnginePage::loadFinished, [&](){page_1->runJavaScript("document.getElementById(\"rfqsTable\").innerHTML;"
,[&](const QVariant &data){
qDebug() << data.toString() <<endl;
});});
}

ayanda83
10th January 2017, 12:10
Anybody...:(

anda_skoa
10th January 2017, 13:40
setHtml is definitely asynchronous, as the WebKit/Blink architecture uses a separate process for the web content processing.

loadFinished() should however be the indicator of when the web content has been loaded.

Maybe the web content executes scripts as well which change the DOM tree after load?

Cheers,
_

ayanda83
10th January 2017, 13:53
Maybe the web content executes scripts as well which change the DOM tree after load?

_

Thank you for your reply. I suspected that too because the data I am trying to retrieve is in a JQuery data table. Now, is there a way to disable the page scripts and still be able to run my own script on the page? I've read about QWebEngineScript::ScriptWorldId in the documentation and I am not sure if that would be my solution.

ayanda83
10th January 2017, 16:26
The line of code below disables the JavaScript that comes with the page but that unfortunately means I cannot run my own javascript on the same page.
page_1->settings()->setAttribute(QWebEngineSettings::JavascriptEnabled , false);:(

anda_skoa
11th January 2017, 10:28
Are you sure the table is even there if you disable the script?
I.e. isn't jQuery creating it?

Maybe there is some form of script event in the web domain that you can use to trigger you script.

Cheers,
_

ayanda83
11th January 2017, 11:48
Are you sure the table is even there if you disable the script?
I.e. isn't jQuery creating it?

_
Because I am using QNAM to get the HTML, I get all the table data and all the page scripts. When I disable the scripts, the table is still there in pure html format but because the scripts are disabled, I am unable to run my own script. If I enable the scripts, something in JQuery interferes with my own script hence I cannot retrieve the data.

The url below is the site I am trying to get the data from (i.e. the data inside the table).
http://web1.capetown.gov.za/web1/TenderPortal/Tender/
I've been exploring a different approach where I try to strip the HTML of all in-page scripts (i.e. using regular expressions) before I set the HTML to the QWebEnginePage. I am still working on getting the correct regular expression though, (I.e. one that is going to match everything in-between script tags and then using QString::remove() I will remove the scripts from the HTML). But I am still struggling to get the right Regular Expression. See code below.
QString html = (QString)tenderPage_reply->readAll();
QString temp_html = html.remove(QRegExp("\\b<script.*</script>\b"));

qDebug() << temp_html <<endl;

QFile linksFile(QDir::currentPath().append("/Include/Program_Files/tenderMainPage2.txt"));

if(!linksFile.open(QFile::WriteOnly | QFile::Text))
{
QMessageBox msgBox_2;
msgBox_2.setText("Tender Main File did not open...");
msgBox_2.exec();
return;
}


QTextStream out(&linksFile);
out << temp_html <<endl;

page_1 = new QWebEnginePage(this);
//page_1->settings()->setAttribute(QWebEngineSettings::JavascriptEnabled , true);

page_1->setHtml(temp_html);

QWebEngineView *view_1 = new QWebEngineView();

view_1->setPage(page_1);
view_1->show();

QObject::connect(page_1, &QWebEnginePage::loadFinished, [&](){page_1->runJavaScript("document.getElementById(\"rfqsTable\").innerHTML;"
,QWebEngineScript::MainWorld, [&](const QVariant &data){
qDebug() << data.toString() <<endl;
});});
}

anda_skoa
12th January 2017, 09:42
If the content you are passing at setHtml does already contain all the data, one thing you could try is to just replace all "http/https" occurences with an URL scheme that the web engine simply can't load.

Cheers,
_