Is QWebEnginePage::setHtml() synchronous or Asynchronous?
Hi there guys, compliments of the season. I'm creating this web crawler that mines data off websites. Here is my problem, I am using QNAM to pull the HTML from a website and then I create a QWebEnginePage object where I set the HTML to the page for the purpose of using the QWebEnginePage::runJavaScript() to extract the data I need from the HTML. The problem is that the QWebEnginePage::runJavaScript() function doesn't work if I run it immediately after the call to QWebEnginePage::setHtml(). So I resorted to calling a lambda within QObject::connect() where the signal is QWebEnginePage::loadFinished() and the outcome was that QWebEnginePage::runJavaScript() will retrieve data that is typically at the top of the page (e.g. "document.title;" will run) but the function return an empty string for the data that is in the mid-section to the end of the HTML document. This gives me an impression that I am trying to access elements that are not loaded by the page yet, which is confusing because I am running a lambda on a loadFinished() signal. Please see code below.
Code:
void CPT_Page::getTenderSite()
{
page_1 = new QWebEnginePage(this);
page_1->setHtml(tenderPage_reply->readAll());
QWebEngineView *view_1 = new QWebEngineView();
view_1->setPage(page_1);
view_1->show();
QObject::connect(page_1,
&QWebEnginePage
::loadFinished,
[&](){page_1
->runJavaScript
("document.getElementById(\"rfqsTable\").innerHTML;" qDebug() << data.toString() <<endl;
});});
}
Re: Is QWebEnginePage::setHtml() synchronous or Asynchronous?
Re: Is QWebEnginePage::setHtml() synchronous or Asynchronous?
setHtml is definitely asynchronous, as the WebKit/Blink architecture uses a separate process for the web content processing.
loadFinished() should however be the indicator of when the web content has been loaded.
Maybe the web content executes scripts as well which change the DOM tree after load?
Cheers,
_
Re: Is QWebEnginePage::setHtml() synchronous or Asynchronous?
Quote:
Originally Posted by
anda_skoa
Maybe the web content executes scripts as well which change the DOM tree after load?
_
Thank you for your reply. I suspected that too because the data I am trying to retrieve is in a JQuery data table. Now, is there a way to disable the page scripts and still be able to run my own script on the page? I've read about QWebEngineScript::ScriptWorldId in the documentation and I am not sure if that would be my solution.
Re: Is QWebEnginePage::setHtml() synchronous or Asynchronous?
The line of code below disables the JavaScript that comes with the page but that unfortunately means I cannot run my own javascript on the same page.
Code:
page_1->settings()->setAttribute(QWebEngineSettings::JavascriptEnabled, false);
:(
Re: Is QWebEnginePage::setHtml() synchronous or Asynchronous?
Are you sure the table is even there if you disable the script?
I.e. isn't jQuery creating it?
Maybe there is some form of script event in the web domain that you can use to trigger you script.
Cheers,
_
Re: Is QWebEnginePage::setHtml() synchronous or Asynchronous?
Quote:
Originally Posted by
anda_skoa
Are you sure the table is even there if you disable the script?
I.e. isn't jQuery creating it?
_
Because I am using QNAM to get the HTML, I get all the table data and all the page scripts. When I disable the scripts, the table is still there in pure html format but because the scripts are disabled, I am unable to run my own script. If I enable the scripts, something in JQuery interferes with my own script hence I cannot retrieve the data.
The url below is the site I am trying to get the data from (i.e. the data inside the table).
http://web1.capetown.gov.za/web1/TenderPortal/Tender/
I've been exploring a different approach where I try to strip the HTML of all in-page scripts (i.e. using regular expressions) before I set the HTML to the QWebEnginePage. I am still working on getting the correct regular expression though, (I.e. one that is going to match everything in-between script tags and then using QString::remove() I will remove the scripts from the HTML). But I am still struggling to get the right Regular Expression. See code below.
Code:
qDebug() << temp_html <<endl;
QFile linksFile
(QDir::currentPath().
append("/Include/Program_Files/tenderMainPage2.txt"));
if(!linksFile.
open(QFile::WriteOnly |
QFile::Text)) {
msgBox_2.setText("Tender Main File did not open...");
msgBox_2.exec();
return;
}
out << temp_html <<endl;
page_1 = new QWebEnginePage(this);
//page_1->settings()->setAttribute(QWebEngineSettings::JavascriptEnabled, true);
page_1->setHtml(temp_html);
QWebEngineView *view_1 = new QWebEngineView();
view_1->setPage(page_1);
view_1->show();
QObject::connect(page_1,
&QWebEnginePage
::loadFinished,
[&](){page_1
->runJavaScript
("document.getElementById(\"rfqsTable\").innerHTML;" ,QWebEngineScript
::MainWorld,
[&](const QVariant &data
){ qDebug() << data.toString() <<endl;
});});
}
Re: Is QWebEnginePage::setHtml() synchronous or Asynchronous?
If the content you are passing at setHtml does already contain all the data, one thing you could try is to just replace all "http/https" occurences with an URL scheme that the web engine simply can't load.
Cheers,
_