PDA

View Full Version : Parsing html via QT after update



dram
13th October 2014, 12:34
Hello. I was use this code on qt 5.3.1 32 BIT
But Now i have 5.3.2 64 BIT

std::string html = std::move(output.buffer); // html from curl - all ok
QWebPage * tmp_webpage = new QWebPage();
tmp_webpage->mainFrame()->setHtml(QString::fromStdString(html));
std::fstream test_stream;
test_stream.open("example14.html", std::ios::out | std::ios::in);
test_stream << tmp_webpage->mainFrame()->toHtml().toStdString(); // html was cut in about 50%
test_stream.close();

QWebFrame * tmp_frame = tmp_webpage->mainFrame();
QWebElement mainTable_site = tmp_frame->findFirstElement(QString::fromStdString(mainTable_ selector)); // not found because qt cut my correct html



IF it is necessarry i can share my HTML (but trust me - it is all okey) After update QT to 64 bit and 5.3.2 version sth going wrong.

Best regards

wysota
13th October 2014, 13:03
What if you convert to something that can accept non-ascii input instead of std::string which will cut the text at the first occurence of a null character? E.g. stay with QString?

dram
13th October 2014, 23:29
What if you convert to something that can accept non-ascii input instead of std::string which will cut the text at the first occurence of a null character? E.g. stay with QString?

Hmm this?


QString html = std::move(output.buffer);
QWebPage * tmp_webpage = new QWebPage();
tmp_webpage->mainFrame()->setHtml(html);
std::fstream test_stream;
test_stream.open("example14.html", std::ios::out | std::ios::in);
test_stream << tmp_webpage->mainFrame()->toHtml().toStdString(); //only toHtml is not enough
test_stream.close();


These functions return the same ^^

But This



QString html = std::move(output.buffer);
QWebPage * tmp_webpage = new QWebPage();
tmp_webpage->mainFrame()->setHtml(html);
std::fstream test_stream;
test_stream.open("example14.html", std::ios::out | std::ios::in);
test_stream << html.toStdString();// tmp_webpage->mainFrame()->toHtml().toStdString();



Returns good html... (to test_stream)

The HTML is CUT after
</script> and in append code is
</div></div></body></html> where in orginal code (HTML - after </script>) is
<table "class">....


And at the end:



std::cout << html.size() << " vs " << tmp_webpage.mainFrame()->toHtml().length();


Returns: 13876 vs 8509 ... where is my html code?

wysota
14th October 2014, 07:00
Start by replacing fstream with QFile (and possibly QTextStream if you want to stream the text in) and do not convert via std::string.

However the real question here is why are you using curl to download an html document and then you set it on a browser object that could download the document by itself.

dram
14th October 2014, 09:48
wysota - thanks for reply.

I'm using curl because i login to site and do many things. I thought curl is the best way.
And i set html on browser object to parse these HTML

Ok let see - with QFile code



QByteArray html = std::move(output.buffer);
QByteArray html_second;

QWebPage tmp_webpage;// = new QWebPage();
//tmp_webpage->mainFrame()->setHtml(html); // result was the same in setcontent / sethtml
tmp_webpage.mainFrame()->setContent(html);

html_second.append(tmp_webpage.mainFrame()->toHtml());
QFile test_stream("example14.html");
test_stream.open(QIODevice::ReadWrite | QIODevice::Truncate);
test_stream.write(html_second);
test_stream.close();


Still not solved

It is sth wrong with setHtml/content


QByteArray html = std::move(output.buffer);
test_stream.write(html);

There var "html" stores good HTML, only after setcontent and ->toHtml code have changed

wysota
14th October 2014, 10:52
Does the page contain the proper content when you view it in QWebView? Or is it already truncated?

dram
14th October 2014, 11:10
I dont think so because plain text is too cut

I dont display QWebView in my program.

All of these operations are doing in background.

wysota
14th October 2014, 11:25
I dont display QWebView in my program.
So do it just for the test.

dram
14th October 2014, 12:04
I do not understand why i have to do it

Let see in orginal html code i have


<table id="production_table" >

But in setHtml - this and later part is cut.

So


std::string mainTable_selector = "table[id=\"production_table\"]";
QWebElement mainTable_site = tmp_frame->findFirstElement(QString::fromStdString(mainTable_ selector));

Found - NULL - so there is not this element.

But remember, my code was working on 32 bit and version 5.3.1 ...

wysota
14th October 2014, 14:23
I doubt the architecture has anything to do with this. What happens if you save the downloaded content in a file and then point QWebPage directly to that file?

Does this work?


QByteArray html = output.buffer;
QFile file("file.html");
file.open(QIODevice::WriteOnly|QIODevice::Truncate );
file.write(html);
file.close();
QWebPage page;
connect(page.mainFrame(), &QWebFrame::loadFinished, [&]() { qDebug() << page.mainFrame()->findAll("#production_table").count(); }); // CONFIG+=c++11 for lambda to work
page.mainFrame()->setUrl(QUrl::fromLocalFile("file.html"));
QEventLoop loop;
loop.exec();

dram
14th October 2014, 16:28
But when i changed QT Version(5.3.1 -> 5.3.2) and all have ruined

I change


connect(page.mainFrame(), &QWebFrame::loadFinished, [&]() { qDebug() << page.mainFrame()->findAll("#production_table").count(); }); // CONFIG+=c++11 for lambda to work
to


connect(page.mainFrame(), &QWebFrame::loadFinished, [&]() { qDebug() << page.mainFrame()->findAllElements("#production_table").count(); } ); // CONFIG+=c++11 for lambda to work


Because findAll method not found in qtwebframe

In console i have got '0'

But thread is hanging on loop.exec();

wysota
14th October 2014, 16:37
Does file.html contain the expected content or is it truncated?

dram
14th October 2014, 16:42
expected code - Wysota

Maybe QWebFrame from 5.3.1 (But on 64 bit) will fix my problem?

Could you tell me where could i found 5.3.1 (64 bit) version ?

I think i should put only QWebFrame from version 5.3.1 ?

QT team changed some code in QWEBFRAME in update from 5.3.1 to 5.3.2 ?

wysota
14th October 2014, 21:31
Could you attach the file here? Just use the exact same file saved with QFile, do not copy and paste the content.

dram
14th October 2014, 22:08
https://www.dropbox.com/s/1ivmgv8fyq0yr1o/file.html?dl=0

(btw where can i find qt 5.3.1 windows 64 msvc 2013 x64bit ? i found windows 32bit msvc2013 64bit but where is windows 64 bit?

dram
15th October 2014, 09:20
(i cant edit) I think it is setContent/setHtml fail. After update to 5.3.2(64bit) my code has crashed...

Remember, before update all had been working.

Now i have strange problem, maybe someone from QT Support can answer question? Why html code after setHtml function is truncate?

wysota
15th October 2014, 11:01
(i cant edit) I think it is setContent/setHtml fail. After update to 5.3.2(64bit) my code has crashed...

I really doubt this is caused by setHtml.

This is a test app where your page seems to work fine:


#include <QtWidgets>
#include <QWebView>
#include <QWebPage>
#include <QWebInspector>
#include <QWebSettings>

int main(int argc, char **argv) {
QApplication app(argc, argv);
QWebView view;
QWebInspector inspector;
QWebPage page;
page.settings()->setAttribute(QWebSettings::DeveloperExtrasEnabled, true);
view.setPage(&page);
view.setUrl(QUrl::fromLocalFile("/path/to/file.html"));
// view.setUrl(QUrl("http://www.google.com"));
inspector.setPage(&page);
inspector.show();
view.show();
return app.exec();
}

It requires /path/to/file.html to point to the file downloaded with curl.

I tested it with Qt 5.3.2 on 64 bit system (Linux though, but it shouldn't matter).

dram
15th October 2014, 23:36
But how to find

QWebElement mainTable_site = tmp_frame->findFirstElement(QString::fromStdString(mainTable_ selector));

In our code?

Is there meaningful answer why my code after update doesn't work ?

wysota
16th October 2014, 08:12
But how to find

QWebElement mainTable_site = tmp_frame->findFirstElement(QString::fromStdString(mainTable_ selector));

In our code?

Is there meaningful answer why my code after update doesn't work ?

Right now I was trying to verify if setHtml is broken or not. It seems it is not so you have to look for the problem in your code and not in Qt. So far I did not see any complete piece of code of yours so it is hard for me to test it. It might help if you prepared a minimal compilation example reproducing the problem, similar to what I did. I can only give you a hint that I would wait until the page is fully loaded before trying to access its contents.