retrieve web content from url and parse it [Archive]

gig-raf

26th March 2015, 20:26

I am thinking about writing a QT application that will:

1. download a html page (store the source in a QString)
2. parse the QString with the HTML and create a QLIST with urls (links)
3. download each and every link found in parallel in a seperate threads.

I looked at QNetworkAccessManager and QWebView for doing this. But think both seem a bit overkill.

Do i need write the asyncron, i mean can not I not just start the download in a seperate thread, when done send the result back to the main thread?
-will this cause the main gui thread to lock up?

what would be the most simplest way of doing this?

jefftee

26th March 2015, 20:54

QNetworkAccessManager, QNetworkRequest, and QNetworkReply are all very easy to use in my opinion and that's what I'd recommend that you use for this task.

True that the nature of the Qt networking classes above is asynchronous and you can force a synchronous nature by using a QEventLoop, but that generally prompts the question "Why isn't using signals/slots acceptable?"

Threading is much more complex if done correctly. Most people that wind up implementing threading in Qt cobble something together that works most of the time, but isn't actually implemented correctly. If you insist on trying to force a synchronous behaviour, then the snippet below should get you started:

QEventLoop loop;
QNetworkAccessManager nam;
QRequest req(QUrl("http://google.com"));
QNetworkReply *reply = nam.get(req);
connect(reply, &QNetworkReply::finished, &loop, &QEventLoop::quit);
loop.exec();
QByteArray buffer = reply->readAll();

BeautiCode

26th March 2015, 22:35

You can send a GET request to the web page using QTCPSocket and receive the html code in return.
Then do your own parsing.
Just open up a socket and connect to the website (example.com)
then send

GET /dir.html HTTP/1.1\r\nHost: example.com\r\n\r\n

Then you'll receive the HTTP header back, along with the html code.

gig-raf

27th March 2015, 20:24

Thank you all.

I have decieded to go for the QNetworkAccessManager with Signal/Slots option. Still think it is overkill, but at least I learn by using it :-)

The QTcpSocket was not feasable for what I wanted to do as not all sites would return a header..

once again thanks for taking your time.

ChrisW67

27th March 2015, 20:53

QNetworkAccessManager is almost the least you could do to handle connections and their failures. There is no point reinventing that wheel. If you use this approach then you still need to parse whatever marginally compliant HTML is returned (if indeed the returned content is HTML at all). That is by far the least trivial part of this exercise to do properly from first principles. There is a reason that browsers are beasts.

You can use the very capable QWebPage and QWebFrame::findAllElements() together to do the link extraction work without having to worry about the networking level so much. The example in the QWebPage detailed description shows how fetch the content and where to parse it

BTW: every HTTP server must return a header before the content

gig-raf

1st April 2015, 22:33

Hi again,

I finally had time to look into this. For the time being I chose to try out the QNetworkAccessManager. It works, but I have a problem, which might not be realted to QT at all, but would be great of someone here could confirm.

example: using a firefox browser I go to google images and search for anything. If I right click and pick view page source I see lots of urls some looking like these:

....imgurl=http://www.test.de/1.jpg&..... and so on.

if I do open the google images with the same search conditions in QT via QNetworkAccessManager and print the data of the qbytearray the source is different. There are no links with the keywork imgurl.

I am not an expert, but I believe the google images site is using some java scripts that hides, or is needs to be executed. Or?

I basically need to trick the web page to believe I am a firefox browser :-) but this apparently is not easy at least not with QNetworkAccessManager. Will QWebView be any better at this?

I know this is probably to much to ask, but it would really nice for my soul to get someone elses opinion on this.

thanks.

jefftee

1st April 2015, 22:53

Have you tried setting the "User-Agent" header to the same value used by your browser? You may need other headers like "Accepts" as well (and possibly others).

Edit: added hypen to User-Agent and as an example, here's the User-Agent from my firebox brower on the mac:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Firefox/31.0

ttimt

2nd April 2015, 02:09

Hi again,

I finally had time to look into this. For the time being I chose to try out the QNetworkAccessManager. It works, but I have a problem, which might not be realted to QT at all, but would be great of someone here could confirm.

example: using a firefox browser I go to google images and search for anything. If I right click and pick view page source I see lots of urls some looking like these:

....imgurl=http://www.test.de/1.jpg&..... and so on.

if I do open the google images with the same search conditions in QT via QNetworkAccessManager and print the data of the qbytearray the source is different. There are no links with the keywork imgurl.

I am not an expert, but I believe the google images site is using some java scripts that hides, or is needs to be executed. Or?

I basically need to trick the web page to believe I am a firefox browser :-) but this apparently is not easy at least not with QNetworkAccessManager. Will QWebView be any better at this?

I know this is probably to much to ask, but it would really nice for my soul to get someone elses opinion on this.

thanks.
What source are you getting?

gig-raf

2nd April 2015, 19:33

I will try to setting the header, I will let you know. I did a some test with the fancybrowser example (qwebview), there the pages are presentet correctly as javascripts are executed correctly. So another way would be to use that class instead.

I will let you know what I come up with. I tried to do this in Python some years ago, and I never succeeded, google does what they can to make it hard to scrape their images.

But I dont want to give up, I know it is possible and I want to learn how to do it. :-)

gig-raf

2nd April 2015, 21:38

Thank you so much!!!

after setting the user-agent I got what I wanted!! I am so happy!

QNetworkRequest request;
request.setUrl(QUrl(url));
request.setRawHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Firefox/31.0");
reply = qnam.get(request);
..
..