PDA

View Full Version : Webkit: extract information from HTML



bunjee
29th December 2010, 14:51
Greetings QtCentre,

I would like to do the following:

- Get the HTML from a webpage without loading all the medias and stuff.
- Extract a given Div from its id.

Is there a way to do that ?

Thanks a lot !

B.A.

javimoya
29th December 2010, 15:47
You are looking for a html parser... :) (I don't know any easy and reliable html parser for c++)
anyway you can use qwebpage for that... (it's easy and reliable... but thats is not his purposal -> not efficient/not fast for that)

- Get the HTML from a webpage without loading all the medias and stuff.

QWebSettings * settings = QWebSettings::globalSettings();
settings->setAttribute(QWebSettings::AutoLoadImages, false);
settings->setAttribute(QWebSettings::JavascriptEnabled, false);
settings->setAttribute(QWebSettings::JavaEnabled, false);
settings->setAttribute(QWebSettings::PluginsEnabled, false);
settings->setAttribute(QWebSettings::PrivateBrowsingEnabled, true);

- Extract a given Div from its id.

QWebFrame * frame;
QWebElementCollection elems;
QWebElement elem;

frame = page.mainFrame();
elem = frame->findFirstElement("div.tabContent h1"); // css selector !! extremely powerful.
OR
elems = frame->findAllElements("table#myId tbody tr");

and then -> elem.toPlainText()

wysota
30th December 2010, 00:39
- Get the HTML from a webpage without loading all the medias and stuff.
- Extract a given Div from its id.

Is there a way to do that ?
Use QNetworkAccessManager instead of webkit. For parsing you can use QXmlQuery if the page is a valid xml. If not then... well... probably QWebElement wouldn't work with it anyway. You can always use QRegExp, if you're only interested in a single tag, that's probably the best choice.

javimoya
30th December 2010, 02:01
... QWebElement wouldn't work with it anyway ...

I disagree !
I had used many times... and it works. it's reliable in every html I've tested.
if aqwebview can render it... qwebelement can parser it.

wysota
30th December 2010, 10:43
I disagree !
I had used many times... and it works. it's reliable in every html I've tested.
if aqwebview can render it... qwebelement can parser it.

Maybe. Nevertheless the resulting tree might be different from what you would expect :)

squidge
30th December 2010, 11:15
I don't know why you want to do this, but if its a regular thing where you want to extract some information from a webpage at regular intervals, then a better choice might be Python coupled with something like Beautiful Soup. Its pretty easy to use even if you don't know Python (it took me about 15 minutes or so to parse a webpage in the exact way that I wanted and I've never use python before). You can tell it (copied from the website) "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text.". I find it perfect for screen scraping. It also doesn't choke on invalid XML.

wysota
30th December 2010, 12:12
I guess using QtScript (or some else javascript engine) with jQuery is also an option if we talk about alternative approaches. Provided QtScript understands DOM. If not, then you have to wrap it all in QWebPage.

bunjee
1st January 2011, 19:50
Wow so many replies,

Looks like I started a debate here :).

What I want to do is quite simple indeed. I want to extract a "metascore" from this website: http://www.metacritic.com/

I suspect QNetworkAccessManager to be my best bet since the parsing required is rather simple.