PDA

View Full Version : How to get data from webkit requests



rbp
28th September 2010, 13:36
From the QNetworkReply (http://doc.qt.nokia.com/4.6/qnetworkreply.html) docs:


QNetworkReply is a sequential-access QIODevice, which means that once data is read from the object, it no longer kept by the device. It is therefore the application's responsibility to keep this data if it needs to.

How can I save this data before it is read and lost?

I tried catching the QNetworkAccessManager::finished() signal but usually (not always) the data was already lost.

wysota
28th September 2010, 14:57
What do you mean it was "already lost"? Did you read it in some other place?

rbp
29th September 2010, 00:44
webkit would have read that data to help render the webpage

rbp
29th September 2010, 02:10
Perhaps more explanation is needed. I am using QWebView to render webpages and have subclassed QNetworkAccessManager to try and save the data from certain requests, however the data is usually already read by the time of the QNetworkAccessManager::finished() signal.

Would the solution be to override QNetworkAccessManager::createRequest() and create my own custom QNetworkReply objects that maintain the returned data?
If so do you know of any examples?

wysota
29th September 2010, 10:22
If you even manage to intercept the data then WebKit won't be able to access it. Is that really what you want? Isn't it better to pick it up directly from WebKit's webpage object after it is downloaded and used?

rbp
29th September 2010, 15:16
If you even manage to intercept the data then WebKit won't be able to access it. Is that really what you want?

No. For example if I use peek() rather than read() then data is still there. Problem is peeking before webkit reads. Or overriding QNetworkReply so this is not a problem.



Isn't it better to pick it up directly from WebKit's webpage object after it is downloaded and used?

Unfortunately not possible in my use case because data is not retained.

rbp
5th October 2010, 06:04
I struggled subclassing QNetworkReply - I don't sufficiently understand the internals.

I am using the Python bindings so does that open up any additional possibilities?

Or is there some way to just tell QIODevice to maintain the buffer?

wysota
5th October 2010, 06:57
Or is there some way to just tell QIODevice to maintain the buffer?
The device is sequential. Once you read it, the data is gone.

The only thing that I know will work which comes to my mind right now is to reimplement QNetworkAccessManager::createRequest() to return a custom QNetworkReply that will act as a proxy between a real network reply (which you have to create too) and the stuff that uses it (WebKit and your code). Then all calls to read() will be going through your network reply class so you can do anything you want with the data (i.e. copy it and provide elsewhere). I don't know if that's the best or the simplest approach but I know it will work (it might be quite a lot of work though).

rbp
5th October 2010, 14:44
yeah I was trying to take that proxy approach, but feel I am working in the dark because am not familiar with QNetworkReply internals.

I find that the html at the URL is downloaded correctly and the finished() signal is emitted, but the content of QWebView remains empty.
How does QWebView get the content from QNetworkReply? Neither peek(), read(), readLine(), readData(), or readAll() is ever called in this class, but it will render properly when I use a standard QNetworkReply.

Here is my code (adapted from this example (http://gitorious.org/qtwebkit/performance/blobs/master/host-tools/mirror/main.cpp)):

class NetworkReply(QNetworkReply):
def __init__(self, reply):
self.reply = reply
QNetworkReply.__init__(self)

# handle these to forward
reply.metaDataChanged.connect(self.applyMetaData)
reply.readyRead.connect(self.readInternal)
reply.error.connect(self.errorInternal)
# forward signals
reply.finished.connect(self.finished)
reply.uploadProgress.connect(self.uploadProgress)
reply.downloadProgress.connect(self.downloadProgre ss)

self.setOpenMode(QNetworkReply.ReadOnly)
self.data = self.buffer = ''


def operation(self):
return self.reply.operation()


def request(self):
return self.reply.request()


def url(self):
return self.reply.url()


def abort(self):
self.reply.abort()

def close(self):
self.reply.close()


def isSequential(self):
return self.reply.isSequential()


def setReadBufferSize(self, size):
QNetworkReply.setReadBufferSize(size)
self.reply.setReadBufferSize(size)


def applyMetaData(self):
for header in self.reply.rawHeaderList():
self.setRawHeader(header, self.reply.rawHeader(header))

self.setHeader(QNetworkRequest.ContentTypeHeader, self.reply.header(QNetworkRequest.ContentTypeHeade r))
self.setHeader(QNetworkRequest.ContentLengthHeader , self.reply.header(QNetworkRequest.ContentLengthHea der))
self.setHeader(QNetworkRequest.LocationHeader, self.reply.header(QNetworkRequest.LocationHeader))
self.setHeader(QNetworkRequest.LastModifiedHeader, self.reply.header(QNetworkRequest.LastModifiedHead er))
self.setHeader(QNetworkRequest.SetCookieHeader, self.reply.header(QNetworkRequest.SetCookieHeader) )

self.setAttribute(QNetworkRequest.HttpStatusCodeAt tribute, self.reply.attribute(QNetworkRequest.HttpStatusCod eAttribute))
self.setAttribute(QNetworkRequest.HttpReasonPhrase Attribute, self.reply.attribute(QNetworkRequest.HttpReasonPhr aseAttribute))
self.setAttribute(QNetworkRequest.RedirectionTarge tAttribute, self.reply.attribute(QNetworkRequest.RedirectionTa rgetAttribute))
self.setAttribute(QNetworkRequest.ConnectionEncryp tedAttribute, self.reply.attribute(QNetworkRequest.ConnectionEnc ryptedAttribute))
self.setAttribute(QNetworkRequest.CacheLoadControl Attribute, self.reply.attribute(QNetworkRequest.CacheLoadCont rolAttribute))
self.setAttribute(QNetworkRequest.CacheSaveControl Attribute, self.reply.attribute(QNetworkRequest.CacheSaveCont rolAttribute))
self.setAttribute(QNetworkRequest.SourceIsFromCach eAttribute, self.reply.attribute(QNetworkRequest.SourceIsFromC acheAttribute))
# attribute does not exist
#self.setAttribute(QNetworkRequest.DoNotBufferUplo adDataAttribute, self.reply.attribute(QNetworkRequest.DoNotBufferUp loadDataAttribute))
self.metaDataChanged.emit()


def errorInternal(self, e):
self.error.emit(e)
self.setError(e, str(e))


def readInternal(self):
# this is called
s = self.reply.readAll()
self.data += s
self.buffer += s
self.readyRead.emit()


def bytesAvailable(self):
return len(self.buffer) + self.reply.bytesAvailable()


def readAll(self):
# this is never called
return self.data


def readData(self, data, size):
# this is never called
size = min(size, len(self.buffer))
data, self.buffer = self.buffer[:size], self.buffer[size:]
return size

rbp
7th October 2010, 08:24
figured it out - the docs are wrong. readData() only takes an int and returns a string:


def readData(self, size):
size = min(size, len(self.buffer))
data, self.buffer = self.buffer[:size], self.buffer[size:]
return str(data)

wysota
7th October 2010, 12:28
figured it out - the docs are wrong. readData() only takes an int and returns a string:
Hmm? Where the docs are wrong?

rbp
8th October 2010, 02:08
See Python QIODevice documentation (http://www.pyside.org/docs/pyside/PySide/QtCore/QIODevice.html?highlight=qiodevice#PySide.QtCore.Q IODevice.readData)

wysota
8th October 2010, 02:09
Didn't you mix readData() with read()? The latter does what you say.

rbp
8th October 2010, 02:21
Mix them how? My example wouldn't work until I defined readData() as above, which means that documentation is wrong.

wysota
8th October 2010, 11:29
No the docs are not wrong because Qt's QIODevice::readData() really takes two parameters. If something is wrong then it's the PyQt wrapping of that method.

rbp
8th October 2010, 15:24
Yes you are right that Qt's readData() takes two parameters - I am not disputing that.
However the Python bindings use a modified API to better suit the language, which is why I was able to guess the true syntax. The documentation is wrong.

wysota
8th October 2010, 16:23
In that case there should be no readData() method at all as it completely duplicates what read() does :) In C++ it makes sense to have two methods, in Python, it seems it doesn't.

piotr.dobrogost
15th December 2010, 22:34
Once I wanted to create a proxy for QNetworkReply too. That's when I started the thread Subclassing QNetworkReply (http://www.qtcentre.org/threads/23857).

I'm curious; how did you find this example (http://gitorious.org/qtwebkit/performance/blobs/master/host-tools/mirror/main.cpp)?

rbp
16th December 2010, 05:00
someone passed me that link on the Qt webkit mailing list.

KBHomes
17th December 2010, 04:36
Wow, this is *exactly* what I needed! I was in the same position, couldn't figure out how to correctly subclass QNetworkReply, but this did just the trick!

Thanks a ton.

rbp
11th January 2011, 06:15
I've had ongoing problems with sub-classing QNetworkReply. For example proxies and caching don't work reliably.

So to get these features working I use the standard QNetworkReply but cache everything (with QNetworkDiskCache) and then read the data from the cache.

piotr.dobrogost
11th January 2011, 21:39
I've had ongoing problems with sub-classing QNetworkReply. For example proxies and caching don't work reliably.

Can you give any examples?

rbp
12th January 2011, 00:40
Example of new code, old code, websites that don't work - which?

piotr.dobrogost
13th January 2011, 00:01
Example of new code, old code, websites that don't work - which?

Example of when the QNetworkReply's proxy from host-tools does not work.

rbp
13th January 2011, 01:34
My code was a port of host-tools into pyqt.
One of the problems was with AJAX dependent apps like Google's keyword tool: https://adwords.google.com/select/KeywordToolExternal
Some of the AJAX requests were never called.

rbp
23rd August 2012, 17:12
A few people have contacted me asking for the source code. Seems this is a common problem.

Here it is:
http://code.google.com/p/webscraping/source/browse/webkit.py

John Peterson
14th November 2012, 09:56
When I load the local file

<!doctype html>
<html>
<head>
<title>Test</title>
</head>
<body>
Test
</body>
</html>

the call sequence is

downloadProgress 85 85
readyRead
bytesAvailable [return 85 + 0]
abort
aboutToClose
finished [len(self.buffer) = 85]

I.o.w. QWebView doesn't call a read* method (and then display an empty page).

Can you confirm this?

Thanks!

PyQt-win-gpl-4.9.5

Update

Answer: self.setUrl(reply.url())