PDA

View Full Version : Read gzip file line-by-line



The_Fallen
5th September 2011, 15:40
Hi,

I'm currently trying to read a gzipped file. It contains an ASCII table, so I would like to read it line-by-line in order to get the values from the table. For this I wrote a small class CompressedFile, which is derived from QIODevice.

The overloaded readLineData function looks like this:



qint64 CompressedFile::readLineData(char *data, qint64 maxlen)
{
qint64 len = 0;
while(len < maxlen) {
if (_bufPos >= _bufSize) {
fillBuffer();
if (_bufSize == 0) return 0;
}
data[len++] = _buffer[_bufPos++];
if (data[len-1] == '\n') {
data[len-1] = '\0';
break;
}
}
return len;
}


Since I got problems with qUncompress, I'm using zlib directly, so fillBuffer() looks like this:



void CompressedFile::fillBuffer()
{
_bufOffset += _bufSize;
_bufSize = gzread(_file, _buffer, BUFSIZE);
_bufPos = 0;
}


So in principle I uncompress the file into a buffer (I tried sizes between 1kB and 1MB) and read from that. When I come to the end of the buffer, I read again.

I use it like this:



// open file
CompressedFile file(filename);
file.open(QIODevice::ReadOnly);

// start reading
QString line;
while (!(line = file.readLine()).isEmpty()) {
// do something
}


My problem is, that with one of the files I want to read with this, it takes about 35 seconds. I've got some IDL and python code here, that does the same in ~10 seconds. Any ideas, how can speed things up?

Cheers,
fallen

high_flyer
6th September 2011, 13:08
This code never gets call IMHO, or only once, if _busPos is initialized with something other than zero which is greater than _busSize when it gets initialized:


if (_bufPos >= _bufSize) {
fillBuffer();
if (_bufSize == 0) return 0;
}

Since in here, _bufPos is always set to 0:


void CompressedFile::fillBuffer()
{
_bufOffset += _bufSize;
_bufSize = gzread(_file, _buffer, BUFSIZE);
_bufPos = 0; //<--- always 0.
}


Is there a reason you are reading char after char and not a whole line using QIDevice::readLine()?
Because this is what takes so long...

The_Fallen
6th September 2011, 13:14
Hi,


This code never gets call IMHO, or only once, if _busPos is initialized with something other than zero which is greater than _busSize when it gets initialized:

_bufOffset is being increased in readLineData() and is reset in fillBuffer(), since I start to read at the beginning of the buffer after filling it. So the code works fine, it's just quite slow...


Is there a reason you are reading char after char and not a whole line using QIDevice::readLine()?
Because this is what takes so long...

But there is no QIODevice, that's what I want to create right now. The CompressedFile class is a wrapper for a gzipped file that inherits QIODevice and readLineData() is the protected function that is internally called from readline(). And I'm reading the buffer char after char, since I'm looking for a line break...

Cheers,
fallen

high_flyer
6th September 2011, 13:39
_bufOffset is being increased in readLineData()
Oh, right, missed that one... sorry.


And I'm reading the buffer char after char, since I'm looking for a line break...
But realLine() will also recognize line breaks... that is exactly what it does!

Why not just unzip the file, get the buffer, put the buffer in QTexstStream, and read it line by line?

The_Fallen
6th September 2011, 13:41
Why not just unzip the file, get the buffer, put the buffer in QTexstStream, and read it line by line?

File is too large, couple of hundred MB. But maybe I'll do that anyway... Thanks.