Parsing/extracting from a binary QByteArray [Archive]

View Full Version : Parsing/extracting from a binary QByteArray

Phlucious

2nd December 2011, 02:17

Forgive me if "parse" only applies to text-based data, but the idea is the same. I have a chunk of bytes stored in a 30-byte QByteArray that I extracted from a very organized external file via QFile::read(int). I know that the first 12 bytes is 3 longs, the next 2 bytes is an unsigned short, etc.

QFile infile("c:\temp\file.bin");
//... open the file, etc etc
QByteArray chunk = infile.read(30);
I then pass that QByteArray into a parser function running in another thread while the main thread moves onto the next chunk. This happens millions of times in sequence, so wrapping that QByteArray into a QDataStream as suggested in the documentation (http://doc.qt.nokia.com/latest/qdatastream.html#QDataStream-4) slows the process down so much that I want to try manipulating the QByteArray directly.

How do I (for example) read the first 4 bytes from a QByteArray, concatenate them, then cast the result into a long int? I'm looking for a faster equivalent to this that doesn't require me to construct a QDataStream (and, by extension, a QBuffer):

QDataStream io(chunk);
qint32 data;
io >> data;
I've tried this:

bool ok = true;
chunk.mid(0,4).toInt(&ok); //ok==false
but it doesn't work since the data is binary, not characters.

The only option I've come up with looks really ugly:

quint32 a = chunk.at(0);
quint32 b = chunk.at(1);
quint32 c = chunk.at(2);
quint32 d = chunk.at(3);
quint32 abcd = 0 | (a << 24) | (b << 16) | (c << 8) | d;
I'm not even sure that this works or if it's faster! Of course, I still have to deal with endianness and all that, but let's just assume that's not a problem for now.

Any suggestions?

Santosh Reddy

2nd December 2011, 04:41

This will work assuming that the hardware platfrom and data chunk have same endian. If hardware platfrom and data chunk have different endian you are looking at a different problem. Endian mismatch will for sure cause degrade in performance.

This is just plain C style, nothing much of C++

struct Record
{
quint32 Long1;
quint32 Long2;
quint32 Long3;
quint32 Long4;
quint16 Short1;
quint16 Short2;
//...
};

const Record *record = (const Record*)chunk.constData();
quint32 long_data1 = record->Long1;
quint32 long_data2 = record->Long2;
quint16 short_data1 = record->Short1;

marcvanriet

2nd December 2011, 09:28

If the contents of the 30-byte chunks is fixed as you indicate, and if the endianess is the same, then Santosh's solution is the way to go.

Otherwise you must encode it like you did, but you can write it more compact like :

unsigned char *pData = (unsigned char*)chunk.data();
quint32 long_data1 = pData[0] | ((quint32)pData[0+1]<<8) | ((quint32)pData[0+2]<<16) | ((quint32)pData[0+3]<<24);

And to make it more readable, create a #define to extract the data, subsituting the '0' for the parameter in your define. And create different #defines for different data types.

As a sidenote... I don't think that parsing such small chunks of data in a separate thread will speed up things. Possibly the effort of task switching is greater than the effort of parsing.

Best regards,
Marc

Phlucious

2nd December 2011, 17:44

Thanks to both of you for useful responses. I hadn't thought of re-casting the char* into a struct. Wouldn't I have to add an extra char to the end of my struct because QByteArray::data() and QByteArray::constdata() both return null-terminated string?

Qt has some useful endian-conversion functions in QtEndian that I might be able to use after importing.

As a sidenote... I don't think that parsing such small chunks of data in a separate thread will speed up things. Possibly the effort of task switching is greater than the effort of parsing.

My reasoning was that it'd be faster to read a 30-byte (or 60-byte, or 190-byte... there's a handful of possible record formats) chunk from the disk and parsing that from memory instead of reading it 1-8 bytes at a time since HDD read/write tends to be very slow unless you have a SSD. Is that reasoning sound, or does Qt cache the file more than I realize? I've had issues trying to rapidly process a lot of data straight off the hard drive in the pastâ€”it tends to be many times slower than loading the whole file into memory first.

You make a good point about task switching, though... maybe I'll thread a couple thousand records at a time instead of one at a time.

I read in this thread (http://www.qtcentre.org/threads/36619-QVector-Slower-than-STL-Vector?p=168752#post168752) that the system might run faster if I used the native int instead of the smaller integers. Supposing for a second that speed's more important than memory, would it be worthwhile to re-cast the records into a struct of native integers after importing? I guess there's only one way to really find out...

marcvanriet

2nd December 2011, 19:15

I read in ... that the system might run faster if I used the native int instead of the smaller integers. Supposing for a second that speed's more important than memory, would it be worthwhile to re-cast the records into a struct of native integers after importing?

That won't speed up the reading part of course.

It may only be usefull if you run complex numerical algorithms on the data (for instance image manipulation, fourier transforms, color space transformations, ...).

It won't make a difference if you're just displaying the data in a graph or logging the data in a report or so.

Best regards,
Marc