PDA

View Full Version : Problem in converting Large QDomDocument to QByteArray



sattu
21st August 2012, 12:34
Hi everyone,


I am using "QDomDocument" class to create XML out of my database having more than 5 lakh records. Following is the code I am using to achieve this:-



QDomDocument XMLdoc;
//Put all the records inside XMLdoc
QByteArray XmlByte = XMLdoc.toByteArray(); //Here is where I am facing the problem


So the issue is if my database has less than 5 lakh records (say 3 lakh), then "XMLdoc.toByteArray()" is completely fine.
But if my database has >= 5 lakh records, then "XMLdoc.toByteArray()" hangs. I am not able to figure out how to solve this. Is this due to some memory issue or something?

spirit
21st August 2012, 12:36
What does "lakh" mean?

sattu
21st August 2012, 13:53
What does "lakh" mean?

1 lakh = 100,000

yeye_olive
21st August 2012, 15:33
If I am not mistaken QByteArray always stores its contents in a single continuous memory block, which probably leads to many reallocations as XMLdoc is serialized. Clearly this approach does not scale. Why do you need the QByteArray for? If its only use is to be written to a file (or any QIODevice), I suggest an alternative: set up a QTextStream on top of that device and call QDomNode::save() for XMLdoc. That way the serialized representation will be progressively computed and written to the device and never be stored completely in memory.

Note that your approach still stores the whole database in memory (QDomDocument) but at least QDomDocument does not use a single continuous memory block. If this is not the primary representation of your data, it may even be better to use QXmlStreamWriter directly, without the need to build a QDomDocument.

sattu
21st August 2012, 16:28
If I am not mistaken QByteArray always stores its contents in a single continuous memory block, which probably leads to many reallocations as XMLdoc is serialized. Clearly this approach does not scale. Why do you need the QByteArray for? If its only use is to be written to a file (or any QIODevice), I suggest an alternative: set up a QTextStream on top of that device and call QDomNode::save() for XMLdoc. That way the serialized representation will be progressively computed and written to the device and never be stored completely in memory.
Well, you are correct regarding the memory point. But I still need QByteArray, because I need to send the XML over a TCP socket.


Note that your approach still stores the whole database in memory (QDomDocument) but at least QDomDocument does not use a single continuous memory block. If this is not the primary representation of your data, it may even be better to use QXmlStreamWriter directly, without the need to build a QDomDocument.
Yaa, but I found QDomDocument much easy and simple to use rather than QXmlStreamWriter. But tell me one thing, my program is going to run in an embedded system. So as you have pointed, from memory point of view I should be using QXmlStreamWriter rather than QDomDocument, right? Because in embedded systems, memory and speed are very critical factors.

yeye_olive
21st August 2012, 17:25
Well, you are correct regarding the memory point. But I still need QByteArray, because I need to send the XML over a TCP socket.
This is debatable. The approach I outlined works for any QIODevice, and QTcpSocket derives from QIODevice (and even if for some reason you did not use QTcpSocket but another interface to TCP sockets, you could still write a QIODevice wrapper for it). But you have a point. Since QTcpSocket works asynchronously, data is sent when control goes back to the event loop. Calling QDomNode::save() would therefore serialize the QDomDocument completely in memory, which is essentially the same as your current approach.

Therefore you should not serialize the whole document in one go, but progressively as needed. For example you could write some data and wait for the socket to emit its bytesWritten() signal before sending more. You may want to check that the socket's bytesToWrite() is below a certain threshold before deciding to send more data. I think that QXmlStreamWriter will be well adapted to this approach (read below).


Yaa, but I found QDomDocument much easy and simple to use rather than QXmlStreamWriter. But tell me one thing, my program is going to run in an embedded system. So as you have pointed, from memory point of view I should be using QXmlStreamWriter rather than QDomDocument, right? Because in embedded systems, memory and speed are very critical factors.
Right, QXmlStreamWriter has the advantage that you do not need to build the whole XML representation in memory. So, to expand on my comment above, here is what I suggest:
- Define a class responsible for reading the database, serializing its XML representation and sending it over TCP.
- You will be reading the database progressively. Add member fields allowing you to track what you have already read. For example, if your database is a container such as QVector, you could use a const_iterator referring to the next element to read.
- Set up a QXmlStreamWritter on top of the QTcpSocket.
- Write a method sendSomeData() that -- well -- sends some data. In the QVector example, the function could read an element, advance the iterator, and serialize that element by calling the appropriate methods of the QXmlStreamWritter. It may be useful to check the socket's bytesToWrite() to decide when you think you have written enough data.
- Begin sending some data. For example you can call sendSomeData() once.
- Use bytesWritten() and bytesToWrite() for the socket as explained above to decide when to call sendSomeData() again.
- When sendSomeData() has nothing left to do, it can e.g. emit a signal sendingDone() to let the rest of the application know that everything has been sent.

Using this approach you can send the whole data over TCP without ever allocating too much memory.

sattu
21st August 2012, 18:06
Thanks a lot yeye_olive. Your approach certainly would save a lot of my memory. And yes, taking your streaming hint, I put my QByteArray inside a QTextStream (just for now to test) and then i did XMLdoc.save(QTextStream). Now it works perfectly fine for any length of data without any issues.



- You will be reading the database progressively. Add member fields allowing you to track what you have already read. For example, if your database is a container such as QVector, you could use a const_iterator referring to the next element to read.


Well, this i can't do as i need to put all the records inside a single Parent node. My XML structure is like:-



<MyParent>
<record>
<UserID> 1 </UserID>
<UserName> XYZ1 </UserName>
<UserType> Admin </UserType>
</record>
<record>
<UserID> 2 </UserID>
<UserName> XYZ2 </UserName>
<UserType> User </UserType>
</record>
<record>
<UserID> 3 </UserID>
<UserName> XYZ3 </UserName>
<UserType> User </UserType>
</record>
.
.
.//and so on. Each <record></record> represents a single record in the table.
.
.
.
</MyParent>


So, if i keep reading the database progressively and then send over the socket, then I think i would need to have many "MyParent" nodes instead of just one. I am not sure though because I have never used "QXmlStreamWriter" class.



What I was thinking was:-
1) First to form the entire XML from all the records available.
2) Then send packet by packet from the formed XML over the socket.


What do you say?

yeye_olive
21st August 2012, 19:02
So, if i keep reading the database progressively and then send over the socket, then I think i would need to have many "MyParent" nodes instead of just one. I am not sure though because I have never used "QXmlStreamWriter" class.
You can achieve the result you want with QXmlStreamWriter:
1. Initialize the document with


streamWriter.writeStartDocument();
streamWriter.writeStartElement(QString::fromAscii("MyParent"));

2. Each time you want to add a record, do


streamWriter.writeStartElement(QString::fromAscii("record"));
streamWriter.writeTextElement(QString::fromAscii("UserID"), QString::number(theUserID));
streamWriter.writeTextElement(QString::fromAscii("UserName"), theUserName);
streamWriter.writeTextElement(QString::fromAscii("UserType"), theUserType);
streamWriter.writeEndElement();

3. When you have written all the records, do


streamWriter.writeEndElement();
streamWriter.writeEndDocument();

Have a look at QXmlStreamWriter's api to see how you can choose the codec and auto formatting options (line breaks, indentation). What you need to do is monitor bytesWritten() and bytesToWrite() on the QTcpSocket to gradually process the records. Use an iterator (or the equivalent for the data structure you are using) to keep track of the next record to process.


What do you say?
I would not do that as it builds a huge data structure in memory when you can easily generate it on-the-fly with QXmlStreamWriter as explained above.

sattu
22nd August 2012, 08:35
Ok, I got your idea. I tried using "QXmlStreamWriter" class the way you told. It's working great. But till now, I haven't told you regarding the protocol that we follow in our socket programming.


What we ultimately send through the socket is (HEADER + ACTUAL_DATA).
1) HEADER contains the info regarding the length of ACTUAL_DATA.
2) ACTUAL_DATA is the XML data that we need to form and send.

So at the receiver end:-
1) We first check the HEADER value to get the length that we actually need to receive.
2) Then we keep looping for ACTUAL_DATA, until we have received the total no.of bytes specified in HEADER.


So, if we use "QXmlStreamWriter" class, then how do we know in advance regarding the length of the TOTAL XML data that would be formed?
What ultimately is the requirement is that, at receiver end we need to loop continuously until we have received the entire XML data. Earlier we had thought of using 'EOF' character to know that we have received the entire data. But we had problems with that, so we switched to adding a HEADER before ACTUAL_DATA so that we know in advance how much data we are actually going to receive.
So, can you suggest any change in our protocol so that we can use "QXmlStreamWriter" class and simultaneously meet the objective of informing the receiver when it has received the complete data?

yeye_olive
22nd August 2012, 10:33
Well, the two usual ways to deal with the "message framing" problem of TCP are indeed using a delimiter or prefixing with the length of the message.

With your current approach (prefixing with the length of the message), I am afraid that you need to serialize the whole data before sending any of it, because you need to determine and send the size first. You can still improve over your original "XMLdoc.toByteArray()" solution with any of these options:
1. Use QXmlStreamWriter on top of a QByteArray (see the QXmlStreamWriter::QXmlStreamWriter(QByteArray *) constructor). This will allocate a huge QByteArray and serialize to it but at least you will not build a huge QDomDocument. There is still the problem of the periodic reallocations of the QByteArray as it grows. Once the QByteArray is ready, send its size through the QTcpSocket, then send its contents progressively using bytesWritten() and bytesToWrite() to avoid duplicating the whole QByteArray in the socket's internal buffer.
2. Same as 1, but call QByteArray::reserve() with an overestimate of the final size to avoid reallocations. It may be feasible in your case, it all depends on whether the size of a serialized record and the number of records are predictable.
3. (More sophisticated.) Make QXmlStreamWriter operate on a custom QIODevice that stores the data in a linked list of QByteArrays. When more storage space is needed, a new QByteArray is allocated (and QByteArray::reserve() is called on it with a suitable value, e.g. the double of the size of the previous QByteArray) and simply appended to the linked list, which does not require any reallocation. Keep track of the total number of bytes written since the beginning and of the number of remaining bytes pre-reserved on the last QByteArray to know when to allocate a new one. When everything is serialized, send the total length, then the data progressively as usual. You will need a const iterator to know which QByteArray of the list is currently being sent, and an integer to know the position of the next byte to send in that QByteArray.

The other approach (using a delimiter) is much easier here. If changing the protocol is still an option, I would suggest you do that. It is well-adapted to cases like this one in which you cannot compute the size of a message in advance, which is typical with text-based serializations like XML. Since XML does not use the NUL ('\0') character, why don't you use that as a delimiter? Then all you have to do on the writing end is use the QXmlStreamWriter as you currently do and, when everything is serialized, write an additional byte '\0' to the socket. On the receiving end, you can read the incoming data blocks and scan them to stop just before the '\0' byte. You can send all these blocks to a QXmlStreamReader (using QXmlStreamReader::addData()) and pull the XML tokens to gradually rebuild the structure.

sattu
22nd August 2012, 12:48
Actually olive, using a delimiter isn't feasible as many times we need to transfer binary data. There is a possibility of the delimiter being present in the ACTUAL_DATA.
So, your first option looks the best but then again it's not possible to serialize the whole data before sending because there are chances of memory segmentation happening when no.of records are huge. So we are planning to add the following modifications to our protocol:-

1) Reading from the database, forming the XML and then sending over the socket.....everything will happen frame by frame as told by you.
2) Each frame will have the usual HEADER containing the length of ACTUAL_DATA present in that particular frame.
3) Modification:- The first frame will have the additional info of the TOTAL NO.OF RECORDS that we would be send gradually. So at the receiver end we would keep looping until the no.of parsed records becomes equal to the value present in the HEADER of first frame.
4) I am planning to skip the following step suggested by you. I hope this shouldn't be an issue:-

streamWriter.writeStartDocument();
streamWriter.writeEndDocument();
Reason- I don't want the following header in my XML: <?xml version="1.0" encoding="UTF-8"?>

yeye_olive
22nd August 2012, 14:25
Actually olive, using a delimiter isn't feasible as many times we need to transfer binary data. There is a possibility of the delimiter being present in the ACTUAL_DATA.
You could use a hybrid protocol using length-prefixing for some messages and delimiters for others. Just prefix each message with a byte indicating which one of the two conventions is used for that message. This allows you to choose length-prefixing for binary data with predictable length, and delimiters for XML. This is not difficult to do, and may be the best approach in your situation.


So, your first option looks the best but then again it's not possible to serialize the whole data before sending because there are chances of memory segmentation happening when no.of records are huge. So we are planning to add the following modifications to our protocol:-

1) Reading from the database, forming the XML and then sending over the socket.....everything will happen frame by frame as told by you.
2) Each frame will have the usual HEADER containing the length of ACTUAL_DATA present in that particular frame.
3) Modification:- The first frame will have the additional info of the TOTAL NO.OF RECORDS that we would be send gradually. So at the receiver end we would keep looping until the no.of parsed records becomes equal to the value present in the HEADER of first frame.
OK, that works too. Besides the receiving end can use the total number of records to optimize the allocation of the internal structures for storing the database.



4) I am planning to skip the following step suggested by you. I hope this shouldn't be an issue:-

streamWriter.writeStartDocument();
streamWriter.writeEndDocument();
Reason- I don't want the following header in my XML: <?xml version="1.0" encoding="UTF-8"?>
Is there a good reason for not including this small header? Frankly it does not weigh much compared to the huge database. The documentation for QXmlStreamWriter does not explicitly state that what you suggest is forbidden, but it does not state that it works as you expect either. I am more concerned about the receiving end. By removing this header (called the XML declaration), you remove the information about the text encoding. It seems that QXmlStreamReader relies on the XML declaration since it does not offer a way to set a codec manually. If I were you I would keep the XML declaration and save myself some trouble.

Finally, if you can completely change the protocol, why do you use XML in the first place? You could encode the database to binary data. For example:

Database = [number of records (integer)][record 1][record 2]...
Record = [UserID (integer)][UserName (string)][UserTye (integer)]
integer = big-endian 32-bit unsigned integer
string = UTF-8 encoded string followed by NUL byte

This format would also be suited to progressive serialization and deserialization.

sattu
22nd August 2012, 15:45
It seems that QXmlStreamReader relies on the XML declaration since it does not offer a way to set a codec manually.

Actually at receiver end we use vb.net. And it works fine without the XML declaration part.


Finally, if you can completely change the protocol, why do you use XML in the first place? You could encode the database to binary data.
Yaa, we have kept this as an option. Rather than fetching the records manually, putting the field names and values into a container and then forming the XML out of it, we could directly send the entire database binary file over the socket. At the receiver end we could save it to another database file and then extract the field values there itself. This could save a lot of overhead in case of huge transactions.