PDA

View Full Version : How to read a XML file that uses UTF-8?



PaladinKnight
5th April 2010, 17:51
Hi!

There's a program I want to modify that has some problems parsing an XML file that uses UTF-8.

The content of some of the fields in the XML file is dumped into a flat file using a QTextStream (to which I tried specifying the encoding) but I can see that the characters which are not present in normal 7-bit ASCII are not correctly processed.

For example, a UTF-8 character that takes two bytes ends up in the flat file taking 4-5 bytes.

My guess is that when the file is read Qt (the code in question uses a QDomDocument) thinks that the file is in ISO-8859-1 (or something like that) and read the UTF-8 character as two characters. When it then tries to dump it in the flat file it tries to store these two characters as separate UTF-8 multi-bytes characters.

The end result is that the text strings end up being corrupted.

Is there a way to tell a QDomDocument which character set to use? Is it supposed to do it by itself using the XML header or is there something else to do? The correct character set (UTF-8) is declared in the XML file header.

Thank you!

Nick

Lykurg
5th April 2010, 19:10
How looks your code? Did you use
<?xml version="1.0" encoding="utf-8"?> inside your xml files?

PaladinKnight
10th April 2010, 13:52
Hi!

Sorry for the delayed reply, this week has been completely crazy...

The program in question is not a program I wrote, it's a program I'm trying to modify.

I'll have to simplify it somewhat in order to post it here.

The XML file does have the encoding declaration you posted, that's what I meant by <<The correct character set (UTF-8) is declared in the XML file header.>>.

I was hoping that the problem could be some sort of encoding declaration (like the one that can be done with a QTextStream) but there doesn't seem to be any to specify it when using QDomDocument (setContent) and a QFile.

Thank you!

Nick