Qt Unicode Problems [Archive]

Sven

25th December 2010, 17:55

Hello folks,

I've got a problem and I really don't know how to solve this.

For instance, I've got a QString with a german umlaut build with unicode like this:

QString str = "\u00fc"; // works got "Ã¼" as german umlaut

If I have a text file which contains the unicode characters qt does not transform it to german umlauts. example:

QFile f("test.txt");
f.open(QIODevice::ReadOnly);
QString str = f.readAll();

Can anyone explain me why? I have tested the codecs example with my text file and again he does not replace the characters and I have used QtextStream and QTextEncoder classes with no success. Anyone can help ?

Lesiok

25th December 2010, 18:34

From QFile doc : By default, QFile assumes binary, i.e. it doesn't perform any conversion on the bytes stored in the file.
So use QTextStream and read carefully about QTextStream::setCodec

Sven

25th December 2010, 18:54

I have already tried this class and I know that I have to use UTF-8 but it still does not convert me the strings.

Well,
even if i launch the codecs example with my text file it does not convert the string.

ChrisW67

26th December 2010, 02:05

This code:

#include <QtCore>
#include <QDebug>

int main(int argc, char *argv[])
{
QCoreApplication app(argc, argv);

QString test = QString::fromStdWString(L"\u00fc");

QFile data("test.txt");
if (data.open(QIODevice::WriteOnly)) {
QTextStream out(&data);
out.setCodec("UTF-8");
out << test;
}
data.close();

if (data.open(QIODevice::ReadOnly)) {
QTextStream in(&data);
in.setCodec("UTF-8");
QString str;
in >> str;
qDebug() << str << test;
}

}

happily prints "Ã¼" "Ã¼". It writes this UTF-8 encoded file containing U+00FC:

$ od -tx1 test.txt
0000000 c3 bc
0000002
$ cat test.txt
Ã¼

Sven

26th December 2010, 10:35

Yes that works for sure, but i still have a problem because in my main application I get a QNetworkReply which contains the data (with \uXXXX).

QString test = QString::fromStdWString(L"\u00fc");

And this only works because your compiler does the byte conversation for you.

So in my case I get a string which contains those \uXXXX unicode characters and I convert them using a function I have wrote which works pretty simple:

Get the hex value
Use QByteArray::fromHex()
Replace \uXXXX with the byte array

This method works for German umlauts and some more things but the problem is it does not work with the cyrillic alphabet and now I really don't know what to do there:

Here is a snippet :

QByteArray strReply = m_pSearchReply->readAll();

bool buni = strReply.indexOf("\\u");
if (buni) {
do
{
int idx = strReply.indexOf("\\u");

QByteArray hex = strReply.mid(idx+2, 4);
hex.replace("0", "");
hex = QByteArray::fromHex(hex);
strReply.replace(idx, 6, hex);

} while (strReply.contains("\\u"));
}

[...]

So anyones now what I am doing wrong with the unicode character sets?

Lesiok

26th December 2010, 13:55

I think You are doing wrong nothing. UTF character representation is TWO binary bytes. Something like \uXXXX is not UTF binary character. This is an ASCII representation.
Are You sure that this codes are correct ?

ChrisW67

26th December 2010, 23:07

Yes that works for sure, but i still have a problem because in my main application I get a QNetworkReply which contains the data (with \uXXXX)

The literal \u00FC identifies a Unicode code point. Despite being expressed in hex, it does not necessarily dictate the actual bits that are used to represent that character.

Read http://www.joelonsoftware.com/articles/Unicode.html for more

The real question is, "How is the Unicode character encoded in the stream you are receiving?" The encoding dictates how you should try to get the character(s) into a QString. So, get your stream into a QByteArray and dump it in hex and see what you have.

If you are receiving a 0xC3 byte followed by a 0xBC byte then it is UTF-8 encoded.
If you are receiving a 0x00 byte followed by a 0xFC byte, or in the reverse order, then it is probably UTF-16/UCS-2 encoded.
If you are receiving a three 0x00 bytes followed by a 0xFC byte, in a few possible byte orderings, then it is probably UTF-32/UCS-4 encoded.

Sven

27th December 2010, 14:51

I'll check it out later. But what can I do with the QString if i know which encoding the string has got?

Edit: Ok I have checked it and it seems that there is no encoding because I cannot find any UTF-X Header in that QNetworkReply.

Sven

27th December 2010, 18:51

Ok finally i found a solution for my problem. I wrote my own string class which converts the unicode which is needed in my special case.

Anyways i think it's not the fastest way and still not the best solution but if anyone got a better method, let me know..

class QscGrooveString : public QString
{
public:
QscGrooveString(const QString &aData = QString())
: QString(aData) {

if (contains("\\u")) {
do {
int idx = indexOf("\\u");
QString strHex = mid(idx, 6);
strHex = strHex.replace("\\u", QString());
int nHex = strHex.toInt(0, 16);
replace(idx, 6, QChar(nHex));
} while (indexOf("\\u") != -1);
}
}
};

ChrisW67

28th December 2010, 06:28

By UTF-x header I assume that you mean the optional byte order marker which is just another Unicode code point. The absence of a BOM tells you nothing about the stream of bytes that follow, but you need to know about that stream in order to make sense of it. The presence of a BOM allows you to make a guess at the encoding and byte order where that is relevant.

You are receiving a string "\\u00FC" which is not a Unicode character in some UTF-encoded form, but a series of six characters. No wonder you couldn't make sense of it. What is sending this across the network?

Your solution will work for the Unicode Basic Multilingual Plane, i.e. code points below U+10000, but not the supplementary code points. This might be adequate depending on purpose (it includes the Cyrillic code points).

You could save a few string operations:

class QscGrooveString : public QString
{
public:
QscGrooveString(const QString &aData = QString())
: QString(aData)
{
int idx = -1;
while ( ( idx = indexOf("\\u") ) != -1 ) {
int nHex = mid(idx + 2, 4).toInt(0, 16);
replace(idx, 6, QChar(nHex));
}
}
};