PDA

View Full Version : How to convert QString to std::string or char*?



yangyunzhao
24th August 2009, 06:38
If my QString only has ascii character, I can convert it to std::string or char* as follow:


QString x = "hello";
std::string str = x.toStdString();
str.c_str();


But if my QString has chinses character, I cannot conver it to std::string or char*.


QString x = " 你 好 ";
std::string str = x.toStdString(); //str will become Unrecognizable Code.
str.c_str();

yogeshgokul
24th August 2009, 06:59
QString x = " 你 好 ";
std::string str = x.toStdString(); //str will become Unrecognizable Code.
str.c_str();

As Qt docs says for QString::toStdString().
If the QString contains non-ASCII Unicode characters, using this operator can lead to loss of information, since the implementation calls toAscii().

You can use:

std::wstring QString::toStdWString () const

yangyunzhao
24th August 2009, 07:58
As Qt docs says for QString::toStdString().

You can use:

std::wstring QString::toStdWString () const

Yes,if use std::wstring,it works ok.

But in my program ,I use a third party API. The third party API request a std::string or char* parameter.

yogeshgokul
24th August 2009, 08:01
But in my program ,I use a third party API. The third party API request a std::string or char* parameter.

Try:

string->toUtf8()->data();

nish
24th August 2009, 08:04
Yes,if use std::wstring,it works ok.

But in my program ,I use a third party API. The third party API request a std::string or char* parameter.

if you want char* then forget about unicode... you cant pass a chinese character.in char*. Dont waste your time... tell your boss that char* is 7bit ascii.. (-127to 127).. so either change the third party library or live with english only.

franz
24th August 2009, 08:05
Never heard of UTF-8 (http://en.wikipedia.org/wiki/Utf-8), have you?

Follow yogeshgokul's advice and convert your strings to utf-8 encoding.

nish
24th August 2009, 10:36
Never heard of UTF-8 (http://en.wikipedia.org/wiki/Utf-8), have you?

Follow yogeshgokul's advice and convert your strings to utf-8 encoding.

no matter what encoding he converts his QString to ... but when eventually it be converted to a char* (by any function, toAscii(), or data(), or anything), it will loose the unicode charecter.

The Storm
24th August 2009, 20:52
It will not, char* is just a pointer to memory, there will be no loss, depending how the further reading will be done.

nightghost
24th August 2009, 21:31
Yes, but its likely, that the author of the library would have used something other then char* if the library supports non ascii characters.

nish
25th August 2009, 03:21
It will not, char* is just a pointer to memory, there will be no loss, depending how the further reading will be done.

and how do you read char* ? 8 bits at a time isn't it? so no matter how your memory is stored... the functions which manipulate char* will ALWAYS loose data (or corrupt data in real sense)

The Storm
25th August 2009, 08:40
You have a long way to go...

This function is from the Qt sources:


QString QString::fromUtf8(const char *str, int size)
{
if (!str)
return QString();
if (size < 0)
size = qstrlen(str);

QString result;
result.resize(size); // worst case
ushort *qch = result.d->data;
uint uc = 0;
uint min_uc = 0;
int need = 0;
int error = -1;
uchar ch;
int i = 0;

// skip utf8-encoded byte order mark
if (size >= 3
&& (uchar)str[0] == 0xef && (uchar)str[1] == 0xbb && (uchar)str[2] == 0xbf)
i += 3;

for (; i < size; ++i) {
ch = str[i];
if (need) {
if ((ch&0xc0) == 0x80) {
uc = (uc << 6) | (ch & 0x3f);
need--;
if (!need) {
if (uc > 0xffffU && uc < 0x110000U) {
// surrogate pair
*qch++ = QChar::highSurrogate(uc);
uc = QChar::lowSurrogate(uc);
} else if ((uc < min_uc) || (uc >= 0xd800 && uc <= 0xdfff) || (uc >= 0xfffe)) {
// overlong seqence, UTF16 surrogate or BOM
uc = QChar::ReplacementCharacter;
}
*qch++ = uc;
}
} else {
i = error;
need = 0;
*qch++ = QChar::ReplacementCharacter;
}
} else {
if (ch < 128) {
*qch++ = ch;
} else if ((ch & 0xe0) == 0xc0) {
uc = ch & 0x1f;
need = 1;
error = i;
min_uc = 0x80;
} else if ((ch & 0xf0) == 0xe0) {
uc = ch & 0x0f;
need = 2;
error = i;
min_uc = 0x800;
} else if ((ch&0xf8) == 0xf0) {
uc = ch & 0x07;
need = 3;
error = i;
min_uc = 0x10000;
} else {
// Error
*qch++ = QChar::ReplacementCharacter;
}
}
}
if (need) {
// we have some invalid characters remaining we need to add to the string
for (int i = error; i < size; ++i)
*qch++ = QChar::ReplacementCharacter;
}

result.truncate(qch - result.d->data);
return result;
}


As I told you, the data is not lost, everything is depending on how you read it. Other nice example is if you want to send int directly via network. Each function that sends that via the network takes char* pointer. Lets say we put:



int i = 100;
socket->write( (const char*)i, sizeof(int) );


And now what - the integer is lost ? On the other side you will get 4 bytes(32bit OS), the you need to make just some bit operations in order to get back the integer number.

nish
25th August 2009, 08:57
actually we both are saying the same thing but looking at different angles...


he you need to make just some bit operations in order to get back the integer number.does someAsciiSearchFunc(char*,char*) will do that correctly if you provide it with 16 bit data? it will check 8bits only...
so yes you can play around with typecasting if own the code on both sides...

nish
25th August 2009, 09:09
ahhh... now i got it... just read about utf8 ... looks like i really have a long way to go... time to go back to school..:(

yogeshgokul
25th August 2009, 10:30
time to go back to school..:(
Yes ! school bell is ringing. ;)

So now what will be final outcome. What should yangyunzhao do to get the things done.

wysota
25th August 2009, 10:40
Lets say we put:



int i = 100;
socket->write( (const char*)i, sizeof(int) );


And now what - the integer is lost ?

This is different. String manipulation includes a special interpretation of the \0 character. In the above case it will work because you are passing the size but when you pass a string that contains \0 characters to let's say... strcpy(), you will lose data. So it all comes down to the question if UTF-8 encoded string contains null bytes or not. From what I see it doesn't so storing UTF-8 encoded strings in char* should be safe. On the other hand if the 3rd party API does something with the data it gets, it is likely you will lose data or get incorrect outputs (i.e. the size of such "text" will be incorrect).

nish
25th August 2009, 10:45
i was trying to say the same thing... ummm.. my english..

yogeshgokul
25th August 2009, 10:51
Then why QString class have this function.

QString QString::fromUtf8 ( const char * str, int size = -1 ) [static]

This function takes char* parameter and function name says fromUtf8. Does it mean Qt accepting Utf8 formatted string pointed by char*.
So it implies Qt itself saying that a Utf8 formatted string can be pointed by char*.

nish
25th August 2009, 11:54
go and read wikipedia
UTF-8 encodes each character (code point (http://en.wikipedia.org/wiki/Universal_Character_Set)) in 1 to 4 octets (http://en.wikipedia.org/wiki/Octet_%28computing%29) (8-bit bytes (http://en.wikipedia.org/wiki/Byte)), with the single octet encoding used only for the 128 US-ASCII characters. See the Description section below for details.

wysota
25th August 2009, 12:11
Then why QString class have this function.

QString QString::fromUtf8 ( const char * str, int size = -1 ) [static]

This function takes char* parameter and function name says fromUtf8.

"const char *" in C/C++ is "an array of 8bit values" so anything can go there. Its Qt equivalent is QByteArray. How the data stored in a particular const char * is interpreted is another story (that's why you have the "size" parameter here). UTF-8 strings CAN contain 0x00 values - but only if you directly place them there yourself. So in 99.9% of the cases keeping a utf-8 string in a C byte array without also keeping its size is fine.

yogeshgokul
25th August 2009, 12:17
So in 99.9% of the cases keeping a utf-8 string in a C byte array without also keeping its size is fine.

Then how you explain this:


no matter what encoding he converts his QString to ... but when eventually it be converted to a char* (by any function, toAscii(), or data(), or anything), it will loose the unicode charecter.

wysota
25th August 2009, 12:32
Why am I to explain someone elses statements? :)

yogeshgokul
25th August 2009, 12:37
Why am I to explain someone elses statements? :)
Yep thass true.
I just posted that because of this post. So I though you guys means same.

i was trying to say the same thing... ummm.. my english..

After this thread I becomes.
ME = ME - UTF. :crying::crying:

The Storm
25th August 2009, 21:00
"const char *" in C/C++ is "an array of 8bit values" so anything can go there. Its Qt equivalent is QByteArray. How the data stored in a particular const char * is interpreted is another story (that's why you have the "size" parameter here). UTF-8 strings CAN contain 0x00 values - but only if you directly place them there yourself. So in 99.9% of the cases keeping a utf-8 string in a C byte array without also keeping its size is fine.

Other way to go around is serialization. He write the software he know what he can expect. For example the integer may vary on differend CPU's and OSes but thats why Qt provide to us qint8, qint16, qint32, qint64. In this way we can guarantee that the size of bytes will be fixed. So lets say that even he do not have 0x00 on the end of the utf-8 string, he can use other methods around. For example his own protocol, lets say he knows that the first 2 bytes that come are the length of the string, then he will know how much to continue to read further. I know that you are aware of this, I just point it out to make the thing clear( and a bit offtopic ). :)

Ontopic: By my opinion the function in the API that yangyunzhao is talking about does not recognize utf-8.

yogeshgokul
26th August 2009, 06:52
Ontopic: By my opinion the function in the API that yangyunzhao is talking about does not recognize utf-8.
But yangyunzhao want to pass chinese string to API.

nish
26th August 2009, 07:41
But yangyunzhao want to pass chinese string to API.
btw its called Mandarian and not chinese... its like calling "indian" for hindi.. :p

yogeshgokul
26th August 2009, 07:49
btw its called Mandarian and not chinese... its like calling "indian" for hindi.. :p
I am not bothering because the starter(yangyunzhao) mentioned term Chinese. And rest I don't care.
Better see this and don't forget to pay your kind attention on bold letters.:D


But if my QString has chinses character, I cannot conver it to std::string or char*.

nish
26th August 2009, 07:52
:crying::crying:buhuhuhuhu... this thread has killed all my reputation:crying::crying: