Correct and full conversion to/from std::string. [Archive]

View Full Version : Correct and full conversion to/from std::string.

nroberts

3rd January 2011, 18:12

I'm working on a project that we want to use Unicode and could end up in countries like Japan, etc... We want to use std::string for the underlying type that holds string data in the data layer (see http://stackoverflow.com/questions/4521252/qt-msvc-and-zcwchar-t-i-want-to-blow-up-the-world as to why). The problem is that I'm not completely sure which function pair (to/from) to use for this and be sure we're 100% compatible with anything the user might enter in the Qt layer.

A look at to/fromStdString indicates that I'd have to use setCodecForCStrings. The documentation for that function though indicates that I wouldn't want to do this for things like Japanese. This is the set that I'd LIKE to use though. Does someone know enough to explain how I'd set this up if it's possible?

The other option that looks like I could be pretty sure of it working is the to/fromUTF8 functions. Those would require a two step approach though so I'd prefer the other if possible.

Is there anything I've missed?

Repost from: http://stackoverflow.com/questions/4586969/which-function-pair-in-qstring-to-use-for-converting-to-from-stdstring

franz

3rd January 2011, 20:40

Well, your question seems to consist of a lot of sub-questions. And it will be tough to answer them all. The basic rule is to use UTF-8 encoding when serializing/storing string data. It adheres to the unicode standard (http://www.unicode.org/) and is guaranteed to be lossless, so you won't be losing data. Even though it is a two step process, I'd recommend you do all your serialization to and from UTF-8.

QString is, as you state in one of your stack overflow posts, a unicode monster, but the common copying and editing actions are just as quick if not quicker than the std::string (I always find std to be a worrying thing to type...). Internally the string data is stored in UTF-16 encoding.

There is indeed evidence that compiling Qt with wchar_t as built in type will compile correctly. It will mean that your complied libraries are binary incompatible with other Qt versions, but as long as you have it documented somewhere with your program, I guess there will be no issue. Also, the nature of shared librarying on windows doesn't really require you to keep the binary compatibility around. Even so, you will probably be happier with your data stored in UTF-8, where you don't have to take the byte order into account. Also, whenever your string data 'leaves' the QString, you will have to make sure you are using the correct locale/encoding.

The fact of the matter is, that since QString is one of the few properly usable unicode compatible string implementations available. Especially when using a full Qt implementation, without any special stuff, Qt users will practically never run into this type of issues, because Qt solves the biggest portion of it for you (serializing to binary data, writing xml documents are all included). I am pretty certain that this is the reason this type of issue doesn't get a lot of replies.

Added after 10 minutes:

By the way, if your goal is to really be i18n-ready and you are planning on continuing the use of Qt, you should really look into using Qt's i18n and translation (http://doc.trolltech.com/latest/i18n-source-translation.html) approach.

nroberts

3rd January 2011, 20:59

Even so, you will probably be happier with your data stored in UTF-8, where you don't have to take the byte order into account. Also, whenever your string data 'leaves' the QString, you will have to make sure you are using the correct locale/encoding.

That's what it seemed to me. The only issue I think we're libel to run into is going to be wrt windows (at least older versions, not so sure now) require wide characters in order to access Unicode, so things like opening files and such is going to have to be handled differently.

Can't wait until the standard dictates unicode strings!

As to the two step process, someone on the other site suggested switching the codec for CStrings to UTF-8. Is this sufficient or would I run into problems? I guess the two part process isn't any big deal since I already have a templated "string_cast" function that I can wrap it up in but that was an idea for removing it and just using to/fromStdString.

The fact of the matter is, that since QString is one of the few properly usable unicode compatible string implementations available. Especially when using a full Qt implementation, without any special stuff, Qt users will practically never run into this type of issues, because Qt solves the biggest portion of it for you (serializing to binary data, writing xml documents are all included). I am pretty certain that this is the reason this type of issue doesn't get a lot of replies.

Yeah, thing is that we've already got the serialization and all that figured out with standard components and boost. We're pretty concerned about not getting tied down to any given UI framework. Too many times that's come up and bitten us in the ass.

By the way, if your goal is to really be i18n-ready and you are planning on continuing the use of Qt, you should really look into using Qt's i18n and translation (http://doc.trolltech.com/latest/i18n-source-translation.html) approach.

Yeah, we'll be using that in the UI specific layers. Still not sure how to deal with the lower layers though.

As to continuing to use Qt...I can't tell and try not to let that make a difference. Qt is a step up from just about everything else we've tried but there's also a lot of things I don't like about it. I'd prefer compiler errors when I connect signals and slots incorrectly for instance...or be able to connect lambda expressions to signals. Various bits of our stuff also get used in web programs and such too so I certainly can't get locked to the UI. I'm actually rather hoping that someday someone will make the C++ ui library I want. It would probably resemble GTKmm more than Qt in many ways and defer a lot of things to more standard components. To tell the truth the only reason we picked Qt over GTKmm was the fact that GTKmm doesn't obey ANY accessibility protocol in windows and thus can't be driven by test scripts like Qt can. If that were to be fixed we just might switch.

franz

3rd January 2011, 21:39

That's what it seemed to me. The only issue I think we're libel to run into is going to be wrt windows (at least older versions, not so sure now) require wide characters in order to access Unicode, so things like opening files and such is going to have to be handled differently.Well, here you should just use the W postfixed function names I guess. Windows uses an almost correct UTF-16 codec. The older versions use UCS-2 (yes, it's different). Just so you know. Good luck in finding the right solution for your needs in any case.

Cheers.

ChrisW67

3rd January 2011, 22:00

A look at to/fromStdString indicates that I'd have to use setCodecForCStrings. The documentation for that function though indicates that I wouldn't want to do this for things like Japanese. This is the set that I'd LIKE to use though. Does someone know enough to explain how I'd set this up if it's possible?

The documentation for that function says that you need to be careful using encodings that do not preserve the ASCII range, and uses Shift-JIS encoding as an example. Since the encoding that you wish to use is UTF-8, which does preserve the ASCII range, you should have no issues in this regard. Japanese Unicode code points can happily be encoded in UTF-8.