PDA

View Full Version : Unicode + plain C++



ct
17th March 2007, 12:35
When I try to use Qt it is all so fine but when I try to do something with plain C++, I found unicode to be quite a painful experience.

How to read plain unicode from a text file ? I have read many references to use (wchar_t *) but it would be helpful if I could get some real example.

wchar_t *unicode = L'unicode';

is this valid ? but what if my unicode character has a different font, do I also need an IDE that supports unicode for this purporse ?

or may be someone could help me get the unicode string from its value. In qt it is all too simple. If I want a unicode character with 0x0915 all I have to do is
QString s = QChar(2325); //0x0915 = 2325d

But how do we perform such a task in plain C++??

One last thing, I have seen that many browsers represent unicode in the form
क , could someone throw some light on that one too ??:eek:

wysota
17th March 2007, 13:15
When I try to use Qt it is all so fine but when I try to do something with plain C++, I found unicode to be quite a painful experience.
Plain C++ doesn't support Unicode.


but what if my unicode character has a different font, do I also need an IDE that supports unicode for this purporse ?
Could you explain what you mean by "a different font"?


But how do we perform such a task in plain C++??
You can't. You have to use a compiler that has Unicode support and then you can use the wide char (wchar) type. I think STL has a class supporting wchar, you might use it.


One last thing, I have seen that many browsers represent unicode in the form
क , could someone throw some light on that one too ??:eek:

This is called an "entity" and comes from SGML. ꪪ means "a character with a hexadecimal value of 'AAAA'".

ct
18th March 2007, 07:18
Could you explain what you mean by "a different font"?


I mean characters like क <-- this one.



You can't. You have to use a compiler that has Unicode support and then you can use the wide char (wchar) type. I think STL has a class supporting wchar, you might use it.

I was wondering how Qt was using unicode as it is based on c++ also the very point that I was trying to ask is how to use wchar to read special unicode characters like क from a file or store those characters on some data structure based on theri unicode equivalent code just like we can store a character based on their ASCII code ( I know unicode will be having multibyte but still is there a way to represent a unicode char based upon their hex value in C++ )

camel
18th March 2007, 09:28
I mean characters like क <-- this one.


What I see there is the "Replacement Glyph" (http://unicode.org/glossary/#replacement_glyph) it is used if the real glyph coult not be rendered to the screen, for example because the currently used font does not have a glyph for the unicode character.

(If that was not the one you wanted to show, welcome to the problems with unicode ;-)



I was wondering how Qt was using unicode as it is based on c++


Qt has its very own string class that is basically a vector of QChar, which is a wrapper around a 16 bit wide integer. Thus Qt can handles UTF16 internally (all other values are handled via surrogate pairs (http://unicode.org/glossary/#surrogate_pair)).

There is nothing inherently wrong with c++ that it cannot work with Unicode, it is just not well supported without help :-)


also the very point that I was trying to ask is how to use wchar to read special unicode characters like क from a file or store those characters on some data structure based on theri unicode equivalent code just like we can store a character based on their ASCII code

Do you really need to do this without Qt? Files are more often than not coded in very different encoding types, Latin1 Ascii, UTF8... and to be able to read all these you would have to rewrite (or find somewhere) a complete reimplementation of QTextCodec and friends.

Would it be possible to read all data using Qt, and then transform it via QString::toStdWString() to a std::wstring, which you then could handle in the pure c++ code?

(I never worked with std::wstring and read that there are rather large implementation differences in this class, and might not be available every where..but that might be a thing of the past, I just do not know :-/ )

I found this nice FAQ: UTF-8 and Unicode FAQ for Unix/Linux (http://www.cl.cam.ac.uk/~mgk25/unicode.html#c)

And to quote the Unicode Standard (http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf), Chapter 5.2:

With the wchar_t wide character type, ANSI/ISO C provides for inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension. The Unicode characters in the ASCII range U+0020 to U+007E satisfy these conditions. Thus, if an implementation uses ASCII to code the portable C execution set, the use of the Unicode character set for the wchar_t type, in either UTF-16 or UTF-32 form, fulfills the requirement.


If I read that correctly

const wchar_t[] = L"A small test!";
should be a valid unicode string. But this is only (always) valid for a subset of the possible unicode characters. :-/



I know unicode will be having multibyte but still is there a way to represent a unicode char based upon their hex value in C++

You can always use:

const uint32_t testChar = 0xFFFD;
const uint32_t testString[] = {0xFFFD, 0x0};

camel
18th March 2007, 09:53
I should read further before quoting...the next paragraph in the Unicode standard is quite interesting too:

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers. However, programmers who want a UTF-16 implementation can use a macro or typedef (for example, UNICHAR) that can be compiled as unsigned short or wchar_t depending on the target compiler and platform. Other programmers who want a UTF-32 implementation can use a macro or typedef that might be compiled as unsigned int or wchar_t, depending on the target compiler and platform. This choice enables correct compilation on different platforms and compilers. Where a 16-bit implementation of wchar_t is guaranteed, such macros or typedefs may be predefined (for example, TCHAR on the Win32 API.

ct
19th March 2007, 08:46
I can see that the replacement glyph could be due to various OS. If you are using windows XP you will probably see it but for some *nix distro you will have to install the proper unicode.

Actually I must rename this thread to Unicode + STL C++ (or standard C++, plain just isn't informative ;) ). Anyways, I was kind of missing a whole set of wide chars and their relative functions provided by STL notably wstring, wint_t and some other operation on wide strings like fputwc(FILE *,wchar).

I found a good article which could be pretty much helpful.

http://www.codeproject.com/file/ConfigString.asp

And here is a piece of code to output a unicode character from its quivalent character code, of course on the file.


//need to include wchar.h and other usual headers
wint_t code = 0x0915;
wchar_t ch = code;
FILE *fp = fopen("out.txt","w");
fputwc(fp,ch);
//done

wysota
19th March 2007, 10:33
What is "a proper unicode"? Unicode is Unicode, there are no variants of Unicode, that's why it has "Uni" in its name. The only thing that decides if the glyph is displayed or not is the font - it might or might not have a particular glyph in its set, the OS shouldn't have anything to do with this.

ct
20th March 2007, 08:34
What is "a proper unicode"? Unicode is Unicode, there are no variants of Unicode, that's why it has "Uni" in its name. The only thing that decides if the glyph is displayed or not is the font - it might or might not have a particular glyph in its set, the OS shouldn't have anything to do with this.

What I meant by "proper unicode" was the proper fonts and settings required for the glyphs of various script. Of course the core of the OS has nothing to do with it , I was talking about the general configuration/fonts that the OS originally comes up with. For example, it is easy to display the script of my language (which is devnagari) in Windows XP , whereas some fonts have to be installed in some linux version.

I am confused about the exact settings in various OS (someone could clarify this thing). I think we may have to tweak with the rendering engine to properly dislplay some glyphs.