PDA

View Full Version : Does QChar support Unicode code-points beyond 16-bit FFFF?



jamadagni
2nd August 2009, 02:47
The page http://doc.qtsoftware.com/4.5/qchar.html says that "The QChar class provides a 16-bit Unicode character." My understanding is that each QChar uses 16-bits and therefore it can hold unique values only from 0000 to FFFF. Therefore the question arises:

Does QChar support Unicode code-points beyond 16-bit FFFF?

http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes shows how beyond the code point FFFF exists the other planes such as Supplementary Multilingual Plane -- how can I represent characters from those coderanges using QChar?

miwarre
2nd August 2009, 11:48
Unicode points belonging to planes other than the first (commonly referenced as Basic Multilingual Plane or BMP), i.e. points exceeding the range 0x0000-0xFFFF, are represented in UTF-16 as pairs of 16-bit values known as surrogate pairs or simply surrogates. Surrogates use the reserved BMP area U+D800 - U+DFFF.

From a 32-bit (actually 21 bits are enough...) non-BMP character int ch32, you can get its surrogates wchar_t c0, c1 with:



c0 = (wchar_t)((ch32 >> 10) + 0xD800);
c1 = (wchar_t)((ch32 & 0x3FF) + 0xDC00);


The good new is that you can plainly store surrogates in QString and related classes and get them back after basic manipulations without problems.

The possibly less good news are that some kinds of processing are less plain:
* a surrogate pair takes the memory of two wchar_t but actually counts as a single character (I'm pretty sure QString::length() returns 2, not 1; it is up to you what to do with this);
* sorting probably does not work correctly with off-the-shelf methods (but I never tried, and sorting those kinds of characters raises a whole lot of other issues...);
* if you split a string you must be sure not to cut a pair in the middle; and so on.

So, they are there and you can use them, but with some care...

May I ask why do you care? Non-BMP planes contain very unusual characters, mostly archaic or 'non-standard' Chinese hanzi or not-so-common scripts like Kharoshthi, Osmanya, Deseret, Shavian, or Near-East ancient scripts... as very few people care about them, I'm curious who else does and why, just curious...

Ciao,

M.

jamadagni
2nd August 2009, 12:22
Unicode points belonging to planes other than the first (commonly referenced as Basic Multilingual Plane or BMP), i.e. points exceeding the range 0x0000-0xFFFF, are represented in UTF-16 as pairs of 16-bit values known as surrogate pairs

I am aware of this. I just wanted to confirm that a single QChar cannot be used to represent characters beyond the BMP.


May I ask why do you care? Non-BMP planes contain very unusual characters, mostly archaic or 'non-standard' Chinese hanzi or not-so-common scripts like Kharoshthi, Osmanya, Deseret, Shavian, or Near-East ancient scripts... as very few people care about them, I'm curious who else does and why, just curious...

To be frank, we are working on a proposal for a script which has been allotted space in the SMP roadmap (http://unicode.org/roadmaps/smp/) but since this script is still in use our judgment was that putting it in the SMP would hinder easy implementation. From what you say, what I suspected is in fact true. IIRC most Unicode support libraries (such as Qt and ICU (http://www.icu-project.org/)) use 16 bits to ostensibly uniquely represent any Unicode character that might be needed. And going beyond FFFF means those libraries cannot cope up anymore... I daresay Trolltech (oops, Qt Software) will not easily accept a request to upgrade QChar to 24 bits. (Or even 17 bits would be sufficient for the SMP?) Let me try...

jamadagni
2nd August 2009, 12:41
Well I have gone and filed a bug in the Qt Task Tracker asking for QChar to be made 17-bit -- let us see what the good guys at Qt do.

miwarre
3rd August 2009, 12:08
To be frank, we are working on a proposal for a script which has been allotted space in the SMP roadmap but since this script is still in use our judgment was that putting it in the SMP would hinder easy implementation. From what you say, what I suspected is in fact true. IIRC most Unicode support libraries (such as Qt and ICU) use 16 bits to ostensibly uniquely represent any Unicode character that might be needed. And going beyond FFFF means those libraries cannot cope up anymore... I daresay Trolltech (oops, Qt Software) will not easily accept a request to upgrade QChar to 24 bits. (Or even 17 bits would be sufficient for the SMP?) Let me try...
Oh, I see.

Well, I suspect convincing Qt or any other vendor to change such a basic concept as character will not be easy...

However, if this is your perspective, the current Qt classes can give some support: for instance, QChar methods like isHighSurrogate(), isLowSurrogate(), highSurrogate(uint ucs4), ushort lowSurrogate(uint ucs4), can make the UTF-32 <-> UTF-16 trips more manageable. Also, QString and QTextCodec methods to convert back and forth between UTF-8 and other encodings fully support the full Unicode plane space.

If some standard encoding for your script exists or can be devised, it is also possible to extend QTextCodec to support it, easing semi-transparent conversion between a custom encoding and UTF-xxx.

So, the current situation is not so dark...

I'm afraid, until we all (or we most...) agree that a 'real' character is 32-bit wide (odd sizes like 24 bits have no real chances) and includes properties like direction, joining, separable, etc. there will not be an easy world-wide solution.

Ciao,

M.

jamadagni
5th August 2009, 19:42
Thanks for your interest. I just got a mail from the Qt people saying as how QChar supports surrogates, and it is not convenient to use 24 bits (far less 17 bits). So it is unlikely that full 32-bit conversion will take place in Qt. I suspected as much, and have actually found space (it will be a squeeze but it's ok) in the BMP for my script, so wish me luck!

miwarre
7th August 2009, 17:00
I just got a mail from the Qt people saying as how QChar supports surrogates, and it is not convenient to use 24 bits (far less 17 bits). So it is unlikely that full 32-bit conversion will take place in Qt. I suspected as much,
It was to be expected, I dare say...

and have actually found space (it will be a squeeze but it's ok) in the BMP for my script, so wish me luck!
Oh! To convince the Unicode Consortium to occupy some precious points in the BMP may be even more difficult!! (unless you plan to use the Private Use Area, which has its own problems.).

You are probably fully aware of the following link: http://unicode.org/roadmaps/index.html. If not, you may want to check it out...

Ciao,

M.