PDA

View Full Version : utf8 filenames / QDir::entryList



sedi
22nd August 2013, 19:06
Hi,
my program has to work on Windows and Android.
When I want to display files in a folder, I have encoding problems, as Android apparently uses utf8 for its file system. This is what I do:



ui->label->setText("Hallo Ä Ö Ü ä ö ü ß .,;*+");
QStringList files=dir.entryList(QStringList()<<"*.*");
for (int i= 0;i<files.count();i++)
{
QTreeWidgetItem* item=new QTreeWidgetItem(QStringList()<<files.at(i));
ui->treeWidget->addTopLevelItem(item);
}

The label->setText stuff is shown correctly, so this is not a display thing. The treeWidget looks like this
9456 (Android)
instead of this
9457 (Windows).
So the umlauts are borked.

I've tried setting the default locale to german and english, but this didn't help:


QLocale germanLocale(QLocale::German,QLocale::Germany) ;
QLocale englishLocale(QLocale::English, QLocale::UnitedStates);
QLocale::setDefault(germanLocale);
// QLocale::setDefault(englishLocale);


Furthermore I've tried getting the entryInfoList, then decoding the fileName. But since QFileInfo::fileName() returns a String, I feel like chasing my own tail here:


QFileInfoList infos=dir.entryInfoList(QStringList()<<"*.*");
for (int i=0; i<infos.count();i++)
{
QFileInfo inf=infos.at(i);
QString fileName= QFile::decodeName(inf.fileName().toUtf8());
QTreeWidgetItem* item=new QTreeWidgetItem(QStringList()<<fileName);
ui->treeWidget->addTopLevelItem(item);
}

Needless to say that it also didn't work.

Docs say on http://qt-project.org/doc/qt-4.8/porting4.html#qdir that QDir::encodedEntryList() has been removed.
They also say on http://qt-project.org/doc/qt-5.1/qtcore/qfile-compat.html#setEncodingFunction "does nothing". I'm still in 4.7, but I will probably change to 5.1 sooner or later.


I'm at my (small) wit's end.

ChrisW67
23rd August 2013, 00:04
On your Android device read a broken file name from the list and inspect:


for (int i = 0; i < fileName.size(); ++i)
qDebug() << i << fileName.at(i).unicode();

Do you get this:


0 66
1 114
2 252
3 100
4 101
5 114
6 72
7 246
8 114
9 116

for "BrüderHört" or something else?

sedi
23rd August 2013, 00:23
Actually it is


66 B
114 r
195 &Atilde;
188 &frac14;
100 d
101 e
114 r
72 H
195 &Atilde;
182 &para;
114 r
116 t
46 .
106 j
112 p
103 g

sedi
23rd August 2013, 02:39
...but with that inspiration I did some kind of brute force workaround which seems to work for me, though it appears to be quite ugly.
I am very open for better ideas, especially concerning the performance...


QString MainWindow::fixUtf8BrokenString(QString text)
{
int index = text.indexOf(QChar(195));
while (index>=0)
{
if (text.count()>++index)
{
int code;
code=text[index].toAscii();
switch (code)
{
case 128: text.replace(QString(QChar(195))+QString(QChar(128 )),"À");break;
case 129: text.replace(QString(QChar(195))+QString(QChar(129 )),"Á");break;
case 130: text.replace(QString(QChar(195))+QString(QChar(130 )),"Â");break;
case 131: text.replace(QString(QChar(195))+QString(QChar(131 )),"Ã");break;
case 132: text.replace(QString(QChar(195))+QString(QChar(132 )),"Ä");break;
case 133: text.replace(QString(QChar(195))+QString(QChar(133 )),"Ã…");break;
case 135: text.replace(QString(QChar(195))+QString(QChar(135 )),"Ç");break;
case 136: text.replace(QString(QChar(195))+QString(QChar(136 )),"È");break;
case 137: text.replace(QString(QChar(195))+QString(QChar(137 )),"É");break;
case 138: text.replace(QString(QChar(195))+QString(QChar(138 )),"Ê");break;
case 139: text.replace(QString(QChar(195))+QString(QChar(139 )),"Ë");break;
case 140: text.replace(QString(QChar(195))+QString(QChar(140 )),"Ì");break;
case 141: text.replace(QString(QChar(195))+QString(QChar(141 )),"Í");break;
case 142: text.replace(QString(QChar(195))+QString(QChar(142 )),"ÃŽ");break;
case 143: text.replace(QString(QChar(195))+QString(QChar(143 )),"Ï");break;
case 144: text.replace(QString(QChar(195))+QString(QChar(144 )),"Ð");break;
case 145: text.replace(QString(QChar(195))+QString(QChar(145 )),"Ñ");break;
case 146: text.replace(QString(QChar(195))+QString(QChar(146 )),"Ã’");break;
case 147: text.replace(QString(QChar(195))+QString(QChar(147 )),"Ó");break;
case 148: text.replace(QString(QChar(195))+QString(QChar(148 )),"Ô");break;
case 149: text.replace(QString(QChar(195))+QString(QChar(149 )),"Õ");break;
case 150: text.replace(QString(QChar(195))+QString(QChar(150 )),"Ö");break;
case 152: text.replace(QString(QChar(195))+QString(QChar(152 )),"Ø");break;
case 153: text.replace(QString(QChar(195))+QString(QChar(153 )),"Ù");break;
case 154: text.replace(QString(QChar(195))+QString(QChar(154 )),"Ú");break;
case 155: text.replace(QString(QChar(195))+QString(QChar(155 )),"Û");break;
case 156: text.replace(QString(QChar(195))+QString(QChar(156 )),"Ü");break;
case 157: text.replace(QString(QChar(195))+QString(QChar(157 )),"Ý");break;
case 158: text.replace(QString(QChar(195))+QString(QChar(158 )),"Þ");break;
case 159: text.replace(QString(QChar(195))+QString(QChar(159 )),"ß");break;
case 160: text.replace(QString(QChar(195))+QString(QChar(160 )),"Ã ");break;
case 161: text.replace(QString(QChar(195))+QString(QChar(161 )),"á");break;
case 162: text.replace(QString(QChar(195))+QString(QChar(162 )),"â");break;
case 163: text.replace(QString(QChar(195))+QString(QChar(163 )),"ã");break;
case 164: text.replace(QString(QChar(195))+QString(QChar(164 )),"ä");break;
case 165: text.replace(QString(QChar(195))+QString(QChar(165 )),"Ã¥");break;
case 166: text.replace(QString(QChar(195))+QString(QChar(166 )),"æ");break;
case 167: text.replace(QString(QChar(195))+QString(QChar(167 )),"ç");break;
case 168: text.replace(QString(QChar(195))+QString(QChar(168 )),"è");break;
case 169: text.replace(QString(QChar(195))+QString(QChar(169 )),"é");break;
case 170: text.replace(QString(QChar(195))+QString(QChar(170 )),"ê");break;
case 171: text.replace(QString(QChar(195))+QString(QChar(171 )),"ë");break;
case 172: text.replace(QString(QChar(195))+QString(QChar(172 )),"ì");break;
case 173: text.replace(QString(QChar(195))+QString(QChar(173 )),"Ã*");break;
case 174: text.replace(QString(QChar(195))+QString(QChar(174 )),"î");break;
case 175: text.replace(QString(QChar(195))+QString(QChar(175 )),"ï");break;
case 177: text.replace(QString(QChar(195))+QString(QChar(177 )),"ñ");break;
case 178: text.replace(QString(QChar(195))+QString(QChar(178 )),"ò");break;
case 179: text.replace(QString(QChar(195))+QString(QChar(179 )),"ó");break;
case 180: text.replace(QString(QChar(195))+QString(QChar(180 )),"ô");break;
case 181: text.replace(QString(QChar(195))+QString(QChar(181 )),"õ");break;
case 182: text.replace(QString(QChar(195))+QString(QChar(182 )),"ö");break;
case 184: text.replace(QString(QChar(195))+QString(QChar(184 )),"ø");break;
case 185: text.replace(QString(QChar(195))+QString(QChar(185 )),"ù");break;
case 186: text.replace(QString(QChar(195))+QString(QChar(186 )),"ú");break;
case 187: text.replace(QString(QChar(195))+QString(QChar(187 )),"û");break;
case 188: text.replace(QString(QChar(195))+QString(QChar(188 )),"ü");break;
case 189: text.replace(QString(QChar(195))+QString(QChar(189 )),"ý");break;
case 191: text.replace(QString(QChar(195))+QString(QChar(191 )),"ÿ");break;
}
}
index = text.indexOf(QChar(195));
}
return text;
}

ChrisW67
23rd August 2013, 05:44
So the file name coming off the device is being treated as a Latin1 string leading to a broken result.


195 = 0xC3
188 = 0xBC

which are the correct bytes for a UTF8 encoded "ü" (U+00FC) but are being converted to two QChars.
Quite how to fix this I don't know.

sedi
23rd August 2013, 10:37
Are you sure with Latin1 "U+00FC" ? For me it seems like &Atilde;+00BC, with &Atilde; being sort of an escape character.

I've looked up the codes in this Utf8 table (http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?names=-&utf8=dec) here. With that information I can just string-replace all Umlauts and other important chars. But for my case, it actually works as expected.

This said, it's probably quite slow to tackle the problem this way, it seems like reassembling the debris instead of preventing the accident.


Many thanks for the idea to actually look into the bytes myself - sometimes I don't see the wood for the trees. If anyone has a better idea (in terms of performance or safety of use), I'd be very happy to improve or entirely change my approach, but for the moment I can use that.

ChrisW67
23rd August 2013, 11:21
U+00FC is a Unicode code point (http://www.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement) for the character 'ü'. When encoded in UTF-8 that single Unicode character becomes 2 bytes 0xC3 0xBC.

If you interpret those two bytes as Latin1 (http://www.wikipedia.org/wiki/Latin1) characters (which are always one byte-one char) you get, as you point out, Ã and ¼.

So, the file name is encoded in UTF8 on the device. It is read as a set of bytes that are then incorrectly treated as a Latin1 string.