PDA

View Full Version : unicode



JeanC
10th May 2015, 20:10
Hello,
I want to load a file with unicode:

The file:


Никифорова Сэндэма Сохондо, Читинская обл.,
сон Москва, Московская обл.,


I made this function:


void loadfromfile(QStringList *list, QString file, bool unicode)
{
QFile File(file);
if (File.open(QFile::ReadOnly))
{
QTextStream in(&File);
if (unicode)
in.setCodec("UTF-16");
QString line;
do
{
line = in.readLine();
if (!line.isEmpty())
list->append(line);
}
while (!line.isNull());
File.close();
}
}


Yet when I load the file I am still not seeing the right characters:



loadfromfile(&list, filename, true);
QString s;
foreach (s, list)
{
QMessageBox msg;
msg.setText(s);
msg.exec();
}


Thanks for help.

d_stranz
10th May 2015, 21:12
Run it in the debugger, set a breakpoint on line 12 in loadFromFile() and inspect what "line" contains when the readLine() call returns. Is it what you expect?

anda_skoa
10th May 2015, 22:45
And make sure the file is indeed UTF-16, not some other unicode, e.g. UTF-8 or UTF-32, or not UTF at all.

Cheers,
_

JeanC
11th May 2015, 11:49
Thank you both.

I suspect it indeed has to do with the encoding. Problem, I don't know what it is, it was generated by some software as export data and I like to import it.

If you like to try, I attached "test.txt", it has Russian chars in it.
11170

ps simple code to demonstrate:


QFile file("test.txt");
file.open(QIODevice::ReadOnly | QIODevice::Text);
QTextStream in(&file);
QString s = in.readLine();
QMessageBox msg;
msg.setText(s);
msg.exec();
file.close();

ChrisW67
11th May 2015, 13:34
First part of file in hex


0000000: cd e8 ea e8 f4 ee f0 ee e2 e0 20 d1 fd ed e4 fd .......... .....
0000010: ec e0 20 20 20 20 20 4a 41 4e 20 30 39 2c 20 31 .. JAN 09, 1
0000020: 39 35 36 30 36 3a 30 32 3a 34 38 20 41 4d 20 20 95606:02:48 AM
0000030: 20 20 2d 30 39 3a 30 30 31 31 32 45 33 32 27 30 -09:00112E32'0
0000040: 30 22 35 31 4e 34 39 27 30 30 22 d1 ee f5 ee ed 0"51N49'00".....
0000050: e4 ee 2c 20 d7 e8 f2 e8 ed f1 ea e0 ff 20 ee e1 .., ......... ..
0000060: eb 2e 2c 20 0d 0a f1 ee ed 20 20 20 20 20 20 20 .., .....
0000070: 20 20 20 20 20 20 20 20 20 20 20 20 20 46 45 42 FEB

From the absence of a 16-bit byte-order-mark (either 0xfe 0xff or 0xff 0xfe), an 8-bit encoding.
From the space chars (hex 0x20) and digits (0x30 - 0x39) either an eight-bit or UTF-8 encoding, not UTF-16 which would have a zero byte associated with each of these bytes.
From the line endings, 0d 0a, Windows
First few bytes not valid UTF-8 encoding.

Looks like Windows CP1251 (http://en.wikipedia.org/wiki/Windows-1251) encoded Cyrillic to me.
First three bytes == first three characters: Ник

Same file encoded UTF-8 11171

JeanC
11th May 2015, 14:48
Thanks Chris, (the thanks button is not working on the forum)
With that file and setCodec("UTF-8") I get correct texts.
Problem is I have to deal with the file as it is. Is there any way I can read it at all?
If not, shrug.. unicode can be such a mess sometimes..

Radek
11th May 2015, 15:03
It is cp1251 most likely (like Chris writes). I was able to open the file using Kate and got a sensible text. Checking the coding of the text, I got cp1251.