View Full Version : unicode
JeanC
10th May 2015, 20:10
Hello,
I want to load a file with unicode:
The file:
Ðикифорова СÑндÑма Сохондо, ЧитинÑÐºÐ°Ñ Ð¾Ð±Ð».,
Ñон МоÑква, МоÑковÑÐºÐ°Ñ Ð¾Ð±Ð».,
I made this function:
void loadfromfile(QStringList *list, QString file, bool unicode)
{
QFile File(file);
if (File.open(QFile::ReadOnly))
{
QTextStream in(&File);
if (unicode)
in.setCodec("UTF-16");
QString line;
do
{
line = in.readLine();
if (!line.isEmpty())
list->append(line);
}
while (!line.isNull());
File.close();
}
}
Yet when I load the file I am still not seeing the right characters:
loadfromfile(&list, filename, true);
QString s;
foreach (s, list)
{
QMessageBox msg;
msg.setText(s);
msg.exec();
}
Thanks for help.
d_stranz
10th May 2015, 21:12
Run it in the debugger, set a breakpoint on line 12 in loadFromFile() and inspect what "line" contains when the readLine() call returns. Is it what you expect?
anda_skoa
10th May 2015, 22:45
And make sure the file is indeed UTF-16, not some other unicode, e.g. UTF-8 or UTF-32, or not UTF at all.
Cheers,
_
JeanC
11th May 2015, 11:49
Thank you both.
I suspect it indeed has to do with the encoding. Problem, I don't know what it is, it was generated by some software as export data and I like to import it.
If you like to try, I attached "test.txt", it has Russian chars in it.
11170
ps simple code to demonstrate:
QFile file("test.txt");
file.open(QIODevice::ReadOnly | QIODevice::Text);
QTextStream in(&file);
QString s = in.readLine();
QMessageBox msg;
msg.setText(s);
msg.exec();
file.close();
ChrisW67
11th May 2015, 13:34
First part of file in hex
0000000: cd e8 ea e8 f4 ee f0 ee e2 e0 20 d1 fd ed e4 fd .......... .....
0000010: ec e0 20 20 20 20 20 4a 41 4e 20 30 39 2c 20 31 .. JAN 09, 1
0000020: 39 35 36 30 36 3a 30 32 3a 34 38 20 41 4d 20 20 95606:02:48 AM
0000030: 20 20 2d 30 39 3a 30 30 31 31 32 45 33 32 27 30 -09:00112E32'0
0000040: 30 22 35 31 4e 34 39 27 30 30 22 d1 ee f5 ee ed 0"51N49'00".....
0000050: e4 ee 2c 20 d7 e8 f2 e8 ed f1 ea e0 ff 20 ee e1 .., ......... ..
0000060: eb 2e 2c 20 0d 0a f1 ee ed 20 20 20 20 20 20 20 .., .....
0000070: 20 20 20 20 20 20 20 20 20 20 20 20 20 46 45 42 FEB
From the absence of a 16-bit byte-order-mark (either 0xfe 0xff or 0xff 0xfe), an 8-bit encoding.
From the space chars (hex 0x20) and digits (0x30 - 0x39) either an eight-bit or UTF-8 encoding, not UTF-16 which would have a zero byte associated with each of these bytes.
From the line endings, 0d 0a, Windows
First few bytes not valid UTF-8 encoding.
Looks like Windows CP1251 (http://en.wikipedia.org/wiki/Windows-1251) encoded Cyrillic to me.
First three bytes == first three characters: Ðик
Same file encoded UTF-8 11171
JeanC
11th May 2015, 14:48
Thanks Chris, (the thanks button is not working on the forum)
With that file and setCodec("UTF-8") I get correct texts.
Problem is I have to deal with the file as it is. Is there any way I can read it at all?
If not, shrug.. unicode can be such a mess sometimes..
Radek
11th May 2015, 15:03
It is cp1251 most likely (like Chris writes). I was able to open the file using Kate and got a sensible text. Checking the coding of the text, I got cp1251.
Powered by vBulletin® Version 4.2.5 Copyright © 2024 vBulletin Solutions Inc. All rights reserved.