PDA

View Full Version : QByteArray and UTF-8



Benecore
30th May 2012, 12:10
Hi guys,

I try get data from binary file and everything is fine until a utf-8 character.
For example i have this code:

QFile *sub = new QFile("C:\\TestFile.ctl");
if (!sub->open(QIODevice::ReadOnly)){
return 0;
}
QByteArray data = sub->read(200);
int start = data.indexOf(QByteArray::fromHex("0001"), data.indexOf(QByteArray::fromHex("0000"), 32))+8;
int stop = data.indexOf(QByteArray::fromHex("0000"), start);
for (int i = from; i < to; ++i)
{
test << reinterpret_cast<const char*>(&data.constData()[i]);
}
name = test.join(" ").replace(" ", "");
qDebug() << name;

In qDebug() i get this
http://image.devpda.net/di/E2R7/vystup.png
but originals strings in the binary file is Čeština.

How can i get this original string.

Thank you for help me.

wysota
30th May 2012, 12:42
qDebug() << QString::fromUtf8(data.constData(), data.size());

By the way, what's the point of QByteArray::fromHex("0001") and this whole search you are doing?

Benecore
30th May 2012, 12:53
TestFile.ctl is not my file and contains much data. Index in the lines 6-7 is the point where is stored the name. And the name is in others files different. This help me to find Start and Stop where is the name.
I write the data to QStringList named test so this

qDebug() << QString::fromUtf8(data.constData(), data.size());
is not solution for me :(

BTW: I try name.toUtf8() and others codings but doesn't view original name.

wysota
30th May 2012, 13:01
Why is it not a solution for you?

My question about fromHex is that I don't understand why you're calling fromHex at all. This whole search of yours looks weird. If you tell us what exactly you are trying to do, maybe we'll find a more straightforward solution.

Benecore
30th May 2012, 13:18
In the file is stored information about Name and Vendor name of application (Symbian OS). And i want get this informations. Following code

void GetData()
{
QFile *sub = new QFile(QString("c:\\TestFile.ctl"));
if (sub->exists()){
if (!sub->open(QIODevice::ReadOnly)){
return;
}
QByteArray data = sub->read(200);
int findvendor = data.indexOf(QByteArray::fromHex("0000"), 32);
QStringList vendorList;
for (int i = 32; i<findvendor; i++){
vendorList << reinterpret_cast<const char*>(&data.constData()[i]);
}
vendor = vendorList.join(" ").replace(" ", "");
int beginName = data.indexOf(QByteArray::fromHex("0001"), findvendor)+8;
int endName = data.indexOf(QByteArray::fromHex("0000"), beginName);
QStringList nameList;
for (int i = beginName; i<endName; i++){
nameList << reinterpret_cast<const char*>(&data.constData()[i]);
}
qDebug() << nameList;
name = nameList.join(" ").replace(" ", "").replace(QChar(0x00), "").replace(QChar(0x0c), "").replace(QChar(0x02), "");
}else{
name = tr("Unknown");
vendor = tr("Unknown");
}
qDebug() << name << '\n' << vendor;
delete sub;
}

works fine until file doesn't contain UTF-8 chars.
Here is the two files. One is without UTF-8 and one with UTF-8 chars.
7764

wysota
30th May 2012, 13:30
I was expecting more something like I need to extract a text string that starts not earlier than 32 bytes from the beginning of the file, is prefixed by 0x0001 and suffixed by 0x0000. Presenting more weird code is definitely not going to help.

Benecore
30th May 2012, 13:32
Sorry my english is bad. Yes i want get string which start in 32 bytes

wysota
30th May 2012, 13:43
In 32 bytes? Why then are you searching for 0x0000?

Benecore
30th May 2012, 13:56
I programming in C++ just couple months. This code is rewritten from Python application.
Author gave me permission to use the code. My experience with binary files in C++ language are small.
This is python code

finvend = s.cont.find(hexto('0000'), 32)
vendor = s.cont[32:finvend]
start = (s.cont.find(hexto('0001'), finvend) + 8)
name = s.cont[start:s.cont.find(hexto('0000'), start)]
I try rewrite this code and i know is not good solutions but for a while i don't understand how to read Binary files in C++ :(

BTW: I tried for example read just 4 or 8 byte start with 32 bytes. file.seek(32) and then read, but result is not same as this previous code. I missing in C++ indexing like index[from:to]

wysota
30th May 2012, 15:51
It's not a matter of knowing C++ or not. It's a matter of knowing what you are looking for. Do you know what you are looking for?


int endOfVendor = ba.indexOf("\x00\x00", 32);
QByteArray vendorDat = ba.mid(32, endOfVendor-32);
QString vendorStr = QString::fromUtf8(vendorDat.constData(), vendorDat.size());

Benecore
31st May 2012, 09:49
Thanks for another option, but result is same (little better).
7765

BTW: This
ba.indexOf("\x00\x00", 32); and this
ba.indexOf(QByteArray::fromHex("0000"), 32);is not same because return value (start byte position) is different. I tried two QChar (0x00), but this doesn't matter my problem is only the result of string with UTF-8 chars.

wysota
31st May 2012, 09:58
The result is wrong probably because printing to the console directly goes through latin1 decoding instead of utf-8. Try printing the result to a file and then open the file in some editor capable of using utf-8.

Benecore
31st May 2012, 10:19
Hm, something is wrong but what? jesus....
This is the content of text file:
7766

I write the string like that:

QFile zapis("C:\\test.txt");
zapis.open(QIODevice::WriteOnly | QIODevice::Text);
QTextStream streamFileOut(&zapis);
//streamFileOut.setCodec("UTF-8");
streamFileOut << vendorStr;
streamFileOut.flush();
zapis.close();

wysota
31st May 2012, 10:33
Probably the editor does not know the string is in utf-8 (if it is in utf-8).

Benecore
31st May 2012, 10:46
String is UTF-8. I tried write this to file

QFile zapis("C:\\test.txt");
zapis.open(QIODevice::WriteOnly | QIODevice::Text);
QTextStream streamFileOut(&zapis);
streamFileOut << QString::fromUtf8("Čeština");
streamFileOut.flush();
zapis.close();
and the result is not same. :confused:

Is in this code something wrong?

QFile file("C:/TestFile.ctl");
if (!file.open(QIODevice::ReadOnly)){
return 0;
}
QByteArray data = file.read(200)
QByteArray nameData = data.mid(59, 14);
QString nameStr = QString::fromUtf8(nameData.constData(), nameData.size());
file.close();
QFile zapis("C:\\test.txt");
zapis.open(QIODevice::WriteOnly | QIODevice::Text);
QTextStream streamFileOut(&zapis);
//streamFileOut.setCodec("UTF-8");
streamFileOut << nameStr;
streamFileOut.flush();
zapis.close();

I don't understand why the string does not correct result :(

wysota
31st May 2012, 12:47
I will repeat my earlier question -- do you know what you are looking for or are you just guessing what you are doing? Are you sure the string is stored in the file the way you are trying to read it? Reading 14 characters from a utf-8 string will definitely not give you "Čeština". If this string was utf-8 encoded, it would be stored using 9 bytes (5 bytes for ascii characters + 4 bytes for two non-ascii characters).


#include <QtCore>

int main(int argc, char **argv) {
QString str = QString::fromUtf8("Čeština");
QByteArray ba = str.toUtf8();
qDebug() << "size:" << ba.size();
qDebug() << "hex:" << ba.toHex();
}

Output:

size: 9
hex: "c48c65c5a174696e61"

Benecore
31st May 2012, 13:54
Yes i know what I'm looking for. I looking for solution my problem and my problem is correct result of UTF-8 string, that's all. I know that Čeština has 9 bytes. But inside the binary file is between each chars one empty char (00 - is char or isn't) so Čeština hasn't 9 bytes but 14 bytes.
Image of the binary file in HEX editor
The chars Č and š are probably somewhere else.
7768
I have too much files with this format and this is the results if i use loop
7769
As you can see all names are retrieved from binary file with use algorythm and displayed correctly. Only names which contains UTF-8 chars is not correctly displayed. But it doesn't matter if not has a solution. :( Sorry my english

wysota
31st May 2012, 14:52
But inside the binary file is between each chars one empty char (00 - is char or isn't) so Čeština hasn't 9 bytes but 14 bytes.
So it is not UTF-8 encoded. End of story. Even if each character was separated by a null character, then UTF-8 encoded Čeština would be 18 bytes and not 14. To me it seems you simply have a Unicode string there with each character encoded using 16 bits.


#include <QtCore>

int main() {
QString str = QString::fromUtf8("Čeština");
QByteArray utf8 = str.toUtf8();
qDebug() << "UTF-8 size:" << utf8.size() << utf8.toHex();
QVector<unsigned int> ucs4array = str.toUcs4();
QByteArray ucs4((const char*)ucs4array.data(), ucs4array.size() * sizeof(unsigned int)); // I'm lazy, not converting to big-endian
qDebug() << "UCS-4 size:" << ucs4.size() << ucs4.toHex();
const QChar *unicodeStr = str.unicode();
QByteArray unicode;
const QChar *c = unicodeStr;
while(c->unicode()) {
ushort val = qToBigEndian(c->unicode());
QByteArray ba((const char*)&val, sizeof(ushort));
unicode.append(ba);
c++; // always wanted to do that :)
}
qDebug() << "Unicode size:" << unicode.size() << unicode.toHex();
return 0;
}

Output:
UTF-8 size: 9 "c48c65c5a174696e61"
UCS-4 size: 28 "0c010000650000006101000074000000690000006e00000061 000000"
Unicode size: 14 "010c0065016100740069006e0061"

Remark: The above output shows a series of values and not an encoded string (hence the difference between '010c' and '0c01').

Benecore
2nd June 2012, 14:23
Thanks for this answer, this I did not know. :)

So my last question is: Is possible to get original strings with UTF-8 support from this files? or not.

wysota
2nd June 2012, 20:12
Yes. However what you have is not UTF-8, it's pure 16-bit Unicode. See QString::setUnicode()