PDA

View Full Version : Character encoding issues



yagabey
14th December 2008, 21:43
Hello,
I am writing a qt console application; in my application, i am parsing an rss source , and then i am searching for a keyword(that comes from user input), finally i am writing the related item in the rss source to a text file.

I get the word to be searched from the console:

cout<<"Authorname ?\n";
cin>>author;
searchedAuthor=author;//searchedAuthor is a QString

parsing the rss:


if (currentTag == "dc:creator"){
authorString += xml.text().toString();

searching for my author:

if(authorString.contains(searchedAuthor,Qt::CaseIn sensitive)){
outputfile << titleString<< " "<<linkString <<" "<< descriptionString << authorString <<"\n";

This code works good for English. But i should use Turkish language( which has some extra characters like "ğ,ş,ı,ö,ç".
The problem is that : if authorString contains some of that extra characters "QString::contains" function cannot find the author that it should find. "cout" function works without any problem(displays characters corectly). I dont know much about the character encoding issues;but i've seen setCodecForTr in the docs:


QTextCodec::setCodecForTr(QTextCodec::codecForName ("eucTR"));

I didnt know if eucTR is installed or not, just tried...This also didn't work..What can i do to make the "QString::contains" function work properly for Turkish language?

Thanks in advance...

caduel
15th December 2008, 11:03
You have to make sure that you read data with the encoding set to whatever the data is encoded in. If the data is turkish text, probably you need to do something like QTextCodec::setCodecForLocale(QTextCodec::codecFor Name("eucTR"));

Note that it probably is better not to set that globally, but only for your input file. See QTextStream::setCodec().


(Note that setCodecForTr sets the codec to be used for translations (the tr("...") calls in Qt code), the "Tr" has nothing to do with Turkish here ;-)

HTH

yagabey
15th December 2008, 11:20
Hello,
Do you mean something like that:


cout<<"Author name ?\n";
QTextCodec::setCodecForLocale(QTextCodec::codecFor Name("eucTR"));
cin>>author;
searchedAuthor=author;//searchedAuthor is a QString

I tried that; but the result didn't change.
And how will i know if "eucTR" exists or not?(is there a list of installed codecs?)


the "Tr" has nothing to do with Turkish here ;-)

yes you are right :) ...

caduel
15th December 2008, 12:13
no, I'd drop cin here.
try something like

QTextStream in(stdin);
in.setLocale("eucTR");
in >> searchedAuthor;

alternatively:

cin >> author;
searchedAuthor = QString::fromAscii(author.c_str()); // after setting the locale
// ... if (see docs) QTextCodec::setCodecForCStrings() has been set

HTH

gerome69
15th December 2008, 12:42
First of all:
Qt always uses unicode for character encoding, so all the turkish, german, chinese letters etc. are represented.

Microsoft windows always uses a country specific encoding,
f.e:
- Codepage 850 in western europe: http://de.wikipedia.org/wiki/Codepage_850
- Codepage 857 for turkish: http://de.wikipedia.org/wiki/Codepage_857

=> So you have to convert the input from your specific encoding to unicode.

Let's have a look at:
http://doc.trolltech.com/4.4/qtextcodec.html
There you can read:
The supported encodings are:
[...]
# IBM 850
# IBM 866
# IBM 874
[...]
# ISO 8859-1 to 10

So, unfortunately your needed encoding "IBM 857" is missing.
I don't know if it works, but tryp ISO 8859-9:
http://de.wikipedia.org/wiki/ISO_8859-9

A simple test program goes like this (look at my comments in the code!):


#include <QtCore>
#include <QtGui>

#include <iostream>
using namespace std;

int main(int argc, char** argv) {
QApplication app(argc, argv);

char author[100];

cout<<"Authorname ?\n";
cin>>author;

// Just for to get an idea of how the character codes are seen internally.
//Please enter some of your arbitrary chars. In german I always use "äöü".
for (int i=0; i<100; i++) {
int c=author[i];

if (c<0) c+=256;
printf("%d ", c);
}

QByteArray encodedString=author;
// QTextCodec *codec=QTextCodec::codecForName("IBM 850"); // western europe
// QTextCodec *codec=QTextCodec::codecForName("IBM 850"); // turkish but will not work :-(
QTextCodec *codec=QTextCodec::codecForName("ISO 8859-9"); // try it

if (!codec) {
printf("Codec not supported.\n");

return 0;
}

QString searchedAuthor=codec->toUnicode(encodedString);

// this message box gives you a validation if the encoding is interpreted correctly.
// out put on the console does not show you anything, because it does not use unicode
QMessageBox::information(NULL, "Ausgabe", searchedAuthor);

return 0;
}

Have fun, Gérôme

yagabey
15th December 2008, 14:32
I couldnt make "c_str()" function work although I added <cstring>, <string> headers?


searchedAuthor = QString::fromAscii(author.c_str());

That function returned true..(Codec supported..)

if (!codec) {
printf("Codec not supported.\n");
return 0;
}

I also tried:

QTextCodec *q;
q = q->codecForName("ISO-8859-9");
QTextCodec::setCodecForCStrings(q);

and


QTextCodec *codec=QTextCodec::codecForName("ISO 8859-9");

in the message box, characters are not correct again..(it shows "Ä°" instead of "İ" )

QMessageBox::information(NULL, "Ausgabe", searchedAuthor);

what else should i do? :confused:

caduel
15th December 2008, 14:43
I assumed that author is a std::string; if it is not (maybe it's just a char[20] or so...), just drop it.

yagabey
15th December 2008, 14:47
searchedAuthor = QString::fromAscii(author);

didnt fix the problem :crying:

caduel
15th December 2008, 15:08
show us the (complete) code you are using

gerome69
15th December 2008, 15:23
QTextCodec *codec=QTextCodec::codecForName("ISO 8859-9");

in the message box, characters are not correct again..(it shows "Ä°" instead of "İ" )

QMessageBox::information(NULL, "Ausgabe", searchedAuthor);

what else should i do? :confused:

So, the right codec is definetyly Codepage 857, which is not supported by Qt :-(

You have to implement the conversion from CP 857 to ISO-8859-9 yourself.

Just walk throught the byte array and convert chars>127:



for (int i=0; i<auth_len; i++) {
int c=author[i];

if (c<0) {
c+=256;

switch(c) {
// f.e. the g with a "bow" on it
case 167: author[i]=240; break; // or maybe 240-256=-16, try it, I've got no turkish windows to test it!
// I think you don't have to do it for all 128 chars, only for the few
// turkish special chars you need
}
}
}


G.

yagabey
15th December 2008, 21:34
Here is a summary of the code:



#include <QHttp>
#include <QUrl>
#include <QBuffer>
#include <QFile>
#include <QTextStream>
#include <QXmlStreamReader>
#include <QHttp>
#include <QByteArray>


class ColumnListing
{
Q_OBJECT
public:
ColumnListing();

public slots:
void fetch();
void readData(const QHttpResponseHeader &);

private:
void parseXml();

QXmlStreamReader xml;
QString currentTag;
QString linkString;
QString titleString;
QString descriptionString;
QString authorString;
QString urltext;
QString inputNews;
QString searchedAuthor;
QFile file;
QFile *htmlFile;
QHttp httpInstance;
int subconnectionId;
QUrl urlColumnToGo;

QHttp http;
int connectionId;

};

constructor part:


ColumnListing::ColumnListing(QWidget *parent)
: QWidget(parent)
{

connect(&http, SIGNAL(readyRead(const QHttpResponseHeader &)),
this, SLOT(readData(const QHttpResponseHeader &)));

char *input;
char *author;
input = new char[100];
author = new char[100];

cout<<"Newspaper ?\n";
cin>>input;
inputNews=input;//inputNews is global QString

cout<<"Author ?\n";
cin>>author;
searchedAuthor=author;//searchedAuthor is global QString

if (inputNews=="milliyet"){
urltext.append("http://www.milliyet.com.tr/D/rss/rss/RssY.xml?ver=51");
}
else if (inputNews=="sabah"){
urltext.append("http://www.sabah.com.tr/rss/yazarlar.xml");
}
else if (inputNews=="radikal"){
urltext.append("http://www.radikal.com.tr/radikal_yazar.xml");
}
/*** More News sites here....***/

file.setFileName("output.txt");//file is global QFile
if (!file.open(QFile::ReadWrite | QFile::Truncate))
return;

htmlFile= new QFile("htmloutput.html");//htmlFile is global QFile
if (!htmlFile->open(QFile::ReadWrite | QFile::Truncate))
return;
}

fetching:

void ColumnListing::fetch()
{
xml.clear();
QUrl url(urltext);
http.setHost(url.host());
connectionId = http.get(url.path());
}

parsing rs doc:

void ColumnListing::parseXml()
{
QTextStream inputText(&file);

while (!xml.atEnd()) {
xml.readNext();
if (xml.isStartElement()) {
if (xml.name() == "item")
linkString = xml.attributes().value("rss:about").toString();
currentTag = xml.name().toString();
} else if (xml.isEndElement()) {
if (xml.name() == "item") {

if(authorString.contains(searchedAuthor,Qt::CaseIn sensitive) ){
inputText << titleString<< " "<<linkString <<" "<< descriptionString << authorString <<"\n";

QUrl url(linkString);

httpInstance.setHost(url.host());
subconnectionId = http.get(url.path(),htmlFile);//write the author page into html file
}

titleString.clear();
linkString.clear();
descriptionString.clear();
authorString.clear();

}

} else if (xml.isCharacters() && !xml.isWhitespace()) {
if (currentTag == "title"){
titleString += xml.text().toString();
}
else if (currentTag == "link"){
linkString += xml.text().toString();
}
else if (currentTag == "description"){
descriptionString += xml.text().toString();
}
else if (currentTag == "dc:creator"){
authorString += xml.text().toString();
}
}
}
if (xml.error() && xml.error() != QXmlStreamReader::PrematureEndOfDocumentError) {
qWarning() << "XML ERROR:" << xml.lineNumber() << ": " << xml.errorString();
http.abort();
}
}

main.cpp

int main(int argc, char **argv)
{
QApplication app(argc, argv);

QTextCodec *q;
q = q->codecForName("ISO-8859-9");
QTextCodec::setCodecForCStrings(q);

ColumnListing *columnlisting = new ColumnListing;
columnlisting->fetch();
return app.exec();
}

yagabey
15th December 2008, 22:20
Ok at last I made it work:

instead of:

searchedAuthor = QString::fromAscii(author);

I used :

searchedAuthor = QString::fromUtf8(author);

and now everything works perfectly, thank you all... :)