PDA

View Full Version : convert ampersand encoded HTML into something readable



tetsuoii
16th October 2010, 22:41
I'm extracting strings from a webpage, but they are full of national characters encoded in html ampersand - hash - ascii code digits - semicolon format, looking something like this:

"sekretær i kø"

This is awful, and I can't find any information on how to decode these special characters, to the point where I'm close to writing a parser just for the purpose. The web page is encoded in utf-8 format, and the browser has no problem displaying it but in Qt all I get is mangled text strings...

Does anyone know how to read html encoded characters right?

Lykurg
16th October 2010, 23:02
Do not double post! I'll close this one.

EDIT: Oh, come on, decide where you want to post before you post! And once you posted, don't change and make the first one unreadable. Ask a moderator for moving your post if really needed.

For educational purpose, I leave this one closed. Edit your first one and you will be get an answer.

ChrisW67
17th October 2010, 04:30
0xFFFF ? A Unicode non-character perhaps? What was the question?

Lykurg
17th October 2010, 10:32
Ok, a little mess here, but now both threads or merged and open again. ChrisW67's answer was referring to the now deleted post which hadn't had a question...

Lykurg
17th October 2010, 10:37
Ok, and now to prove that we are gentile here:

Have a look at QTextDocument. Set the html and receive the plain text back. A more lightweighted solution would be to search for such notations and replace them by hand.

tetsuoii
24th October 2010, 18:49
sorry 'bout the double posting, i'll try not to heat your helmet next time =)

anyway, setting all labels to label->setTextFormat(Qt::RichText); fixed all my problems, both the one described above and the one where I couldn't use norwegian ascii characters which I had to substitute with &#0230; etc. that don't display like "h<?>lvetes j<?>vla kr<?>ket<?>r" :confused: anymore!

It also improved my mood, which was on a slope..., So to all scandinavian, french, german, polish and other special character users, the setTextFormat( Qt::RichText ); function is highly recommended!

And thanks alot to you, Helmet-Man, for your valuable advice which may have saved me days of work!