PDA

View Full Version : QtXml/QDomDocument and invalid attribute syntax



AyaKoshigaya
29th March 2011, 16:13
Hi,

i am using QDomDocument to parse our XMLFiles - works nice and fast, except one big problem.

Some of the xml files are using invalid syntax in attributes like this:


<xml>
<node someAttribute="this is <b>text</b>" />
</xml>

As you can see there are "<" and ">" in the attribute-value. Our old xml parser had no problems with this since it's in quotes.

I can't change all the xml files by hand because we get a ton of them every day from our customers. So.. does anyone have any idea how I can solve this problem?

QDomDocument doesn't work with this.. it just stops reading the xml at the first node with a "<" in the attribute-value :(

Thanks,
Aya

wysota
30th March 2011, 11:56
Well... your file is not xml, it's "xml-like". Every proper xml parser should bail out on this. What you could do is that you can prescan the whole file using a regular expression and convert all invalid attributes to valid ones before passing the text to an xml parser. Something like:

QMap<QString,QString> replacements;
replacements.insert("<", "&lt;");
replacements.insert(">", "&gt;");
// etc
QMapIterator<QString,QString> iter(replacements);
while(iter.next()){
text = text.replace(iter.key(), iter.value());
}

Of course this is an oversimplification as this would replace all angle brackets and that's certainly not what you want. You have to detect attributes first (using regular expressions) and only operate on their contents.

Maybe something like this?

QRegExp("([A-Za-z]+)\\s*=\\s*\"([^\"]+)\""); // cap(1) contains attr name, cap(2) contains value