PDA

View Full Version : reading XML node and sub nodes



elcuco
15th August 2014, 21:04
Hi all,

I need to parse an XML file. When a specific node, is found, I need to get the exact test inside that tag, not only the "xml text".

Example:


<root>
<a>11111</a>
<a><b>test 123</b></a>
</root>


In this XML I need to be able to get: "1111" and "<b>test 123</b>".

I am using this code in Qt5:


void test1() {
QString rawXML =
"<root>"
" <a>11111</a>"
" <a><b>test 123</b></a>"
"</root>";
QXmlStreamReader xml(rawXML);
QStringRef s;

xml.readNextStartElement();
s = xml.name();
if (s!="root") {
return;
}

while (!xml.atEnd()) {
xml.readNextStartElement();
s = xml.name();
if (s!="a") {
break;
}

QString ss = xml.readElementText(QXmlStreamReader::IncludeChild Elements);
qDebug("%s", qPrintable(ss));
}
}


This is not working as I expect. I am getting "test 123" and not "<b>test 123</b>".

What is the best approach to handle this situation? How can I parse the XML and getting the desired result?

10558

EDIT: similar stackoverflow question: http://stackoverflow.com/questions/5200713/parsing-xml-with-nodes-containing-html-in-qt

ChrisW67
15th August 2014, 22:52
Take a look at the tokenString() function. You may be able to construct the result as you go through the start, text, and end tokens of the <b> element.

anda_skoa
16th August 2014, 12:02
Alternatively use QDomDocument for parsing, QDomDocument::elementsByTagName to get all the <a> tags, then call QDomNode::save() on all their children.

Cheers,
_

elcuco
16th August 2014, 18:42
Another solution that does not work:



QByteArray ba(rawXML);
QBuffer bytes;

bytes.setBuffer(&ba);
bytes.open(QIODevice::ReadOnly);
QXmlStreamReader xml(&bytes);
...
QIODevice *device = xml.device();
pstart = device->pos();
QString ss = xml.readElementText(QXmlStreamReader::IncludeChild Elements);
pend = device->pos();

char line[100];
int len = pend-pstart;
device->seek(pstart);
device->read(line, len);
device->seek(pend);


This is because it seems that seems that QXmlStreamReader will buffer my whole data in advance. .. so "pos()" will always return the last byte in the raw data.

elcuco
17th August 2014, 18:45
2nd attempt, using QDomDocument:



void test2() {
const char* rawXML =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
"<root>"
" <a>11111</a>"
" <a><b>test 123</b></a>"
"</root>";
QDomDocument xml("rawXML");
QByteArray ba(rawXML);
xml.setContent(ba);

QDomElement rootElement = xml.documentElement();
qDebug("root element has %d childs", rootElement.childNodes().count());
qDebug("root element is %s ", qPrintable(rootElement.nodeName()) );

QDomNodeList a = rootElement.elementsByTagName("a");
for (int i=0; i< a.length(); i ++) {
QString aContent;
QTextStream ts(&aContent);
a.at(i).save(ts,0);
qDebug("- [%s]", qPrintable(aContent));
QDomNode d = a.at(i);
}
}


Now, I get not only the content, but the tags as well. This is the output:



root element has 2 childs
root element is root
- [<a>11111</a>
]
- [<a>
<b>test 123</b>
</a>
]


I assume I can trim the leading
<a>and ending
</a>\n, but this is at the same level of ugliness I am trying to avoid.

Any other idea?

EDIT:

I should be testing more before testing. anda_skoa, I re-read your post and changed:



const char* rawXML = "<root><a>11111</a><a><b>test 123</b></a></root>";

for (int i=0; i< alist.length(); i ++) {
QDomNode a = alist.at(i);
QString aContent;
QTextStream ts(&aContent);
a.firstChild().save(ts,0);
qDebug("- [%s]", qPrintable(aContent));
}


Which is better:


- [11111]
- [<b>test 123</b>
]

Still, I get an extra newline after then
</b> which is bad for me... but its much better then before. It seems that QDomDocument is adding extra newlines when parsing. Am I correct? How can I disable this "feature"?

anda_skoa
17th August 2014, 23:01
Does the newline matter? it is still the same XML content, no?

One other thing you could look into is XQuery, Qt has a module for that as well (called Qt XML patterns)

Cheers,
_

elcuco
18th August 2014, 06:58
I modified the XML source to be in a single line (see last example). The newline is important, but since it can be part of the input (I think it needs to be escaped, so it may be possible to filter out). Still bugs me as it feels like Qt is doing extra work behind my back.

Yes, XmlPatterns is another option... I was looking into it, but as I know nothing about it, I was hoping someone would give me the one-liner I am looking for... :)

anda_skoa
18th August 2014, 08:08
Hmm, I am not sure the newline at this point (after an end tag) is relevant as far as XML goes.

Cheers,
_

elcuco
18th August 2014, 17:00
Hmm, I am not sure the newline at this point (after an end tag) is relevant as far as XML goes.

Cheers,
_

In theory you are right.
In practice, I am paring XMLs that contain data which (once) was a user input, and thus I need to store the exact that the user wrote.

Note to self:
If you ever get into problems ... UUENCODE or base64 that &^%&^% text.... unfortunately this time the format is pre-defined for me.