QRegExp to capture HTML link [Archive]

View Full Version : QRegExp to capture HTML link

T4ng10r

25th October 2011, 20:12

Hi.
I want to extract link from few pages. First I used QWebFrame and it worked, but why use cannon to shot fly.
Now I'm trying to use RegExp to capture those links. On the other hand I could use XMLParse, but using many parsers for different page layout seems ... to much code.

So, searched link is

<a href="/?p=AMD+X2+II+555+AM3+B" class="produkt" title="Opis"><span class="produkt">AMD Phenom II X2 555 Black Edition s.AM3 BOX</span></a>
I tried with
<a [\d\w= ]*(class)=\"(produkt)\"> but it was too greedy.
Any suggestion?

ChrisW67

25th October 2011, 22:48

QRegExp::setMinimal() may help but regular expressions to match nested delimiter pairs (i.e <>, "", etc.) is difficult to get right. The situation is not helped by inconsistent HTML... for example upper/lower case, single/double/no quote, valueless attributes etc.

T4ng10r

26th October 2011, 06:53

Yes, I tried setting this minimal flag.

const QString cstrProduktLinkRegExp("<a .*class=\"produkt\">");
QRegExp stProductRow(cstrProduktLinkRegExp,Qt::CaseSensiti ve,QRegExp::RegExp2);
stProductRow.setMinimal(true);
iStart = strSearchTable.indexOf(stProductRow,iStart);
if (stProductRow.matchedLength()==0)
return -1;
QString strProductRow = stProductRow.cap();

Still - he finds first matching <A> and finish regexp AFTER first appearance of class="produkt".