Regular expression help neede [Archive]

aurora

13th January 2012, 11:24

I have collection of files, the contents of all those files have the following format

-- File name
--
-- listOne (L1)
-- listTwo (L2)
-- listThree (L3)
-- HeaderLine (HE)
-- listFour (L6)
-- listFive (L2)
-- listSix (L9)
-- listSeven (L0)
-- someline (SL)
-- listeight (LL)
--
--
REMAINING CONTENTS OF THE LINE
-----------------------------------------------------------------------
some more contents
------------------------------------------------------------------------

Here i want to store only L1,L2,L3 etc in a list, except HE,SL and remaining lines of files
How can i do that?
Please help me, i went through QREgExp class defination also, and i wrote code but that seems to be very big and inserts some blank strings into stored list

while(!f.atEnd() && (!line.contains("------------------------------------------")))
{

if(!line.contains("-- "))
{
flag=1;
QRegExp rx("[\(]([a-z]|[0-9]|[_]|[A-Z])+[\)]");
rx.indexIn(line);
QRegExp rx1("([a-z]|[0-9]|[_]|[A-Z])+");
rx1.indexIn(rx.cap(0));
captured.append(rx1.cap(0));
line=f.readLine();
}
else if(flag==1)
{
flag++;
captured.pop_back();
QRegExp rx("[\(]([a-z]|[0-9]|[_]|[A-Z])+[\)]");
rx.indexIn(line);
QRegExp rx1("([a-z]|[0-9]|[_]|[A-Z])+");
rx1.indexIn(rx.cap(0));
captured.append(rx1.cap(0));
line=f.readLine();
}

else if(flag>0)
{ flag++;
QRegExp rx("[\(]([a-z]|[0-9]|[_]|[A-Z])+[\)]");
rx.indexIn(line);
QRegExp rx1("([a-z]|[0-9]|[_]|[A-Z])+");
rx1.indexIn(rx.cap(0));

captured.append(rx1.cap(0));
line=f.readLine();
}

}

Please help me sort this problem

Lykurg

13th January 2012, 11:30

QRegExp rx("[\(]([a-z]|[0-9]|[_]|[A-Z])+[\)]");
can be written as

QRegExp rx("\(([a-z0-9_A-Z])\)");which is a littel more readable. Then use QRegExp::capturedTexts() to get the content inside the brackets.

aurora

13th January 2012, 11:47

Sorry Lykurg i posted wrongly...please look at the file format once...
I dont want to capture "HE" and "LS" like lines which has some sub lines....

aurora

15th January 2012, 14:06

For example....my input is file shown above,
and regular expression must capture
only L1,L2,L3,L6,L2,L9,L0,LL

it should not capture the line which has subline, thats allâ€¦

amleto

15th January 2012, 14:43

does everything you want to save start "(L" ?

aurora

16th January 2012, 03:56

does everything you want to save start "(L" ?

No amleto....nothing like that...
> All texts inside round bracket, which is present at the end of all line.
> And regular expression should not capture line which has sub line..
eg:

-- its main line
-- its sub line
-- its another subline
> Consider all the lines which starts with "-- "

ChrisW67

16th January 2012, 05:50

In your example all the lines you wish to capture start with "-- " and end with "(L.)" where "." is a wildcard, and the unwanted grouping lines do not. Amleto is asking if that is a rule (or something like a rule) you can use. If amleto's observation is correct then this:

#include <QtCore>
#include <QDebug>

int main(int argc, char **argv)
{
QCoreApplication app(argc, argv);
QFile in("test.txt");
if (in.open(QIODevice::ReadOnly)) {
QTextStream s(&in);

QRegExp re("--\\s.*\\((L.)\\)");
while (!s.atEnd()) {
QString line = s.readLine();
if (re.exactMatch(line))
qDebug() << re.cap(1);
}
}
app.exec();
}

trivially gives:

"L1"
"L2"
"L3"
"L6"
"L2"
"L9"
"L0"
"LL"

If not, you need to infer hierarchy from changes in the amount of whitespace between "-- " and the text of lines ending "\\(.+\\)". Something like:

#include <QtCore>
#include <QDebug>

int main(int argc, char **argv)
{
QCoreApplication app(argc, argv);

QFile in("test.txt");
if (in.open(QIODevice::ReadOnly)) {
QTextStream s(&in);

int lastDepth = 0;
QString lastValue;
QStringList result;

QRegExp re("--(\\s+).*\\((.+)\\)");
while (!s.atEnd()) {
QString line = s.readLine();

if (re.exactMatch(line)) {
int newDepth = re.cap(1).length();
if (lastDepth > 0 && newDepth <= lastDepth)
result << lastValue;
lastDepth = newDepth;
lastValue = re.cap(2);
}
else if (line.startsWith("---")) {
if (lastDepth > 0)
result << lastValue;
break;
}
}

qDebug() << result;
}

app.exec();
}

outputs:

("L1", "L2", "L3", "L6", "L2", "L9", "L0", "LL")
for your example. This is dependent on reliable indenting which, if humans are involved, is unlikely to be the case.

aurora

16th January 2012, 09:32

ya chis....
ur second option is the one i wanted...but i think i need to make changes in regular expression, its not capturing anything
Would u mind to tell what changes shall i do for that? Please dont mind that regular expression is seems to be bit complex and thats not capturing anything....
I'm getting confuse to make changes to that....

ChrisW67

16th January 2012, 23:27

The regular expression and code I gave you does capture the values inside () at the end of suitable lines in your example data. By "end" I mean really the end, no trailing whitespaces for example.

What exactly are you feeding the routine as input?

aurora

19th January 2012, 04:08

The regular expression and code I gave you does capture the values inside () at the end of suitable lines in your example data. By "end" I mean really the end, no trailing whitespaces for example.

What exactly are you feeding the routine as input?

i gave a file which is having following contents

-- Filename: abcd.txt
-- Player names
-- Geraint Jones (GJa)
-- James Anderson (JA)
-- England (Eng)
-- Captain (cap)
-- KevinPietersen (KPa)
-- Andrew Strauss (AS)
-- Paul Collingwood (PC)
-- Flintoff (AF)
-- Australia (Au)
-- Manager (Man)
-- Ponting (RP)
-- Adam Gilchrist (AG)
-- Brad Hogg (BH)
--
--

Here it should capture only the Player names....and should not capture country name and designation
This is the input i have given but that regular expression didnt capture anything...

ChrisW67

19th January 2012, 05:38

This is the input i have given but that regular expression didnt capture anything...
You sure? Really sure?

Using the code exactly as I posted above, with a test.txt containing exactly what you just posted I get:

("GJa", "JA", "KPa", "AS", "PC", "AF", "RP", "AG")

The country entries "Eng" and "Au" are missing because your stated requirement was that we "should not capture line which has sub line". The same reasoning excludes the "cap" and "Man" entries. Hoggy (BH) is missing because your example text does not meet the format you posted at the start of this thread, i.e. a line of hyphens terminating the header triggered the output of the last pending entry.

Have you tried to understand how the code functions?

aurora

19th January 2012, 06:10

Thanks a lot cris, some minor correction needed in my main program, now it works as i expected.
Thank u so much for looking at my issue with that much Patient...:)

ChrisW67

19th January 2012, 06:35

A variation that captures BH as well:

#include <QtCore>
#include <QDebug>

int main(int argc, char **argv)
{
QCoreApplication app(argc, argv);

QFile in("test.txt");
if (in.open(QIODevice::ReadOnly)) {
QTextStream s(&in);

int lastDepth = 0;
QStringList result;

QRegExp re("--(\\s+).*\\((.+)\\)");
while (!s.atEnd()) {
QString line = s.readLine();

if (re.exactMatch(line)) {
int newDepth = re.cap(1).length();
if (newDepth > lastDepth && result.size() > 0)
result.removeLast();
lastDepth = newDepth;
result << re.cap(2);
}
else if (line.startsWith("---"))
break;
}

qDebug() << result;
}

app.exec();
}

aurora

19th January 2012, 11:03

Ok...Thank u once again Chris....:)