PDA

View Full Version : separate numbers of texts



jaca
12th October 2011, 18:47
Regular expression that will isolate numbers and text.



QStringList linesInt;
QRegExp rx("(\\d+)");
QString text = "5001 1001 5002 1002.5 Observation reason: 10 river and pond."

int pos = 0;
while ((pos = rx.indexIn(text,pos)) != -1){
linesInt << rx.cap(1);
pos += rx.matchedLength();
}
//linesInt = (5001, 1001, 5002, 1002, 5, 10) --- I want linesInt = (5001, 1001, 5002, 1002.5)
//or linesInt = (5001, 1001, 5002, 1002.5, Observation reason: 10 river and pond.)


The code below give me the result I expect. But I would have to test if the first 4 are numbers.


QStringList linesInt;
QRegExp rx("\t");
QString text = "5001\t1001\t5002\t1002.5\tObservation reason: 10 river and pond."
linesInt = text.split(rx,QString::SkipEmptyParts);
//linesInt = (5001, 1001, 5002, 1002.5, Observation reason: 10 river and pond.)


Someone would indicate a QRegExp?

stampede
12th October 2011, 21:32
A simple improvement of your first regexp:

QRegExp rx("(\\d+\\.?\\d*)");
but it will catch the last "10" as well (and numbers like "1000." too).

jaca
12th October 2011, 22:24
A simple improvement of your first regexp:

QRegExp rx("(\\d+\\.?\\d*)");
but it will catch the last "10" as well (and numbers like "1000." too).

Thanks stampede!

Really, he will catch the last "10" and numbers like "1000.". I'm trying but have not found the solution.
The problem will also be isolate the text of string.

stampede
12th October 2011, 22:33
I think catching strings like "1000." is actually a good thing. For example, in C, 100. is as good as 100.0
You can fix this by replacing the '*' with '+' if you dont like it.

ChrisW67
13th October 2011, 06:28
Or

QRegexp rx("(\\d+(\\.\\d+)?)");

will capture the dot in a number if it is followed by other digits but not otherwise. It really depends on what you need.
You hint that input string is not indicative of your entire possible inputs (variable length of number list). You could do something like:


QRegexp rx("^((?:(?:\\d+(?:\\.\\d+)?)\\s+)*)(.*)");

and then split group 1 on white space to get only the leading numbers. Group 2 contains the trailing remainder.

jaca
13th October 2011, 13:55
Or

QRegexp rx("(\\d+(\\.\\d+)?)");

will capture the dot in a number if it is followed by other digits but not otherwise. It really depends on what you need.
You hint that input string is not indicative of your entire possible inputs (variable length of number list). You could do something like:


QRegexp rx("^((?:(?:\\d+(?:\\.\\d+)?)\\s+)*)(.*)");

and then split group 1 on white space to get only the leading numbers. Group 2 contains the trailing remainder.

Thanks ChrisW67,

I used the code below and it worked well. But the problem continues to capture "1000." not excluding the dot and not considered as a real value.


QStringList lines, linesInt;
QString linesStr;
QRegExp rxIn("^((?:(?:\\d+(?:\\.\\d+)?)\\s+)*)(.*)");
int pos = 0;
while ((pos = rxIn.indexIn(lines[i],pos)) != -1){
linesInt << rxIn.cap(1).split("\t",QString::SkipEmptyParts);
linesStr = rxIn.cap(2);
pos += rxIn.matchedLength();
}

wysota
13th October 2011, 16:01
Unless there is a good reason not to, I would suggest to parse the text manually instead of using regular expressions. It seems you are focused on one class of characters (digits+dot) so using a regular expression will not be much faster (if at all) than parsing the string manually and you will save a lot of time trying to get the expression right. You can even use some kind of parser generator if you want.

ChrisW67
14th October 2011, 00:07
Thanks ChrisW67,
I used the code below and it worked well.
You have misused the anchored regular expression. You should match it against each line once and then split the first capture group. There is no while() loop required.


QStringList lines;
lines
<< "5001 1001 5002 1002.5 Observation reason: 10 river and pond."
<< "500. 1001 5002 1002.7 Observation reason: 20 lake or dam"
<< "5001 100. 5002 1002.7 Observation reason: 20 lake or dam"
<< "30.1 30.1 30 Observation reason: 10 river and pond.";

QRegExp rx("^((?:(?:\\d+(?:\\.\\d+)?)\\s+)*)(.*)");
// QRegExp rx("^((?:(?:\\d+\\.?\\d*)\\s+)*)(.*)");
foreach(QString line, lines) {
if (rx.indexIn(line) != -1) {
qDebug() << "Numbers:" << rx.cap(1).split(' ', QString::SkipEmptyParts) << "Remainder" << rx.cap(2);
}
else
qDebug() << "No match:" << line;
}

The expression treats "500." as not-a-number and stops capturing at the first non-number. The commented expression includes these as numbers but you get the dot which is fine in most circumstances.

I agree, by the way, with wysota that the problem is probably easier to handle manually.