PDA

View Full Version : How do I use QRegExp to split an expression



lni
12th January 2016, 10:01
Hi,

I need to split string such as "Stage1 <= 4.4e-05 || Stage == 1.2 && Comp >= 1.4e+03 || A+e-C > D" to get all variables in the expression.

In the example string, the result should be "Stage1", "4.4e-05", "Stage", "1.2", "Comp", "1.4e+03", "A", "e", "C", "D"

I am using QRegExp rx( "[+\\-*/(),<>&=| ]" ) to split it.

However, it also split "4.4e-05" and "1.4e+03". How can I write the QRegExp to split it without breaking scientific notation.

Thanks!


Here is sample code




#include <QStringList>
#include <QRegExp>
#include <QDebug>

int main()
{
QString str( "Stage1 <= 4.4e-05 || Stage == 1.2 && Comp >= 1.4e+03 || A+e-C > D" );
qDebug() << "str =" << str;

QRegExp rx( "[+\\-*/(),<>&=| ]" );
QStringList strList = str.split( rx, QString::SkipEmptyParts );

qDebug() << "strList =" << strList;
}

d_stranz
13th January 2016, 05:27
You could try making two passes. In the first pass, do not use "+" or "-" in your reg exp. Take the string list result from pass 1 and examine each entry to see if it matches a reg exp for a number (you can search online for suitable regular expressions). If it matches, keep it. If not, then submit the substring to a second pass that splits on your original reg exp in Line 10.

It is hard to write a single regular expression that will match arbitrary string expressions in a single pass. This is why the lex / yacc and flex / bison tools exist.

lni
13th January 2016, 06:00
You could try making two passes. In the first pass, do not use "+" or "-" in your reg exp. Take the string list result from pass 1 and examine each entry to see if it matches a reg exp for a number (you can search online for suitable regular expressions). If it matches, keep it. If not, then submit the substring to a second pass that splits on your original reg exp in Line 10.

It is hard to write a single regular expression that will match arbitrary string expressions in a single pass. This is why the lex / yacc and flex / bison tools exist.

I did use two passes, but still fail. I use the following string for test
"a-aa+bb+4.4e-05-1.2e+2"

It should split into "a", "aa", "bb", "4.4e-05", "1.2e+2", but it doesn't. Please help.

Here is my code


#include <QStringList>
#include <QRegExp>
#include <QDebug>

static QStringList getFormulaVarList( const QString& formula )
{
QString::SplitBehavior behavior = QString::SkipEmptyParts;

QRegExp opRx( "[*/()<>&=| ]" );
QRegExp plusMinus( "[+\\-]" );

QStringList strList;
foreach( const QString& str, formula.split( opRx, behavior ) ) {
bool ok;
str.toDouble( &ok );
if ( !ok ) {
strList << str.split( plusMinus, behavior );
}
}
strList.removeDuplicates();

QStringList result;
foreach( const QString& str, strList ) {
bool ok;
str.toDouble( &ok );
if ( !ok && !str.startsWith( "math.", Qt::CaseInsensitive ) ) {
result << str;
}
}
result.removeDuplicates();

return result;

} // getFormulaVarList

int main( int argc, char** argv )
{
//QString formula( "Stage1 <= 4.4e-05 || Stage == 1.2 && Comp >= 1.4e+03 || A+e-C > D" );
//QString formula( "a * aa+4.4e-05 + math.log( b )" );
//QString formula( "a * aa + 4.4e-05 + math.log( b )" );
QString formula( argv[ 1 ] );

qDebug() << "formula =" << formula;

qDebug() << "items =" << getFormulaVarList( formula );
}

d_stranz
13th January 2016, 06:26
I think you should read the documentation on QRegularExpression and especially read the Perl tutorial (http://perldoc.perl.org/perlretut.html) linked to in that doc. Your regular expressions are much too simple to match the kind of strings you have as input, and I do not think you can do it with a single regular expression.


"a-aa+bb+4.4e-05-1.2e+2"

You might also read this (http://www.technical-recipes.com/2011/a-mathematical-expression-parser-in-java-and-cpp/).
You can see that even parsing simple expressions like this one takes a lot of code to recognize the symbols, operators, and constants and convert that into tokens.

lni
13th January 2016, 07:48
That code doesn't seem to be right.

I download the C/C++ code, and use "4.4e-05 + 1" to test, it gives
Result = -0.6
.....

The problem is to split "+" and "-" if they are math operator, but don't split scientific notation, such as "4.4e-05"

Added after 48 minutes:

I use 3 passes and it appears to work



#include <QStringList>
#include <QRegExp>
#include <QDebug>

static QStringList getFormulaVarList( const QString& formula )
{
QString::SplitBehavior behavior = QString::SkipEmptyParts;

// split the formular in 3 passes
QRegExp noScientificRx( "\\d+[e|E][+-]\\d+" );
QRegExp opRx( "[*/()<>&=| ]" );
QRegExp plusMinusRx( "[+\\-]" );

QStringList varList;
// pass 1 - remove scientific notation
foreach( const QString& scientific, formula.split( noScientificRx, behavior ) ) {
// pass 2 - remove math operator
foreach( const QString& str, scientific.split( opRx, behavior ) ) {
bool ok;
str.toDouble( &ok );
if ( !ok ) {
// pass 3 - remove +/-
varList << str.split( plusMinusRx, behavior );
}
}
}
varList.removeDuplicates();

QStringList result;
foreach( const QString& str, varList ) {
// finally remove numbers and math objects
bool ok;
str.toDouble( &ok );
if ( !ok && !str.startsWith( "math.", Qt::CaseInsensitive ) ) {
result << str;
}
}
result.removeDuplicates();

return result;

} // getFormulaVarList

int main( int argc, char** argv )
{
//QString formula( "Stage1 <= 4.4e-05 || Stage == 1.2 && Comp >= 1.4e+03 || A+e-C > D" );
//QString formula( "a * aa+4.4e-05 + math.log( b )" );
//QString formula( argv[ 1 ] );
QString formula;
if ( argc == 1 ) {
formula = "a*aa+4.4e-05+math.log( b )";
} else {
formula = argv[ 1 ];
}

qDebug() << "formula =" << formula;
qDebug() << "variables =" << getFormulaVarList( formula );
}


Added after 5 minutes:

I reduce to 2 passes



#include <QStringList>
#include <QRegExp>
#include <QDebug>

static QStringList getFormulaVarList( const QString& formula )
{
QString::SplitBehavior behavior = QString::SkipEmptyParts;

// split the formular in 2 passes
QRegExp scientificRx( "\\d+[e|E][+-]\\d+" );
QRegExp opRx( "[*/()<>&=|+\\- ]" );

QStringList varList;
// pass 1 - remove scientific notation
foreach( const QString& noScientific,
formula.split( scientificRx, behavior ) ) {
// pass 2 - remove math operator
foreach( const QString& str,
noScientific.split( opRx, behavior ) ) {
bool ok;
str.toDouble( &ok );
if ( !ok ) {
varList << str;
}
}
}
varList.removeDuplicates();

QStringList result;
foreach( const QString& str, varList ) {
// finally remove numbers and math objects
bool ok;
str.toDouble( &ok );
if ( !ok && !str.startsWith( "math.", Qt::CaseInsensitive ) ) {
result << str;
}
}
result.removeDuplicates();

return result;

} // getFormulaVarList

int main( int argc, char** argv )
{
//QString formula( "Stage1 <= 4.4e-05 || Stage == 1.2 && Comp >= 1.4e+03 || A+e-C > D" );
//QString formula( "a * aa+4.4e-05 + math.log( b )" );
//QString formula( argv[ 1 ] );
QString formula;
if ( argc == 1 ) {
formula = "a*aa+4.4e-05+math.log( b )";
} else {
formula = argv[ 1 ];
}

qDebug() << "formula =" << formula;
qDebug() << "variables =" << getFormulaVarList( formula );
}

d_stranz
13th January 2016, 17:22
//QString formula( "Stage1 <= 4.4e-05 || Stage == 1.2 && Comp >= 1.4e+03 || A+e-C > D" );
//QString formula( "a * aa+4.4e-05 + math.log( b )" );

Well, good for you. What is the output from line 57 for these two inputs?

lni
14th January 2016, 09:55
"Stage1 <= 4.4e-05 || Stage == 1.2 && Comp >= 1.4e+03 || A+e-C > D"
get ("Stage1", "Stage", "Comp", "A", "e", "C", "D")

"a * aa+4.4e-05 + math.log( b )"
get ("a", "aa", "b")

This is what I need. I need all variables in the formula so I can give those inputs to the script engine.

yeye_olive
14th January 2016, 11:11
Forget QRegExp and QRegularExpression and spend a few tens of minutes to learn a real lexer generator such as Flex (or Flex++).

yeye_olive
14th January 2016, 15:03
I'm having a change of heart; I still recommend that you use a lexer generator, but here is a solution based on QRegularExpression.

The thing to pay attention to is that, even if you are interested in identifiers only, you have to parse floating-point numbers too, because you need to identify which occurrences of "e" belong to a number and which are part of an identifier.

Anyway, here is a function that prints all the numbers and identifiers in the string s:

void printMatches(const QString &s) {
static QRegularExpression varOrNumMatcher("[a-zA-Z_][a-zA-Z\\d]*|(?:\\.\\d+|\\d+\\.?\\d*)(?:[eE][+-]?\\d+)?");
QRegularExpressionMatchIterator i = varOrNumMatcher.globalMatch(s);
while (i.hasNext())
qDebug() << i.next().capturedRef();
}
Identifiers are made of underscores, ASCII letters and digits, and must not begin with a digit.
Numbers have an optional integer part, an optional fractional part, and an optional exponent (introduced by 'e' or 'E' and an optional sign).
This simplistic lexer does not parse negative numbers, and does not recognize built-in identifiers like "math" and "log", from your last example.
You'll have to take it from here.

Notice that a regular expression is OK for lexing, but cannot help you parse recursive expressions (such as arithmetic expressions with parentheses). You need a context-free grammar for that. See Flex and Bison.

d_stranz
14th January 2016, 16:47
I'm having a change of heart; I still recommend that you use a lexer generator

That was my first thought, but there is so much overhead to learning those tools that the OP might as well spend the time using regular expressions in Qt. Agreed that if the expressions ever get more complex than the examples posted or if there are precedence rules / recursive expressions, a grammar will be needed. It appears that all the OP needs at this point are the identifiers and not an expression tree for evaluation.