Regex to filter words containing english alphabets [Archive]

dipeshtech

31st March 2011, 17:18

I have text expressions something like as mentioned below. I want to filter only words with English alphabets (i.e no special chars, no operators, no quotes ...etc)

As an output, i expect words like this:

Input:

Buy 1 lakh SMS at 5p/SMS & Get 1lakh Data & keyword for 2 months free. Pay Rs.2000 more & Get a Dynamic 15 pages WebSite. Call 9811968238.

Output:

Buy lakh SMS at SMS Get lakh Data keyword for months free Pay Rs more Get Dynamic pages WebSite Call

QFile file("E:\\SMS\\dout.csv");
file.open(QIODevice::ReadWrite | QIODevice::Text);
QTextStream out(&file);
out << "This file is generated by Qt\n\n\n";

QFile file2("E:\\SMS\\CCHECK.txt");
file2.open(QIODevice::ReadWrite | QIODevice::Text);
QTextStream cc_in(&file2);

QString chk_ln = cc_in.readLine();

while(!chk_ln.isNull())
{
//Problem in below line
QStringList list2 = chk_ln.split(QRegExp("\\W+"), QString::SkipEmptyParts);

for (int i = 0; i < list2.size(); ++i)
{
out<<list2[i]<<" ";

}
out<<"\n";

chk_ln = cc_in.readLine();
}

Any help is appreciated !!

wysota

31st March 2011, 20:35

QFile in(...);
if(!in.open(...)) ...;
QFile out(...);
if(!out.open(...)) ...;
char c;
QChar ch;
while(!in.atEnd()){
in.getChar(&c);
ch = c;
if(ch.isSpace() || ch.isLetter())
out.putChar(c);
}

dipeshtech

31st March 2011, 20:42

Thanks...!!

I got it done like this:

Qstring temp;

while(!chk_ln.isNull())
{
QStringList list2 = chk_ln.split(QRegExp("\\W+"), QString::SkipEmptyParts);

for (int i = 0; i < list2.size(); ++i)
{
temp= list2[i]; //added
if(temp.contains(QRegExp("[0-9]"))) //added
{ continue; } //added
out<<list2[i]<<" ";

}
out<<"\n";

chk_ln = cc_in.readLine();
}

But, thanks for answering!

SixDegrees

31st March 2011, 22:04

Thanks...!!

I got it done like this:

Qstring temp;

while(!chk_ln.isNull())
{
QStringList list2 = chk_ln.split(QRegExp("\\W+"), QString::SkipEmptyParts);

for (int i = 0; i < list2.size(); ++i)
{
temp= list2[i]; //added
if(temp.contains(QRegExp("[0-9]"))) //added
{ continue; } //added
out<<list2[i]<<" ";

}
out<<"\n";

chk_ln = cc_in.readLine();
}

But, thanks for answering!

What happens when your input contains punctuation marks?

I think you want something along the lines of /[a-z][A-Z]/, or a simple check of ASCII/Unicode value range to take diacritical marks into account while excluding all control characters, punctuation marks and whitespace.

dipeshtech

31st March 2011, 22:13

Yeah..!! you spotted it right, i am looking for the same what you mentioned.

But, by the aforementioned solution (In my previous post) i am able to to get the right answer in presence of punctuation marks also.

I tested it.

wysota

31st March 2011, 22:53

Using a regular expression just to see if a single character is a digit is a really inefficient idea.

dipeshtech

31st March 2011, 23:08

Yeah..!! I agree it is bit inefficient, but i didn't wanted to go for character by character reading so tried this. I am not an expert and new to Qt, so please forgive me for this ignorance. Anyways, thanks for pointing it out. ( I am still learning)

wysota

31st March 2011, 23:13

The fact that you didn't write code for reading character by character doesn't mean the code you have written doesn't internally read character by character. It does and it does it inefficiently - QString::contains() with a regexp that evaluates to a single character tries to match a string that is one character long so effectively it does character by character evaluation and is slower than if it were evaluating a single character as it works on strings. It's fastest to do an ascii value comparison:

if(c>='A' && c<='z') character is ok;
For most architectures these are three (or maybe even two) machine instructions. If you have a lot of comparisons to make, the performance hit is significant.

dipeshtech

31st March 2011, 23:25

Yeah!! I was thinking on this line and got this point in mind :-

doesn't internally read character by character

but, wasn't aware (actually ignorant) that it is slower than character by character evaluation. Actually, i need to process a large data and that too on my mobile device after implementing the algorithm. It would really make difference for my processing.

Thanks a TON for explaining the fact VERY Clearly.

wysota

31st March 2011, 23:42

but, wasn't aware (actually ignorant) that it is slower than character by character evaluation.
You have an additional overhead of compiling a state machine for the regular expression in every iteration of the loop. The least you can do is move the regular expression out of the loop so it lives through iterations. But still it's just inefficient to use one character regular expression unless maybe when the expression tests many classes of characters and not just one like in your case and defnitely not using contains().

dipeshtech

31st March 2011, 23:52

I think i will go by simple character by character tokenization to filter the words. That should reduce the overhead then.

wysota

1st April 2011, 00:09

The fastest approach I can think of is to use something similar to QStringRef where you will not copy the data from the original string in every iteration but instead mark positions and lengths of every valid token and at the end extract those tokens from the string in one go, possibly with merging the marks so that you can extract as large areas as possible. Then you avoid the split, avoid copying data and other expensive operations. And you will omit the effect of your current implementation that doesn't preserve whitespaces.

dipeshtech

1st April 2011, 00:24

I have to try it out...!! Not getting it at present, but will try it for sure.

It's already dawn here, will try after some rest.