Are fast reg exps too much to ask? [Archive]

View Full Version : Are fast reg exps too much to ask?

fullmetalcoder

13th June 2007, 18:24

I'm building a generic text processing engines which creates parser from XML definitions and I'm facing a kinda big problem : reg exp entities are just SOOOOOOOOOOO SLOW...

You see, I knew that reg exp ate a lot of CPU time but I barely noticed a difference until I loaded a 1.2Mb file (15k lines). The cost of reg exp is anything but affordable :

RX enabled : processing lasts more than one minute
RX disabled : processing takes only 3 secondsIs there anyone knowing a smart trick to make reg exp decently fast or should I simply abandon them and try to write my own minimal RX-like text matching routines? :(

marcel

13th June 2007, 18:46

Well, if you're going to make your own engine: http://swtch.com/~rsc/regexp/regexp1.html

mm78

13th June 2007, 21:13

Is there anyone knowing a smart trick to make reg exp decently fast or should I simply abandon them and try to write my own minimal RX-like text matching routines? :(

Yeah, it's a bit slow, but you can run in trough your favorite profiling tool, identify the bottlenecks, optimize it and share the patch with us! :rolleyes:

wysota

13th June 2007, 22:08

Yeah, it's a bit slow, but you can run in trough your favorite profiling tool, identify the bottlenecks, optimize it and share the patch with us! :rolleyes:

The problem is not in the regexp engine but in the regexp itself. The link provided in the previous post points out the problem and suggests how to overcome it. It needs some knowledge about finite state automata and/or some exercises, but after some practice it's not that hard to provide a minimal regexp matching requirements. You have to convert your expressions to a automata, merge and minimize them and convert the result back to a regexp (or use the resulting finite state machine directly).

mm78

13th June 2007, 22:42

Interesting article :)

So far I just skimmed trough it, but I noticed that it's focused on pathological regular expressions, which are corner cases, being slow with Perl, OCRE, Python and Ruby. GNU awk, GNU grep, TCL and awk don't perform too bad compared with the Thomson implementation used.

I have no idea what algorithm Qt uses, it might be a horrible one or it might be a good one. Ohh, I found this (but so far only skimmed trough it as well): Seriously Weird QRegExp (http://doc.trolltech.com/qq/qq01-seriously-weird-qregexp.html). It might be outdated and not valid any more though.

(And yes, I did read the section about an algorithm being predictable, consistent and fast on all input. ;))

fullmetalcoder

14th June 2007, 15:49

Well, I think if I'm courageous enough I'll drop regexps and port my whole text processing engine to finite automatons because AFAIK a new approach could make it something like 10 times as fast (and maybe more actually). The downside is that I don't know much about finite automatons and state machines... So I'll probably have a hard time writing something working.... If anyone has good links about these topics please share them :)

marcel

14th June 2007, 19:12

Finite state machines are a discrete math concept. A practical approach can be seen in Computer Structure/Organization fields and logical circuit design.

As for software implementations of finite state machines I think the best example is a compiler. So you should look in that direction...

Hey, gcc is open source :). No, really, I think you can manage to find a good book on compiler theory.

Regards

wysota

14th June 2007, 19:37

You might want to take a look at parser generators such as bison (yacc) or simmilar.

fullmetalcoder

15th June 2007, 20:43

You might want to take a look at parser generators such as bison (yacc) or simmilar.
I really don't like to digg into someone else's code... And AFAIK I'm not the only one... You know it's always hard to find your way in a code that you've not written because you have no idea on how the author think so unless there are a dozen line of comments for each line of actual code it's pretty difficult to figure out how things work... ;)

But anyway I finally came to something that appears to work pretty fine. I first crafted a very minimal regexp-like NFA-based "parser" after a careful reading of the paper pointed out by marcel (thanks again for this great link :)) but it turned out to take about 8 seconds to parse my 16k lines rcc-generated file (you know, the kind of source code with 20 hex numbers on each line...) so I did a bit of brainstorming and decided to mix two methods : the NFA stuff I had and a tree-based stuff I thought about before NFA showed up. Basically I am using a nested QHash<> structure to match word tokens (limited by word boundaries) by doing nested lookups of chars encountered in the aforementioned structure until I found a token end marker. If no word token is found I fallback on the more permissive regexp-like approach (which is significantly faster than QRegExp but has less features).
My achievement so far is : about two seconds (depends on how busy th CPU is) to process that goddamn file. I'm not sure I can do better but I'm convinced that someone here would be able to find what to improve if anything can be ;)

Here comes the code (a minimal C++ syntax engine is constructed within it so try loading C(++) files if you want to test it...) :

marcel

15th June 2007, 20:55

I really don't like to digg into someone else's code... And AFAIK I'm not the only one... You know it's always hard to find your way in a code that you've not written because you have no idea on how the author think so unless there are a dozen line of comments for each line of actual code it's pretty difficult to figure out how things work... ;)

Just wait till you get a job.:)

fullmetalcoder

15th June 2007, 21:02

Just wait till you get a job.:)
Hopefully I should not get a job in computer sciences but (once I finish my studies) in chemistry instead. Aside from that I'm sure it would be pretty hard for me to run both Open source projects and work-related stuff. Indeed I'm willing to bet that I'd spend all my time in Open Source stuff and only the remaining (so not much... ;)) in work stuffs. But I'll lead a first experiment this summer because I got a little PHP job (I don't know anything at PHP BTW but I'll have to learn faaast...) during holidays.