Results 1 to 11 of 11

Thread: Are fast reg exps too much to ask?

  1. #1
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Are fast reg exps too much to ask?

    I'm building a generic text processing engines which creates parser from XML definitions and I'm facing a kinda big problem : reg exp entities are just SOOOOOOOOOOO SLOW...

    You see, I knew that reg exp ate a lot of CPU time but I barely noticed a difference until I loaded a 1.2Mb file (15k lines). The cost of reg exp is anything but affordable :
    • RX enabled : processing lasts more than one minute
    • RX disabled : processing takes only 3 seconds
    Is there anyone knowing a smart trick to make reg exp decently fast or should I simply abandon them and try to write my own minimal RX-like text matching routines?
    Current Qt projects : QCodeEdit, RotiDeCode

  2. #2
    Join Date
    Feb 2006
    Location
    Romania
    Posts
    2,744
    Thanks
    8
    Thanked 541 Times in 521 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Are fast reg exps too much to ask?

    Well, if you're going to make your own engine: http://swtch.com/~rsc/regexp/regexp1.html

  3. The following user says thank you to marcel for this useful post:

    fullmetalcoder (14th June 2007)

  4. #3
    Join Date
    Oct 2006
    Posts
    42
    Thanks
    1
    Thanked 8 Times in 8 Posts
    Qt products
    Qt4
    Platforms
    Windows

    Default Re: Are fast reg exps too much to ask?

    Quote Originally Posted by fullmetalcoder View Post
    Is there anyone knowing a smart trick to make reg exp decently fast or should I simply abandon them and try to write my own minimal RX-like text matching routines?
    Yeah, it's a bit slow, but you can run in trough your favorite profiling tool, identify the bottlenecks, optimize it and share the patch with us!

  5. #4
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,372
    Thanks
    3
    Thanked 5,019 Times in 4,795 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Are fast reg exps too much to ask?

    Quote Originally Posted by mm78 View Post
    Yeah, it's a bit slow, but you can run in trough your favorite profiling tool, identify the bottlenecks, optimize it and share the patch with us!
    The problem is not in the regexp engine but in the regexp itself. The link provided in the previous post points out the problem and suggests how to overcome it. It needs some knowledge about finite state automata and/or some exercises, but after some practice it's not that hard to provide a minimal regexp matching requirements. You have to convert your expressions to a automata, merge and minimize them and convert the result back to a regexp (or use the resulting finite state machine directly).

  6. #5
    Join Date
    Oct 2006
    Posts
    42
    Thanks
    1
    Thanked 8 Times in 8 Posts
    Qt products
    Qt4
    Platforms
    Windows

    Default Re: Are fast reg exps too much to ask?

    Interesting article

    So far I just skimmed trough it, but I noticed that it's focused on pathological regular expressions, which are corner cases, being slow with Perl, OCRE, Python and Ruby. GNU awk, GNU grep, TCL and awk don't perform too bad compared with the Thomson implementation used.

    I have no idea what algorithm Qt uses, it might be a horrible one or it might be a good one. Ohh, I found this (but so far only skimmed trough it as well): Seriously Weird QRegExp. It might be outdated and not valid any more though.

    (And yes, I did read the section about an algorithm being predictable, consistent and fast on all input. )

  7. #6
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Are fast reg exps too much to ask?

    Well, I think if I'm courageous enough I'll drop regexps and port my whole text processing engine to finite automatons because AFAIK a new approach could make it something like 10 times as fast (and maybe more actually). The downside is that I don't know much about finite automatons and state machines... So I'll probably have a hard time writing something working.... If anyone has good links about these topics please share them
    Current Qt projects : QCodeEdit, RotiDeCode

  8. #7
    Join Date
    Feb 2006
    Location
    Romania
    Posts
    2,744
    Thanks
    8
    Thanked 541 Times in 521 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Are fast reg exps too much to ask?

    Finite state machines are a discrete math concept. A practical approach can be seen in Computer Structure/Organization fields and logical circuit design.

    As for software implementations of finite state machines I think the best example is a compiler. So you should look in that direction...

    Hey, gcc is open source . No, really, I think you can manage to find a good book on compiler theory.

    Regards

  9. #8
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,372
    Thanks
    3
    Thanked 5,019 Times in 4,795 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Are fast reg exps too much to ask?

    You might want to take a look at parser generators such as bison (yacc) or simmilar.

  10. #9
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Are fast reg exps too much to ask?

    Quote Originally Posted by wysota View Post
    You might want to take a look at parser generators such as bison (yacc) or simmilar.
    I really don't like to digg into someone else's code... And AFAIK I'm not the only one... You know it's always hard to find your way in a code that you've not written because you have no idea on how the author think so unless there are a dozen line of comments for each line of actual code it's pretty difficult to figure out how things work...

    But anyway I finally came to something that appears to work pretty fine. I first crafted a very minimal regexp-like NFA-based "parser" after a careful reading of the paper pointed out by marcel (thanks again for this great link ) but it turned out to take about 8 seconds to parse my 16k lines rcc-generated file (you know, the kind of source code with 20 hex numbers on each line...) so I did a bit of brainstorming and decided to mix two methods : the NFA stuff I had and a tree-based stuff I thought about before NFA showed up. Basically I am using a nested QHash<> structure to match word tokens (limited by word boundaries) by doing nested lookups of chars encountered in the aforementioned structure until I found a token end marker. If no word token is found I fallback on the more permissive regexp-like approach (which is significantly faster than QRegExp but has less features).
    My achievement so far is : about two seconds (depends on how busy th CPU is) to process that goddamn file. I'm not sure I can do better but I'm convinced that someone here would be able to find what to improve if anything can be

    Here comes the code (a minimal C++ syntax engine is constructed within it so try loading C(++) files if you want to test it...) :
    Attached Files Attached Files
    Current Qt projects : QCodeEdit, RotiDeCode

  11. #10
    Join Date
    Feb 2006
    Location
    Romania
    Posts
    2,744
    Thanks
    8
    Thanked 541 Times in 521 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Are fast reg exps too much to ask?

    I really don't like to digg into someone else's code... And AFAIK I'm not the only one... You know it's always hard to find your way in a code that you've not written because you have no idea on how the author think so unless there are a dozen line of comments for each line of actual code it's pretty difficult to figure out how things work...
    Just wait till you get a job.

  12. #11
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Are fast reg exps too much to ask?

    Quote Originally Posted by marcel View Post
    Just wait till you get a job.
    Hopefully I should not get a job in computer sciences but (once I finish my studies) in chemistry instead. Aside from that I'm sure it would be pretty hard for me to run both Open source projects and work-related stuff. Indeed I'm willing to bet that I'd spend all my time in Open Source stuff and only the remaining (so not much... ) in work stuffs. But I'll lead a first experiment this summer because I got a little PHP job (I don't know anything at PHP BTW but I'll have to learn faaast...) during holidays.
    Current Qt projects : QCodeEdit, RotiDeCode

Similar Threads

  1. Replies: 12
    Last Post: 7th September 2011, 16:37
  2. Fast serialization/deserialization
    By fullmetalcoder in forum Qt Programming
    Replies: 22
    Last Post: 9th May 2007, 12:58
  3. fast writing of large amounts of data to files
    By TheKedge in forum Qt Programming
    Replies: 1
    Last Post: 13th February 2007, 16:33
  4. Replies: 6
    Last Post: 8th January 2007, 10:24
  5. Fast Keyboard, help !
    By Alex63 in forum Qt Programming
    Replies: 2
    Last Post: 27th June 2006, 18:18

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Qt is a trademark of The Qt Company.