Results 1 to 9 of 9

Thread: ideas for text extraction/transformation spec

  1. #1
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default ideas for text extraction/transformation spec

    I am currently writing an engine for generic text extraction and transformation. I got the generic lexer and the generic parser to work but I'm a bit unsure about the design to use for the transformation layer.

    When the parser finds a match it notifies an handler with a match identifier and a list of matched tokens (some more data could be passed but at the moment I see no use for that).

    Now the question is : how could the actions to perform upon a given match (taking the matched tokens into account of course) be specified?

    Ideally I'd want to avoid resorting to a full-blown scripting language as I want the engine to be as fast as possible (currently the lexer can handle 1.4Mb of C++ in less than 100ms on my laptop, 40% of it being spent loading the data from disk to RAM and the parser has can extract the whole structure of qwidget.h in 5ms (smaller file but much more complex to handle)).

    Is there any spec for such text transformations or do I have to create my own? If so do you have any ideas to create something that would be relatively easy to implement and easy to understand and use for people not necessarily versed in programming.

    As a side note, here are two things I would expect that spec to be able to describe :

    • creation of a class tree from code
    • autogeneration of SIP bindings files from C++ sources
    Current Qt projects : QCodeEdit, RotiDeCode

  2. #2
    Join Date
    Mar 2007
    Location
    Germany
    Posts
    229
    Thanks
    2
    Thanked 29 Times in 28 Posts
    Qt products
    Qt4
    Platforms
    Windows

    Default Re: ideas for text extraction/transformation spec

    I don't think my post will be useful to you, but perhaps you can clarify my confusion after I have read your post .

    I didn't get the point what your "enginge" can do. It is neither a lexer nor a parser to recognise "languages" (probably defined in (E)BNF).
    It can... well... extract (?) text and transform it into what?

    This somewhat sounds like your engine can do something compared to XSLT (transforming one XML document into an onther with an different XSL (Extensible Stylesheet Language)).

  3. #3
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: ideas for text extraction/transformation spec

    It read a set of files which define a lexer and a parser. These files are basically a set of rules using a regexp-like syntax which get translated into automatons.

    So strictly speaking it does not generate lexers/parsers as they only extracts data they recognize and won't complain/abort due to syntax errors, contrary to what usual LL/LR/LALR/GLR do. Hence the use of "extract" which can be puzzling I admit but is the best description I can give.

    Once that data is extracted it has to be processed in some way to create something else (so the operation as a whole can be perceived as a "transformation" but strictly speaking it would be better described as "generation controlled by extracted data").

    Ideally I would like the processing of the extracted data to be general purpose but if it is not possible I will focus on the secific goals I had when starting this project (the two I mentioned in my previous post).

    I hope this makes my point clearer.
    Current Qt projects : QCodeEdit, RotiDeCode

  4. #4
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: ideas for text extraction/transformation spec

    Quote Originally Posted by fullmetalcoder View Post
    Now the question is : how could the actions to perform upon a given match (taking the matched tokens into account of course) be specified?
    Maybe use an approach similar to delegates in item views (I don't remember the exact design pattern name for that)? Allow users to inject small objects with little pieces of code to manipulate tokens.

    Is there any spec for such text transformations or do I have to create my own?
    Well... there is XSL-T

    If so do you have any ideas to create something that would be relatively easy to implement and easy to understand and use for people not necessarily versed in programming.
    I'm currently working on a code generator based on XSD schemas and I'm using QtScript to provide means of data manipulation. The overall performance is not impressive but I didn't bother to do any optimizations as speed is not a priority with my usecase.

    Apart from QtScript I'm using a simple replace mechanism - you feed the engine with a text template with variable placeholders and the engine substitutes them with their values.

    creation of a class tree from code
    I have a nice C++ class generator It works quite well, handles dependencies, creates accessors for variables, etc. It can still be improved significantly but I'd say it's a nice tool and I'm using it quite often.

    As for a mechanism you are looking for - try coming up with some kind of algebra for data manipulation so that one could write transformation rules in forms of formulas (I guess that's a bit similar approach to what XSL-T does).

  5. #5
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: ideas for text extraction/transformation spec

    Quote Originally Posted by wysota View Post
    Maybe use an approach similar to delegates in item views (I don't remember the exact design pattern name for that)? Allow users to inject small objects with little pieces of code to manipulate tokens.
    Could you elaborate this? I don't really see how it would work.

    Quote Originally Posted by wysota View Post
    Well... there is XSL-T
    XSL-T is a trimmed-down script language hidden in XML balises. I'd rather use a real script language from the start or write my own specificaly tailored set of primitives and avoid cluttering them with angle brackets...

    Quote Originally Posted by wysota View Post
    I'm currently working on a code generator based on XSD schemas and I'm using QtScript to provide means of data manipulation. The overall performance is not impressive but I didn't bother to do any optimizations as speed is not a priority with my usecase.
    Quote Originally Posted by wysota View Post
    Apart from QtScript I'm using a simple replace mechanism - you feed the engine with a text template with variable placeholders and the engine substitutes them with their values.
    That was my original plan, with some special operators to select tokens or groups of tokens within the matches but it looked a bit cumbersome.

    Quote Originally Posted by wysota View Post
    I have a nice C++ class generator It works quite well, handles dependencies, creates accessors for variables, etc. It can still be improved significantly but I'd say it's a nice tool and I'm using it quite often.
    How does it generate classes? (or did you forget a "tree" somewhere?) Are the sources available?

    Quote Originally Posted by wysota View Post
    As for a mechanism you are looking for - try coming up with some kind of algebra for data manipulation so that one could write transformation rules in forms of formulas.
    This deserve some consideration but I'm not sure I have enough time to craft such a thing atm.
    Current Qt projects : QCodeEdit, RotiDeCode

  6. #6
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: ideas for text extraction/transformation spec

    Quote Originally Posted by fullmetalcoder View Post
    Could you elaborate this? I don't really see how it would work.
    You provide an interface with a set of methods that the "delegate" can access to get tokens and to push output values and let the user register a function (or rather an instance of a class) that will be called in certain conditions and which will do the manipulation. You can do it as a set of plugins so that you don't have to recompile the main parsing application if that's an issue.

    How does it generate classes? (or did you forget a "tree" somewhere?)
    It generates single classes and allows adding and resolving dependencies to them so that all classes land in an appropriate order in the destination file(s). It's actually a very simple mechanism but works quite well.

    Are the sources available?
    Not at the moment but at some point I think we can think about opening its code.

  7. #7
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: ideas for text extraction/transformation spec

    Quote Originally Posted by wysota View Post
    You provide an interface with a set of methods that the "delegate" can access to get tokens and to push output values and let the user register a function (or rather an instance of a class) that will be called in certain conditions and which will do the manipulation. You can do it as a set of plugins so that you don't have to recompile the main parsing application if that's an issue.
    Ok I see. This is very similar to what I already do but a bit more "elaborate" : currently I have a Handler cinterface which has to be reimplemented. An instance of it is passed to the parser which feeds it with match data. I think I will keep this general "model" but my point is now to offer a way to define processing of the parsed data without being forced to write C++ code (well, I'll have to write some to implement that but the end user won't).

    Quote Originally Posted by wysota View Post
    It generates single classes and allows adding and resolving dependencies to them so that all classes land in an appropriate order in the destination file(s). It's actually a very simple mechanism but works quite well.

    Not at the moment but at some point I think we can think about opening its code.
    That sounds interesting but I am not sure I understand how and when you use this generator. Is this somewhat similar to generating code from UML descriptions?
    Current Qt projects : QCodeEdit, RotiDeCode

  8. #8
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: ideas for text extraction/transformation spec

    Quote Originally Posted by fullmetalcoder View Post
    That sounds interesting but I am not sure I understand how and when you use this generator. Is this somewhat similar to generating code from UML descriptions?
    For instance I have been using it to generate implicit shared classes, data "payload" classes or widget implementations from XSD schemas. Of course the fact that I was using XSD has nothing to do with the class generator, it's just a data source that's feed to a parser which then calls the CPP generator.

  9. #9
    Join Date
    Jan 2006
    Location
    travelling
    Posts
    1,116
    Thanks
    8
    Thanked 127 Times in 121 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: ideas for text extraction/transformation spec

    just sharing my latest thoughts, (comments welcome as always) :

    • this "level" of the framework is supposed to generate tags/entries : (match identifier (e.g variable decl, namespace or whatever construct might have been matched), list of tokens (integers), associated list of productions) -> text
    • tag generation rule defined as S-expressions : easy to parse, excellent simplicity/flexibility ratio
    • basic operators availables : all string operations provided by QString and maybe a few more tailored to the use case plus several lookup operators to access the tokens/productions data

    I have yet to figure an elegant way to allow creation of "intermediate" results.

    Note : I have not started implementing anything as of yet : these are just ideas and will remain so until my exams end.
    Current Qt projects : QCodeEdit, RotiDeCode

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.