ideas for text extraction/transformation spec

**fullmetalcoder** · 19th February 2009, 16:49

I am currently writing an engine for generic text extraction and transformation. I got the generic lexer and the generic parser to work but I'm a bit unsure about the design to use for the transformation layer.

When the parser finds a match it notifies an handler with a match identifier and a list of matched tokens (some more data could be passed but at the moment I see no use for that).

Now the question is : how could the actions to perform upon a given match (taking the matched tokens into account of course) be specified?

Ideally I'd want to avoid resorting to a full-blown scripting language as I want the engine to be as fast as possible (currently the lexer can handle 1.4Mb of C++ in less than 100ms on my laptop, 40% of it being spent loading the data from disk to RAM and the parser has can extract the whole structure of qwidget.h in 5ms (smaller file but much more complex to handle)).

Is there any spec for such text transformations or do I have to create my own? If so do you have any ideas to create something that would be relatively easy to implement and easy to understand and use for people not necessarily versed in programming.

As a side note, here are two things I would expect that spec to be able to describe :

creation of a class tree from code
autogeneration of SIP bindings files from C++ sources

**Boron** · 19th February 2009, 18:33

I don't think my post will be useful to you, but perhaps you can clarify my confusion after I have read your post

.

I didn't get the point what your "enginge" can do. It is neither a lexer nor a parser to recognise "languages" (probably defined in (E)BNF).
It can... well... extract (?) text and transform it into what?

This somewhat sounds like your engine can do something compared to XSLT (transforming one XML document into an onther with an different XSL (Extensible Stylesheet Language)).

**fullmetalcoder** · 19th February 2009, 21:02

It read a set of files which define a lexer and a parser. These files are basically a set of rules using a regexp-like syntax which get translated into automatons.

So strictly speaking it does not generate lexers/parsers as they only extracts data they recognize and won't complain/abort due to syntax errors, contrary to what usual LL/LR/LALR/GLR do. Hence the use of "extract" which can be puzzling I admit but is the best description I can give.

Once that data is extracted it has to be processed in some way to create something else (so the operation as a whole can be perceived as a "transformation" but strictly speaking it would be better described as "generation controlled by extracted data").

Ideally I would like the processing of the extracted data to be general purpose but if it is not possible I will focus on the secific goals I had when starting this project (the two I mentioned in my previous post).

I hope this makes my point clearer.

**wysota** · 19th February 2009, 21:43

Originally Posted by fullmetalcoder

Now the question is : how could the actions to perform upon a given match (taking the matched tokens into account of course) be specified?

Maybe use an approach similar to delegates in item views (I don't remember the exact design pattern name for that)? Allow users to inject small objects with little pieces of code to manipulate tokens.

Is there any spec for such text transformations or do I have to create my own?

Well... there is XSL-T

If so do you have any ideas to create something that would be relatively easy to implement and easy to understand and use for people not necessarily versed in programming.

I'm currently working on a code generator based on XSD schemas and I'm using QtScript to provide means of data manipulation. The overall performance is not impressive but I didn't bother to do any optimizations as speed is not a priority with my usecase.

Apart from QtScript I'm using a simple replace mechanism - you feed the engine with a text template with variable placeholders and the engine substitutes them with their values.

creation of a class tree from code

I have a nice C++ class generator

It works quite well, handles dependencies, creates accessors for variables, etc. It can still be improved significantly but I'd say it's a nice tool and I'm using it quite often.

As for a mechanism you are looking for - try coming up with some kind of algebra for data manipulation so that one could write transformation rules in forms of formulas (I guess that's a bit similar approach to what XSL-T does).

**fullmetalcoder** · 19th February 2009, 22:00

Originally Posted by wysota

Maybe use an approach similar to delegates in item views (I don't remember the exact design pattern name for that)? Allow users to inject small objects with little pieces of code to manipulate tokens.

Could you elaborate this? I don't really see how it would work.

Originally Posted by wysota

Well... there is XSL-T

XSL-T is a trimmed-down script language hidden in XML balises. I'd rather use a real script language from the start or write my own specificaly tailored set of primitives and avoid cluttering them with angle brackets...

Originally Posted by wysota

I'm currently working on a code generator based on XSD schemas and I'm using QtScript to provide means of data manipulation. The overall performance is not impressive but I didn't bother to do any optimizations as speed is not a priority with my usecase.

Originally Posted by wysota

Apart from QtScript I'm using a simple replace mechanism - you feed the engine with a text template with variable placeholders and the engine substitutes them with their values.

That was my original plan, with some special operators to select tokens or groups of tokens within the matches but it looked a bit cumbersome.

Originally Posted by wysota

I have a nice C++ class generator

It works quite well, handles dependencies, creates accessors for variables, etc. It can still be improved significantly but I'd say it's a nice tool and I'm using it quite often.

How does it generate classes? (or did you forget a "tree" somewhere?) Are the sources available?

Originally Posted by wysota

As for a mechanism you are looking for - try coming up with some kind of algebra for data manipulation so that one could write transformation rules in forms of formulas.

This deserve some consideration but I'm not sure I have enough time to craft such a thing atm.

**wysota** · 20th February 2009, 08:15

Originally Posted by fullmetalcoder

Could you elaborate this? I don't really see how it would work.

You provide an interface with a set of methods that the "delegate" can access to get tokens and to push output values and let the user register a function (or rather an instance of a class) that will be called in certain conditions and which will do the manipulation. You can do it as a set of plugins so that you don't have to recompile the main parsing application if that's an issue.

How does it generate classes? (or did you forget a "tree" somewhere?)

It generates single classes and allows adding and resolving dependencies to them so that all classes land in an appropriate order in the destination file(s). It's actually a very simple mechanism but works quite well.

Are the sources available?

Not at the moment but at some point I think we can think about opening its code.

**fullmetalcoder** · 20th February 2009, 19:06

Originally Posted by wysota

You provide an interface with a set of methods that the "delegate" can access to get tokens and to push output values and let the user register a function (or rather an instance of a class) that will be called in certain conditions and which will do the manipulation. You can do it as a set of plugins so that you don't have to recompile the main parsing application if that's an issue.

Ok I see. This is very similar to what I already do but a bit more "elaborate" : currently I have a Handler cinterface which has to be reimplemented. An instance of it is passed to the parser which feeds it with match data. I think I will keep this general "model" but my point is now to offer a way to define processing of the parsed data without being forced to write C++ code (well, I'll have to write some to implement that but the end user won't).

Originally Posted by wysota

It generates single classes and allows adding and resolving dependencies to them so that all classes land in an appropriate order in the destination file(s). It's actually a very simple mechanism but works quite well.

Not at the moment but at some point I think we can think about opening its code.

That sounds interesting but I am not sure I understand how and when you use this generator. Is this somewhat similar to generating code from UML descriptions?

**wysota** · 20th February 2009, 19:30

Originally Posted by fullmetalcoder

That sounds interesting but I am not sure I understand how and when you use this generator. Is this somewhat similar to generating code from UML descriptions?

For instance I have been using it to generate implicit shared classes, data "payload" classes or widget implementations from XSD schemas. Of course the fact that I was using XSD has nothing to do with the class generator, it's just a data source that's feed to a parser which then calls the CPP generator.

**fullmetalcoder** · 10th April 2009, 23:30

just sharing my latest thoughts, (comments welcome as always) :

this "level" of the framework is supposed to generate tags/entries : (match identifier (e.g variable decl, namespace or whatever construct might have been matched), list of tokens (integers), associated list of productions) -> text
tag generation rule defined as S-expressions : easy to parse, excellent simplicity/flexibility ratio
basic operators availables : all string operations provided by QString and maybe a few more tailored to the use case plus several lookup operators to access the tokens/productions data

I have yet to figure an elegant way to allow creation of "intermediate" results.

Note : I have not started implementing anything as of yet : these are just ideas and will remain so until my exams end.