PDA

View Full Version : Parsing Files



JayJay
20th August 2007, 17:42
Hey Guys,

I'm looking for an professionell way for parsing Text files.
How to start here with Qt? Is there any "Qt-way" or must I realize this with regular expressions?

My purpose is, parsing C++ header files ...

fullmetalcoder
20th August 2007, 17:48
My purpose is, parsing C++ header files ...
What kind of parsing do you want to achieve? Building an AST? Extracting symbols/tags? Highlighting? Something else? The goal will affect the implementation but, in most case, a good old hand-made parser performs best with languages as complex as C++.

JayJay
20th August 2007, 18:05
I want to extract class, functions and member informations of an header file.
Handmade means some orgies with QRegExp?!

Michiel
20th August 2007, 18:13
I don't think you have to use QRegExp that much. I'd say, first separate the header file by its whitespace (and the places between words and special characters) to get a list of tokens. Then go through the list with some sort of finite state machine next to it.

fear
20th August 2007, 18:18
If You want store some settings, easiest way is to use QSettings.

fullmetalcoder
20th August 2007, 18:37
I want to extract class, functions and member informations of an header file.
And what do you want to do with the extracted data?


Handmade means some orgies with QRegExp?!
Not at all! Handmade means crafting a lexer and a parser in C++. The lexer reads a character stream and turns it into a sequence of token (or a token stream). Then the parser analyzes the tokens and builds a tree/tag list/whatsoever you want.

If you don't want to waste time writing a parser from scratch you can have a look at the one used by Qt tools (lupdate or qt3to4, or both) or generate one using yacc/ANTLR or akin.

elcuco
20th August 2007, 18:37
I want to extract class, functions and member informations of an header file.
Handmade means some orgies with QRegExp?!

how about another direction: run ctags on that file, and parse the tags file.

JayJay
20th August 2007, 18:38
Storing isn't that problem. An Alghorithm for parsing these files is the problem ...

After cutting all whitespaces, I've to go through it via QRegExp I think ... The best way I think ...

marcel
20th August 2007, 18:45
A lexer can be implemented in any number of ways. Using regular expressions is one of the hardest/slowest.

I say to look in the opposite direction, towards finite automata , especially if you are interested in only one language.

Regards

Michiel
20th August 2007, 18:46
Storing isn't that problem. An Alghorithm for parsing these files is the problem ...

After cutting all whitespaces, I've to go through it via QRegExp I think ... The best way I think ...

Why? I don't think you'll need QRegExp at all. What's your plan?

JayJay
20th August 2007, 18:46
Not at all! Handmade means crafting a lexer and a parser in C++. The lexer reads a character stream and turns it into a sequence of token (or a token stream). Then the parser analyzes the tokens and builds a tree/tag list/whatsoever you want.


Easy to read ... But how to write? You've got some examples or links?

I don't want to use ctags, no depency is the best depency ...

elcuco
20th August 2007, 18:47
Storing isn't that problem. An Alghorithm for parsing these files is the problem ...

After cutting all whitespaces, I've to go through it via QRegExp I think ... The best way I think ...

bad idea, you are not aware of the context in which the string is found. some bad examples (FullMetalCoder will be able to explain about NFA and DFA, I suggest you to read about Turing machines as well)



QString s = "class a { int foo; };";

// class a { int foo; };

#if 0
class a
{
int foo;
};
#endif

/*
class a
{
int foo;
};
*/

Michiel
20th August 2007, 18:51
bad idea, you are not aware of the context in which the string is found. some bad examples (FullMetalCoder will be able to explain about NFA and DFA, I suggest you to read about Turing machines as well)

Note to JayJay: That's (Non)Deterministic Finite Automata. And while Turing Machines are very interesting, I don't think they're relevant for this at all.

Anyway, I think everyone here is pointing you in the right direction. You need to create a token-stream/list and go through them with a finite automaton. But how formal and elaborate you want to make your parser really depends on what kind of stuff you expect to find in these header files. Are they more predictable than elcuco's example?

fullmetalcoder
20th August 2007, 19:01
Easy to read ... But how to write? You've got some examples or links?
Hehe... That's just so true! Well, I've already suggested to look into Qt sources. The lexer/parsers used by the porting tool (just checked ;)), which are also used by the HEAD version of KDevelop BTW, are ready to use, though I've never tried to actually use them. If all you is "lesser" parsing (only symbol/tags) put into a tree then you can consider using QCodeModel 2. It's a small module I made for that very usage (for Edyuk) and which works pretty well turning full Qt headers into a tree in about 6 seconds... http://edyuk.svn.sf.net/svnroot/edyuk/trunk/3rdparty/qcodemodel2 (do a checkout... you can't dl the sources from there...)