PDA

View Full Version : Parsing CSV data



cmre123
28th October 2010, 17:36
Can anyone assist me with writing a program to parse csv files? I've searched for a good way to parse data using QT, but most info suggests using third party software or learning QLALR, which is not an option.

wysota
28th October 2010, 17:57
You can parse CSV data using a regular expression. The expression is available somewhere in the web.

ChrisW67
28th October 2010, 22:27
Can anyone assist me with writing a program to parse csv files? I've searched for a good way to parse data using QT, but most info suggests using third party software or learning QLALR, which is not an option.
"CSV" covers a huge range of pseudo-standard formats with all sorts of nasty peculiarities, for example quotes or no quotes, header line(s), embedded commas in fields, variable escape characters, fields that contain embedded new line characters etc. See RFC 4180 (http://tools.ietf.org/html/rfc4180) for example. The reason many suggest using a third-party library is that this variability is taken care of by someone else.

As Wysota wrote, you can use a regular expression but you will have fun with certain CSV variations. If the format is suitably simple and reliable then a QString::split() might be adequate.

adzajac
29th October 2010, 07:56
Hi.
Here is the fast method to parse a line of text in to the tokens. It is faster than using regexp and this one is designed for CSV according to RFC.


std::vector<std::string> tokenize(const std::string& str,char delimiter) {
std::vector<std::string> tokens;

unsigned int pos = 0;
bool quotes = false;
std::string field = "";

while(str[pos] != 0x00 && pos < str.length()){
char c = str[pos];
if (!quotes && c == '"' ){
quotes = true;
} else if (quotes && c== '"' ){
if (pos + 1 <str.length() && str[pos+1]== '"' ){
field.push_back(c);
pos++;
} else {
quotes = false;
}
} else if (!quotes && c == delimiter) {
tokens.push_back(std::string(field));
field.clear();
} else if (!quotes && ( c == 0x0A || c == 0x0D )){
tokens.push_back(std::string(field));
field.clear();
} else {
field.push_back(c);
}
pos++;
}
return tokens;
}

wysota
29th October 2010, 09:40
I don't think it parses CSV correctly. It doesn't consider embedded newlines and I think it will also fail on embedded quote characters and will certainly fail on unicode strings. Oh, and I don't think it is faster than a regular expression as you are pushing characters to string one by one which is basically slow. You can easily improve it by storing positions of beginning and end of a token and then extracting the whole item in one go.

adzajac
29th October 2010, 14:44
Quotes even embedded are taken account and '\n' also if they are in quotes as in RFC. The Unicode if fact is not supported it must be ascii. About performance, it's faster than boost::tokenizer ( i've measured that, because CSV's in my project are huge ) and I think it could be faster than regexp. Regexp could give you a ability to parse files with different character length i.e. unicode.

wysota
29th October 2010, 14:57
I think that efficiency of boost::tokenizer is strictly based on how the parsing function is implemented. From what I understand boost::tokenizer is just a wrapper over the parsing function. So I believe your solution can be put into boost::tokenizer too if you strip the outside while loop. But I still claim that either a regexp or a dedicated parser based on automata would be faster. It's not that this is really important unless you are parsing megabytes of csv a second ;)