PDA

View Full Version : Remove Duplicate Lines in a file



nmuntz
9th January 2009, 14:09
Hi,

Can somebody point me in the right direction on what would be some of the most efficient ways of removing duplicate lines in a file? I probably won't be loading files bigger than 5 mb, but still I would like to do it as efficiently as possible.

Thanks a lot in advance !

Boron
9th January 2009, 15:16
1.) I would use a QCryptographicHash to calculate a hash over each line.
2.) I would store the hashes together with their corresponding line number in a QList (or any other generic container; access should be fast for the next step).
3.) I would sort the QList using qSort from QtAlgorithms.
4.) I would iterate over the QList and compare the current element with the "next" element (if any). If next is identical, you know the line number of a duplicate line in your text file.

(I hope hashes can be compared faster than strings).

That's what I would do.
I think this could be done faster. But if you're not dealing with millions of lines it should be OK.

maverick_pol
11th January 2009, 19:00
Hi,

There is a good example, based on C(not so Qt dependent) and you can easily port in to C++ streams, in the "ADVANCE C" Peter D.Hipson book( look for "PURGING"); of course the idea above sounds good and is using Qt and everyone here likes QT.


good luck.

wysota
11th January 2009, 20:49
Cryptographic hash is to complex. Here is something much simpler:


QFile f;
f.open(QFile::ReadOnly|QFile::Text);
QFile out;
out.open(QFile::WriteOnly|QFile::Text);
QSet<int> linesSeen;
while(!f.atEnd()){
QString s = f.readLine();
int h = qHash(s);
if(linesSeen.contains(h)) continue;
linesSeen << h;
out.write(s);
}
f.close();
out.close();

nmuntz
12th January 2009, 01:04
Cryptographic hash is to complex. Here is something much simpler:


QFile f;
f.open(QFile::ReadOnly|QFile::Text);
QFile out;
out.open(QFile::WriteOnly|QFile::Text);
QSet<int> linesSeen;
while(!f.atEnd()){
QString s = f.readLine();
int h = qHash(s);
if(linesSeen.contains(h)) continue;
linesSeen << h;
out.write(s);
}
f.close();
out.close();

Agreed. I tried to look into the cryptographic solution but I gave up on it, as it was too complicated.
Your solution is working great!!!

Thank you very much!