Results 1 to 5 of 5

Thread: Remove Duplicate Lines in a file

  1. #1
    Join Date
    May 2008
    Location
    USA
    Posts
    22
    Thanks
    4
    Thanked 2 Times in 2 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Remove Duplicate Lines in a file

    Hi,

    Can somebody point me in the right direction on what would be some of the most efficient ways of removing duplicate lines in a file? I probably won't be loading files bigger than 5 mb, but still I would like to do it as efficiently as possible.

    Thanks a lot in advance !

  2. The following user says thank you to nmuntz for this useful post:

    zeFree (23rd April 2013)

  3. #2
    Join Date
    Mar 2007
    Location
    Germany
    Posts
    229
    Thanks
    2
    Thanked 29 Times in 28 Posts
    Qt products
    Qt4
    Platforms
    Windows

    Default Re: Remove Duplicate Lines in a file

    1.) I would use a QCryptographicHash to calculate a hash over each line.
    2.) I would store the hashes together with their corresponding line number in a QList (or any other generic container; access should be fast for the next step).
    3.) I would sort the QList using qSort from QtAlgorithms.
    4.) I would iterate over the QList and compare the current element with the "next" element (if any). If next is identical, you know the line number of a duplicate line in your text file.

    (I hope hashes can be compared faster than strings).

    That's what I would do.
    I think this could be done faster. But if you're not dealing with millions of lines it should be OK.

  4. #3
    Join Date
    May 2007
    Location
    Lublin, Poland
    Posts
    345
    Thanks
    40
    Thanked 8 Times in 4 Posts
    Qt products
    Qt3 Qt4
    Platforms
    MacOS X Unix/X11 Windows

    Default Re: Remove Duplicate Lines in a file

    Hi,

    There is a good example, based on C(not so Qt dependent) and you can easily port in to C++ streams, in the "ADVANCE C" Peter D.Hipson book( look for "PURGING"); of course the idea above sounds good and is using Qt and everyone here likes QT.


    good luck.
    Qt allows you to use everything you want
    wysota
    --------------------------------------------------------------------------------
    #if defined(Q_OS_UNIX) && defined(QT_DEBUG)
    abort(); // trap; generates core dump
    #else
    exit(1); // goodbye cruel world
    #endif

  5. #4
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Remove Duplicate Lines in a file

    Cryptographic hash is to complex. Here is something much simpler:

    Qt Code:
    1. f.open(QFile::ReadOnly|QFile::Text);
    2. QFile out;
    3. out.open(QFile::WriteOnly|QFile::Text);
    4. QSet<int> linesSeen;
    5. while(!f.atEnd()){
    6. QString s = f.readLine();
    7. int h = qHash(s);
    8. if(linesSeen.contains(h)) continue;
    9. linesSeen << h;
    10. out.write(s);
    11. }
    12. f.close();
    13. out.close();
    To copy to clipboard, switch view to plain text mode 

  6. The following user says thank you to wysota for this useful post:

    zeFree (23rd April 2013)

  7. #5
    Join Date
    May 2008
    Location
    USA
    Posts
    22
    Thanks
    4
    Thanked 2 Times in 2 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Remove Duplicate Lines in a file

    Quote Originally Posted by wysota View Post
    Cryptographic hash is to complex. Here is something much simpler:

    Qt Code:
    1. f.open(QFile::ReadOnly|QFile::Text);
    2. QFile out;
    3. out.open(QFile::WriteOnly|QFile::Text);
    4. QSet<int> linesSeen;
    5. while(!f.atEnd()){
    6. QString s = f.readLine();
    7. int h = qHash(s);
    8. if(linesSeen.contains(h)) continue;
    9. linesSeen << h;
    10. out.write(s);
    11. }
    12. f.close();
    13. out.close();
    To copy to clipboard, switch view to plain text mode 
    Agreed. I tried to look into the cryptographic solution but I gave up on it, as it was too complicated.
    Your solution is working great!!!

    Thank you very much!

  8. The following user says thank you to nmuntz for this useful post:

    zeFree (23rd April 2013)

Similar Threads

  1. How to remove duplicate enteries from QStringList.
    By merry in forum Qt Programming
    Replies: 5
    Last Post: 7th March 2019, 15:02
  2. remove node in xml file
    By mattia in forum Newbie
    Replies: 1
    Last Post: 6th March 2008, 13:25
  3. Set up the Qt4.3.2 with Visual Studio 2005
    By lamoda in forum Installation and Deployment
    Replies: 6
    Last Post: 30th January 2008, 06:51
  4. reading from a file
    By mickey in forum General Programming
    Replies: 32
    Last Post: 19th July 2007, 01:04
  5. qt-3.3.8 fail in scratchbox
    By nass in forum Installation and Deployment
    Replies: 0
    Last Post: 25th May 2007, 15:21

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.