File based merge sort

**pixx** · 8th December 2010, 15:40

Hi!

I try to solve word count problem (calculate count of each word in files array) with very big set of data - larger than my pc's memory (RAM).

Format of each file:
string: int
string: int

I can save intermediate results in files, but at the end i'll have hundreds of files, each of them is equal in size than my pc's memory.

Can i merge all this intermediate files in 1 with Qt tools? Is there an implementation of merge sort algorithm in qt or community code? Like this http://adaszewski.fachowcy.pl/file-based-merge-sort/ but working with Qt data containers like QMap?

Thank you for help!

**high_flyer** · 8th December 2010, 16:37

I can save intermediate results in files

What exactly is an intermediate result?
What is such a result made of in terms of data?
I don't understand why intermediate results should consume so much memory...

**pixx** · 8th December 2010, 17:48

What exactly is an intermediate result?

Eg:
word1 23
word2 12
Where word* - it's natural language word, and numbers - it's counts of this words in some collection of plain text files. The size of intermediate result is limited by RAM size.

What is such a result made of in terms of data?

Qt Code:

Switch view

typedef QMap<QString, int> WordCount;

typedef QMap<QString, int> WordCount;

To copy to clipboard, switch view to plain text mode

Note that i talk about standard example from Qt examples. The difference is that i have very huge set of source plain text files to process and QMap could not fit into RAM. So i make intermediate results of the same format. After that i want to merge all of intermediate results with summarize counts of the same words in different intermediate results, eg:
intermediate result 1:
cat 3
dog 1
intermediate result 2:
bird 10
cat 5
final result (after external merge sort):
bird 10
cat 8
dog 1

Intermediate results stores in separate files limited by size of RAM. So i cant just read all files in memory and sort them, because each file is little less than memory size.
I know that there is a standard algorithm called "external memory merge sort" which can sort many large sets of data in one single largest set.
So i could not find such implementation for Qt and asked people about it.

May be there is an unofficial plugin for Qt?

Sorry for my chinglish...

**ChrisW67** · 8th December 2010, 21:25

STXXL http://stxxl.sourceforge.net/

**marcvanriet** · 9th December 2010, 00:08

So you have a text file that is so big that even the word count is a bit less than your computers' memory ??? Wow. Can you mail it to me ?

Merging two results files is easy, especially if they are already sorted. Roughly something like this :

Qt Code:

Switch view

//- Open two results files
//   - read a line of each file
//- until both files are completely read :
//   - are the words identical ?
//      - yes -> write word and the sum to your output file
//                   read next line in both files
//      - no -> write only the word that is lowest in the alphabet to your output file
//                  read next line from this words' results file

//- Open two results files
//   - read a line of each file
//- until both files are completely read :
//   - are the words identical ?
//      - yes -> write word and the sum to your output file
//                   read next line in both files
//      - no -> write only the word that is lowest in the alphabet to your output file
//                  read next line from this words' results file

To copy to clipboard, switch view to plain text mode

You could speed this up by comparing e.g. 100 files simultaneously.

Maybe you could also use a MySql database, but that will probably be too slow for word counting operations. You could store the results of all the input text files in a single table, and then execute a grouping and summing SQL statement onto it. I wonder if you would hit any limitation of MySql.

Best regards,
Marc