File based merge sort

Printable View

8th December 2010, 15:40
pixx

File based merge sort

Hi!

I try to solve word count problem (calculate count of each word in files array) with very big set of data - larger than my pc's memory (RAM).

Format of each file:
string: int
string: int

I can save intermediate results in files, but at the end i'll have hundreds of files, each of them is equal in size than my pc's memory.

Can i merge all this intermediate files in 1 with Qt tools? Is there an implementation of merge sort algorithm in qt or community code? Like this http://adaszewski.fachowcy.pl/file-based-merge-sort/ but working with Qt data containers like QMap?

Thank you for help!
8th December 2010, 16:37
high_flyer

Re: File based merge sort

Quote:

I can save intermediate results in files

What exactly is an intermediate result?
What is such a result made of in terms of data?
I don't understand why intermediate results should consume so much memory...
8th December 2010, 17:48
pixx

Re: File based merge sort
Quote:

What exactly is an intermediate result?

Eg:
word1 23
word2 12
Where word* - it's natural language word, and numbers - it's counts of this words in some collection of plain text files. The size of intermediate result is limited by RAM size.

Quote:

What is such a result made of in terms of data?
Code:

typedef QMap<QString, int> WordCount;
Note that i talk about standard example from Qt examples. The difference is that i have very huge set of source plain text files to process and QMap could not fit into RAM. So i make intermediate results of the same format. After that i want to merge all of intermediate results with summarize counts of the same words in different intermediate results, eg:
intermediate result 1:
cat 3
dog 1
intermediate result 2:
bird 10
cat 5
final result (after external merge sort):
bird 10
cat 8
dog 1

Intermediate results stores in separate files limited by size of RAM. So i cant just read all files in memory and sort them, because each file is little less than memory size.
I know that there is a standard algorithm called "external memory merge sort" which can sort many large sets of data in one single largest set.
So i could not find such implementation for Qt and asked people about it.

May be there is an unofficial plugin for Qt?

Sorry for my chinglish... :confused:
8th December 2010, 21:25
ChrisW67

Re: File based merge sort

STXXL http://stxxl.sourceforge.net/
9th December 2010, 00:08
marcvanriet

Re: File based merge sort
So you have a text file that is so big that even the word count is a bit less than your computers' memory ??? Wow. Can you mail it to me ? ;)

Merging two results files is easy, especially if they are already sorted. Roughly something like this :
Code:

//- Open two results files // - read a line of each file //- until both files are completely read : // - are the words identical ? // - yes -> write word and the sum to your output file // read next line in both files // - no -> write only the word that is lowest in the alphabet to your output file // read next line from this words' results file
You could speed this up by comparing e.g. 100 files simultaneously.

Maybe you could also use a MySql database, but that will probably be too slow for word counting operations. You could store the results of all the input text files in a single table, and then execute a grouping and summing SQL statement onto it. I wonder if you would hit any limitation of MySql.

Best regards,
Marc