PDA

View Full Version : File based merge sort



pixx
8th December 2010, 15:40
Hi!

I try to solve word count problem (calculate count of each word in files array) with very big set of data - larger than my pc's memory (RAM).

Format of each file:
string: int
string: int

I can save intermediate results in files, but at the end i'll have hundreds of files, each of them is equal in size than my pc's memory.

Can i merge all this intermediate files in 1 with Qt tools? Is there an implementation of merge sort algorithm in qt or community code? Like this http://adaszewski.fachowcy.pl/file-based-merge-sort/ but working with Qt data containers like QMap?

Thank you for help!

high_flyer
8th December 2010, 16:37
I can save intermediate results in files
What exactly is an intermediate result?
What is such a result made of in terms of data?
I don't understand why intermediate results should consume so much memory...

pixx
8th December 2010, 17:48
What exactly is an intermediate result?
Eg:
word1 23
word2 12
Where word* - it's natural language word, and numbers - it's counts of this words in some collection of plain text files. The size of intermediate result is limited by RAM size.


What is such a result made of in terms of data?

typedef QMap<QString, int> WordCount;
Note that i talk about standard example from Qt examples. The difference is that i have very huge set of source plain text files to process and QMap could not fit into RAM. So i make intermediate results of the same format. After that i want to merge all of intermediate results with summarize counts of the same words in different intermediate results, eg:
intermediate result 1:
cat 3
dog 1
intermediate result 2:
bird 10
cat 5
final result (after external merge sort):
bird 10
cat 8
dog 1

Intermediate results stores in separate files limited by size of RAM. So i cant just read all files in memory and sort them, because each file is little less than memory size.
I know that there is a standard algorithm called "external memory merge sort" which can sort many large sets of data in one single largest set.
So i could not find such implementation for Qt and asked people about it.

May be there is an unofficial plugin for Qt?

Sorry for my chinglish... :confused:

ChrisW67
8th December 2010, 21:25
STXXL http://stxxl.sourceforge.net/

marcvanriet
9th December 2010, 00:08
So you have a text file that is so big that even the word count is a bit less than your computers' memory ??? Wow. Can you mail it to me ? ;)

Merging two results files is easy, especially if they are already sorted. Roughly something like this :

//- Open two results files
// - read a line of each file
//- until both files are completely read :
// - are the words identical ?
// - yes -> write word and the sum to your output file
// read next line in both files
// - no -> write only the word that is lowest in the alphabet to your output file
// read next line from this words' results file


You could speed this up by comparing e.g. 100 files simultaneously.

Maybe you could also use a MySql database, but that will probably be too slow for word counting operations. You could store the results of all the input text files in a single table, and then execute a grouping and summing SQL statement onto it. I wonder if you would hit any limitation of MySql.

Best regards,
Marc