Re: File based merge sort
Quote:
I can save intermediate results in files
What exactly is an intermediate result?
What is such a result made of in terms of data?
I don't understand why intermediate results should consume so much memory...
Re: File based merge sort
Quote:
What exactly is an intermediate result?
Eg:
word1 23
word2 12
Where word* - it's natural language word, and numbers - it's counts of this words in some collection of plain text files. The size of intermediate result is limited by RAM size.
Quote:
What is such a result made of in terms of data?
Code:
typedef QMap<QString, int> WordCount;
Note that i talk about standard example from Qt examples. The difference is that i have very huge set of source plain text files to process and QMap could not fit into RAM. So i make intermediate results of the same format. After that i want to merge all of intermediate results with summarize counts of the same words in different intermediate results, eg:
intermediate result 1:
cat 3
dog 1
intermediate result 2:
bird 10
cat 5
final result (after external merge sort):
bird 10
cat 8
dog 1
Intermediate results stores in separate files limited by size of RAM. So i cant just read all files in memory and sort them, because each file is little less than memory size.
I know that there is a standard algorithm called "external memory merge sort" which can sort many large sets of data in one single largest set.
So i could not find such implementation for Qt and asked people about it.
May be there is an unofficial plugin for Qt?
Sorry for my chinglish... :confused:
Re: File based merge sort
Re: File based merge sort
So you have a text file that is so big that even the word count is a bit less than your computers' memory ??? Wow. Can you mail it to me ? ;)
Merging two results files is easy, especially if they are already sorted. Roughly something like this :
Code:
//- Open two results files
// - read a line of each file
//- until both files are completely read :
// - are the words identical ?
// - yes -> write word and the sum to your output file
// read next line in both files
// - no -> write only the word that is lowest in the alphabet to your output file
// read next line from this words' results file
You could speed this up by comparing e.g. 100 files simultaneously.
Maybe you could also use a MySql database, but that will probably be too slow for word counting operations. You could store the results of all the input text files in a single table, and then execute a grouping and summing SQL statement onto it. I wonder if you would hit any limitation of MySql.
Best regards,
Marc