PDA

View Full Version : Advice: Reading large text file.



enricong
14th July 2011, 16:44
I need to read a large (multi-GB) text file.

It is formatted something like this



------------------------

Record ID: 1
Attribute 1: 1234
Attribute 2: 1234 1234 1234
1234 1234 1234
Attribute 3: 1234

--------------------------------

Record ID: 2
Attribute 1:

....


There may be hundreds of thousands of records in the file. I have a list of maybe 500 record IDs and I need to pull the attributes for each of the 500. Also, a record may refer to another record, in that case, I will need to also pull the data for the referenced record. (ie. Record ID#1000 refers to ID#50. Then I need to go back and get Record #50)

I decided to ReadAll the whole file into a QString.
Then I use find to find the section of text that I want.
Then I feed the record text into a QTextstream so I can use readline to parse out every line and get the attributes.

I read in the file first into memory because I figure operations in memory will be faster.

Another thought is to just readline directly from the Qfile saving off what I need. However since I am making repeated access to disk, I figure this might be slower.

Also, I may need to read back and find another record. For example when i get to record 1000, it may refer to record 50, so I have to go back and find record 50. I may have encountered Record 50 first, but I would have discarded it because I would not know I need it until I get to record 1000. This seems to be another advantage of the first approach since I can just used find on the QString.


Does this sound like the best way for me to do this? or is there a faster way to do this?

Santosh Reddy
14th July 2011, 17:00
It would be better to store the processed records in memory, and reuse then, instead of re-loading them from file.

I would say, open the file, load all the records, and store in memory in a processed format, and do functions(). This way file is read only once duing loading, all the rest of the time records are read from memory. (But you need to be sure of each record size and total memory required to load all the records)

enricong
14th July 2011, 19:05
I have tried that too and it seems to take slightly longer.

If I understand you correctly, you mean parsing through the entire file and saving the records in memory.

The file may have a million records, so I tried going through line by line and saving the million records in a QHash, then accessing the QHash to get the 500 records i actually care about.

However, when I changed to just loading the entire file TEXT into a QString, then searching the QString for the 500 records I want, it was faster.

I think this is because that parsing through a record to get the data I want will take X amount of time. If I just do it for 500 records, its 500X time versus if I do it for every record it is 1000000X time. I guess just storing the text in memory and using indexOf is faster.

But yes I agree I will be wasting alot of memory doing this. I could delete the QString after I'm done with it, but all the program does after it gets the 500 records I care about is save it to a CSV file the exit.

I'll have to do some testing to see how much memory its using to see if its too much. I wish there was a indexOf funtion to allow me to search for a substring in a file directly. That way I could pull the parts of text into memory versus the entire thing.

Another option would be for me to make two passes of the file but then I have to read the entire file twice.

enricong
15th July 2011, 17:46
I tried to open a 2.5GB file and found that readall would not work.

I tried reading line by line and saving the formatted data however that took a long time.

Then I tried to read the entire file, and indexing the file position with the start of each record (with its record ID). Then I used seek() to get the details of each record I was interested in. It involved two passes though

I will have to test which way is faster.

DanH
15th July 2011, 21:07
You say you only need about 500 records out of the file. Read the records one at a time and see if each is the one you need. If so, save it, otherwise discard. Then read the next.

Be sure to enable buffering (ie, use QTextStream or the equivalent) on the file so that performance doesn't suck.

enricong
16th July 2011, 04:25
You say you only need about 500 records out of the file. Read the records one at a time and see if each is the one you need. If so, save it, otherwise discard. Then read the next.

Be sure to enable buffering (ie, use QTextStream or the equivalent) on the file so that performance doesn't suck.

Yes, but I dont know what all the records are.
I will start with a list of maybe 300 record ID's, so I go through the file one by one as you say. However inside each record, there may be a reference to another record. in that case, I would need to go find that record. I could read the file line by line as you suggest, but I would need to do two passes to go back and get the references. That is why I do one full pass first to index the file, then I get the records I need using seek. I assume that when I use seek, it jumps to that part of the file and does not need to read every line.

nish
16th July 2011, 08:04
handling text file in GB size is stupidity. Convert your text file into a sqlite database and then you can fire simple queries to select the desired data. As your text file already formatted, it would take just one small function to convert the text records to sqlite records.

DanH
16th July 2011, 12:11
Certainly you can read through once and then index back with "seek". No need to read the entire file at once.

Or convert to a SQL DB.