reading from a file [Archive]

mickey

9th July 2007, 13:35

std::fstream file;
std::fstream ifile = file.open(file_name.c_str();

1.uhm, but anyway the 'get' cause a crash at runtime....
2.But is there a way to assign in one "hit" all text file to a char*? (I thought use a vector<string> too, but If I have a very large file and large lines maybe vector goes slow...)
thanks

jacek

9th July 2007, 16:15

the 'get' cause a crash at runtime....
How did you allocate the buffer?

But is there a way to assign in one "hit" all text file to a char*? (I thought use a vector<string> too, but If I have a very large file and large lines maybe vector goes slow...)
You could use seekg() and tellg() to check the file size first and then use a large enough buffer, but it might lead to race condition if some other process is modifying that file.

mickey

9th July 2007, 17:38

thanks for reply;
I just see for the tellp/g solution and it seems ok; But could I put file into vector<string> too? (I mean: is it efficient a vector? eg: my file is 2GB size large)

jacek

9th July 2007, 21:25

is it efficient a vector?
If you are going to use push_front/push_back methods it won't be. On the other hand, if you are going to allocate it in one go, it won't differ much from reading data line by line.

eg: my file is 2GB size large
How much RAM do you have in your machine? You certainly don't want your machine to start swapping.

If you want to optimize something, first use a profiler to check where the bottleneck is.

Remember: "premature optimization is the root of all evil".

mickey

9th July 2007, 23:03

Hi I have 512mb. and with this below my machine swapping! So haven't I hoping to use file larger than my Ram?

ifile.read (memblock, size);

Sorry: are you saying that working on "memblock" is much less efficient than using push_back to put the file in a vector?

thanks

jacek

9th July 2007, 23:30

Hi I have 512mb. and with this below my machine swapping! So haven't I hoping to use file larger than my Ram?
You can use files larger than your RAM, but in parts that fit in the memory. Getting data from disk is *a lot* slower than getting it from the memory, so you should avoid swapping at all cost.

are you saying that working on "memblock" is much less efficient than using push_back to put the file in a vector?
On the contrary. push_back() might result in relocation of all of the data in the vector.

mickey

10th July 2007, 08:57

hi, What does that mean?

...... but in parts that fit in the memory.

So is "memblock = new char [size];" is the best solution?
thanks

jacek

10th July 2007, 13:11

What does that mean?
If you have 512MB of RAM, you shouldn't try to read more than that.

So is "memblock = new char [size];" is the best solution?
std::vector and std::string will be OK too, but only if you won't change their size too often.

mickey

11th July 2007, 17:01

hi, I tried read with the solution above with memblock on a 200mb text file. it's goes fast.
Then I tried to read file and put it into a vector:

vector<string> text;
copy (it_file, eos, inserter (text, text.begin()) );

this goes very slow (I didn't see the end of the copy!)
I tried this operation of "copy" because I need to sort the lines of the file too. Then I thought to put the lines in a vector and then call sort() on it. But the "reading" is too slow and I need speed.
I'm thinking of implement my own sort on memblock. But maybe it won't so fast as vector.sort().......What do you suggest, please?

jacek

11th July 2007, 19:20

What is the format of the file you want to read and what operations you need to perform?

mickey

11th July 2007, 20:37

hi, my file contains lines of strings (it's a text); At the moment I'd like order every line of the file ( line1 < line2)

jacek

11th July 2007, 20:57

First of all there's a small application called sort that will do that for you, but I'm not sure if it's available on windows.

If you still want to do it yourself, it sounds like a job for merge sort. You can save some time by starting with the biggest blocks that fit into your RAM and sorting them using quick sort or similar algorithm.

If you know the maximum line length or at least its approximate value, you can create std::vector< std::string > of some constant size and preallocate space in every string (you can do that using two-parameter version of std::vector's constructor).

mickey

12th July 2007, 00:26

If you know the maximum line length or at least its approximate value, you can create std::vector< std::string > of some constant size and preallocate space in every string (you can do that using two-parameter version of std::vector's constructor).
I'm confused; I know the maximum dimension of lines (1KB) but I cannot know the file size... When I say vector <string> I mean every element of vector is a line of the file.
Are you saying to do somthing like: vector<string> line (10, ""); ?? (if it's so I have allocate 10 lines.. and I don't know how many lines could have the file).

jacek

12th July 2007, 12:59

I cannot know the file size...
Luckily it isn't required for merge sort.

Are you saying to do somthing like: vector<string> line (10, ""); ?? (if it's so I have allocate 10 lines.. and I don't know how many lines could have the file).
Yes. To be exact, something like:

std::string temp;
temp.reserve( MAX_LINE_LEN );
std::vector< std::string > lines( NLINES, temp );

// or if std::string tries to be smart:

std::vector< std::string > lines( NLINES );
for( ... ) {
line->reserve( MAX_LINE_LEN );
}

Where MAX_LINE_LEN = 1024 and NLINES is around 256k. Remember that the point is to avoid reallocations and swapping.

This way you can read 256k lines into preallocated space, which should be fast enough. Then you can sort that vector using std::sort, dump everyting into a temporary file and read another set of lines. And so on until the end of file. When you are finished you can free the memory and continue with merge sort (http://en.wikipedia.org/wiki/Mergesort) algorithm.

fullmetalcoder

12th July 2007, 13:43

If you have 512MB of RAM, you shouldn't try to read more than that.
This doesn't make much sense... In most cases (users who dropped text-based interface) the OS/Desktop/backgroud tasks... occupy about half of the available memory (sometimes more). Thus it is highly recommened not to load a full file which is bigger than 40% of your available memory (yet it is possible). In such cases the best way is to read pieces of the file, perform some tasks and then discard them before loading some other new pieces... It should be a little tricky to work this way to sort lines of text but it's still doable.

mickey

12th July 2007, 16:55

OK! But I don't still understand this:
1. I don't know the number of the lines ( I know only its dimesion with instructions below)

ifile.seekg (0, ios::end);
size = ifile.tellg();

So I cannot allocate space. Instead I know maximum lenght (dimension) of one line. So I can allocate the size for string line.
Then, I have to sort only line by line (not the words inside a line and then line by line)
2. After that, in your opinion, can I use

memblock = new char [size];
ifile.read (memblock, size);

to do all work ( I still understand it'll more speed) ??

jacek

12th July 2007, 16:58

This doesn't make much sense...
What doesn't make sense exactly?

In such cases the best way is to read pieces of the file, perform some tasks and then discard them before loading some other new pieces... It should be a little tricky to work this way to sort lines of text but it's still doable.
Isn't that what I wrote in my two previous posts?

jacek

12th July 2007, 17:06

1. I don't know the number of the lines ( I know only its dimesion with instructions below)
If you are going to use merge sort, you don't have to know this.

After that, in your opinion, can I use
[...]
to do all work ( I still understand it'll more speed) ??
The version with std::vector will be easier to implement and you if preallocate memory, it shouldn't be significantly slower.

mickey

12th July 2007, 17:29

EDIT: OK. Such as other user, I couldn't understand the trick to work on block of 256k

jacek

12th July 2007, 18:52

Such as other user, I couldn't understand the trick to work on block of 256k
Suppose you have file X and you want to sort it. Using merge sort you would do it this way:
blockSize := 1.
while not end of file X:
copy blockSize lines to file A,
if no more lines in file X:
END: file A is sorted.
copy blockSize lines to file B.
rewind files X, A and B.
while not end of file A and B:
merge two blocks from A and B and write them to X,
rewind files X, A and B.
blockSize := 2 * blockSize.
go to 2.

For example, let X= 42, 24, 67, 45, 79, 33, 76, 85, 29, 59, 26, 92, 30, 56, 81, 27.

Iteration 1:
A = 24, 45, 33, 85, 59, 92, 56, 27
B = 42, 67, 79, 76, 29, 26, 30, 81

X = 24, 42, 45, 67, 33, 79, 76, 85, 29, 59, 26, 92, 30, 56, 27, 81

Now X consists of 2-element sorted blocks.

Iteration 2 (this time blockSize is 2):
A = 24, 42, 33, 79, 29, 59, 30, 56
B = 45, 67, 76, 85, 26, 92, 27, 81

X = 24, 42, 45, 67, 33, 76, 79, 85, 26, 29, 59, 92, 27, 30, 56, 81

Now 4-element blocks are sorted.

Iteration 3 (blockSize = 4):
A = 24, 42, 45, 67, 26, 29, 59, 92
B = 33, 76, 79, 85, 27, 30, 56, 81

X = 24, 33, 42, 45, 67, 76, 79, 85, 26, 27, 29, 30, 56, 59, 81, 92

Final iteration:
A = 24, 33, 42, 45, 67, 76, 79, 85,
B = 26, 27, 29, 30, 56, 59, 81, 92

X = 24, 26, 27, 29, 30, 33, 42, 45, 56, 59, 67, 76, 79, 81, 85, 92

Note that:

you don't have to read whole blocks to merge them,
you don't have to start with blockSize = 1 (you can use some more efficient algorithm to sort blocks in memory, dump them into a file and then continue with merge sort),
when the size of the input file isn't a power of 2, you will have a block that is shorter than blockSize.

mickey

12th July 2007, 20:21

The thing becames interesting....

The version with std::vector will be easier to implement and you if preallocate memory, it shouldn't be significantly slower.
I still haven't clear a point: How do I put lines of file X in the vector line?

while( getline (ifile, temp, '\n') ) {
lines[l] = temp;
l++;
}

When I speak of "speed" I mean that I know read a file line by line isn't so efficent..(!?)
Then you said to work on 3 files: but file access and work on files is so efficent? (maybe it'll be better use 2 vector instead of 2 files?)
thanks

jacek

12th July 2007, 21:26

How do I put lines of file X in the vector line?
I would use operator>>.

When I speak of "speed" I mean that I know read a file line by line isn't so efficent..(!?)
Then you said to work on 3 files: but file access and work on files is so efficent? (maybe it'll be better use 2 vector instead of 2 files?)
If you read 2GB file into 512MB of RAM, over 3/4 of it will land back on the hard drive in a swap partition/file. If you divide that file into several blocks and sort them in memory and finally merge them, it should be faster than sorting a vector which 3/4 are in swap.

If you write each sorted part of the file into separate temporary file and then merge all of those files at once (not two at a time as in classical merge sort), you will need only one merge operation. This means that you will have to read and write every line from/to hard drive only twice.

fullmetalcoder

13th July 2007, 09:30

What doesn't make sense exactly?
Telling that a file bigger than memory shouldn't be loaded in order to prevent swap use because the "swap limit" is actually much lower than the actual physical memory size...

mickey

13th July 2007, 17:50

Luckily it isn't required for merge sort.
Yes. To be exact, something like:

std::string temp;
temp.reserve( MAX_LINE_LEN );
std::vector< std::string > lines( NLINES, temp );

// or if std::string tries to be smart:

std::vector< std::string > lines( NLINES );
for( ... ) {
line->reserve( MAX_LINE_LEN );
}

Where MAX_LINE_LEN = 1024 and NLINES is around 256k. Remember that the point is to avoid reallocations and swapping.

This way you can read 256k lines into preallocated space, which should be fast enough. Then you can sort that vector using std::sort, dump everyting into a temporary file and read another set of lines. And so on until the end of file. When you are finished you can free the memory and continue with merge sort (http://en.wikipedia.org/wiki/Mergesort) algorithm.
Sorry, but the solutions with esamples of some posts before is different from what you said here above?? With solution with examples where Have I the vectors? Where do I call sort() ?

jacek

13th July 2007, 22:17

Telling that a file bigger than memory shouldn't be loaded in order to prevent swap use because the "swap limit" is actually much lower than the actual physical memory size...
In other words, you say that upper bound is incorrect, because it's too big?

jacek

13th July 2007, 22:44

Sorry, but the solutions with esamples of some posts before is different from what you said here above??
No, that's the first step, which allows you to save some time.

With solution with examples where Have I the vectors? Where do I call sort() ?
Instead of starting the merge sort algorithm with blockSize = 1, you can start it with blockSize = 256000, but you'll need a file that contains sorted blocks of 256000 lines. That's where you are going to use the vector and sort().

f := [A, B].
fileId := 0.
while not end of file X:
vector := read 256000 lines from file X,
sort vector,
write vector to file f[fileId],
fileId := 1 - fileId.
blockSize := 256000.
go to 3 from post #23.

Merge sort is one of the basic algorithms that every programmer should know. If you need more explanations, see Algorithms + Data Structures = Programs by Niklaus Wirth.

mickey

14th July 2007, 19:52

sorry, last question: your solution counts on many (dim_file/block_size) temporary files? Because that's what I understand only now..

jacek

15th July 2007, 00:45

sorry, last question: your solution counts on many (dim_file/block_size) temporary files? Because that's what I understand only now..
You need at least two temporary files, but you can use more.

mickey

17th July 2007, 02:25

sorry but I'm thinking that what I don't understand isn't the merge sort...
You said that I can do with 2 files A and B. So I work with 256000 lines. Sort they and put in file A; then read other 256000 lines, sort(), and put in file B. Now using mergesort to merge two file and write they in the first 500.000 lines of file X. Now I go on. 256.000*2 = 512.000; now the first 512000 lines of X are sorted! But the file isn't finish. Here begin what I don't understand....(I'm thinking this: ); rewind X,A,B. from line 512.001 of X: take a block of 256.000, sort, copy on A; take the next 256.000 block of X, sort, copy on B; mergesort on A and B and overwrite lines from 512.001 to 1024.000; take a block from of 256.000 lines of X from line 1024.000; there's only 120000 lines! Sort them; restart: blocksize 256.000*2=512.000; copy lines from 1 to 512.000 to A (they're sorted); copy lines from 512.001 to B (sorted too); mergesort on A and B and write 1024.000 lines on X; take lines from 1024.000 to end; copy 1024.000 to A and copy the 120.000 lines to B; mergesort on A and B and write on X. Stop. Is it this? It seems me slow...

jacek

19th July 2007, 01:04

You said that I can do with 2 files A and B. So I work with 256000 lines. Sort they and put in file A; then read other 256000 lines, sort(), and put in file B. Now using mergesort to merge two file and write they in the first 500.000 lines of file X. Now I go on. 256.000*2 = 512.000; now the first 512000 lines of X are sorted! But the file isn't finish. Here begin what I don't understand....
Before you start to merge anything, you have to split the whole file, not only the first blocks.

If you have a file 'X' with 2048k of lines, you have to do the following:
repeat 4 times:
lines := read 256k lines from X,
sort lines
write lines to file A
lines := read 256k lines from X,
sort lines
write lines to file B
rewind files
repeat 4 times:
merge 256k line blocks from A and B and append them to X
rewind files
repeat 2 times:
append 512k lines from X to A
append 512k lines from X to B
rewind files
repeat 2 times:
merge 512k line blocks from A and B and append them to X
rewind files
append 1024k lines from X to A
append 1024k lines from X to B
rewind files
merge 1024k line blocks from A and B and append them to X

It seems me slow...
Of course it's slow. If you can't fit your data in memory, everything will be slow, but other methods might be event slower.

Although there might be other solution. If you find such function f, that if line1 < line2, then f(line1) < f(line2), you can sort values returned by that function instead.