Reading large unsorted file [Archive]

View Full Version : Reading large unsorted file

GSS

14th July 2016, 12:28

Hi everybody!

I have a few text files with roughly 50 million lines. Each line has 3 fields separated by a space, being the first one a ID.

The file is not sorted so there are multiple entries for the same ID all over the file. I would like to create different files, one for each ID, but with my limited capabilities I have no idea on how to make this fast. Any advice?

Thanks

ps: I want the lines in each file sorted by the second column.

anda_skoa

14th July 2016, 13:22

GSS

14th July 2016, 14:13

I would do this in two steps:

1) read the input file line by line and create temporary files for each ID
2) then process each of those files, by reading the data into memory into a sorted container, e.g QMap, then writing that into the target file

Cheers,
_

Thanks!

Wouldn't it be even better/faster/simpler if I put all the initial data into a QMultiMap (I'm not sure if this is even possible)?

d_stranz

14th July 2016, 16:44

Wouldn't it be even better/faster/simpler if I put all the initial data into a QMultiMap

I would use QMap< QPair< QString, QString >, QString >. QMap sorts by key value. The QPair<> key has the ID as the first entry, the column 1 value as the second entry. QPair's less than operator compares the first values and then the second values if there is a tie. Thus, when you finish building the map, you can read it out in order sorted by ID then column 1 within ID.

Assuming you can fit 50 million lines in memory, that is.

anda_skoa

14th July 2016, 17:30

Assuming you can fit 50 million lines in memory, that is.

Exactly, hence the idea of processing each ID separately.

But if you have the memory to do all in one go then I would go for a map of maps or a hash of maps.

Cheers,
_

Radek

16th July 2016, 09:16

If you cannot process 50 millions of records on memory, search an algorithm named file merging. It is an algorithm which allows you sorting huge files using your hard disk and a small amount of memory rather effectively. 50 millions of records isn't a big task for the algorithm.

deepal_de

16th July 2016, 13:40

if you want to make it fast use Qt Concurrent. no matter how bad your code is Qt Concurrent will user all your cores and finish the job fast.

#include <QtConcurrent>

void readDataFile(QString filePath)
{
//read data from file here
}

int main(int argc, char *argv[])
{
QtConcurrent::run(QThreadPool::globalInstance(), &readDataFile, "C:\DataFile1.txt");
QtConcurrent::run(QThreadPool::globalInstance(), &readDataFile, "C:\DataFile2.txt");
QtConcurrent::run(QThreadPool::globalInstance(), &readDataFile, "C:\DataFile3.txt");
}

GSS

18th July 2016, 10:32

Hi everyone!

I tried the suggestions of using QMap and QPair but it takes ages! For example, one of the files has almost 25 million lines and it took 14 hours just to read with the following code:

#include <QCoreApplication>
#include <QMultiMap>
#include <QtCore/QFile>
#include <QTextStream>
#include <QStringList>
#include <iostream>
#include <QDebug>
#include <QDir>
#include <QByteArray>
#include <QTime>

using namespace std;

int main(int argc, char *argv[])
{
QTime myTimer;
QCoreApplication a(argc, argv);
QPair < QString, QString > pair1;
QMap< QPair < QString, QString >, QString > map1;
QMap< QString, QString > map2;
QMap< QString, QMap< QString, QString > > map3;
QString id, time, value;

QFile file("File2.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) {
qDebug() << file.fileName() << file.error() << file.errorString();
return 0;
}

QTime begin, end;
begin = begin.currentTime();

QTextStream in(&file);
myTimer.start();
int count = 0;
/*while (!in.atEnd()) {
QString line = in.readLine();
QStringList list1 =line.split(' ');
pair1.first = list1[0];
pair1.second = list1[1];
map1.insert(pair1, list1[2]);
cout << count++ << "\n";
}*/
while (!in.atEnd()) {
QString line = in.readLine();
QStringList list1 =line.split(' ');
map2.insert(list1[1], list1[2]);
map3.insert(list1[0], map2);
cout << count++ << "\n";
}

end = end.currentTime();
cout << begin.hour() << ":" << begin.minute() << ":" << begin.second() << " " << myTimer.elapsed() << " " << end.hour() << ":" << end.minute() << ":" << end.second() << "\n";
return 0;
}

I'll give a try to the the QtConcurrent.

Lesiok

18th July 2016, 10:38

As I see You don't reset map2. So after next line map2 contains all lines from begin to current.
I think that first line in whie loop should be :
map2.clear();

anda_skoa

18th July 2016, 11:08

I'll give a try to the the QtConcurrent.
No point in doing that, QtConcurrent is for parallelizing multiple tasks, you only have one task.

As I see You don't reset map2. So after next line map2 contains all lines from begin to current.
I think that first line in whie loop should be :
map2.clear();

No, every loop iteration is a single record.

The loop needs to work with map3 and only with map3.

Something like

map3[list1[0].insertMulti(list1[1], list1[2]);

i.e. retrieve or create the inner map for key "list1[0]" then insert the current inner pair, allowing multiple values for key "list1[1]"

Since the result is written back into files again, one could probably even avoid the encoding and decoding and just work with QByteArray instead of QString.

Is the machine's physical RAM large enough to do that without swapping?
Such a long time sounds like as if the machine started swapping.

Cheers,
_

GSS

18th July 2016, 11:58

As I see You don't reset map2. So after next line map2 contains all lines from begin to current.
I think that first line in whie loop should be :
map2.clear();

Well spotted. This reduced the running time to 40 minutes.

The loop needs to work with map3 and only with map3.

Something like

map3[list1[0].insertMulti(list1[1], list1[2]);

i.e. retrieve or create the inner map for key "list1[0]" then insert the current inner pair, allowing multiple values for key "list1[1]"

Since the result is written back into files again, one could probably even avoid the encoding and decoding and just work with QByteArray instead of QString.

Is the machine's physical RAM large enough to do that without swapping?
Such a long time sounds like as if the machine started swapping.

Cheers,
_

I'm still running with your suggestions so I cannot tell if it will take less than the 40 minutes.

Regarding memory, it doesn't seem to be an issue. The file has almost 400 MB and the application uses less than 3 GB while running. Anyway, I'm not familiar with the concepts of "swap memory".

Using QByteArray seems to be interesting but how could I then sort the entries and put everything back to files?

Thanks!

EDIT: Maybe memory is an issue. I just got a "Segmentation fault" and only half of the lines were read. Perhaps this happened now and not with the previous attempts because I have other applications running now.

EDIT2:

The loop needs to work with map3 and only with map3.

Something like

map3[list1[0].insertMulti(list1[1], list1[2]);

I run for only 1/4 of the lines and it took 10 minutes, which means 40 minutes as the correction suggested by Lesiok.

Lesiok

18th July 2016, 12:56

No, every loop iteration is a single record._
Sorry anda_skoa, you're wrong. map2 is declared outside the loop thus every iteration appends next elements. Adding a clear shortened significantly duration of process.

anda_skoa

18th July 2016, 13:29

Sorry anda_skoa, you're wrong.

Maybe, but I don't think so :-)

map2 is declared outside the loop thus every iteration appends next elements.

yes

Adding a clear shortened significantly duration of process.
Yes, and it also makes map3 only contain the last map2 for each key, which will alway be only one entry since it is cleared for every line in the file.

My understanding was that the programs should keep all entries for a "map3" key, not just the last pair.

Regarding memory, it doesn't seem to be an issue. The file has almost 400 MB and the application uses less than 3 GB while running. Anyway, I'm not familiar with the concepts of "swap memory".

Swap is a file or partition on disk to which the operating system "swaps" memory that is currently not used by the running application, e.g. memory used by another application which is currently not running.
See https://en.wikipedia.org/wiki/Swap_memory

Getting a system into a state where it starts swapping heavily wil slow it "to its knees", since there is a lot of I/O overhead with a slow (compared to RAM) memory device.

Using QByteArray seems to be interesting but how could I then sort the entries and put everything back to files?

I am not sure what you mean, QFile is a QIODevice subclass, it has read/write functions for QByteArray.

Cheers,
_

GSS

18th July 2016, 13:45

Yes, and it also makes map3 only contain the last map2 for each key, which will alway be only one entry since it is cleared for every line in the file.

My understanding was that the programs should keep all entries for a "map3" key, not just the last pair.

You are absolutely right. It has to be as you say or it will replace and not append.

I am not sure what you mean, QFile is a QIODevice subclass, it has read/write functions for QByteArray.

Sorry, I though you were referring to QFile::readAll() and work from there.

I will explore the use of QByteArray to avoid the conversion to QString.

Thank you very much!