PDA

View Full Version : QFile::exists( filename ) slow



lni
23rd February 2015, 16:09
Hello, I have about 3,000,000 small files in a directory tree that needs to be ftped through network.

I use QNetworkAccessManager to handle the job. About 1,000,000 files have been ftped to another computers, so I tar those files and untar into the target computer, and I do



foreach( filename, fileList ) {

if ( !QFile::exists( filename ) ) {

// start ftp filename using QNetworkAccessManager.

}


}



However, QFile::exists( filename ) takes a lot of time (about 0.02 second ) per file. If I kills the program and restart it, those files that have been checked by QFile::exists( filename ) will return quickly ( 0.001 seconds ), that is 20 times faster.

How can I solve this problem?

Thanks.

wysota
23rd February 2015, 18:27
The difference comes from the fact that the second time your system reads from cache without directly accessing the disk. Performance may heavily depend on the filesystem.

lni
23rd February 2015, 18:57
The difference comes from the fact that the second time your system reads from cache without directly accessing the disk. Performance may heavily depend on the filesystem.

But I am not reading the content of the file, I just merely check if the file exists.

I am using CentOS 6.6. Is there such thing as file index in Linux?

Thanks

EDIT: I reboot the machine. QFile::exists() is still quick for those file that have been previously checked.

wysota
23rd February 2015, 21:16
But I am not reading the content of the file, I just merely check if the file exists.
This still involves accessing and caching the inode.


I am using CentOS 6.6. Is there such thing as file index in Linux?
Depends what software you have installed.


EDIT: I reboot the machine. QFile::exists() is still quick for those file that have been previously checked.
A soft reboot will likely not invalidate the on-disk cache.

lni
23rd February 2015, 21:54
This still involves accessing and caching the inode.


Depends what software you have installed.


A soft reboot will likely not invalidate the on-disk cache.

I am not familiar with the system or kernel. Could you please tell me if there is a way to improve the access time, or what software I can install to help?

It appears if I do QFile::exists() on all the 3 millions files, then they all will be quick to be accessed afterward. I don't think the 3 millions files will all be kept in cache, do they?

Many thanks.

wysota
23rd February 2015, 23:45
I am not familiar with the system or kernel. Could you please tell me if there is a way to improve the access time, or what software I can install to help?
There is no general instant acme-improve-my-access-times solution. Checking if 1M files exists will take time if you do it again and again every time you run your program. What you can do is list all the files in the directory and check your files against that list. Preferably doing that while other files get transfered over network.


It appears if I do QFile::exists() on all the 3 millions files, then they all will be quick to be accessed afterward. I don't think the 3 millions files will all be kept in cache, do they?
You are not accessing the files but the directory they reside in. And sure, it will all fit into the cache easily.

lni
25th February 2015, 22:51
I give up the file exists check and decide to save the data into MySql database.

However it ends up I have a 5 GB database and growing. How can I compress or decrease the size when saving to database? Thanks!

Here is my pseudo code:





QByteArray compressed = qCompress( fileContent.toAscii(), 9 ); // fileContent is QString
QByteArray b64 = compessed.toBase64();

QString sql = QString( "INSERT INTO myTable ( data ) values ( '%1' );" ).arg( QString( b64 ) );

query.exec( sql );

wysota
25th February 2015, 23:09
We don't know what you are saving into the database so it is hard to suggest anything. But what exactly was the reason to using QFile::exists() anyway?

lni
25th February 2015, 23:28
We don't know what you are saving into the database so it is hard to suggest anything. But what exactly was the reason to using QFile::exists() anyway?

I try to save text files into database, so essentially it is a QString.

I want to download more than 3,000,000 small text files from network, QFile::exists was used to checked if the files already exists, then I don't need to download it again. Previously those files were saved into a directory tree. Now I save them into database, and I am getting 5GB database and counting. I am trying to find a way to compress the QString (or QByteArray) as much as possible.

wysota
26th February 2015, 00:19
Why not simply have a registry of files you already downloaded and only download those that are not in the registry? You don't need to check if each file exists or not. Just read what you already downloaded and proceed from there. As for the database, saving individual files into individual records of a database is basically a bad idea. If all files are text files then it is best to compress them all together using some smart data structure tailored to compressing text. Much depends what you want to do with the files once you download them.

Edit:Better yet, why not simply use rsync instead of writing your own software :)