PDA

View Full Version : Identifying file when filename is not enough



drhex
8th November 2006, 21:02
Hi,

I'm writing an app that associates metadata to files in a directory hierarchy. Not all relevant filetypes allow embedding of metadata, so it is kept in a separate database.
The user might have modified the files while my program was not running, so it can
scan the directory and compare the results to that of a previous scan.

Let's say it turns out that two files were missing and two new files have appeared.
This can be because

Two files were deleted and two other files created, OR
Two files were renamed

In the first case, the metadata associated with the missing files should also be removed.
In the second case, the metadata should be transferred over to the new files.

So how can one tell which of 1) and 2) has happened? :crying:
I have considered matching them by remembering the filesizes, but it is not unique enough to identify a single file among many.
Another option is to compute a hashcode of the file's contents (to see if any of the "new" files have the same code as a "missing" one), but that would make importing large amounts of files into the directory very slow.
The best solution I have so far is to identify the files by inode-number, as I've noticed that it will not change when a file is moved or renamed within the filesystem.
However, QFileInfo does not have a method to return the inode-number, making the solution Linux-specific. Is there a more portable way?

wysota
8th November 2006, 21:19
This is indeed very tricky, especially that you'd probably want to handle hard and soft links as well :)
I see another problem - what happens if a file is renamed/deleted and another file is created with the same name? How should your system behave then? If it is to notice that, it'd have to scan all the files all the time. Hashes won't help you here as well as two files might have the same contents which would give the same hash value and for large file sets you could encounter hash clashes.
The inode thing is very tricky too, as if you move a file, it might be transfered to another filesystem which would make it change the inode too and there is that problem with links as well, links can point to the same inode... And different filesystems have different inode structures and some don't have inodes at all (FAT?).

First thing I would do is to think about the ways of maintaining/tracing the information you desire - without going into implementation details (so no inodes, no Qt, no C++). For example tracing file contents is not sufficient because two distinct files can have the same content. Tracing file path and file contents is better, but still files can be moved so the path changes. Another idea is to trace last modification times - if you keep a database of those, you can try to guess if something happened to the file when your system was out of order. Try this first and when you come up with a theoretical solution, then dive into implementation details.

aamer4yu
9th November 2006, 04:38
What are you using the database for ?
Is it for identiifying files based on meta-data or you want to search files based on some meta-data ??

If the case is former, I cant suggest anything for now, but for the latter I can suggest the following..

1. Get the input meta-data from the user.
2. Search in the database
3. Get the filename from database associated with the meta-data.
4. Regenerate meta-data for that file.
5. If the meta-data is same in both cases, then the file must be same.
6. If the meta-data is not same, the file must have been modified / deleted / etc.
..........In this case regenerate your database, as it needs updation, and start from Step 2

I guess this algo will help to
- identify if files have been modified/deleted/renamed
- update database only when required

TheRonin
9th November 2006, 12:54
I would hash each file and then simply compare hashes. To speed things up you might want to use a fast hash-function and perform the hashing in a different thread. Perhaps even have a queue of files to be hashed running in the background. If i'm not mistaken Tiger Tree Hash (TTH) is both fast (hundreds of meg / sec, depending on your hardware of course) and secure: http://en.wikipedia.org/wiki/Hash_tree#Tiger_tree_hash

drhex
9th November 2006, 20:44
The metadata is to be used for finding files based on metadata.


4. Regenerate meta-data for that file.

It would surely be wonderful if metadata could be generated automatically by a program. Alas, the data is things like "name of person in this picture".


Tracing file path and file contents is better

Yes, the primary link between metadata and contents is through the full file path.
Identifying files that have been renamed/moved is considered an extra bonus for the user,
if he e.g. swaps two files so the names remain the same it is supposed to be intentional.


Another idea is to trace last modification times

I've considered that as well. It seems less precise than inodes - the resolution is rarely better than a second and systems can process many files per second. Also, a simple edit to a file (were no metadata needs to be changed) changes last modification time but not inode. To keep the modification-time database up-to-date, the directories would have to be scanned automatically on each programstart.


a queue of files to be hashed running in the background.

Now that's clever! Might work for that auto-scan-on-startup as well. I want a dual-core cpu :rolleyes: