PDA

View Full Version : Algorithm or way to compare two txt files



ZSWASW
14th January 2014, 22:43
Hello. I searched a lot of algorithms, and I couldn't find that one I need :(

I've got 2 txt file. In 1st I have:


file1 176
file2 209
file3 159392
file4 405617
file5 405886
file6 516485

And is 2nd:


file1 176
file4 405617
file3 159392
file8 49556
file2 209


And I want to compare these files, to get only changes, so in this example it should be:


file5 405886
file6 516485
file8 49556


1st part of line is name of file, and 2nd is size, so I can split them, put into array... anything.
Only way I found to do it, is to put these files to array, take 1 line, and seek full 2nd array to find it, take next line and again. And vice versa.

Is it possible to make it easier?

ChrisW67
15th January 2014, 03:12
Easier? Probably not, your algorithm is about as simple as it can be.

How many entries in the lists? A QHash will make the search faster for data sets of hundreds or more.
Is order important in the output?
Is there an implied or explicit order in the data?

ZSWASW
15th January 2014, 09:04
Easier? Probably not, your algorithm is about as simple as it can be.
Oh sorry, I didn't mean "easier". I mean "more professional", because searching array, entry after entry, as I think, is not efficient and 'nice'. I thought that maybe exists any algorithm to do this things.


How many entries in the lists? A QHash will make the search faster for data sets of hundreds or more.
Is order important in the output?
Is there an implied or explicit order in the data?
How many... It should be universal, but probably between 1 to 1000 entries. Order is not important in output. I'm making a synchronization between PC and Network, so I must know which file should I upload/download.
And last question: I don't know if I good understood it (any hidden data?), but there is no an implied or explicit order

d_stranz
16th January 2014, 03:38
And I want to compare these files, to get only changes

You can have 3 kinds of changes:

1 - A file has been added on one side or the other
2 - A file has been removed on one side or the other
3 - A file is present on both sides, but the sizes are different

Do you want to check for all 3 cases? If so, what should be the output in the case of a file that has been removed?

One way to improve your current algorithm is to sort both lists first. Then you do not need to scan through all of list2 for each item on list1. You can make one pass through each of the lists by keeping track of where you are in each list.

This discussion from stackoverflow (http://stackoverflow.com/questions/12993655/find-common-elements-in-2-sorted-arrays) describes what I mean.

ZSWASW
16th January 2014, 13:14
You can have 3 kinds of changes:

1 - A file has been added on one side or the other
2 - A file has been removed on one side or the other
3 - A file is present on both sides, but the sizes are different

Do you want to check for all 3 cases? If so, what should be the output in the case of a file that has been removed?

Yes, sth like that, but checking for modification of file I will try later. For removed files, I thought that that files just will be not showed in the list... But now I see that is bigger problem.
I can have 2 (or more) clients in some PCs, management will be only with these clients. When I remove file from client 1, how to check in client 2, whether this file has been removed form client 1 and delete it, or if file has been added in client 2 and upload it...
I want to do it like in dropbox :/ Any ideas? Leave other list in server with removed files..?


One way to improve your current algorithm is to sort both lists first. Then you do not need to scan through all of list2 for each item on list1. You can make one pass through each of the lists by keeping track of where you are in each list.

This discussion from stackoverflow (http://stackoverflow.com/questions/12993655/find-common-elements-in-2-sorted-arrays) describes what I mean.

Thanks for that, that will be probably the best option to optimize.