Guidance
指路人
g.yi.org
Guidance Forums / Reginald Rexx / Comparing Large files.

Register 
注册
Search 搜索
首页 
Home Home
Software
Upload

  
Forum List • Thread List • Reply • Refresh • New Topic • Search • Previous • Next First 1 Last
Message1. Comparing Large files.
#13279
Posted by: PeterJ 2010-03-27 01:51:15 Last edited by: PeterJ 2010-03-27 02:12:05 (Total edited 10 times)
For a binary comparison of 2 large files you need to read the files and either perform a
binary compare or create CRCs. However you need to read the entire file, which may be expensive.

Over the time my digital photo directories have grown and there were duplicates all over the place.
In the past I used a full binary compare which took time and was inlegant.

I have implemented a small DLL which allows the creation of partial fingerprints of files. This means instead of reading several MB you decide for example to read 20 blocks of 20 K and produce the fingerprint of it. It is of course not absolutely save but very unlikely that these blocks are equal and the file is not.  

There are 2 fingerprint methods which maybe used the SHA1 (http://en.wikipedia.org/wiki/SHA1)
or the CRC fingerprint (http://en.wikipedia.org/wiki/Cyclic_redundancy_check).

in the DLL are 2 funtions implemented to create a fingerprint of a file:

sha1=sha1file(file,blocks-to-read,block-length)
crc=crcfile(file,blocks-to-read,block-length)

blocks-to-read: defines the numbers of blocks which should be read in the to create
                           an SHA1 or a CRCfingerprint
                           default value is 32
block-length:     is the block length which is read
                           default value is 16000

if block-to-read*block-length is greater than 90% of the file, the entire file is processed

additionally there are 2 funtions implemented which return the fingerprint of passed strings:

sha1=sha1string(string)
crc=crcstring(string)

Example:

envdir=DIRECTORY()"\FingerPrint"
rc=0
SAY FUNCDEF('SHA1File','str,str,32,32',envdir,'SHA1FILE')
SAY FUNCDEF('CRCFile','32u,str,32,32',envdir,'CRCFILE')
SAY FUNCDEF('SHA1String','str,str,32',envdir,'SHA1STRING')
SAY FUNCDEF('CRCString','32u,str,32',envdir,'CRCSTRING')

regdir=VALUE('HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\App Paths\RxLaunch.exe\Path', ,"WIN32")
file=regdir"RXLaunch.exe"
SAY "Reginald Directory: "regdir
SAY "Fingerprint of    : "file 
SAY "SHA1: "sha1file(file,256,100)
SAY "CRC : "crcfile(file,256,100)

SAY "Fingerprint of  strings" 
SAY "SHA1: "sha1string("The quick brown fox jumps over the lazy dog")
SAY "SHA1: "sha1string("The quick brown fox jumps over the lazy cog")

SAY "CRC : "crcstring("The quick brown fox jumps over the lazy dog")
SAY "CRC : "crcstring("The quick brown fox jumps over the lazy cog")
FingerPrint.zip
Message2. Personally
#13285
Posted by: Michael S 2010-03-27 14:48:14 Last edited by: Michael S 2010-03-27 14:50:19 (Total edited 2 times)
I use Beyond Compare You might like to have a look at it. (Here's a quote from the online help)
At various times, Beyond Compare performs a content comparison on a pair of files to determine if they match.

Content comparison methods

CRC comparison compares CRC values.  A CRC (Cyclic Redundancy Check) is a numeric representation of the contents of a file.  If the CRCs of two files are different, then the contents of those files must be different.  If the CRCs match, then it is likely (but not absolutely certain) that the contents match.  When comparing two versions of the same data, you can be reasonably confident that they are exactly the same if their CRCs match.

Binary comparison compares files byte-by-byte.

Rules-based comparison compares files based on their file format associations.  It allows you to define unimportant differences, such as changes in whitespace or source code comments.  A rules-based comparison can also ignore differences in file encoding or line endings.

It has added advantages as well inasmuch as you (we) can compare source code on the PC with source code in TSO Here's a quote from the online help
Message3.
#13290
Posted by: PeterJ 2010-03-27 15:09:37 Last edited by: PeterJ 2010-03-27 15:45:39 (Total edited 2 times)
thanks Michael, I use it too, mainly for diretory synchronisation.
Forum List • Thread List • Reply • Refresh • New Topic • Search • Previous • Next First 1 Last
掌柜推荐
 
 
 
 
 
 
 
 
 
 
 
 
© Sat 2024-4-20  Guidance Laboratory Inc.
Email:webmaster1g.yi.org Hits:0