I recently had to set up a new Windows Server 2003 machine. The setup process
was straightforward until I installed Perl. You can call me old fashioned, but
I prefer not to run a setup application or use an .msi file to install Perl.
Instead, I choose to copy Perl from another machine. Essentially, I simply copied
the Perl directory from one machine to another.
Several days later, I discovered a problem. The copy process was successful, but I had copied the Perl directory from the wrong machine. As a result, a couple of files were missing and a couple of others were slightly different. Usually, it wouldn't be too much of a problem to hunt down such discrepancies. However, the Perl directory consists of more than 6000 files. So, to ensure that I knew what files were different between machines, I needed a tool that could compare directories. I decided to write a Perl script to do the job. To write this script, I had to determine which algorithm to use, then implement that algorithm.
Determining the Algorithm
You can use different algorithms in a script to compare directories. For example,
a script might use a comparison algorithm that takes into account the directories'
tree structure, file count, filenames, file sizes, and file time-stamps. The
ideal algorithm would also analyze each file's contents to verify that each
file contains the same data. It's possible that two different files could have
the same relative file paths and identical file sizes yet have different contents.
For example, you could have the C:\Dir1 \readme.txt and C:\Dir2\read me.txt
files. Both files have the same relative path (relative to C:\Dir1 or C:\Dir2),
the same filename (readme.txt), and the same file size (21 bytes). However,
one file might contain the text This is a readme file and the other file
might contain the text I think I smell food!
I didn't need such an elaborate comparison between the Perl directories. I just needed a simple algorithm that would take into account the directories' partial tree structure, filenames, and file sizes. So, I decided to create an algorithm that compared the files' paths and byte sizes.
To create this algorithm, I determined that I could use a single hash in which
each path is stored as a hash key. The hash key's associated value is a subhash
whose key indicates the analyzed directory (1 or 2) and whose key's value specifies
the file's size. The resulting hash might look like the one in Figure
1. In this sample hash, the file named FileNumber1.txt exists in both directories
(1 and 2) and both files are the same size (1234 bytes). However, only directory
1 contains a file named FileNumber2.txt, which has a size of only 32 bytes.
With this algorithm, reporting the results would be simple. The script would
simply need to walk through each key in the %FileList hash. If the %FileList
hash key's value contains only one subhash key called {1} or {2}, the file exists
in only one directory, so the script would print that file's path on screen.
If both {1} and {2} subhash keys exist, the script would compare their values.
If these values weren't identical, the files have different sizes, so the script
would print each file's path and size on screen.
Implementing the Algorithm
The DirDiff.pl script demonstrates how I implemented the algorithm I decided
to use. Listing 1 shows an excerpt from that
script. (You can download the entire script from the Windows Scripting Solutions
Web site. See page 1 for download information.) The code at callout A in Listing
1 declares some of the variables that the script uses. The use vars line
declares the global variables %Config and $gFileCount. These variables need
to be accessible from all different subroutines. Therefore, they aren't lexically
declared with the my keyword.
The %FileList, %File, and %Size variables are declared at the beginning of
the script because they are used later by the write command in the PrintReport()
subroutine. For the write command to properly print the values in these
variables, the variables have to be declared locally in that particular scope.
However, because the script uses strict (as all good Perl scripts should),
the variables must first be declared. Because you can't locally scope lexical
scalar variables (i.e., those variables created using my), the script
lexically declares these variables at the beginning of the script. Perl variables
aren't typically declared this way. The only reason the script does it this
way is because of the use of the write command.
Callout B highlights the main block of DirDiff.pl. This block of code calls the CollectFileList() subroutine for the directories being compared, then enumerates through the directories using a foreach loop. It would have been just as easy to hardcode two calls to CollectFileList() and pass in the two directories, but less fun to script. Finally, the block of code calls the PrintReport() subroutine.
In callout B, note the print statement that specifies the STDERR file handle. You'll find such print statements throughout DirDiff.pl. I included these statements for those users who want to redirect the script's output to a file instead of the screen. Because of these statements, all data printed to STDOUT (the default print file handle) will be redirected, but the STDERR output will continue to display on screen. That way, the script's progress information doesn't clutter the output redirected to the file.
The CollectFileList() subroutine, which callout C shows, creates a list of the files in a directory. The subroutine accepts four parameters. The first parameter ($Path) specifies the path to the directory to be examined.
The second parameter ($FileList) is a reference to the %FileList hash. I used
a reference because this hash will be modified and I want these changes to persist
across multiple calls to the subroutine. Alternatively, you could pass in the
hash instead of a hash reference, then return the modified hash. However, the
size of this hash will undoubtedly grow quite large for bigger directories.
Passing such large hashes in and out of subroutines impacts the script's performance
in terms of both memory usage and speed.
The last two parameters specify the directory being examined ($Context) and the relative path to that directory ($RelativePath). Using the relative path rather than the full path is important because the script must examine file paths relative to these two directories. Although the full paths will never match each other, the relative paths should.