A mate from work had a LUKS volume that he had setup and was using to store his personal documents, things like scanned copies of invoices and tax information. Unfortunately in a momentary lapse of concentration after mounting it he ran mkfs.ext4
over it.
I'm sure we have all been there before, anyone who has spent enough time in tech knows the gut wrenching feeling after entering the wrong command. I can still remember the tense feeling after I ran query but forgot the where clause and saw (2986 row(s) affected)
when I was expecting (1 row(s) affected)
.
The good news was that my mate had a backup, the bad news was the backup was six months out of date. He was able to make a copy of the LUKS volume and run PhotoRec over it which pulled back all the files.
PhotoRec is a brilliant tool but it can't recover metadata so things like file names, creation and last modified dates and directory structures were all missing. So he was left with thousands of recovered files, some of which he already had and others that were new.
I wrote a simple python script to run through two directories and check for files in the new directory (the recovered files) that were not in the old directory (backup of original files).
#!/usr/bin/python3
# -*- coding: UTF-8 -*-
"""
A small python script to find new files in two similar directories.
"""
import os
import os.path
import hashlib
import argparse
def setup_options():
"""
Parse options and get the location of the old and the new directory.
"""
parser = argparse.ArgumentParser(
description=('Run through two directories (including sub directories) '
'and find files that are the new directory but not in '
'the old directory.'))
parser.add_argument(
'old_files',
metavar='old_directory',
type=str,
help='The old directory with the original files')
parser.add_argument(
'new_files',
metavar='new_directory',
type=str,
help='The new directory with both original files and new ones.')
return parser.parse_args()
def compare_two_directories(settings):
"""
Run through two directories (including sub directories) and find files that
are in the new directory but not in the old directory.
"""
original_files = set()
# Run through the original directory and creates an MD5 sum each of the
# files. MD5 is insecure because of known hash collisions, however we are
# not trying to validate the file's contents so it's good enough, faster
# and more memory efficient than SHA256.
for dirpath, dirnames, filenames in os.walk(settings.old_files):
for filename in filenames:
file_path = os.path.join(dirpath, filename)
file_hash = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
original_files.add(file_hash)
# Run through the new directory and print files who's md5 hash is not in
# the original list of files.
for dirpath, dirnames, filenames in os.walk(settings.new_files):
for filename in filenames:
file_path = os.path.join(dirpath, filename)
file_hash = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
if file_hash not in original_files:
print(file_path)
if __name__ == "__main__":
compare_two_directories(setup_options())
I've put this script up on GitHub so anyone can use it, with a simple
michael@xo:~$ ./compare.py /home/michael/backup /home/michael/recovered-files
It runs through all the files in the old directory (and it's subdirectories) and calculates an MD5 sum. While MD5 is broken for validating the contents files because of known hash collisions, and should never be used for storing passwords, we are just trying to compare two documents neither of which is coming from an untrusted source so it's good enough and quicker when running over a few thousands documents than SHA2561.
Then it stores the MD5 hash in a set. I've used a set rather than say a list, because I don't want duplicates and I want to be able to check if a value is in the set quickly.
Next it runs through the new directory (and it's subdirectories) and for every file that has an MD5 sum that's not in the set, it outputs the name.
This script brought the number of recovered files down from thousands to a manageable amount, so hopefully it's useful for someone else in a similar situation.
-
I believe that thanks to optimizations in the design of SHA256 it can potentially be quicker than MD5. I've heard that with OpenSSL SHA256 is quicker than MD5. But with my tests using python's hashlib MD5 was faster than SHA256. ↩