Wednesday, December 25, 2013

Similar is not same (Part 2)

Now, we'll start the real thing.

The best way to get rid of duplicate photos is to never ever even get them onto our hard disks. Simple as that. But the problem lies when the photo isn't the same, but similar.

Let's first take the simple case: the photos are just the same. There is a very easy way to check it in PHP.

It's called hashing and there are a few ways to do it. I have chosen maybe the simplest way: Just getting md5-checksum.

$dump=@file_get_contents($img_url);
$md5=md5($dump);
$size=strlen($dump);
$stmt_check_same->execute(array($md5,$size));
if ($stmt_check_same->rowCount()>0)
  continue;


Here the $stmt_check_same is a prepared SQL sentence like this:

$stmt_check_same=$db->prepare("SELECT img_id FROM Images WHERE image_md5=? AND image_size=?")

You can see that as a precaution I check that both the md5's as well as photo sizes are equal . Md5 - like all hashing algorithms - may end up with same hash for also different files; otherwise you could decode everything with 128 bits - which would be the ultimate zipping mechanism. That's not (unfortunately) so and we have to counter-attack it. I have chosen just to check the file sizes are same (as I will anyhow need file size in my database).

Other possibility to ascertain the sameness would have been e.g. calculating another or even more hash values (with different hashing algorithms) and claim the same photo if they all match.

Philosophically I don't mind getting a few photos dismissed without being real duplicate, so I don't loose my dreams because of this hash clash possibility.

No, we have got rid of the exactly same photos, that's a start.

What is the difference between same and similar? With similar image I mean an image, that looks the same, but isn't so bit-by-bit-wise. The photo may be re-sized (either by scaling or its quality changed) or changed subtly otherwise. Another possibility of simple change is that color  photo has been made black and white. There are still further possibilities like mirroring the photo or cropping it.

Md5 is very sensitive with even minor changes in the bit battery that a file is and doesn't help us finding similar photos by any cunning sorting scheme, even the slightest change in the image file would make md5 (very probably) to look and smell totally different (as far as you can get that different with 16 hexadecimal bytes)

In the next part we'll get acquainted with phash (perceptual hash) which tries to catch similarities between photos and see where we do get from there.

No comments:

Post a Comment