Friday, December 27, 2013

Looks like it - or does it? (Part 3)

As I promised, we'll now enter the world of similarity and perceptual hashing.

Perceptual hash is a way of contracting an image (or actually any data lump) into one value which should be close to each other when the images are similar. As was with md5 hash there will be hash clashes where two different images share the same perceptual hash even if the images are substantially different, often so that it is very difficult to see any similarity. I will call these hash clashes false positives.

To give some background to perceptual hashes here is an excellent introduction to perceptual hashing:

Looks Like It

But to put things fast and neat into a simple case: Perceptual hashing may be implemented at least by two different methodologies:

  1. Re-sizing the photo to known small size, setting it gray-scale, calculating the average value of all the pixels, then comparing every bit of the re-sized gray-scaled image and building a bit array where every bit is 0 or 1 (what a surprise) depending if the pixel is darker of lighter than the average. This bit array may easily be converted into a string of hexadecimal characters.
  2. Doing a discrete cosine transformation by re-sizing the photo, gray-scaling it computing average DCT value and reducing the DCT matrix and constructing the hash (see Looks Like It-article for a bit longer explanation, maybe one of what you really do grasp something!)

My solution builds (at least currently) on phasher  PHP class which implements methodology 1. Maybe, whenever I'll have time - I don't have it too much, though it may sound like it ;) - I'll cook or find Discreet Cosine Transformation (DCT)  PHP code and check how the scene changes or try to implement phash-library.

From now on I'll concentrate in the average lightness hashing (methodology 1). Theoretically it should handle the situations where image is re-sized, gray-scaled (heck, it does these itself), it's lightness and contrast changed, but only linearly; e.g. applying gamma correction or histogram may and probably would move the average and so the perceptual hash would change.

When we want to find similar images there will be false positives and also false negatives i.e. we'll get same hashes for dissimilar images and also similar photos may give different hashes. In practice I have found that when I search for similar photos of one under the scrutiny there may be few similar and  a couple of different. If I check the similar images, I may find more similar and some new or same different ones. And I suspect that there will be more similar photos that don't show up even when running all the found similar photos.

Why is that so? As said earlier, the photos may be changed by changes that change the average (for change), added a text or frame around it or removing said ones. If the photo is re-sized it should theoretically give same phash but it doesn't need to do it in practice. The problem partly lies in the re-sizing algorithms. An image consists of pixels that are discreet square buckets empty, full or in-between of three paints (red, green, blue). Now when we re-size the photo smaller we have to join a lot of these buckets and decide what amount of paint we push into the bucket, which should represent maybe tens of pixels in the original photo. If all the pixels are of the same color it's really easy but if the colors vary greatly the decision lies in the algorithm. If the image dimensions don't divide evenly, we need to take care also of sub-pixels. So, it's not a perfect world :)

My first implementation based on calculating the phash for images re-sized 16 x 16 pixels and running it through phasher. Here is the code for this simple phashing algorithm:

// Sorry for omitting the <?php-tags but blogger don't like them.

include_once('./phasher.class.php');
$I = PHasher::Instance();

try {
   $dbh = new PDO('mysql:host=localhost;dbname=imgs', "imgs-user", "imgs-pw");
   $sql="UPDATE Images SET img_phash=? WHERE img_id=?";
   $stmt=$dbh->prepare($sql);

   $sql="SELECT img_id,img_name,image_dir FROM Images WHERE image_status<>'S' AND img_phash IS NULL";
   foreach ($dbh->query($sql) as $r) {
     $filename="images" . $r[2] . "/" . $r[1];
     $iid=$r[0];
     $hash_array=$I->HashImage($filename,0,0,16);
     $signature=$I->HashAsString($hash_array);
     $stmt->execute(array($signature,$iid));
   }
} catch(PDOException $ex) {
   echo "An Error occured!";
}



So I'll go trough the database for new entries that the harvester script has accumulated. The image status S means that the photo should be skipped anyhow (it is a dismissed duplicate or not ... art enough). If I had this system running from the beginning I wouldn't need it here as all the photos written to the database and disk would be automatically phashed and I couldn't know a priori if it is a duplicate (similar, not same) or if its ... artistic... value is sub-optimal.

I run this from command line or actually a crontab script runs it as a part of harvesting and cleaning cycle.

As I prefer to do the things on a hard way I also similarly calculate two additional phashes, one with 19 x 19 re-sized image and another with 11 x 11 initial image. So I have three different hashes for every image. The size choices were quite random, I used prime numbers (except for initial 16) to reduce the similarity of hashes which may or may not be a good thing. The jury is still pondering it.

Now that we have the hashes, what's next? My first decision was to try to hunt the most general images which I predicted would be common similar photos. I was wrong. For some reason still totally unknown to me the most common hash-sharing images were all different.

I wrote the program which shows me the images in groups sorted by diminishing number of same hashes and also ordered by hash in that group. I still have hard time to believe it, but all the images in say about ten first clusters are all different. Then there will come a lot of groups similar images with a few dissimilar dispersed for variation only in-between (but then again, the dissimilar are all dissimilar between themselves). Now, I understand that some hashes are more popular than others but it doesn't ever cease to astonish me how the most common hashes are shared by different photos and only different.

Sometimes I found two groups of similar photos very close to each other and understood that it meant, that the hash values were really close but not the same. I also figured it out that these images were close in the view because of the (secondary) ordering by hash values and it happened only because the difference was in the some of the last bits. Otherwise there would be a lot of distance between the same groups and I don't have eidetic memory to remember it whenever it happened.

If we know two hashes it is easy to calculate the difference (or distance) between two images. It's called Hamming distance and it is readily available in PHP GMP-functions (GNU Multiple Precision) since PHP 4.0.4.

But it doesn't actually help us to find the two images to start with. Of course, we could do it by comparing each image to each other and though we have to compare one image less after each round (the Hamming distance is symmetric) the thought of building a matrix of more than 400 000 images and keeping it up-to-date even afterwards isn't compelling.

Dead end, after all? Sigh, what to do?

The answer is: wait to next part :)

No comments:

Post a Comment