Sunday, February 2, 2014

Yes, it is perfect now :)

I continued the research, though I was quite happy with the results.

There were still false positives, not so many, but some.

I wanted to check if the false positives were caused by the initial re-size of the image. So, to accomplish this I changed the initial image re-sizing factor by setting it to 128x128 pixels instead of 32x32. I started the computing and ... well the dots that indicated progress actually indicated no progress. Another wall to hit your head.

I made a compromise and aimed for 64x64. Even so the calculation was slow but I let it crunch.

Meanwhile I decided, it's time to bite the bullet. From Google I found a library of fast DCT (as a part of Fast Fourier transform library). And now, me dear readers, I have to confess that I went over the board and constructed the hashing algorithm in C (I have been programming C, but not for ages and had forgotten some if not most of it) instead of PHP. Sorry for that.

After some fighting I got it working. The fast DCT algorithm from the library runs probably n log n (n being the number of pixels along one side of the re-sized image) instead of the n4 of the initial straight-forward algorithm and the speed is from other galaxy compared with PHP. (My PHP implementation was struggling on about 16k images (after about 24 hours)when my C program started and the C missile by-passed it in a few moments. Quietly I pressed Ctrl+C on on PHP screen, sighed and closed the screen.

It still took some hours to go through the database - but I am happy to inform you that the results are spectacular, if not stellar!

With my 428k+ images I haven't yet found any false positives. Not one. Zero. None. Ninguno. And it found really many more images that were similar as compared with the earlier (32 x 32-bit initial size) images that I had more or less pruned away.

So, my code does the same steps as earlier one, but instead of 32 x 32 bits original-resize it re-sizes the initial image to 128 x 128. Then it calculates the DCT and reorganizes the bits just like I told earlier. And it finds just similar images and I'd think that it finds at least most of them. Image after image.

I have made a utility script that shows the images sorted by descending value of similar images. With 32 x 32 intial cell-size the most-matched images were all more or less false positives, same way as with average hash I started my little project of pruning similar images from the database. Now it shows just rows of similar images; page after page. So the similarity of the 32 x 32 bith could be much more result of the re-size than clashes of hash algorithm.

I would go as far as saying: problem solved.

As this blog is PHP I haven't (at least not yet) published the C code here (or anywhere). Basically there isn't anything new; all I have done has been converting the code to C and replacing DCT-code with a library call.

Tuesday, January 28, 2014

The better DCT perceptual hash algorithm

In my previous blog entry I derived a simple but efficient method of constructing a DCT hash which facilitates sorting the hash values, basically producing a perceptual sort to a set of images - at least for well-behaving "normal" images.

My first solution was to transform the existing "traditional" DCT hashes into sort-friendly hash. As I don't think it to be very productive to first construct the DCT hashes in traditional way and then running a re-index. I present in this blog post a class to calculate the sort-friendly hash directly.

The class is as follows:



<?php
  // This class constructs perceptual hash for an image
  // It is based on the basic DCT hash but by reorganizing the hash bits
  // the hash is more or less sortable, so building basically a perceptual index
  //
  // DCT code initial author: Elliot Shepherd
  //
  // Sort-friendly transformation author: The Nekkid PHP Programmer
  // http://nekkidphpprogrammer.blogspot.fi
  // nekkidphpprogrammer<at>gmail.com
  //
  class Phash_Bis {
    private $size=32;
    private $small_size=8;  
    private $c=array();
    private $reorder=array();
       
    private function InitCoefficients() {
      $this->c[0]=1.0/sqrt(2);
      for ($i=1; $i < $this->size ; $i++)
        $this->c[$i]=1.0;                   
//
// here we intialize the matrix for placing most significant frequencies to 
// the beginning of the hash
//
      for ($l=0;$l<$this->small_size;$l++) {
        $stp=$l*$l;
        for ($u=0;$u<=$l;$u++) {
          $this->reorder[$stp]=array($u,$l);
          $stp+=2;
        }
        $stp=$l*$l+1;
        for ($v=0;$v<$l;$v++) {
          $indexer=$l*$this->small_size+$v;
          $this->reorder[$stp]=array($l,$v);
          $stp+=2;
        }
      }
      ksort($this->reorder);
    }
    
    private function blue($img,$x,$y) {
      return imagecolorat($img,$x, $y) & 0xff;
    }
        
    public function __construct($my_size=32,$my_small_size=8) {
      $this->size=$my_size;
      $this->small_size=$my_small_size;
      $this->InitCoefficients();
    }
    
    private function ApplyDCT($f) {
      $n=$this->size;
      $F=array();
      for ($u=0;$u<$n;$u++) {
        for ($v=0;$v<$n;$v++) {
          $sum=0.0;
          for ($i=0;$i<$n;$i++) {
            for ($j=0;$j<$n;$j++) {
              $sum+=cos(((2*$i+1)/(2.0*$n))*$u*M_PI)*cos(((2*$j+1)/(2.0*$n))*$v*M_PI)*($f[$i][$j]);
            }
          }
          $sum*=(($this->c[$u]*$this->c[$v])/4.0);
          $F[$u][$v]=$sum;
        }
      } 
      return $F;
    }
    
    public function hash($image) {
      $timing=microtime(true);
      $hash="missing";
      if (file_exists($image)) {
        $size=$this->size;
        $res=imagecreatefromstring(file_get_contents($image));
        $img = imagecreatetruecolor($size, $size);
        imagecopyresampled($img, $res, 0, 0, 0, 0, $size, $size, imagesx($res), imagesy($res));
        imagecopymergegray($img, $res, 0, 0, 0, 0, $size, $size, 50);
        $vals=array();
        for ($x=0;$x<$size;$x++) {
          for ($y=0;$y<$size;$y++) {
            $vals[$x][$y]=$this->blue($img,$x,$y);
          }
        }
        $dct_vals=$this->ApplyDCT($vals);
        $total=0.0;
        for ($x=0;$x<$this->small_size;$x++) {
          for ($y=0;$y<$this->small_size;$y++) {
            $total += $dct_vals[$x][$y];
          }
        }
        $total-=$dct_vals[0][0];
        $avg=$total/($this->small_size*$this->small_size-1);
        $hash=0;
//
// Transformed hash generation
//        
        foreach ($this->reorder as $ptr) {
          $hash = gmp_mul($hash, 2);
          if ($dct_vals[$ptr[0]][$ptr[1]]>$avg)
            $hash=gmp_add($hash, 1); ;
        }
//
//  Hash is returned by hexadecimal string, my preference
//        
        $hash_len=$this->small_size*$this->small_size/4;
        return substr("0000000000000000".gmp_strval($hash,$hash_len),-$hash_len);
        
        
      }
      return $hash;
    }
  }
?>

I hope this helps you to find those pesky duplicates more effectively :)

If you encounter an error or any other problem, you may comment here and I'll see if I can be of help.

Monday, January 27, 2014

Not all bits are created equal!

In the the previous blog entry I mused that maybe the bits just aren't equal. It is something, you might consider when thinking how the DCT goes from low frequency to high frequency, the latter meaning the smaller details.

It is true that taking the average of all the DCT cells and coloring each pixel based if the cell value is over or under the average might somehow equalize the cells but I still felt quite intrigued (read haunted) about the thought that maybe the ordering of bits was significant.

First I thought about running all the images (once more) through DCT and saving the bits but then figured out, that the bits are already there, we have just to change their order. I instinctly ended with the following order of bits from MSB -> LSB: (here for simplicity only first four cells of the upper left of the DCT averaged values in the matrix).


0149
23611
57813
10121415

As you can see the MSB of the hash remains the same and the LSB as well. How did I end up with this order? If you look at the DCT frequency images you can see that each row and each column increases the frequency (= amount of detail) and so always when going to the right or down we refer to smaller details. I could of course as well have used a mirror matrix (mirrored in regard of the diagonal).

To rebuild a weighted DCT hash I used the following code:


<?php
  $order=array();
  for ($l=0;$l<8;$l++) {
    $stp=$l*$l;
    for ($u=0;$u<=$l;$u++) {
      $order[$stp]=($l+$u*8);
      $stp+=2;
    }
    $stp=$l*$l+1;
    for ($v=0;$v<$l;$v++) {
      $indexer=$l*8+$v;
      $order[$stp]=($l*8+$v);
      $stp+=2;
    }
  }
  ksort($order); // Now the keys are in the same order as the bits
  
  $dbh = new PDO('mysql:host=localhost;dbname=imgs', "user", "password");
  $sql="INSERT INTO DCT_ndx (image_id,dct_hash) VALUES (?,?)";
  $stmt=$dbh->prepare($sql);
  $sql="SELECT image_id,dct_hash FROM DCT_index; // old "traditional" hash
  $cnt=0;
  foreach ($dbh->query($sql) as $r) {
    $img_id=$r[0];
    $img_hash=$r[1];
    $old_hash=gmp_init($img_hash,16);
    $hash_bin=gmp_strval($old_hash,2);
    $new_hash=gmp_init(0);    
    foreach ($order as $bpos) {
      $new_hash=gmp_mul($new_hash,2);
      if (substr($hash_bin,$bpos,1)=="1")
        $new_hash=gmp_add($new_hash,1);
    }
    $nh=substr("0000000000000000" . gmp_strval($new_hash,16),-16);
    $stmt->execute(array($img_id,$nh));           
    $cnt++;
    if (($cnt % 500)==0)
        echo "$cnt ";
  }
  echo "\nDone ($cnt entries)\n";   
?>
I made a copy of current index table to another index table as I wasn't sure if this logic worked out or not and anyhow I prefer not to destroy anything that works even somehow.

After a small crunching (took a few hours) I looked the results and, yes, it worked even better than my quite a good solution earlier, where the bits were laid row by row. Of course, where the hashes had clashes (= false positive) ordering the bits didn't change anything.

But when I started to look similar pictures by leftmost substrings the results were much better than earlier: less false positives (some, of course) and much more true positives found. I experimented with number of bytes to use and found out that maybe 8 or 10 first bytes are the most efficient values. It may seem a bit odd that I can now go to less similar first bits (or rather 4 bit arrays as out hash consists of 4-bit hexadecimal numbers) but as the hashes are now more coherent the amount of false positives don't explode even with that.

So, it could be argued that this ordering of bits has created a visual sorting (or at least semi-sorting) of images or - more exact hashing algorithm. This makes it quite efficient to look for duplicates minimizing the time to clean the database.

I haven't yet deleted the duplicates (because of the research I want to do with these hashing algorithms) and I can now look for the all-time duplicates of the same images. One image was in the database 27 times! And, yes, this new ordering of bits show JUST those 27 images. It is of course possible that there might still be one or more similar images left and I have no way of knowing it but I now have really high trust on this index.

So, as you have now seen - not all bits are created equal.

Saturday, January 25, 2014

The Future is here (Part 2)

I have been experimenting with DCT hash (Discrete Cosine Transform) during the last two weeks (yes, I have done something else, too) and I think I currently have quite a well-greased duplicate-finding number-crunching apparatus to handle my image library, which gets cleaned of duplicates (or triplicates or...) every day.

But first things first. As I said in my previous blog post (The future is here Part 1) I implemented the DCT in PHP by re-engineering Java code I introduced in the previous blog post.

The DCT in PHP is here:


<?php
  class Phash_npp {
    private $size=32;
    private $small_size=8;  
    private $c=array();
      
    private function InitCoefficients() {
      $this->c[0]=1.0/sqrt(2);
      for ($i=1; $i < $this->size ; $i++)
        $this->c[$i]=1.0;     
    }
  
    private function blue($img,$x,$y) {
      return imagecolorat($img,$x, $y) & 0xff;
    }
      
    public function __construct($my_size=32,$my_small_size=8) {
      $this->size=$my_size;
      $this->small_size=$my_small_size;
      $this->InitCoefficients();
    }
  
    private function ApplyDCT($f) {
      $n=$this->size;
      $F=array();
      for ($u=0;$u<$n;$u++) {
        for ($v=0;$v<$n;$v++) {
          $sum=0.0;
          for ($i=0;$i<$n;$i++) {
            for ($j=0;$j<$n;$j++) {
              $sum+=cos(((2*$i+1)/(2.0*$n))*$u*M_PI)*cos(((2*$j+1)/(2.0*$n))*$v*M_PI)*($f[$i][$j]);
            }
          }
          $sum*=(($this->c[$u]*$this->c[$v])/4.0);
          $F[$u][$v]=$sum;
        }
      }
      return $F;
    }
  
    public function hash($image) {
      $hash="0000000000000000";
      if (file_exists($image)) {
        $size=$this->size;
        $res=imagecreatefromstring(file_get_contents($image));
        $img = imagecreatetruecolor($size, $size);
        imagecopyresampled($img, $res, 0, 0, 0, 0, $size, $size, imagesx($res), imagesy($res));
        imagecopymergegray($img, $res, 0, 0, 0, 0, $size, $size, 50);
        $vals=array();
        for ($x=0;$x<$size;$x++) {
          for ($y=0;$y<$size;$y++) {
            $vals[$x][$y]=$this->blue($img,$x,$y);
          }
        }
        $dct_vals=$this->ApplyDCT($vals);
        $total=0.0;
        for ($x=0;$x<$this->small_size;$x++) {
          for ($y=0;$y<$this->small_size;$y++) {
            $total += $dct_vals[$x][$y];
          }
        }
        $total-=$dct_vals[0][0];
        $avg=$total/($this->small_size*$this->small_size-1);
        $hash=0;
        for ($x=0;$x<$this->small_size;$x++) {
          for ($y=0;$y<$this->small_size;$y++) {
            $hash = gmp_mul($hash, 2);
            if ($dct_vals[$x][$y]>$avg)
              $hash=gmp_add($hash, 1); ;
          }
        }
        return gmp_strval($hash,16);                
      }
      return $hash;
    }
  }
?>


There's nothing fancy in the code itself - except that it works. I prefer saving the hashes in the database as 16-byte hexadecimal characters, so I'll return the calculated 64-bit DCT hash with 16-character hexadecimal string.

The hashing is done by a simple PHP-script that runs after the harvesting script (by the same bash script).

<?php

  include_once('phash_dct.php');
  $dct_hasher=new Phash_npp;

  try {
    $dbh = new PDO('mysql:host=localhost;dbname=imgs', "user", "password");
    $sql="INSERT INTO DCT_index (image_id,dct_hash) VALUES (?,?)";
    $stmt=$dbh->prepare($sql);
    
    $sql="SELECT img_id,img_name FROM Images LEFT JOIN `DCT_index` ON img_id = image_id WHERE image_id IS NULL";
    $n=0;
    foreach ($dbh->query($sql) as $r) { 
      $filename="images/" . $r[1];
      $iid=$r[0];
      $hash=$dct_hasher->hash($filename);
      $stmt->execute(array($iid,$hash));            
      $n++;
      if (($n % 500)==0)
        echo "$n "; // To show, we aren't dead, much needed during iniatial go
    }
    echo "Done\r\n";
  } catch(PDOException $ex) {
    echo "An error occured!"; //user friendly message
  }
?>

As you can see, I designed the database so that DCT index is an external table with key pointing to image id in the main image table. So now that we have the DCT index table, what can we do with it?

I have about 99,9% trust in that same DCT hash means same image. There are a  few exceptions, some images share same DCT hash, though they don't seem same to me at all - but they are rare and quite a space in between the encounters.

Unfortunately it isn't true if we state that two similar images share same DCT hash. There are some bits that change; I have seen some images where about seven bits may be different on apparently same image.

My first idea was that I construct similar MVP-tree adoption that was used with initial phash algorithms, where it worked quite well. Unfortunately this seemed to be a total cul.de-sac. There just were too many hits and I had to drop this idea. Maybe (just maybe) it might still work with many different images but I flushed the idea as I didn't a) really believe it and b) want to calculate hamming distances to a dozen or so images for every image in the database.

Hmm... what to do? As we found out earlier there isn't any good sorting algorithm to index close images in regard of hamming distance. Of course one could go through all images with brute force but it seems just too heavy (and slow) a solution.

I thought about starting with baby steps by taking a hash of an image as a starting point. The I constructed a list of different hash values by xor-ing the initial value with one bit, that traveled from position 1 to to position 63 - so in the first pass I xor-ed the hash with 1 which inverts the LSB bit, putting that to a list, and then multiplying the 1 with 2 and xor-ing with that. Then I repeated that so in the end I had 64 hashes that contained all hashes that differed from current one with 1 bit in any position. I constructed a SQL statement where the hash was IN(<list of hash values>).

It worked, but I still missed many matches I know existed. Hmm, easy thing to expand the idea: next I permutated all the two bit-combinations and got better but still insufficient results. And when I thought of three or more bits changed the SQL statement would get insanely long. Another show stopper.

Short (but just a little short) of getting really annoyed I continued experimenting. I listed the images sorted by hash and discovered that the similar images where often quite close to each other, though not always successive. Of course there were at least two explanations. Either there were really many duplicates and we just stumbled with the closest ones or all bits are not equal.

The first explanation was so depressing so I had to grasp the latter one. If you think about the DCT you may understand that not all bits are crated equal. If we would be handling the DCT itself it would be quite clear that the LSB bits are less important than higher ones. Actually we are dealing with a bit pattern, which is built on the value being over or under the average value of all the DCT values (a float matrix) and this conclusion isn't clear (and it MAY be very flawed for all I know).

Anyhow, I wrote a selection program, that gets in a parameter that tells how many of the first bits we consider. Or more correctly, how many most significant groups of four bits we consider as the data is in the database in hexadecimal string. So, if I set the parameter to value 16 it means that I find only the matches with exactly same hashes. Value 15 means that we don't mind the last hexadecimal character value or only 60 first bits are significant. This is very simple to do as we just take SQL function LEFT() with given argument.

After experimenting I landed on value 9 which seems to set the soft spot, where we get most of the matches with minimal false positives. There will be false positives but I'd say that only maybe in every tenth image. Some images get really many false positives (at least partly the same odd group we had with the initial average phash) but quite soon I was able to see mostly clean successions of the similar images and smash them.

The selection program selects similar images (as described above) and shows them on the screen as "film strips" ordered by decreasing number of similar images so I was able to delete first the most common duplicate images.

I am quite pleased with the current state of the situation. It is possible (or even probable) that there remains some duplicates (even after I have gone through all the matches - if I ever got to that situation) but I think I am very close to the optimal technical solution - at least so close I am satisfied.

Theoretically there might be more efficient ordering of the bits in the hash as now the are just set row by row, which is just fine as long as you consider the exact match. My intuition says that if we save the bits correctly the most important values would be more coherently in the beginning of the hash. I might rebuild an index with this logic (just out of curiosity), if and when I get the urge to do so. Probably quite soon ;)

I also still dream about writing a general-purpose class for image library mostly containing the tools I have found effective and working.

But until next entry, have a nice day/evening/night/morning or whatever.

Monday, January 6, 2014

The Future is Here! (Part 1)

Hello

This is just a quick note about the achievements since last writing.

I just had to bite the bullet and translate a Java implementation of 2D DCT transformation found (=robbed) from Net. The source code can be found for example here. The code shows Elliot Shepherd as the author so the credits go there, thank you and sorry ;)

I also wrote a class library that calculates the DCT phash (as a 16 char hexadecimal string) to make it analogous to my earlier hashes and to facilitate its handling in the database. The class works more or less as phasher-class referred in the previous blog writings.

I am not yet giving anything out because the code is just calculating the hashes first time for the database (about half a second/per hash) and it will take long (I mean long time like 70 hours) to crunch its way through the database and there's a little research after that as well before I will show my findings and clean the class library which will be available if there's any interest... ;)

So why to hurry to inform you, my (almost non-existent) reader? Because I have some 12000 images hashed so far and the results are just stunningly beautiful (not only the art of the images). All the hashes shared are of the same image, none false positives found so far. I understand now why the DCT was chosen for JPGs. :)

So, I'll pop up most probably during next week or so with something to look forward to.

Monday, December 30, 2013

Ponderations and Paths for the Future and Conclusions (Part 5)

Pondering

In the four preceding blog posts I have introduced a quite simple, still somewhat effective database solution for finding the similar images from a big heap of them. In our haystack may be a lot of needles and our goal was to catch as many of these needles as possible without un-stacking all the hay.

First, let us ponder a bit how well we have achieved our goals (or at least I have). My gut feeling is that though the system finds quite a lot of similar images it doesn't get them all. I have already given some explanations why the perceptual hash may differ between the similar photos. Unfortunately I don't have time to go through whole the evolving collection so I can't give any exact statistics. I just see some images I think I have seen quite often (considering the 380 966 unhidden photos) which would suggest that there are a lot of duplicates still in the database.

However my own feeling says also that the number of duplicates lessens from day to day, every day I get less of series of say five or more similar photos. Let me explain a bit how I clean the database (i.e. my workflow). I try to find time to go through the new images. I read a bunch of them and find the number of similar (as defined in my previous blog entry) and print it below the image. If I click the photo a new window opens and shows all the images which are phash-wise similar. So, if the photo has just one similar photo (itself) and it is...appealing I just look at the next one. If it isn't... art enough to my eye I just delete it (set image_status as 'S').

If there are two or more similar photos I click the photo and get a new window with all the photos that are close enough by perceptual hashes sorted in decreasing image size. Now I click all other similar photos, except the one we came from and mark all but the biggest as hidden. By clicking all similar photos I open another tabs with each image as a comparison base. Why on earth? After experimenting with the system I have found that some similar photos may skip the comparison with the initial one but may still find matches with some others which were close enough to the initial one. Mathematically speaking our perceptual hash isn't a equivalent relation, it doesn't fulfill the rule of translation (or if Hamming distance between phash(A) and phash(B) is below x and Hamming distance between phash(B) and phash(C) is also below x it doesn't mean that Hamming distance between phash(A) and phash(C) would be below x). This causes that sometimes we get a "path" which goes via successive similar images until we don't find the new ones.

I also get some false positives which are not similar to the eye, but phashes are still below the given radius. A bad, bad  thing, isn't it? Well, it depends how you think about it. I think I have found both duplicates and images without .. art values almost more via false positives than via true positives. Furthermore the unpleasing images seem to attract other unpleasing images which both raises the efficiency of cleaning and keeps astounding me as well. It's like there is more in this system than what eye can see. Then again, it just may be my impression.

Optimizations

A few words about optimizations. I should have told it already but to make things work on reasonable time scale you have to index phash and precomputed Hamming distance columns in your database.

The minimal distance for images considered same is a delicate balance between missing true positives and drowning in the false positives. With my data the Hamming distance of four bits seems to be the optimal (as explained in my previous blog entry). I have learnt that I can grasp at most about 20 images in one screen and not starting to miss the similar ones (given most are false positives) when rapidly eyeing through the photos.

The Future (if it ever exists)

I have tried it with different rules for proximity but this has been the only one working well so far. I'll keep trying but don't hold my breath. For me this works as well as I need as now and the next meaningful step would need something that changes the scene remarkably. That would mean that I find a better function for the phash (Discrete Cosine Transformation may be that) which would find more truly similar images and less false positives. Another goal would be implementing some logic that could and would find clusters of similar images effectively. There might be some way with the current database but I yet haven't found it.

As a programmer I would also like to encapsulate this functionality into a generic class. It isn't so easy to do as currently the database operations are tightly intertwined thorough the code and my image files are organized in different directories which needs some additional logic. I have some thoughts about this frontier, though, and I will blog them here when and if they ever see the light.

I end this subject (for now, at least) to a conclusion of what we have got and how it works.

Conclusion

Managing a big image base from different sources is like gardening. Unless you diligently prune and nurture the garden, the weeds will make it less enjoyable. So it means that you have to utilize time, code and tenderness to your collection - unless you don't mind a wild garden(but why would you even read this article in the first place?). If you came here for a turnkey system which deletes just the duplicate images (similar, not same) in you database automatically, I am sorry to disappoint you. Perceptual hashes as I have implemented them here don't find just the similar photos but they find also different photos and think they are similar. It doesn't find all the similar photos either because there's always inaccuracies when re-sizing the initial photos and calculating the average value of the re-sized photos which may put the phash off.

I haven't (yet, at least) encapsulated the functionality in a simple and general class, nor given even the whole source of my working system, but instead shown just the significant parts and left the rest as an easy exercise to the reader. I still hope that you have got some insight into perceptual hashes and also grasped why it isn't an easy  solution to handle, Given some background in PHP I would estimate that it is possible to re-create these ideas as a part of your own image collection.

Saturday, December 28, 2013

Finding similarities - at last (Part 4)

As we - at least I ;) - figured out last time getting the perceptual hash value wasn't the whole solution we were looking, that of finding similar images from a heap of photos.

The problem is that with big image base the N x N comparison just takes too much time. Furthermore sorting via perceptual hash doesn't work too well as we can find close matches only when the difference is in the "last bits" of the perceptual hash.

In the last blog entry we also discovered that the difference or distance between two phashes can be readily calculated with Hamming distance that is available in PHP in gmp_hamdist-function.

As I browsed the Google-land I found a very thorough article about finding similar entries in high dimensional data. You can read it bu following this link here. The article handles very many different indexing schemes for similarity search. It went partly (say 99 parts out of 100) over my head. There were K-trees, VP-trees, MVP-trees and so on. What I sort of gathered from the article that MVP-trees (Multi-Vantage Point) would be one of the best possible solutions in my case, which inherently suffers from "Curse of Dimensionality" i.e. the problem is so multidimensional that one can't handle effectively with traditional methods.

I didn't find any source code for MVP-tree solution, so I made a ... hack which sort of borrows the idea but is still quite easy to implement.

The Hamming distance between two points is a multidimensional vector and can be to "any direction". If say two bits differ in the phash the distance is two and distance stays the same whichever two bits are different.

The Multi-Vantage Point tree builds on some, well, Vantage Points and all other points are relative to these vantage points. So, in my algorithm I choose three points quite randomly and for each perceptual hash (remember I calculate the phashes for 11x11, 16x16 and 19x19) I calculate the Hamming distance from each of these vantage points.

Well, I didn't choose the vantage points totally randomly, but after calculating the distance for first point I looked in the database and selected second which was quite far away (= great hamming distance) and similarly the third so that the hamming distance to the first two images was distinct. In the article linked above was stated that selection of the vantage points isn't very important, there was very small difference in the performance between heuristically selected and random vantage points. My reasoning was that if the vantage points lie close to each other the tree is acting almost like a tree with one vantage point less which would be like - pointless as I just said.

Now I have currently actually 12 values for each image row in the database: three perceptual hashes for each initial image re-size and three times three Hamming distances to each of the vantage points. Why three vantage points? Good question. Any more questions? ;) We are always balancing between space, time and Nathaniel - I mean effectiveness and my first hunch was that the balance lies here but I reserve the right to change my opinion.

Probably one more vantage point would lessen false positives and still (probably) not lead to missed true positives. I shall explore these areas and report here later (see the forthcoming Part 5 about the avenues there are for further tinkering).

As a part of the nightly run I run in the end the dist.php below which builds the distances to the vantage points for each new image. Sorry for not using PDO here, but I preferred building a column name via a variable and just don't know how to do it with PDO. The first line contains the indexes of the vantage points.

$d=array(array('d',55080),array('e',55082),array('f',55108));

foreach ($d as $d1x) {
  $variable=$d1x[0];
  $d1=$d1x[1];
  $db=mysql_connect("192.168.200.85","imgs-user","imgs-pass");
  mysql_select_db("imgs");
  mysql_set_charset('latin1',$db);
  $sql="SELECT img_phash,img_phash11,img_phash19 FROM Images WHERE img_id=$d1";
  $rs=mysql_query($sql);
  $r=mysql_fetch_row($rs);
  $h1=gmp_init($r[0],16);
  $h2=gmp_init($r[1],16);
  $h3=gmp_init($r[2],16);
  $sql="SELECT img_id,img_phash,img_phash11,img_phash19 FROM Images WHERE image_status<>'S' AND {$variable}1 IS NULL";
  $rs=mysql_query($sql);

  while ($r=mysql_fetch_row($rs)) {
    $c1=gmp_init($r[1],16);
    $c2=gmp_init($r[2],16);
    $c3=gmp_init($r[3],16);
    $diff1=gmp_hamdist($h1,$c1);
    $diff2=gmp_hamdist($h2,$c2);
    $diff3=gmp_hamdist($h3,$c3);
    $sql="UPDATE Images SET {$variable}1=$diff1,{$variable}2=$diff2,{$variable}3=$diff3 WHERE img_id={$r[0]}";
    mysql_query($sql);
  }
}


For now we can have the Hamming distance for each image to a set of known images, How does it help us in our great quest? Well, what were our challenges before this step? Yes, the Hamming distances weren't easy to sort so finding close images was just too slow and resource-greedy. We could quite easily find the exact matches but for many reasons the exact matches were not available for similar photos.

But now the situation has changed quite substantially. For each vantage point we can define a Hamming distance radius and show all images inside the radius. My first (and current) implementation sets all the radii same (though I know that getting same Hamming Distance with 19 x 19 is much more demanding than with 11 x 11, more of it in Part 5) and select image as similar if it is inside radii in respect of all vantage points.

I don't give my whole find_similar.php as there is too much supporting code (for showing the images and so on), but the basic SQL is actually very simple:



// $img is the image for which we are trying to find the similar ones
//
// Hamming distances on different cell sizes and each vantage point
//
$sql="SELECT d1,d2,d3,e1,e2,e3,f1,f2,f3 FROM Images WHERE img_id=$img";
.
.
.
$d1=$r[0];
$d2=$r[1];
$d3=$r[2];
$e1=$r[3];
$e2=$r[4];
$e3=$r[5];
$f1=$r[6];
$f2=$r[7];
$f3=$r[8];
$betweens="";
if (isset($_GET["step"]))
  $value=$_GET["step"];
else
  $value=4;
for ($i=1;$i<4 data-blogger-escaped-br="" data-blogger-escaped-i="">  if (strlen($betweens)>0)
    $betweens .= " AND ";
  $v="d$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.="$v BETWEEN $vmin AND $vmax";
  $v="e$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.=" AND $v BETWEEN $vmin AND $vmax";
  $v="f$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.=" AND $v BETWEEN $vmin AND $vmax";
}
$sql="SELECT image_dir,img_name,img_id,image_size FROM Images WHERE $betweens AND image_status<>'S' ORDER BY image_size DESC";
.
.
.


As You can see the default Hamming distance is 4. I ended up with it after checking with different values and for my data (sorry, not available) the 4 seems to be the sweet spot.

This concludes our discussion today. In the next part I mainly try to vision what directions there are to explore in the future..