Sunday, February 2, 2014

Yes, it is perfect now :)

I continued the research, though I was quite happy with the results.

There were still false positives, not so many, but some.

I wanted to check if the false positives were caused by the initial re-size of the image. So, to accomplish this I changed the initial image re-sizing factor by setting it to 128x128 pixels instead of 32x32. I started the computing and ... well the dots that indicated progress actually indicated no progress. Another wall to hit your head.

I made a compromise and aimed for 64x64. Even so the calculation was slow but I let it crunch.

Meanwhile I decided, it's time to bite the bullet. From Google I found a library of fast DCT (as a part of Fast Fourier transform library). And now, me dear readers, I have to confess that I went over the board and constructed the hashing algorithm in C (I have been programming C, but not for ages and had forgotten some if not most of it) instead of PHP. Sorry for that.

After some fighting I got it working. The fast DCT algorithm from the library runs probably n log n (n being the number of pixels along one side of the re-sized image) instead of the n4 of the initial straight-forward algorithm and the speed is from other galaxy compared with PHP. (My PHP implementation was struggling on about 16k images (after about 24 hours)when my C program started and the C missile by-passed it in a few moments. Quietly I pressed Ctrl+C on on PHP screen, sighed and closed the screen.

It still took some hours to go through the database - but I am happy to inform you that the results are spectacular, if not stellar!

With my 428k+ images I haven't yet found any false positives. Not one. Zero. None. Ninguno. And it found really many more images that were similar as compared with the earlier (32 x 32-bit initial size) images that I had more or less pruned away.

So, my code does the same steps as earlier one, but instead of 32 x 32 bits original-resize it re-sizes the initial image to 128 x 128. Then it calculates the DCT and reorganizes the bits just like I told earlier. And it finds just similar images and I'd think that it finds at least most of them. Image after image.

I have made a utility script that shows the images sorted by descending value of similar images. With 32 x 32 intial cell-size the most-matched images were all more or less false positives, same way as with average hash I started my little project of pruning similar images from the database. Now it shows just rows of similar images; page after page. So the similarity of the 32 x 32 bith could be much more result of the re-size than clashes of hash algorithm.

I would go as far as saying: problem solved.

As this blog is PHP I haven't (at least not yet) published the C code here (or anywhere). Basically there isn't anything new; all I have done has been converting the code to C and replacing DCT-code with a library call.

Tuesday, January 28, 2014

The better DCT perceptual hash algorithm

In my previous blog entry I derived a simple but efficient method of constructing a DCT hash which facilitates sorting the hash values, basically producing a perceptual sort to a set of images - at least for well-behaving "normal" images.

My first solution was to transform the existing "traditional" DCT hashes into sort-friendly hash. As I don't think it to be very productive to first construct the DCT hashes in traditional way and then running a re-index. I present in this blog post a class to calculate the sort-friendly hash directly.

The class is as follows:



<?php
  // This class constructs perceptual hash for an image
  // It is based on the basic DCT hash but by reorganizing the hash bits
  // the hash is more or less sortable, so building basically a perceptual index
  //
  // DCT code initial author: Elliot Shepherd
  //
  // Sort-friendly transformation author: The Nekkid PHP Programmer
  // http://nekkidphpprogrammer.blogspot.fi
  // nekkidphpprogrammer<at>gmail.com
  //
  class Phash_Bis {
    private $size=32;
    private $small_size=8;  
    private $c=array();
    private $reorder=array();
       
    private function InitCoefficients() {
      $this->c[0]=1.0/sqrt(2);
      for ($i=1; $i < $this->size ; $i++)
        $this->c[$i]=1.0;                   
//
// here we intialize the matrix for placing most significant frequencies to 
// the beginning of the hash
//
      for ($l=0;$l<$this->small_size;$l++) {
        $stp=$l*$l;
        for ($u=0;$u<=$l;$u++) {
          $this->reorder[$stp]=array($u,$l);
          $stp+=2;
        }
        $stp=$l*$l+1;
        for ($v=0;$v<$l;$v++) {
          $indexer=$l*$this->small_size+$v;
          $this->reorder[$stp]=array($l,$v);
          $stp+=2;
        }
      }
      ksort($this->reorder);
    }
    
    private function blue($img,$x,$y) {
      return imagecolorat($img,$x, $y) & 0xff;
    }
        
    public function __construct($my_size=32,$my_small_size=8) {
      $this->size=$my_size;
      $this->small_size=$my_small_size;
      $this->InitCoefficients();
    }
    
    private function ApplyDCT($f) {
      $n=$this->size;
      $F=array();
      for ($u=0;$u<$n;$u++) {
        for ($v=0;$v<$n;$v++) {
          $sum=0.0;
          for ($i=0;$i<$n;$i++) {
            for ($j=0;$j<$n;$j++) {
              $sum+=cos(((2*$i+1)/(2.0*$n))*$u*M_PI)*cos(((2*$j+1)/(2.0*$n))*$v*M_PI)*($f[$i][$j]);
            }
          }
          $sum*=(($this->c[$u]*$this->c[$v])/4.0);
          $F[$u][$v]=$sum;
        }
      } 
      return $F;
    }
    
    public function hash($image) {
      $timing=microtime(true);
      $hash="missing";
      if (file_exists($image)) {
        $size=$this->size;
        $res=imagecreatefromstring(file_get_contents($image));
        $img = imagecreatetruecolor($size, $size);
        imagecopyresampled($img, $res, 0, 0, 0, 0, $size, $size, imagesx($res), imagesy($res));
        imagecopymergegray($img, $res, 0, 0, 0, 0, $size, $size, 50);
        $vals=array();
        for ($x=0;$x<$size;$x++) {
          for ($y=0;$y<$size;$y++) {
            $vals[$x][$y]=$this->blue($img,$x,$y);
          }
        }
        $dct_vals=$this->ApplyDCT($vals);
        $total=0.0;
        for ($x=0;$x<$this->small_size;$x++) {
          for ($y=0;$y<$this->small_size;$y++) {
            $total += $dct_vals[$x][$y];
          }
        }
        $total-=$dct_vals[0][0];
        $avg=$total/($this->small_size*$this->small_size-1);
        $hash=0;
//
// Transformed hash generation
//        
        foreach ($this->reorder as $ptr) {
          $hash = gmp_mul($hash, 2);
          if ($dct_vals[$ptr[0]][$ptr[1]]>$avg)
            $hash=gmp_add($hash, 1); ;
        }
//
//  Hash is returned by hexadecimal string, my preference
//        
        $hash_len=$this->small_size*$this->small_size/4;
        return substr("0000000000000000".gmp_strval($hash,$hash_len),-$hash_len);
        
        
      }
      return $hash;
    }
  }
?>

I hope this helps you to find those pesky duplicates more effectively :)

If you encounter an error or any other problem, you may comment here and I'll see if I can be of help.

Monday, January 27, 2014

Not all bits are created equal!

In the the previous blog entry I mused that maybe the bits just aren't equal. It is something, you might consider when thinking how the DCT goes from low frequency to high frequency, the latter meaning the smaller details.

It is true that taking the average of all the DCT cells and coloring each pixel based if the cell value is over or under the average might somehow equalize the cells but I still felt quite intrigued (read haunted) about the thought that maybe the ordering of bits was significant.

First I thought about running all the images (once more) through DCT and saving the bits but then figured out, that the bits are already there, we have just to change their order. I instinctly ended with the following order of bits from MSB -> LSB: (here for simplicity only first four cells of the upper left of the DCT averaged values in the matrix).


0149
23611
57813
10121415

As you can see the MSB of the hash remains the same and the LSB as well. How did I end up with this order? If you look at the DCT frequency images you can see that each row and each column increases the frequency (= amount of detail) and so always when going to the right or down we refer to smaller details. I could of course as well have used a mirror matrix (mirrored in regard of the diagonal).

To rebuild a weighted DCT hash I used the following code:


<?php
  $order=array();
  for ($l=0;$l<8;$l++) {
    $stp=$l*$l;
    for ($u=0;$u<=$l;$u++) {
      $order[$stp]=($l+$u*8);
      $stp+=2;
    }
    $stp=$l*$l+1;
    for ($v=0;$v<$l;$v++) {
      $indexer=$l*8+$v;
      $order[$stp]=($l*8+$v);
      $stp+=2;
    }
  }
  ksort($order); // Now the keys are in the same order as the bits
  
  $dbh = new PDO('mysql:host=localhost;dbname=imgs', "user", "password");
  $sql="INSERT INTO DCT_ndx (image_id,dct_hash) VALUES (?,?)";
  $stmt=$dbh->prepare($sql);
  $sql="SELECT image_id,dct_hash FROM DCT_index; // old "traditional" hash
  $cnt=0;
  foreach ($dbh->query($sql) as $r) {
    $img_id=$r[0];
    $img_hash=$r[1];
    $old_hash=gmp_init($img_hash,16);
    $hash_bin=gmp_strval($old_hash,2);
    $new_hash=gmp_init(0);    
    foreach ($order as $bpos) {
      $new_hash=gmp_mul($new_hash,2);
      if (substr($hash_bin,$bpos,1)=="1")
        $new_hash=gmp_add($new_hash,1);
    }
    $nh=substr("0000000000000000" . gmp_strval($new_hash,16),-16);
    $stmt->execute(array($img_id,$nh));           
    $cnt++;
    if (($cnt % 500)==0)
        echo "$cnt ";
  }
  echo "\nDone ($cnt entries)\n";   
?>
I made a copy of current index table to another index table as I wasn't sure if this logic worked out or not and anyhow I prefer not to destroy anything that works even somehow.

After a small crunching (took a few hours) I looked the results and, yes, it worked even better than my quite a good solution earlier, where the bits were laid row by row. Of course, where the hashes had clashes (= false positive) ordering the bits didn't change anything.

But when I started to look similar pictures by leftmost substrings the results were much better than earlier: less false positives (some, of course) and much more true positives found. I experimented with number of bytes to use and found out that maybe 8 or 10 first bytes are the most efficient values. It may seem a bit odd that I can now go to less similar first bits (or rather 4 bit arrays as out hash consists of 4-bit hexadecimal numbers) but as the hashes are now more coherent the amount of false positives don't explode even with that.

So, it could be argued that this ordering of bits has created a visual sorting (or at least semi-sorting) of images or - more exact hashing algorithm. This makes it quite efficient to look for duplicates minimizing the time to clean the database.

I haven't yet deleted the duplicates (because of the research I want to do with these hashing algorithms) and I can now look for the all-time duplicates of the same images. One image was in the database 27 times! And, yes, this new ordering of bits show JUST those 27 images. It is of course possible that there might still be one or more similar images left and I have no way of knowing it but I now have really high trust on this index.

So, as you have now seen - not all bits are created equal.

Saturday, January 25, 2014

The Future is here (Part 2)

I have been experimenting with DCT hash (Discrete Cosine Transform) during the last two weeks (yes, I have done something else, too) and I think I currently have quite a well-greased duplicate-finding number-crunching apparatus to handle my image library, which gets cleaned of duplicates (or triplicates or...) every day.

But first things first. As I said in my previous blog post (The future is here Part 1) I implemented the DCT in PHP by re-engineering Java code I introduced in the previous blog post.

The DCT in PHP is here:


<?php
  class Phash_npp {
    private $size=32;
    private $small_size=8;  
    private $c=array();
      
    private function InitCoefficients() {
      $this->c[0]=1.0/sqrt(2);
      for ($i=1; $i < $this->size ; $i++)
        $this->c[$i]=1.0;     
    }
  
    private function blue($img,$x,$y) {
      return imagecolorat($img,$x, $y) & 0xff;
    }
      
    public function __construct($my_size=32,$my_small_size=8) {
      $this->size=$my_size;
      $this->small_size=$my_small_size;
      $this->InitCoefficients();
    }
  
    private function ApplyDCT($f) {
      $n=$this->size;
      $F=array();
      for ($u=0;$u<$n;$u++) {
        for ($v=0;$v<$n;$v++) {
          $sum=0.0;
          for ($i=0;$i<$n;$i++) {
            for ($j=0;$j<$n;$j++) {
              $sum+=cos(((2*$i+1)/(2.0*$n))*$u*M_PI)*cos(((2*$j+1)/(2.0*$n))*$v*M_PI)*($f[$i][$j]);
            }
          }
          $sum*=(($this->c[$u]*$this->c[$v])/4.0);
          $F[$u][$v]=$sum;
        }
      }
      return $F;
    }
  
    public function hash($image) {
      $hash="0000000000000000";
      if (file_exists($image)) {
        $size=$this->size;
        $res=imagecreatefromstring(file_get_contents($image));
        $img = imagecreatetruecolor($size, $size);
        imagecopyresampled($img, $res, 0, 0, 0, 0, $size, $size, imagesx($res), imagesy($res));
        imagecopymergegray($img, $res, 0, 0, 0, 0, $size, $size, 50);
        $vals=array();
        for ($x=0;$x<$size;$x++) {
          for ($y=0;$y<$size;$y++) {
            $vals[$x][$y]=$this->blue($img,$x,$y);
          }
        }
        $dct_vals=$this->ApplyDCT($vals);
        $total=0.0;
        for ($x=0;$x<$this->small_size;$x++) {
          for ($y=0;$y<$this->small_size;$y++) {
            $total += $dct_vals[$x][$y];
          }
        }
        $total-=$dct_vals[0][0];
        $avg=$total/($this->small_size*$this->small_size-1);
        $hash=0;
        for ($x=0;$x<$this->small_size;$x++) {
          for ($y=0;$y<$this->small_size;$y++) {
            $hash = gmp_mul($hash, 2);
            if ($dct_vals[$x][$y]>$avg)
              $hash=gmp_add($hash, 1); ;
          }
        }
        return gmp_strval($hash,16);                
      }
      return $hash;
    }
  }
?>


There's nothing fancy in the code itself - except that it works. I prefer saving the hashes in the database as 16-byte hexadecimal characters, so I'll return the calculated 64-bit DCT hash with 16-character hexadecimal string.

The hashing is done by a simple PHP-script that runs after the harvesting script (by the same bash script).

<?php

  include_once('phash_dct.php');
  $dct_hasher=new Phash_npp;

  try {
    $dbh = new PDO('mysql:host=localhost;dbname=imgs', "user", "password");
    $sql="INSERT INTO DCT_index (image_id,dct_hash) VALUES (?,?)";
    $stmt=$dbh->prepare($sql);
    
    $sql="SELECT img_id,img_name FROM Images LEFT JOIN `DCT_index` ON img_id = image_id WHERE image_id IS NULL";
    $n=0;
    foreach ($dbh->query($sql) as $r) { 
      $filename="images/" . $r[1];
      $iid=$r[0];
      $hash=$dct_hasher->hash($filename);
      $stmt->execute(array($iid,$hash));            
      $n++;
      if (($n % 500)==0)
        echo "$n "; // To show, we aren't dead, much needed during iniatial go
    }
    echo "Done\r\n";
  } catch(PDOException $ex) {
    echo "An error occured!"; //user friendly message
  }
?>

As you can see, I designed the database so that DCT index is an external table with key pointing to image id in the main image table. So now that we have the DCT index table, what can we do with it?

I have about 99,9% trust in that same DCT hash means same image. There are a  few exceptions, some images share same DCT hash, though they don't seem same to me at all - but they are rare and quite a space in between the encounters.

Unfortunately it isn't true if we state that two similar images share same DCT hash. There are some bits that change; I have seen some images where about seven bits may be different on apparently same image.

My first idea was that I construct similar MVP-tree adoption that was used with initial phash algorithms, where it worked quite well. Unfortunately this seemed to be a total cul.de-sac. There just were too many hits and I had to drop this idea. Maybe (just maybe) it might still work with many different images but I flushed the idea as I didn't a) really believe it and b) want to calculate hamming distances to a dozen or so images for every image in the database.

Hmm... what to do? As we found out earlier there isn't any good sorting algorithm to index close images in regard of hamming distance. Of course one could go through all images with brute force but it seems just too heavy (and slow) a solution.

I thought about starting with baby steps by taking a hash of an image as a starting point. The I constructed a list of different hash values by xor-ing the initial value with one bit, that traveled from position 1 to to position 63 - so in the first pass I xor-ed the hash with 1 which inverts the LSB bit, putting that to a list, and then multiplying the 1 with 2 and xor-ing with that. Then I repeated that so in the end I had 64 hashes that contained all hashes that differed from current one with 1 bit in any position. I constructed a SQL statement where the hash was IN(<list of hash values>).

It worked, but I still missed many matches I know existed. Hmm, easy thing to expand the idea: next I permutated all the two bit-combinations and got better but still insufficient results. And when I thought of three or more bits changed the SQL statement would get insanely long. Another show stopper.

Short (but just a little short) of getting really annoyed I continued experimenting. I listed the images sorted by hash and discovered that the similar images where often quite close to each other, though not always successive. Of course there were at least two explanations. Either there were really many duplicates and we just stumbled with the closest ones or all bits are not equal.

The first explanation was so depressing so I had to grasp the latter one. If you think about the DCT you may understand that not all bits are crated equal. If we would be handling the DCT itself it would be quite clear that the LSB bits are less important than higher ones. Actually we are dealing with a bit pattern, which is built on the value being over or under the average value of all the DCT values (a float matrix) and this conclusion isn't clear (and it MAY be very flawed for all I know).

Anyhow, I wrote a selection program, that gets in a parameter that tells how many of the first bits we consider. Or more correctly, how many most significant groups of four bits we consider as the data is in the database in hexadecimal string. So, if I set the parameter to value 16 it means that I find only the matches with exactly same hashes. Value 15 means that we don't mind the last hexadecimal character value or only 60 first bits are significant. This is very simple to do as we just take SQL function LEFT() with given argument.

After experimenting I landed on value 9 which seems to set the soft spot, where we get most of the matches with minimal false positives. There will be false positives but I'd say that only maybe in every tenth image. Some images get really many false positives (at least partly the same odd group we had with the initial average phash) but quite soon I was able to see mostly clean successions of the similar images and smash them.

The selection program selects similar images (as described above) and shows them on the screen as "film strips" ordered by decreasing number of similar images so I was able to delete first the most common duplicate images.

I am quite pleased with the current state of the situation. It is possible (or even probable) that there remains some duplicates (even after I have gone through all the matches - if I ever got to that situation) but I think I am very close to the optimal technical solution - at least so close I am satisfied.

Theoretically there might be more efficient ordering of the bits in the hash as now the are just set row by row, which is just fine as long as you consider the exact match. My intuition says that if we save the bits correctly the most important values would be more coherently in the beginning of the hash. I might rebuild an index with this logic (just out of curiosity), if and when I get the urge to do so. Probably quite soon ;)

I also still dream about writing a general-purpose class for image library mostly containing the tools I have found effective and working.

But until next entry, have a nice day/evening/night/morning or whatever.

Monday, January 6, 2014

The Future is Here! (Part 1)

Hello

This is just a quick note about the achievements since last writing.

I just had to bite the bullet and translate a Java implementation of 2D DCT transformation found (=robbed) from Net. The source code can be found for example here. The code shows Elliot Shepherd as the author so the credits go there, thank you and sorry ;)

I also wrote a class library that calculates the DCT phash (as a 16 char hexadecimal string) to make it analogous to my earlier hashes and to facilitate its handling in the database. The class works more or less as phasher-class referred in the previous blog writings.

I am not yet giving anything out because the code is just calculating the hashes first time for the database (about half a second/per hash) and it will take long (I mean long time like 70 hours) to crunch its way through the database and there's a little research after that as well before I will show my findings and clean the class library which will be available if there's any interest... ;)

So why to hurry to inform you, my (almost non-existent) reader? Because I have some 12000 images hashed so far and the results are just stunningly beautiful (not only the art of the images). All the hashes shared are of the same image, none false positives found so far. I understand now why the DCT was chosen for JPGs. :)

So, I'll pop up most probably during next week or so with something to look forward to.