I have been experimenting with DCT hash (Discrete Cosine Transform) during the last two weeks (yes, I
have done something else, too) and I think I currently have quite a well-greased duplicate-finding number-crunching apparatus to handle my image library, which gets cleaned of duplicates (or triplicates or...) every day.
But first things first. As I said in my previous blog post (The future is here Part 1) I implemented the DCT in PHP by re-engineering Java code I introduced in the previous blog post.
The DCT in PHP is here:
<?php
class Phash_npp {
private $size=32;
private $small_size=8;
private $c=array();
private function InitCoefficients() {
$this->c[0]=1.0/sqrt(2);
for ($i=1; $i < $this->size ; $i++)
$this->c[$i]=1.0;
}
private function blue($img,$x,$y) {
return imagecolorat($img,$x, $y) & 0xff;
}
public function __construct($my_size=32,$my_small_size=8) {
$this->size=$my_size;
$this->small_size=$my_small_size;
$this->InitCoefficients();
}
private function ApplyDCT($f) {
$n=$this->size;
$F=array();
for ($u=0;$u<$n;$u++) {
for ($v=0;$v<$n;$v++) {
$sum=0.0;
for ($i=0;$i<$n;$i++) {
for ($j=0;$j<$n;$j++) {
$sum+=cos(((2*$i+1)/(2.0*$n))*$u*M_PI)*cos(((2*$j+1)/(2.0*$n))*$v*M_PI)*($f[$i][$j]);
}
}
$sum*=(($this->c[$u]*$this->c[$v])/4.0);
$F[$u][$v]=$sum;
}
}
return $F;
}
public function hash($image) {
$hash="0000000000000000";
if (file_exists($image)) {
$size=$this->size;
$res=imagecreatefromstring(file_get_contents($image));
$img = imagecreatetruecolor($size, $size);
imagecopyresampled($img, $res, 0, 0, 0, 0, $size, $size, imagesx($res), imagesy($res));
imagecopymergegray($img, $res, 0, 0, 0, 0, $size, $size, 50);
$vals=array();
for ($x=0;$x<$size;$x++) {
for ($y=0;$y<$size;$y++) {
$vals[$x][$y]=$this->blue($img,$x,$y);
}
}
$dct_vals=$this->ApplyDCT($vals);
$total=0.0;
for ($x=0;$x<$this->small_size;$x++) {
for ($y=0;$y<$this->small_size;$y++) {
$total += $dct_vals[$x][$y];
}
}
$total-=$dct_vals[0][0];
$avg=$total/($this->small_size*$this->small_size-1);
$hash=0;
for ($x=0;$x<$this->small_size;$x++) {
for ($y=0;$y<$this->small_size;$y++) {
$hash = gmp_mul($hash, 2);
if ($dct_vals[$x][$y]>$avg)
$hash=gmp_add($hash, 1); ;
}
}
return gmp_strval($hash,16);
}
return $hash;
}
}
?>
There's nothing fancy in the code itself - except that it works. I prefer saving the hashes in the database as 16-byte hexadecimal characters, so I'll return the calculated 64-bit DCT hash with 16-character hexadecimal string.
The hashing is done by a simple PHP-script that runs after the harvesting script (by the same bash script).
<?php
include_once('phash_dct.php');
$dct_hasher=new Phash_npp;
try {
$dbh = new PDO('mysql:host=localhost;dbname=imgs', "user", "password");
$sql="INSERT INTO DCT_index (image_id,dct_hash) VALUES (?,?)";
$stmt=$dbh->prepare($sql);
$sql="SELECT img_id,img_name FROM Images LEFT JOIN `DCT_index` ON img_id = image_id WHERE image_id IS NULL";
$n=0;
foreach ($dbh->query($sql) as $r) {
$filename="images/" . $r[1];
$iid=$r[0];
$hash=$dct_hasher->hash($filename);
$stmt->execute(array($iid,$hash));
$n++;
if (($n % 500)==0)
echo "$n "; // To show, we aren't dead, much needed during iniatial go
}
echo "Done\r\n";
} catch(PDOException $ex) {
echo "An error occured!"; //user friendly message
}
?>
As you can see, I designed the database so that DCT index is an external table with key pointing to image id in the main image table.
So now that we have the DCT index table, what can we do with it?
I have about 99,9% trust in that same DCT hash means same image. There are a few exceptions, some images share same DCT hash, though they don't seem same to me at all - but they are rare and quite a space in between the encounters.
Unfortunately it isn't true if we state that two similar images share same DCT hash. There are some bits that change; I have seen some images where about seven bits may be different on apparently same image.
My first idea was that I construct similar MVP-tree adoption that was used with initial phash algorithms, where it worked quite well. Unfortunately this seemed to be a total cul.de-sac. There just were too many hits and I had to drop this idea. Maybe (just maybe) it might still work with many different images but I flushed the idea as I didn't a) really believe it and b) want to calculate hamming distances to a dozen or so images for every image in the database.
Hmm... what to do? As we found out earlier there isn't any good sorting algorithm to index close images in regard of hamming distance. Of course one could go through all images with brute force but it seems just too heavy (and slow) a solution.
I thought about starting with baby steps by taking a hash of an image as a starting point. The I constructed a list of different hash values by xor-ing the initial value with one bit, that traveled from position 1 to to position 63 - so in the first pass I xor-ed the hash with 1 which inverts the LSB bit, putting that to a list, and then multiplying the 1 with 2 and xor-ing with that. Then I repeated that so in the end I had 64 hashes that contained all hashes that differed from current one with 1 bit in any position. I constructed a SQL statement where the hash was IN(<list of hash values>).
It worked, but I still missed many matches I know existed. Hmm, easy thing to expand the idea: next I permutated all the two bit-combinations and got better but still insufficient results. And when I thought of three or more bits changed the SQL statement would get insanely long. Another show stopper.
Short (but just a little short) of getting really annoyed I continued experimenting. I listed the images sorted by hash and discovered that the similar images where often quite close to each other, though not always successive. Of course there were at least two explanations. Either there were really many duplicates and we just stumbled with the closest ones or all bits are not equal.
The first explanation was so depressing so I had to grasp the latter one. If you think about the DCT you may understand that not all bits are crated equal. If we would be handling the DCT itself it would be quite clear that the LSB bits are less important than higher ones. Actually we are dealing with a bit pattern, which is built on the value being over or under the average value of all the DCT values (a float matrix) and this conclusion isn't clear (and it MAY be very flawed for all I know).
Anyhow, I wrote a selection program, that gets in a parameter that tells how many of the first bits we consider. Or more correctly, how many most significant groups of four bits we consider as the data is in the database in hexadecimal string. So, if I set the parameter to value 16 it means that I find only the matches with exactly same hashes. Value 15 means that we don't mind the last hexadecimal character value or only 60 first bits are significant. This is very simple to do as we just take SQL function LEFT() with given argument.
After experimenting I landed on value 9 which seems to set the soft spot, where we get most of the matches with minimal false positives. There will be false positives but I'd say that only maybe in every tenth image. Some images get really many false positives (at least partly the same odd group we had with the initial average phash) but quite soon I was able to see mostly clean successions of the similar images and smash them.
The selection program selects similar images (as described above) and shows them on the screen as "film strips" ordered by decreasing number of similar images so I was able to delete first the most common duplicate images.
I am quite pleased with the current state of the situation. It is possible (or even probable) that there remains some duplicates (even after I have gone through all the matches - if I ever got to that situation) but I think I am very close to the optimal technical solution - at least so close I am satisfied.
Theoretically there might be more efficient ordering of the bits in the hash as now the are just set row by row, which is just fine as long as you consider the exact match. My intuition says that if we save the bits correctly the most important values would be more coherently in the beginning of the hash. I might rebuild an index with this logic (just out of curiosity), if and when I get the urge to do so. Probably quite soon ;)
I also still dream about writing a general-purpose class for image library mostly containing the tools I have found effective and working.
But until next entry, have a nice day/evening/night/morning or whatever.