It is true that taking the average of all the DCT cells and coloring each pixel based if the cell value is over or under the average might somehow equalize the cells but I still felt quite intrigued (read haunted) about the thought that maybe the ordering of bits was significant.
First I thought about running all the images (once more) through DCT and saving the bits but then figured out, that the bits are already there, we have just to change their order. I instinctly ended with the following order of bits from MSB -> LSB: (here for simplicity only first four cells of the upper left of the DCT averaged values in the matrix).
0 | 1 | 4 | 9 |
2 | 3 | 6 | 11 |
5 | 7 | 8 | 13 |
10 | 12 | 14 | 15 |
To rebuild a weighted DCT hash I used the following code:
<?php
$order=array();
for ($l=0;$l<8;$l++) {
$stp=$l*$l;
for ($u=0;$u<=$l;$u++) {
$order[$stp]=($l+$u*8);
$stp+=2;
}
$stp=$l*$l+1;
for ($v=0;$v<$l;$v++) {
$indexer=$l*8+$v;
$order[$stp]=($l*8+$v);
$stp+=2;
}
}
ksort($order); // Now the keys are in the same order as the bits
$dbh = new PDO('mysql:host=localhost;dbname=imgs', "user", "password");
$sql="INSERT INTO DCT_ndx (image_id,dct_hash) VALUES (?,?)";
$stmt=$dbh->prepare($sql);
$sql="SELECT image_id,dct_hash FROM DCT_index; // old "traditional" hash
$cnt=0;
foreach ($dbh->query($sql) as $r) {
$img_id=$r[0];
$img_hash=$r[1];
$old_hash=gmp_init($img_hash,16);
$hash_bin=gmp_strval($old_hash,2);
$new_hash=gmp_init(0);
foreach ($order as $bpos) {
$new_hash=gmp_mul($new_hash,2);
if (substr($hash_bin,$bpos,1)=="1")
$new_hash=gmp_add($new_hash,1);
}
$nh=substr("0000000000000000" . gmp_strval($new_hash,16),-16);
$stmt->execute(array($img_id,$nh));
$cnt++;
if (($cnt % 500)==0)
echo "$cnt ";
}
echo "\nDone ($cnt entries)\n";
?>
I made a copy of current index table to another index table as I wasn't sure if this logic worked out or not and anyhow I prefer not to destroy anything that works even somehow.After a small crunching (took a few hours) I looked the results and, yes, it worked even better than my quite a good solution earlier, where the bits were laid row by row. Of course, where the hashes had clashes (= false positive) ordering the bits didn't change anything.
But when I started to look similar pictures by leftmost substrings the results were much better than earlier: less false positives (some, of course) and much more true positives found. I experimented with number of bytes to use and found out that maybe 8 or 10 first bytes are the most efficient values. It may seem a bit odd that I can now go to less similar first bits (or rather 4 bit arrays as out hash consists of 4-bit hexadecimal numbers) but as the hashes are now more coherent the amount of false positives don't explode even with that.
So, it could be argued that this ordering of bits has created a visual sorting (or at least semi-sorting) of images or - more exact hashing algorithm. This makes it quite efficient to look for duplicates minimizing the time to clean the database.
I haven't yet deleted the duplicates (because of the research I want to do with these hashing algorithms) and I can now look for the all-time duplicates of the same images. One image was in the database 27 times! And, yes, this new ordering of bits show JUST those 27 images. It is of course possible that there might still be one or more similar images left and I have no way of knowing it but I now have really high trust on this index.
So, as you have now seen - not all bits are created equal.
No comments:
Post a Comment