Saturday, December 28, 2013

Finding similarities - at last (Part 4)

As we - at least I ;) - figured out last time getting the perceptual hash value wasn't the whole solution we were looking, that of finding similar images from a heap of photos.

The problem is that with big image base the N x N comparison just takes too much time. Furthermore sorting via perceptual hash doesn't work too well as we can find close matches only when the difference is in the "last bits" of the perceptual hash.

In the last blog entry we also discovered that the difference or distance between two phashes can be readily calculated with Hamming distance that is available in PHP in gmp_hamdist-function.

As I browsed the Google-land I found a very thorough article about finding similar entries in high dimensional data. You can read it bu following this link here. The article handles very many different indexing schemes for similarity search. It went partly (say 99 parts out of 100) over my head. There were K-trees, VP-trees, MVP-trees and so on. What I sort of gathered from the article that MVP-trees (Multi-Vantage Point) would be one of the best possible solutions in my case, which inherently suffers from "Curse of Dimensionality" i.e. the problem is so multidimensional that one can't handle effectively with traditional methods.

I didn't find any source code for MVP-tree solution, so I made a ... hack which sort of borrows the idea but is still quite easy to implement.

The Hamming distance between two points is a multidimensional vector and can be to "any direction". If say two bits differ in the phash the distance is two and distance stays the same whichever two bits are different.

The Multi-Vantage Point tree builds on some, well, Vantage Points and all other points are relative to these vantage points. So, in my algorithm I choose three points quite randomly and for each perceptual hash (remember I calculate the phashes for 11x11, 16x16 and 19x19) I calculate the Hamming distance from each of these vantage points.

Well, I didn't choose the vantage points totally randomly, but after calculating the distance for first point I looked in the database and selected second which was quite far away (= great hamming distance) and similarly the third so that the hamming distance to the first two images was distinct. In the article linked above was stated that selection of the vantage points isn't very important, there was very small difference in the performance between heuristically selected and random vantage points. My reasoning was that if the vantage points lie close to each other the tree is acting almost like a tree with one vantage point less which would be like - pointless as I just said.

Now I have currently actually 12 values for each image row in the database: three perceptual hashes for each initial image re-size and three times three Hamming distances to each of the vantage points. Why three vantage points? Good question. Any more questions? ;) We are always balancing between space, time and Nathaniel - I mean effectiveness and my first hunch was that the balance lies here but I reserve the right to change my opinion.

Probably one more vantage point would lessen false positives and still (probably) not lead to missed true positives. I shall explore these areas and report here later (see the forthcoming Part 5 about the avenues there are for further tinkering).

As a part of the nightly run I run in the end the dist.php below which builds the distances to the vantage points for each new image. Sorry for not using PDO here, but I preferred building a column name via a variable and just don't know how to do it with PDO. The first line contains the indexes of the vantage points.

$d=array(array('d',55080),array('e',55082),array('f',55108));

foreach ($d as $d1x) {
  $variable=$d1x[0];
  $d1=$d1x[1];
  $db=mysql_connect("192.168.200.85","imgs-user","imgs-pass");
  mysql_select_db("imgs");
  mysql_set_charset('latin1',$db);
  $sql="SELECT img_phash,img_phash11,img_phash19 FROM Images WHERE img_id=$d1";
  $rs=mysql_query($sql);
  $r=mysql_fetch_row($rs);
  $h1=gmp_init($r[0],16);
  $h2=gmp_init($r[1],16);
  $h3=gmp_init($r[2],16);
  $sql="SELECT img_id,img_phash,img_phash11,img_phash19 FROM Images WHERE image_status<>'S' AND {$variable}1 IS NULL";
  $rs=mysql_query($sql);

  while ($r=mysql_fetch_row($rs)) {
    $c1=gmp_init($r[1],16);
    $c2=gmp_init($r[2],16);
    $c3=gmp_init($r[3],16);
    $diff1=gmp_hamdist($h1,$c1);
    $diff2=gmp_hamdist($h2,$c2);
    $diff3=gmp_hamdist($h3,$c3);
    $sql="UPDATE Images SET {$variable}1=$diff1,{$variable}2=$diff2,{$variable}3=$diff3 WHERE img_id={$r[0]}";
    mysql_query($sql);
  }
}


For now we can have the Hamming distance for each image to a set of known images, How does it help us in our great quest? Well, what were our challenges before this step? Yes, the Hamming distances weren't easy to sort so finding close images was just too slow and resource-greedy. We could quite easily find the exact matches but for many reasons the exact matches were not available for similar photos.

But now the situation has changed quite substantially. For each vantage point we can define a Hamming distance radius and show all images inside the radius. My first (and current) implementation sets all the radii same (though I know that getting same Hamming Distance with 19 x 19 is much more demanding than with 11 x 11, more of it in Part 5) and select image as similar if it is inside radii in respect of all vantage points.

I don't give my whole find_similar.php as there is too much supporting code (for showing the images and so on), but the basic SQL is actually very simple:



// $img is the image for which we are trying to find the similar ones
//
// Hamming distances on different cell sizes and each vantage point
//
$sql="SELECT d1,d2,d3,e1,e2,e3,f1,f2,f3 FROM Images WHERE img_id=$img";
.
.
.
$d1=$r[0];
$d2=$r[1];
$d3=$r[2];
$e1=$r[3];
$e2=$r[4];
$e3=$r[5];
$f1=$r[6];
$f2=$r[7];
$f3=$r[8];
$betweens="";
if (isset($_GET["step"]))
  $value=$_GET["step"];
else
  $value=4;
for ($i=1;$i<4 data-blogger-escaped-br="" data-blogger-escaped-i="">  if (strlen($betweens)>0)
    $betweens .= " AND ";
  $v="d$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.="$v BETWEEN $vmin AND $vmax";
  $v="e$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.=" AND $v BETWEEN $vmin AND $vmax";
  $v="f$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.=" AND $v BETWEEN $vmin AND $vmax";
}
$sql="SELECT image_dir,img_name,img_id,image_size FROM Images WHERE $betweens AND image_status<>'S' ORDER BY image_size DESC";
.
.
.


As You can see the default Hamming distance is 4. I ended up with it after checking with different values and for my data (sorry, not available) the 4 seems to be the sweet spot.

This concludes our discussion today. In the next part I mainly try to vision what directions there are to explore in the future..

No comments:

Post a Comment