Monday, December 30, 2013

Ponderations and Paths for the Future and Conclusions (Part 5)

Pondering

In the four preceding blog posts I have introduced a quite simple, still somewhat effective database solution for finding the similar images from a big heap of them. In our haystack may be a lot of needles and our goal was to catch as many of these needles as possible without un-stacking all the hay.

First, let us ponder a bit how well we have achieved our goals (or at least I have). My gut feeling is that though the system finds quite a lot of similar images it doesn't get them all. I have already given some explanations why the perceptual hash may differ between the similar photos. Unfortunately I don't have time to go through whole the evolving collection so I can't give any exact statistics. I just see some images I think I have seen quite often (considering the 380 966 unhidden photos) which would suggest that there are a lot of duplicates still in the database.

However my own feeling says also that the number of duplicates lessens from day to day, every day I get less of series of say five or more similar photos. Let me explain a bit how I clean the database (i.e. my workflow). I try to find time to go through the new images. I read a bunch of them and find the number of similar (as defined in my previous blog entry) and print it below the image. If I click the photo a new window opens and shows all the images which are phash-wise similar. So, if the photo has just one similar photo (itself) and it is...appealing I just look at the next one. If it isn't... art enough to my eye I just delete it (set image_status as 'S').

If there are two or more similar photos I click the photo and get a new window with all the photos that are close enough by perceptual hashes sorted in decreasing image size. Now I click all other similar photos, except the one we came from and mark all but the biggest as hidden. By clicking all similar photos I open another tabs with each image as a comparison base. Why on earth? After experimenting with the system I have found that some similar photos may skip the comparison with the initial one but may still find matches with some others which were close enough to the initial one. Mathematically speaking our perceptual hash isn't a equivalent relation, it doesn't fulfill the rule of translation (or if Hamming distance between phash(A) and phash(B) is below x and Hamming distance between phash(B) and phash(C) is also below x it doesn't mean that Hamming distance between phash(A) and phash(C) would be below x). This causes that sometimes we get a "path" which goes via successive similar images until we don't find the new ones.

I also get some false positives which are not similar to the eye, but phashes are still below the given radius. A bad, bad  thing, isn't it? Well, it depends how you think about it. I think I have found both duplicates and images without .. art values almost more via false positives than via true positives. Furthermore the unpleasing images seem to attract other unpleasing images which both raises the efficiency of cleaning and keeps astounding me as well. It's like there is more in this system than what eye can see. Then again, it just may be my impression.

Optimizations

A few words about optimizations. I should have told it already but to make things work on reasonable time scale you have to index phash and precomputed Hamming distance columns in your database.

The minimal distance for images considered same is a delicate balance between missing true positives and drowning in the false positives. With my data the Hamming distance of four bits seems to be the optimal (as explained in my previous blog entry). I have learnt that I can grasp at most about 20 images in one screen and not starting to miss the similar ones (given most are false positives) when rapidly eyeing through the photos.

The Future (if it ever exists)

I have tried it with different rules for proximity but this has been the only one working well so far. I'll keep trying but don't hold my breath. For me this works as well as I need as now and the next meaningful step would need something that changes the scene remarkably. That would mean that I find a better function for the phash (Discrete Cosine Transformation may be that) which would find more truly similar images and less false positives. Another goal would be implementing some logic that could and would find clusters of similar images effectively. There might be some way with the current database but I yet haven't found it.

As a programmer I would also like to encapsulate this functionality into a generic class. It isn't so easy to do as currently the database operations are tightly intertwined thorough the code and my image files are organized in different directories which needs some additional logic. I have some thoughts about this frontier, though, and I will blog them here when and if they ever see the light.

I end this subject (for now, at least) to a conclusion of what we have got and how it works.

Conclusion

Managing a big image base from different sources is like gardening. Unless you diligently prune and nurture the garden, the weeds will make it less enjoyable. So it means that you have to utilize time, code and tenderness to your collection - unless you don't mind a wild garden(but why would you even read this article in the first place?). If you came here for a turnkey system which deletes just the duplicate images (similar, not same) in you database automatically, I am sorry to disappoint you. Perceptual hashes as I have implemented them here don't find just the similar photos but they find also different photos and think they are similar. It doesn't find all the similar photos either because there's always inaccuracies when re-sizing the initial photos and calculating the average value of the re-sized photos which may put the phash off.

I haven't (yet, at least) encapsulated the functionality in a simple and general class, nor given even the whole source of my working system, but instead shown just the significant parts and left the rest as an easy exercise to the reader. I still hope that you have got some insight into perceptual hashes and also grasped why it isn't an easy  solution to handle, Given some background in PHP I would estimate that it is possible to re-create these ideas as a part of your own image collection.

Saturday, December 28, 2013

Finding similarities - at last (Part 4)

As we - at least I ;) - figured out last time getting the perceptual hash value wasn't the whole solution we were looking, that of finding similar images from a heap of photos.

The problem is that with big image base the N x N comparison just takes too much time. Furthermore sorting via perceptual hash doesn't work too well as we can find close matches only when the difference is in the "last bits" of the perceptual hash.

In the last blog entry we also discovered that the difference or distance between two phashes can be readily calculated with Hamming distance that is available in PHP in gmp_hamdist-function.

As I browsed the Google-land I found a very thorough article about finding similar entries in high dimensional data. You can read it bu following this link here. The article handles very many different indexing schemes for similarity search. It went partly (say 99 parts out of 100) over my head. There were K-trees, VP-trees, MVP-trees and so on. What I sort of gathered from the article that MVP-trees (Multi-Vantage Point) would be one of the best possible solutions in my case, which inherently suffers from "Curse of Dimensionality" i.e. the problem is so multidimensional that one can't handle effectively with traditional methods.

I didn't find any source code for MVP-tree solution, so I made a ... hack which sort of borrows the idea but is still quite easy to implement.

The Hamming distance between two points is a multidimensional vector and can be to "any direction". If say two bits differ in the phash the distance is two and distance stays the same whichever two bits are different.

The Multi-Vantage Point tree builds on some, well, Vantage Points and all other points are relative to these vantage points. So, in my algorithm I choose three points quite randomly and for each perceptual hash (remember I calculate the phashes for 11x11, 16x16 and 19x19) I calculate the Hamming distance from each of these vantage points.

Well, I didn't choose the vantage points totally randomly, but after calculating the distance for first point I looked in the database and selected second which was quite far away (= great hamming distance) and similarly the third so that the hamming distance to the first two images was distinct. In the article linked above was stated that selection of the vantage points isn't very important, there was very small difference in the performance between heuristically selected and random vantage points. My reasoning was that if the vantage points lie close to each other the tree is acting almost like a tree with one vantage point less which would be like - pointless as I just said.

Now I have currently actually 12 values for each image row in the database: three perceptual hashes for each initial image re-size and three times three Hamming distances to each of the vantage points. Why three vantage points? Good question. Any more questions? ;) We are always balancing between space, time and Nathaniel - I mean effectiveness and my first hunch was that the balance lies here but I reserve the right to change my opinion.

Probably one more vantage point would lessen false positives and still (probably) not lead to missed true positives. I shall explore these areas and report here later (see the forthcoming Part 5 about the avenues there are for further tinkering).

As a part of the nightly run I run in the end the dist.php below which builds the distances to the vantage points for each new image. Sorry for not using PDO here, but I preferred building a column name via a variable and just don't know how to do it with PDO. The first line contains the indexes of the vantage points.

$d=array(array('d',55080),array('e',55082),array('f',55108));

foreach ($d as $d1x) {
  $variable=$d1x[0];
  $d1=$d1x[1];
  $db=mysql_connect("192.168.200.85","imgs-user","imgs-pass");
  mysql_select_db("imgs");
  mysql_set_charset('latin1',$db);
  $sql="SELECT img_phash,img_phash11,img_phash19 FROM Images WHERE img_id=$d1";
  $rs=mysql_query($sql);
  $r=mysql_fetch_row($rs);
  $h1=gmp_init($r[0],16);
  $h2=gmp_init($r[1],16);
  $h3=gmp_init($r[2],16);
  $sql="SELECT img_id,img_phash,img_phash11,img_phash19 FROM Images WHERE image_status<>'S' AND {$variable}1 IS NULL";
  $rs=mysql_query($sql);

  while ($r=mysql_fetch_row($rs)) {
    $c1=gmp_init($r[1],16);
    $c2=gmp_init($r[2],16);
    $c3=gmp_init($r[3],16);
    $diff1=gmp_hamdist($h1,$c1);
    $diff2=gmp_hamdist($h2,$c2);
    $diff3=gmp_hamdist($h3,$c3);
    $sql="UPDATE Images SET {$variable}1=$diff1,{$variable}2=$diff2,{$variable}3=$diff3 WHERE img_id={$r[0]}";
    mysql_query($sql);
  }
}


For now we can have the Hamming distance for each image to a set of known images, How does it help us in our great quest? Well, what were our challenges before this step? Yes, the Hamming distances weren't easy to sort so finding close images was just too slow and resource-greedy. We could quite easily find the exact matches but for many reasons the exact matches were not available for similar photos.

But now the situation has changed quite substantially. For each vantage point we can define a Hamming distance radius and show all images inside the radius. My first (and current) implementation sets all the radii same (though I know that getting same Hamming Distance with 19 x 19 is much more demanding than with 11 x 11, more of it in Part 5) and select image as similar if it is inside radii in respect of all vantage points.

I don't give my whole find_similar.php as there is too much supporting code (for showing the images and so on), but the basic SQL is actually very simple:



// $img is the image for which we are trying to find the similar ones
//
// Hamming distances on different cell sizes and each vantage point
//
$sql="SELECT d1,d2,d3,e1,e2,e3,f1,f2,f3 FROM Images WHERE img_id=$img";
.
.
.
$d1=$r[0];
$d2=$r[1];
$d3=$r[2];
$e1=$r[3];
$e2=$r[4];
$e3=$r[5];
$f1=$r[6];
$f2=$r[7];
$f3=$r[8];
$betweens="";
if (isset($_GET["step"]))
  $value=$_GET["step"];
else
  $value=4;
for ($i=1;$i<4 data-blogger-escaped-br="" data-blogger-escaped-i="">  if (strlen($betweens)>0)
    $betweens .= " AND ";
  $v="d$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.="$v BETWEEN $vmin AND $vmax";
  $v="e$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.=" AND $v BETWEEN $vmin AND $vmax";
  $v="f$i";
  $vmin=0+$$v-$value;
  $vmax=0+$$v+$value;
  $betweens.=" AND $v BETWEEN $vmin AND $vmax";
}
$sql="SELECT image_dir,img_name,img_id,image_size FROM Images WHERE $betweens AND image_status<>'S' ORDER BY image_size DESC";
.
.
.


As You can see the default Hamming distance is 4. I ended up with it after checking with different values and for my data (sorry, not available) the 4 seems to be the sweet spot.

This concludes our discussion today. In the next part I mainly try to vision what directions there are to explore in the future..

Friday, December 27, 2013

Looks like it - or does it? (Part 3)

As I promised, we'll now enter the world of similarity and perceptual hashing.

Perceptual hash is a way of contracting an image (or actually any data lump) into one value which should be close to each other when the images are similar. As was with md5 hash there will be hash clashes where two different images share the same perceptual hash even if the images are substantially different, often so that it is very difficult to see any similarity. I will call these hash clashes false positives.

To give some background to perceptual hashes here is an excellent introduction to perceptual hashing:

Looks Like It

But to put things fast and neat into a simple case: Perceptual hashing may be implemented at least by two different methodologies:

  1. Re-sizing the photo to known small size, setting it gray-scale, calculating the average value of all the pixels, then comparing every bit of the re-sized gray-scaled image and building a bit array where every bit is 0 or 1 (what a surprise) depending if the pixel is darker of lighter than the average. This bit array may easily be converted into a string of hexadecimal characters.
  2. Doing a discrete cosine transformation by re-sizing the photo, gray-scaling it computing average DCT value and reducing the DCT matrix and constructing the hash (see Looks Like It-article for a bit longer explanation, maybe one of what you really do grasp something!)

My solution builds (at least currently) on phasher  PHP class which implements methodology 1. Maybe, whenever I'll have time - I don't have it too much, though it may sound like it ;) - I'll cook or find Discreet Cosine Transformation (DCT)  PHP code and check how the scene changes or try to implement phash-library.

From now on I'll concentrate in the average lightness hashing (methodology 1). Theoretically it should handle the situations where image is re-sized, gray-scaled (heck, it does these itself), it's lightness and contrast changed, but only linearly; e.g. applying gamma correction or histogram may and probably would move the average and so the perceptual hash would change.

When we want to find similar images there will be false positives and also false negatives i.e. we'll get same hashes for dissimilar images and also similar photos may give different hashes. In practice I have found that when I search for similar photos of one under the scrutiny there may be few similar and  a couple of different. If I check the similar images, I may find more similar and some new or same different ones. And I suspect that there will be more similar photos that don't show up even when running all the found similar photos.

Why is that so? As said earlier, the photos may be changed by changes that change the average (for change), added a text or frame around it or removing said ones. If the photo is re-sized it should theoretically give same phash but it doesn't need to do it in practice. The problem partly lies in the re-sizing algorithms. An image consists of pixels that are discreet square buckets empty, full or in-between of three paints (red, green, blue). Now when we re-size the photo smaller we have to join a lot of these buckets and decide what amount of paint we push into the bucket, which should represent maybe tens of pixels in the original photo. If all the pixels are of the same color it's really easy but if the colors vary greatly the decision lies in the algorithm. If the image dimensions don't divide evenly, we need to take care also of sub-pixels. So, it's not a perfect world :)

My first implementation based on calculating the phash for images re-sized 16 x 16 pixels and running it through phasher. Here is the code for this simple phashing algorithm:

// Sorry for omitting the <?php-tags but blogger don't like them.

include_once('./phasher.class.php');
$I = PHasher::Instance();

try {
   $dbh = new PDO('mysql:host=localhost;dbname=imgs', "imgs-user", "imgs-pw");
   $sql="UPDATE Images SET img_phash=? WHERE img_id=?";
   $stmt=$dbh->prepare($sql);

   $sql="SELECT img_id,img_name,image_dir FROM Images WHERE image_status<>'S' AND img_phash IS NULL";
   foreach ($dbh->query($sql) as $r) {
     $filename="images" . $r[2] . "/" . $r[1];
     $iid=$r[0];
     $hash_array=$I->HashImage($filename,0,0,16);
     $signature=$I->HashAsString($hash_array);
     $stmt->execute(array($signature,$iid));
   }
} catch(PDOException $ex) {
   echo "An Error occured!";
}



So I'll go trough the database for new entries that the harvester script has accumulated. The image status S means that the photo should be skipped anyhow (it is a dismissed duplicate or not ... art enough). If I had this system running from the beginning I wouldn't need it here as all the photos written to the database and disk would be automatically phashed and I couldn't know a priori if it is a duplicate (similar, not same) or if its ... artistic... value is sub-optimal.

I run this from command line or actually a crontab script runs it as a part of harvesting and cleaning cycle.

As I prefer to do the things on a hard way I also similarly calculate two additional phashes, one with 19 x 19 re-sized image and another with 11 x 11 initial image. So I have three different hashes for every image. The size choices were quite random, I used prime numbers (except for initial 16) to reduce the similarity of hashes which may or may not be a good thing. The jury is still pondering it.

Now that we have the hashes, what's next? My first decision was to try to hunt the most general images which I predicted would be common similar photos. I was wrong. For some reason still totally unknown to me the most common hash-sharing images were all different.

I wrote the program which shows me the images in groups sorted by diminishing number of same hashes and also ordered by hash in that group. I still have hard time to believe it, but all the images in say about ten first clusters are all different. Then there will come a lot of groups similar images with a few dissimilar dispersed for variation only in-between (but then again, the dissimilar are all dissimilar between themselves). Now, I understand that some hashes are more popular than others but it doesn't ever cease to astonish me how the most common hashes are shared by different photos and only different.

Sometimes I found two groups of similar photos very close to each other and understood that it meant, that the hash values were really close but not the same. I also figured it out that these images were close in the view because of the (secondary) ordering by hash values and it happened only because the difference was in the some of the last bits. Otherwise there would be a lot of distance between the same groups and I don't have eidetic memory to remember it whenever it happened.

If we know two hashes it is easy to calculate the difference (or distance) between two images. It's called Hamming distance and it is readily available in PHP GMP-functions (GNU Multiple Precision) since PHP 4.0.4.

But it doesn't actually help us to find the two images to start with. Of course, we could do it by comparing each image to each other and though we have to compare one image less after each round (the Hamming distance is symmetric) the thought of building a matrix of more than 400 000 images and keeping it up-to-date even afterwards isn't compelling.

Dead end, after all? Sigh, what to do?

The answer is: wait to next part :)

Wednesday, December 25, 2013

Similar is not same (Part 2)

Now, we'll start the real thing.

The best way to get rid of duplicate photos is to never ever even get them onto our hard disks. Simple as that. But the problem lies when the photo isn't the same, but similar.

Let's first take the simple case: the photos are just the same. There is a very easy way to check it in PHP.

It's called hashing and there are a few ways to do it. I have chosen maybe the simplest way: Just getting md5-checksum.

$dump=@file_get_contents($img_url);
$md5=md5($dump);
$size=strlen($dump);
$stmt_check_same->execute(array($md5,$size));
if ($stmt_check_same->rowCount()>0)
  continue;


Here the $stmt_check_same is a prepared SQL sentence like this:

$stmt_check_same=$db->prepare("SELECT img_id FROM Images WHERE image_md5=? AND image_size=?")

You can see that as a precaution I check that both the md5's as well as photo sizes are equal . Md5 - like all hashing algorithms - may end up with same hash for also different files; otherwise you could decode everything with 128 bits - which would be the ultimate zipping mechanism. That's not (unfortunately) so and we have to counter-attack it. I have chosen just to check the file sizes are same (as I will anyhow need file size in my database).

Other possibility to ascertain the sameness would have been e.g. calculating another or even more hash values (with different hashing algorithms) and claim the same photo if they all match.

Philosophically I don't mind getting a few photos dismissed without being real duplicate, so I don't loose my dreams because of this hash clash possibility.

No, we have got rid of the exactly same photos, that's a start.

What is the difference between same and similar? With similar image I mean an image, that looks the same, but isn't so bit-by-bit-wise. The photo may be re-sized (either by scaling or its quality changed) or changed subtly otherwise. Another possibility of simple change is that color  photo has been made black and white. There are still further possibilities like mirroring the photo or cropping it.

Md5 is very sensitive with even minor changes in the bit battery that a file is and doesn't help us finding similar photos by any cunning sorting scheme, even the slightest change in the image file would make md5 (very probably) to look and smell totally different (as far as you can get that different with 16 hexadecimal bytes)

In the next part we'll get acquainted with phash (perceptual hash) which tries to catch similarities between photos and see where we do get from there.

First project - introduction (Part 1)

My first blog project is something I have fought with during these last months and which has been both challenging and rewarding. The problem in essence is about managing a big collection (417 371 as now) of - ehem - art photos found on the net and lying now on hard disks of my Linux server. I will not discuss how they have ended there for various reasons, even though it would be an excellent and possibly quite popular article (yes, PHP plays big part in it). Maybe in some form in the future, but just maybe... OK, I'll give a few hints in the second part of this series of articles but we won't go in the details.

To follow and possibly duplicate my project you'll need at least a computer with PHP and a database. Mine are Linux and Mysql, but changing both of those are trivial for a PHP programmer who knows the basics of the language - and others are free to envy us ;)

The main problems with this size of a photo collection are at least the following two:
  1. there are duplicates (often multiple duplicates) of photos as they come from different sources
  2. there are photos I myself don't label as .. umm .. artistic
In the series I am about to write, I'll show how I have tried to overcome these two problems. Life would have been easier if I had this routine already in the beginning but as the nature of this kind of project (as a leisure time project) is quite evolutive I'll have to try to handle challenges as they come.

When trying to resolve the problems I try to achieve at least the following goals:
  • minimize disk size taken by unneeded photos (duplicates or unsavoury)
  • keep the highest quality photo
  • try to prune all similar photos whenever possible
I hope I have now given you the introduction needed and it's about time to jump into the project.
Hello World!

Isn't that the way to greet everything new - as a programmer, at least. I think it was K&R who made this greeting well-known - even famous - with their first C example.

Anyhow, this blog isn't about C but PHP, which is one of the most common programming languages for web applications even today. I have been programming with it for maybe 15 years and I still get some surprises, both fine and nasty, every week or two. Of course I am doing mostly routine things but every now and then I have to do something really new and that's where the surprises lurk.

I know PHP is well documented in the the manual pages and in numerous articles and blogs all around the blogsphere. Why then another booooring blog, you'd might think? Well, my only answer i that I try NOT to make it boring but try to aim to a "higher" level instead. With higher level I mean that the blog entries (or serials of them) will build something that isn't doable with just a couple of native function calls. I won't promise rapid serial blogging but when I have something remarkable (as of my viewpoint), I'll share it with you.

I also believe in open source and this blog is (or tries to be) my way to give something back.

About myself: I am a programmer somewhere in the net. That's about all what you need to know ;) I won't make my person identifiable - even less remarkable - through this blog. I want that the sharp point points at PHP itself and not at dull me. Of course some of my screwed personality (if there's such a thing) will shine out, but then, let it shine...

With these words I'll skip the introductory s**t and let's get started... :)