Pondering
In the four preceding blog posts I have introduced a quite simple, still somewhat effective database solution for finding the similar images from a big heap of them. In our haystack may be a lot of needles and our goal was to catch as many of these needles as possible without un-stacking all the hay.First, let us ponder a bit how well we have achieved our goals (or at least I have). My gut feeling is that though the system finds quite a lot of similar images it doesn't get them all. I have already given some explanations why the perceptual hash may differ between the similar photos. Unfortunately I don't have time to go through whole the evolving collection so I can't give any exact statistics. I just see some images I think I have seen quite often (considering the 380 966 unhidden photos) which would suggest that there are a lot of duplicates still in the database.
However my own feeling says also that the number of duplicates lessens from day to day, every day I get less of series of say five or more similar photos. Let me explain a bit how I clean the database (i.e. my workflow). I try to find time to go through the new images. I read a bunch of them and find the number of similar (as defined in my previous blog entry) and print it below the image. If I click the photo a new window opens and shows all the images which are phash-wise similar. So, if the photo has just one similar photo (itself) and it is...appealing I just look at the next one. If it isn't... art enough to my eye I just delete it (set image_status as 'S').
If there are two or more similar photos I click the photo and get a new window with all the photos that are close enough by perceptual hashes sorted in decreasing image size. Now I click all other similar photos, except the one we came from and mark all but the biggest as hidden. By clicking all similar photos I open another tabs with each image as a comparison base. Why on earth? After experimenting with the system I have found that some similar photos may skip the comparison with the initial one but may still find matches with some others which were close enough to the initial one. Mathematically speaking our perceptual hash isn't a equivalent relation, it doesn't fulfill the rule of translation (or if Hamming distance between phash(A) and phash(B) is below x and Hamming distance between phash(B) and phash(C) is also below x it doesn't mean that Hamming distance between phash(A) and phash(C) would be below x). This causes that sometimes we get a "path" which goes via successive similar images until we don't find the new ones.
I also get some false positives which are not similar to the eye, but phashes are still below the given radius. A bad, bad thing, isn't it? Well, it depends how you think about it. I think I have found both duplicates and images without .. art values almost more via false positives than via true positives. Furthermore the unpleasing images seem to attract other unpleasing images which both raises the efficiency of cleaning and keeps astounding me as well. It's like there is more in this system than what eye can see. Then again, it just may be my impression.
Optimizations
A few words about optimizations. I should have told it already but to make things work on reasonable time scale you have to index phash and precomputed Hamming distance columns in your database.The minimal distance for images considered same is a delicate balance between missing true positives and drowning in the false positives. With my data the Hamming distance of four bits seems to be the optimal (as explained in my previous blog entry). I have learnt that I can grasp at most about 20 images in one screen and not starting to miss the similar ones (given most are false positives) when rapidly eyeing through the photos.
The Future (if it ever exists)
I have tried it with different rules for proximity but this has been the only one working well so far. I'll keep trying but don't hold my breath. For me this works as well as I need as now and the next meaningful step would need something that changes the scene remarkably. That would mean that I find a better function for the phash (Discrete Cosine Transformation may be that) which would find more truly similar images and less false positives. Another goal would be implementing some logic that could and would find clusters of similar images effectively. There might be some way with the current database but I yet haven't found it.As a programmer I would also like to encapsulate this functionality into a generic class. It isn't so easy to do as currently the database operations are tightly intertwined thorough the code and my image files are organized in different directories which needs some additional logic. I have some thoughts about this frontier, though, and I will blog them here when and if they ever see the light.
I end this subject (for now, at least) to a conclusion of what we have got and how it works.
Conclusion
Managing a big image base from different sources is like gardening. Unless you diligently prune and nurture the garden, the weeds will make it less enjoyable. So it means that you have to utilize time, code and tenderness to your collection - unless you don't mind a wild garden(but why would you even read this article in the first place?). If you came here for a turnkey system which deletes just the duplicate images (similar, not same) in you database automatically, I am sorry to disappoint you. Perceptual hashes as I have implemented them here don't find just the similar photos but they find also different photos and think they are similar. It doesn't find all the similar photos either because there's always inaccuracies when re-sizing the initial photos and calculating the average value of the re-sized photos which may put the phash off.I haven't (yet, at least) encapsulated the functionality in a simple and general class, nor given even the whole source of my working system, but instead shown just the significant parts and left the rest as an easy exercise to the reader. I still hope that you have got some insight into perceptual hashes and also grasped why it isn't an easy solution to handle, Given some background in PHP I would estimate that it is possible to re-create these ideas as a part of your own image collection.
No comments:
Post a Comment