Sunday, February 2, 2014

Yes, it is perfect now :)

I continued the research, though I was quite happy with the results.

There were still false positives, not so many, but some.

I wanted to check if the false positives were caused by the initial re-size of the image. So, to accomplish this I changed the initial image re-sizing factor by setting it to 128x128 pixels instead of 32x32. I started the computing and ... well the dots that indicated progress actually indicated no progress. Another wall to hit your head.

I made a compromise and aimed for 64x64. Even so the calculation was slow but I let it crunch.

Meanwhile I decided, it's time to bite the bullet. From Google I found a library of fast DCT (as a part of Fast Fourier transform library). And now, me dear readers, I have to confess that I went over the board and constructed the hashing algorithm in C (I have been programming C, but not for ages and had forgotten some if not most of it) instead of PHP. Sorry for that.

After some fighting I got it working. The fast DCT algorithm from the library runs probably n log n (n being the number of pixels along one side of the re-sized image) instead of the n4 of the initial straight-forward algorithm and the speed is from other galaxy compared with PHP. (My PHP implementation was struggling on about 16k images (after about 24 hours)when my C program started and the C missile by-passed it in a few moments. Quietly I pressed Ctrl+C on on PHP screen, sighed and closed the screen.

It still took some hours to go through the database - but I am happy to inform you that the results are spectacular, if not stellar!

With my 428k+ images I haven't yet found any false positives. Not one. Zero. None. Ninguno. And it found really many more images that were similar as compared with the earlier (32 x 32-bit initial size) images that I had more or less pruned away.

So, my code does the same steps as earlier one, but instead of 32 x 32 bits original-resize it re-sizes the initial image to 128 x 128. Then it calculates the DCT and reorganizes the bits just like I told earlier. And it finds just similar images and I'd think that it finds at least most of them. Image after image.

I have made a utility script that shows the images sorted by descending value of similar images. With 32 x 32 intial cell-size the most-matched images were all more or less false positives, same way as with average hash I started my little project of pruning similar images from the database. Now it shows just rows of similar images; page after page. So the similarity of the 32 x 32 bith could be much more result of the re-size than clashes of hash algorithm.

I would go as far as saying: problem solved.

As this blog is PHP I haven't (at least not yet) published the C code here (or anywhere). Basically there isn't anything new; all I have done has been converting the code to C and replacing DCT-code with a library call.