Image hashing is a technique for generating distinct "fingerprints" of images which can be used to identify and group together similar images. "phash" is one of the most popular and effective hashing algorithms. We tried it on 10k images from our archive and had promising results.
This blog is a walkthrough of how we constructed the phashes with the Imagehash library, created easily navigable clusters (groups) of images whose fingerprints (hashes) are identical, and found images that are similar to a query image. An elegant feature of phashes is that similar images will have similar hashes. To know how the hashing algorithm works, check out this other blog
The code implementation of this can be found in a jupyter notebook here. The executed version of this notebook has been archived by the wayback back machine, which can be found here.