Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convenience wrapper for quick similarity computation #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

miracvbasaran
Copy link

Thanks for working on this, it works like a charm!

As I was using your libraries, I thought it was a low-hanging fruit to add a simple convenience wrapper around imagehash & images4, to simplify user experience. It seems to me like new Image.Similar() is the only thing users need from the imagehash library, so we can hide away the complexity of centralHash and hashSet.

Additionally, users could directly use the imagehash.Image struct to create a hashtable for quick lookup, e.g.:

var images map[string]imagehash.Image
for p := range paths {
    images[p] = imagehash.Open(p)
}

func lookup(imgToLookup imagehash.Image) string {
    for p, img := range images {
        if img.Similar(imgToLookup) {
            return p
        }
    }
    return ""
}

LMK if you think this is a good idea and what tests you'd like me to add for them.

@vitali-fedulov
Copy link
Owner

vitali-fedulov commented Jan 21, 2023

Hello Miraç,

I am very excited you have found the package useful! And I have looked at your pull request.

But I am confused, because the addition is contradictory to the purpose of imagehash package:

(1) The main purpose of imagehash package is to avoid each-to-each image comparison, which happens, when using images4.Similar or any (!) Similar function. Hashes in this package are not meant for each-to-each image comparison.

func Similar implies each-to-each comparison.

You propose a func Similar, which merges image4.Similar with imagehash similarity. But it is faster to use image4.Similar directly. That is imagehash is not needed at all for this scenario. And it should not impact precision of comparison.

(2) You propose the Image struct, which is a very large object to keep it in memory for every image.

The purpose of hashes and icons was to avoid keeping images in memory. Additional purpose of icon is generalization for comparison, because pixel-by-pixel comparison of full images does not generalize well.

Actually hashes are an additional memory-saving step to avoid keeping icons in memory.


RECOMMENDATION

Try to use images4 directly, unless you work with billions images.

If you do work with billions of images, then imagehash(es) are used to compare by those hashes only - not any Similar func. One huge hash table is necessary - not the one like in the example on the package README.

It is probably the confusion of terms, because I use "hash" as precise term related to an address in hash table, unlike many people use it for, say, Hamming distance calculations. In imagehash, no distances are calculated between hashes. But for icons - yes, distances are calculated.


Am I missing something? Let me know what you think.

Thank you for the contribution! It helps clarify package descriptions.

@miracvbasaran
Copy link
Author

Hello Vitali,

Thanks for getting back to me! IIUC, you have two concerns:

1. Performance

Perhaps my PR got a bit confusing with all the images and similarities: imagehash.Image, image.Image, Image.Similar(), images4.Similar()... :)

My PR doesn't actually run images4.Similar() unless the imagehash comparison (comparing CentralHash with HashSet) yields a match. See the following snippet:

return foundSimilarImage && images4.Similar(img.Icon, img2.Icon)

When foundSimilarImage is false,i.e. when imagehash comparison doesn't yield a match or in other words, when images are dissimilar, images4.Similar() is not called because the first part of the expression (foundSimilarImage) evaluates to false.

This PR indeed improves performance as opposed to using images4.Similar() directly, even before going into the "billions of images territory".

I ran some informal experiments:
Setup: ~45000 images with ~200 matching image pairs.
Comparing them directly with images4.Similar(): 99s
Comparing them with imagehash.Image.Similar() (This PR): 46s

2. Memory

When I wrote this snippet for my own project before creating this PR, I didn't have images.Image in the imagehash.Image struct because I didn't have need for it. So it was like the following:

type Image struct {
	Icon         images4.IconT
	CentralHash  uint64
	HashSet      []uint64
}

I added Image images.Image when creating the PR because I thought it might be useful for others: e.g. if they want to use this function in images4.

I didn't consider the memory impact. You are right, when working with billions of images that would be prohibitive. I removed it now.

Just as an example, after this change, in my experiments an average imagehash.Image object is about ~0.18 Kb, meaning 1 billion imagehash.Image would be ~180 Mb, which is reasonably small. A user could easily hold a hashmap with tens of billions of images.

I hope that makes sense!

Cheers,
Miraç

@vitali-fedulov
Copy link
Owner

Miraç, got it, thanks!

At this point I am thinking:

  1. Maybe your proposal for faster Similar func would better fit images4 package, or even a future version images5.

  2. Indeed, from looking how people use images4, sometimes they compare 2 images only. They do not even loop in 1-for-each.

  3. Do you mind to provide a sample of your code how you use it in one of your projects? To get an idea. Have you tried or need a hash table as a primary step? - it will be interesting to see the speed gain in addition to 2 of your other benchmarks.

Hopefully your benchmarks reflect purely the comparison step with all icons/hash structs ready, without including image reads time.

I am going slowly on this to make sure we are getting an elegant solution.

Best,
Vitali

@vitali-fedulov vitali-fedulov added the enhancement New feature or request label Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants