user-user or item-item similarity scores #907

victusfate · 2022-04-06T17:40:20Z

victusfate
Apr 6, 2022

In a previous recommendation model (implicit ALS) I use the following scores to compute user-user similarity scores (normalized dot product of user factor vectors).

    factor  = self.model.user_factors[user_idx]
    factors = self.model.user_factors
    norms   = self.model.user_norms
    norm    = norms[user_idx]
    scores  = factors.dot(factor) / (norm * norms)  

    similar_users = {}
    for selected_user in selected_users:
      # convert np.float32 to python number
      similar_users[selected_user] = scores[self.inv_user_map[selected_user]].item()

How would I go about creating a similar (fast) user-user comparison score from say a matrix factorization model (ie using self.model.regressor.steps['FMRegressor'].weights )?

I was considering gathering the specified user's predictions for items they have explicitly scored, and then computing a normalized distance between that and other user predictions for those items. I didn't see anything built-in to support this use case, and wanted to make sure I didn't miss something obvious.

Answered by gbolmier

Apr 10, 2022

Thanks for the feedback @victusfate, this is an interesting use case! Indeed we don't support it right now. Latent factors are stored as dictionaries of numpy arrays in facto.FMRegressor, so latent vectors are accessed explicitly: model.regressor.steps['FMRegressor'].latents['Bob'].

We can use the internal math module to perform a dot product between two users:

river/river/utils/math.py

Line 252 in 082d5eb

def dot(x: dict, y: dict):

But this would be slow for many comparisons compared to vectorization. Maybe you could convert the dictionary of numpy arrays (or a subset of it) to one numpy array and use vectorization if speed is important to you.

View full answer

gbolmier · 2022-04-10T21:56:35Z

gbolmier
Apr 10, 2022
Maintainer

Thanks for the feedback @victusfate, this is an interesting use case! Indeed we don't support it right now. Latent factors are stored as dictionaries of numpy arrays in facto.FMRegressor, so latent vectors are accessed explicitly: model.regressor.steps['FMRegressor'].latents['Bob'].

We can use the internal math module to perform a dot product between two users:

river/river/utils/math.py

Line 252 in 082d5eb

def dot(x: dict, y: dict):

But this would be slow for many comparisons compared to vectorization. Maybe you could convert the dictionary of numpy arrays (or a subset of it) to one numpy array and use vectorization if speed is important to you.

4 replies

victusfate Apr 11, 2022
Author

Oh this is helpful information, thanks @gbolmier
I ended up doing something slow, just so I could get scores. Ideally, I can use those already stored latent vectors.

For reference, this is what I'm using now (cosine distance between predicted vectors of other users, to the target user for their explicitly rated items)
https://github.com/victusfate/concierge/blob/c60399241de7542284a21fd29944406f1ac2851d/concierge/collaborative_filter.py#L329-L369

I also added a secondary user_rankings method based on your input, but I found the results to not be quite as good (based only on my limited review). It is MUCH faster than what I did originally, and requires less additional data.
https://github.com/victusfate/concierge/blob/5a9f597815b2112c49fd7ecad38e6e5526d0d58f/concierge/collaborative_filter.py#L371-L390

gbolmier Apr 20, 2022
Maintainer

May I ask how many users you are comparing to? And what is your response time requirement?

victusfate Apr 21, 2022
Author

It was comparing one user, to a dozen or two other users at a time, and it's currently running in a max of 270ms (slow) but tolerable where it's used (that's for users with several hundred scores in a highly sparse dataset).

It's easy to cache if needed, but that would limit the responsiveness of the predictions.

I could speed it up further if needed:

it's takes a bit of time to predict_one the X other users scores for the target user's predictions, a batch predictor would be faster but I didn't see one available for the Matrix Factorization Recommender model (this is the slowest portion taking ~10ms per prediction on the system I run it on, multiplied by the number of other users compared to ie 20+)
it also takes a small bit of time to generate the numpy arrays from lists, but then the cosine distance calc is very fast

victusfate May 17, 2022
Author

heads up @gbolmier I ended up going with the faster version as the number of interactions & users grew it slowed down too much (to multi-seconds)

latest quick dot prod
https://github.com/victusfate/concierge/blob/9fa7e27a879f478b478f6deaf03255b48c1630e9/concierge/collaborative_filter.py#L386-L406

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

user-user or item-item similarity scores #907

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

user-user or item-item similarity scores #907

victusfate Apr 6, 2022

Replies: 1 comment · 4 replies

gbolmier Apr 10, 2022 Maintainer

victusfate Apr 11, 2022 Author

gbolmier Apr 20, 2022 Maintainer

victusfate Apr 21, 2022 Author

victusfate May 17, 2022 Author

victusfate
Apr 6, 2022

Replies: 1 comment 4 replies

gbolmier
Apr 10, 2022
Maintainer

victusfate Apr 11, 2022
Author

gbolmier Apr 20, 2022
Maintainer

victusfate Apr 21, 2022
Author

victusfate May 17, 2022
Author