Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to add user and/or item features #159

Open
wants to merge 54 commits into
base: user_item_features
Choose a base branch
from

Conversation

martincousi
Copy link

@martincousi martincousi commented Mar 30, 2018

I started modifying the Dataset and Trainset classes to include the option of having user and/or item features since I later want to work on an algorithm that accepts these. I think I made good progress but I still need to figure out how to create a testset with these features.

PS. It appears that this branch also included other modifications that I did with respect to asymetric_measures and the cancellation of printing for the computation of baselines and similarities. Edit: I revert back the changes to accuracy.py and AlgoBase.

@NicolasHug
Copy link
Owner

Thanks for the PR!

Coming up with a way to handle content-based feature is a lot of work. I gave it a quick try before, and found it not to be worth the hassle. I'm not saying it's not doable, it may even be quite easy if you're going for a very specifc solution, but integrating the features in the whole data pipeline (cross validation etc.) in a generic fashion can be tricky. It also require to make choice regarding the dataset loading, etc.

So I think the best way is probably for you to work on it on our own fork for now, and submit a complete PR once you think it's done (if you still want to). But even then, I can't promise we will be able to merge it: it will depend on how useful this addition can be and how well it integrates with the current codebase.

Would that work for you?

Thanks!
Nicolas

@martincousi
Copy link
Author

@NicolasHug What do you think of my implementation? It appears that some tests need to be modified.

@NicolasHug
Copy link
Owner

NicolasHug commented Apr 7, 2018

Thanks,

I really appreciate all the efforts on the clear coding and the good documentation :) !

I have only rapidly checked the code yet, but I'm wondering why you're passing the user / item features to the predict and estimate methods, and in the Prediction object? I may be wrong but it seems to me that the features should simply be stored in the Trainset object?

Could you brief me up a bit on what those user / item features actually are? Meaning, what kind of datasets need such features? What are some examples of algorithms that use those features? Any reference for the Lasso algorithm that you implemented? Are there publically available datasets that we could natively support?

For the tests: you'll need to add scikit-learn as a dependency in all the requirements.txt files (including Travis). That being said I'm not sure yet if adding scikit-learn as a depency would be a wise move -- if the Lasso algorithm you implemented is simply a regularized linear regression model, maybe it would be best to code it ourselves. I love scikit-learn but I'd like to minimize the dependencies as much as possbile.

Thanks!
Nicolas

@martincousi
Copy link
Author

I am passing user/item features to the predict and estimate methods since it makes more sense (in my opinion) that the features are stored in the testset object. In this way, it is also possible to predict for a new user and/or item for which the features were not added to the Dataset and/or Trainset objects. For the Prediction object, it is not necessary to put the features in it. I just made it this way to simplify post-analysis of the predictions.

These features can consist of a variety of things. For example, user features could consist of demographic information (e.g., age, gender) or other elicited information (e.g., preferences for certain actors or movie genres). The item features can consist of attributes associated with the item (e.g., movie genre or studio) or expert information (e.g., expert rating). Most recommender systems do not have such information but for some applications it is possible to ask for this information or to scrape it off the web (e.g., expert information).

Algorithms can be designed to use only user or item features, or both user and item features. These can be hybrid algorithms [1] (i.e., mix of algorithms) or a specific algorithm. The lasso algorithm I have implemented is only a naïve / simplistic implementation of [2]. I did this just to test my implementation of the features add-on. I am now working towards an implementation of factorization machines [3] which is (in my opinion) a much better approach. I plan to implement factorization machines by importing the tffm library. This library appears more complete than polylearn, another library for factorization machines.

I am pretty sure that there most exist publicly available datasets that include such features but I am unaware of which ones and I don't have time to look at it for the moment. Maybe one of the Yahoo datasets?

As I said, I don't think the lasso algorithm should be necessarily implemented. I had removed it but it came back when I updated my master branch. I now removed it again. I am now currently working on an implementation of factorization machines on another branch. I will do a PR when ready.

[1] R. Burke, “Hybrid Recommender Systems: Survey and Experiments,” User Model. User-adapt. Interact., vol. 12, no. 4, pp. 331–370, 2002.
[2] A. Ansari, S. Essegaier, and R. Kohli, “Internet Recommendation Systems,” J. Mark. Res., vol. 37, no. 3, pp. 363–375, 2000.
[3] S. Rendle, “Factorization machines,” in Proceedings - IEEE International Conference on Data Mining, ICDM, 2010, pp. 995–1000.

@NicolasHug
Copy link
Owner

Thanks a lot for the update!

I am passing user/item features to the predict and estimate methods since it makes more sense (in my opinion) that the features are stored in the testset object. In this way, it is also possible to predict for a new user and/or item for which the features were not added to the Dataset and/or Trainset objects.

Yeah you're absolutely right I was looking at it the wrong way.

For the Prediction object, it is not necessary to put the features in it. I just made it this way to simplify post-analysis of the predictions.

Probably best to remove it from there, then.

I've been thinking about adding additional dependencies (scikit-learn, tffm or whatever) and I think it's OK as long as we keep them optional. E.g. if you implement FM with tffm, only those that want to use the FM model would need to install tffm, but it's still not a core dependency (concretely, we don't add tffm to requirements.txt).

@NicolasHug NicolasHug changed the base branch from master to user_item_features April 13, 2018 13:34
@martincousi
Copy link
Author

martincousi commented Apr 13, 2018

Ok, so I removed the features from the Prediction object. Do we want to merge this branch now or do we wait for at least an algorithm that supports features (i.e., should the additional algorithms be part of this PR or an additional one)?

Also, do we need to correct the test codes?

@NicolasHug
Copy link
Owner

NicolasHug commented Apr 13, 2018

Thanks,

I can merge this PR into a new feature branch if you want? And you can send more PRs to the new branch.

For the test: I suspect you'll have other failed tests when you implement the algorithm, so it's up to you. If you prefer to solve tests issues all at once I'm OK with that.

BTW, have you thought of a way to integrate the new changes with the cross validation iterators?

@martincousi
Copy link
Author

The features option already works with the cross validation iterators (to my knowledge).

@martincousi
Copy link
Author

@NicolasHug How can we enable tests on this base branch so that I can see which tests fail?

@NicolasHug
Copy link
Owner

We'd need to modify the.travis.yml file, but you should definitely run the tests locally before any commit anyway.

@martincousi
Copy link
Author

What is the best way to run the tests locally without having to do python test_name.py for each test?

@NicolasHug
Copy link
Owner

Just run pytest at the root directory.

If you haven't already, check out the contributing guidelines

@martincousi
Copy link
Author

I have corrected the tests so that they now work on my computer running pytest. Unfortunately, I don't have time to make new tests to check the features option.

@NicolasHug
Copy link
Owner

NicolasHug commented Apr 13, 2018

No worries, it can wait.
But I'm sure you understand I cannot merge anything that is not thoroughly tested, especially when it's such a big feature / improvement.

EDIT: I mean merging into the master branch for a future release. I don't mind merging untested code into a feature branch.

@igorsantana
Copy link

Hey, any updates on this? I've been following the conversation and would like to know if you guys have any plans to merge this. I am working with context-aware recommender systems and I'm re-writing my code from java to python (which I'm kinda newbie).

Is there a way to populate the Dataset with more info then just user item rating [timestamp]? I've searched through the docs and haven't found it.

Keep up with the nice work! 😊

@martincousi
Copy link
Author

martincousi commented Jun 6, 2018

I have been working on other projects in the mean time but this branch should work without issues. However, I would recommend using my factorization-machines branch as it should contain the latest updates.

However, this branch takes into account user and item features, but not context. Also, by looking at the code, I don't think that the timestamp option is working. To add context with many variables, it would be easy to extend my code to add features on the user-item pairs. Then, you would need to extend the algorithms to take these features into account.

@NicolasHug
Copy link
Owner

@martincousi is 100% correct

No plan to merge this (or the other branch) unfortunately, because I don't have enough visibility on how well it would integrate with the current code base.

@Paola123456
Copy link

Hi Martin and Nicolas,
Has the matrix factorisation algorithm (SVD or SVD++) with user-item features been implemented yet? If it has could you point me in the right direction? Many thanks.

@martincousi
Copy link
Author

If you want to add user/item features to a factorization algorithm, you should take a look at factorization machines.

I have a working implementation at factorization_machines.py. Note that this the sample_weight branch of my fork. It is the most up-to-date and requires PyTorch in order to use the FM class.

To use this class, you first add your features using Dataset.load_features_df(). Then after building your Trainset, you case use the FM class with the options rating_lst=('userID', 'itemID') (similar to SVD) or rating_lst=('userID', 'itemID', 'imp_u_rating') (similar to SVD++). The labels of the features are provided through user_lst and item_lst.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants