Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_raw_interactions() NaN assumptions and edge case #269

Open
DavidLandup0 opened this issue Feb 25, 2025 · 0 comments
Open

get_raw_interactions() NaN assumptions and edge case #269

DavidLandup0 opened this issue Feb 25, 2025 · 0 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@DavidLandup0
Copy link

DavidLandup0 commented Feb 25, 2025

What happened?

Hello! Great work on the library! I recently found an edge case in the get_raw_interactions() calls, which may or may not be intended, so I'm opening an issue for discussion.

Context

Namely, the get_raw_interactions() method is a couple of wrappers for to the convert_to_external() calls from each identifier, which uses get_from_series_by_index().

get_from_series_by_index() seems to have an explicit assumption that NaNs in the resulting dataframe are caused by the reindex() operation. The operation introduces NaNs when there is a mismatch between item IDs and user IDs:

https://github.com/MobileTeleSystems/RecTools/blob/main/rectools/utils/indexing.py#L123-L127

Edge Case

The reindex() operation isn't the only time NaNs can be introduced, though. For example, if a Dataset has a NaN by itself in its raw form - this exception is raised anyways. For example:

>>> import rectools
>>> import pandas as pd
>>> from rectools.dataset import Dataset

>>> data = {
...     'user_id': [1, 2, 3, None, 5, 1, 2, 3, 4, 5],  # There is a None in the User IDs
...     'item_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
...     'rating': [5, 3, 4, 2, 1, 4, 5, 3, 2, 1]
... }

>>> df = pd.DataFrame(data)
>>> df['weight'] = 1
>>> df['datetime'] = pd.to_datetime('2025-02-25')
>>> dataset = Dataset.construct(df)
>>> dataset.get_raw_interactions()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/dataset.py", line 248, in get_raw_interactions
    return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/interactions.py", line 181, in to_external
    Columns.User: user_id_map.convert_to_external(self.df[Columns.User].values),
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/identifiers.py", line 221, in convert_to_external
    result = get_from_series_by_index(self.to_external, internal, strict, return_missing)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/utils/indexing.py", line 127, in get_from_series_by_index
    raise KeyError("Some indices do not exist")
KeyError: 'Some indices do not exist'

While get_from_series_by_index() itself has a very useful strict mode that can be turned off, the get_raw_interactions() method doesn't allow passing this argument down to the external export calls.

Workaround

A simple workaround is to simply re-implement the constituent calls and pass strict=False in a utility module or patch the lib. However, it would be much nicer developer UX if we can simply pass it through the call natively.

Proposal

One backwards-compatible way to allow for this edge case would be to allow **kwargs in the get_raw_interactions() method and pass them to the underlying calls.

Alternatively, it would be nice UX to also raise which indices are missing or provide a bit of extra context in the error message to alleviate part of the debugging.

Thoughts?

Expected behavior

No response

Relevant logs and/or screenshots

No response

Operating System

MacOS

Python Version

3.11

RecTools version

0.8.0

@DavidLandup0 DavidLandup0 added bug Something isn't working good first issue Good for newcomers labels Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant