You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! Great work on the library! I recently found an edge case in the get_raw_interactions() calls, which may or may not be intended, so I'm opening an issue for discussion.
Context
Namely, the get_raw_interactions() method is a couple of wrappers for to the convert_to_external() calls from each identifier, which uses get_from_series_by_index().
get_from_series_by_index() seems to have an explicit assumption that NaNs in the resulting dataframe are caused by the reindex() operation. The operation introduces NaNs when there is a mismatch between item IDs and user IDs:
The reindex() operation isn't the only time NaNs can be introduced, though. For example, if a Dataset has a NaN by itself in its raw form - this exception is raised anyways. For example:
>>> import rectools
>>> import pandas as pd
>>> from rectools.dataset import Dataset
>>> data = {
... 'user_id': [1, 2, 3, None, 5, 1, 2, 3, 4, 5], # There is a None in the User IDs
... 'item_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
... 'rating': [5, 3, 4, 2, 1, 4, 5, 3, 2, 1]
... }
>>> df = pd.DataFrame(data)
>>> df['weight'] = 1
>>> df['datetime'] = pd.to_datetime('2025-02-25')
>>> dataset = Dataset.construct(df)
>>> dataset.get_raw_interactions()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/dataset.py", line 248, in get_raw_interactions
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/interactions.py", line 181, in to_external
Columns.User: user_id_map.convert_to_external(self.df[Columns.User].values),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/identifiers.py", line 221, in convert_to_external
result = get_from_series_by_index(self.to_external, internal, strict, return_missing)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/utils/indexing.py", line 127, in get_from_series_by_index
raise KeyError("Some indices do not exist")
KeyError: 'Some indices do not exist'
While get_from_series_by_index() itself has a very useful strict mode that can be turned off, the get_raw_interactions() method doesn't allow passing this argument down to the external export calls.
Workaround
A simple workaround is to simply re-implement the constituent calls and pass strict=False in a utility module or patch the lib. However, it would be much nicer developer UX if we can simply pass it through the call natively.
Proposal
One backwards-compatible way to allow for this edge case would be to allow **kwargs in the get_raw_interactions() method and pass them to the underlying calls.
Alternatively, it would be nice UX to also raise which indices are missing or provide a bit of extra context in the error message to alleviate part of the debugging.
Thoughts?
Expected behavior
No response
Relevant logs and/or screenshots
No response
Operating System
MacOS
Python Version
3.11
RecTools version
0.8.0
The text was updated successfully, but these errors were encountered:
What happened?
Hello! Great work on the library! I recently found an edge case in the
get_raw_interactions()
calls, which may or may not be intended, so I'm opening an issue for discussion.Context
Namely, the
get_raw_interactions()
method is a couple of wrappers for to theconvert_to_external()
calls from each identifier, which usesget_from_series_by_index()
.get_from_series_by_index()
seems to have an explicit assumption that NaNs in the resulting dataframe are caused by thereindex()
operation. The operation introduces NaNs when there is a mismatch between item IDs and user IDs:https://github.com/MobileTeleSystems/RecTools/blob/main/rectools/utils/indexing.py#L123-L127
Edge Case
The
reindex()
operation isn't the only time NaNs can be introduced, though. For example, if a Dataset has a NaN by itself in its raw form - this exception is raised anyways. For example:While
get_from_series_by_index()
itself has a very usefulstrict
mode that can be turned off, theget_raw_interactions()
method doesn't allow passing this argument down to the external export calls.Workaround
A simple workaround is to simply re-implement the constituent calls and pass
strict=False
in a utility module or patch the lib. However, it would be much nicer developer UX if we can simply pass it through the call natively.Proposal
One backwards-compatible way to allow for this edge case would be to allow
**kwargs
in theget_raw_interactions()
method and pass them to the underlying calls.Alternatively, it would be nice UX to also raise which indices are missing or provide a bit of extra context in the error message to alleviate part of the debugging.
Thoughts?
Expected behavior
No response
Relevant logs and/or screenshots
No response
Operating System
MacOS
Python Version
3.11
RecTools version
0.8.0
The text was updated successfully, but these errors were encountered: