in_set predicate raises error unhashable type: 'Series' #773

Joachim-Sh · 2022-09-05T09:44:06Z

The in_set predicate raises the error unhashable type: 'Series' when used with make_batch_reader and make_petastorm_dataset. I am using pandas 1.3.5. See below for a minimal working example.

import pandas as pd
from petastorm.predicates import in_set
from petastorm import make_batch_reader
from petastorm.tf_utils import make_petastorm_dataset

output_url='file:///tmp/hello_world_dataset'
hello_world = pd.DataFrame({'id': [i for i in range(100)]})
hello_world.to_parquet(output_url)

predicate_id = in_set([1,2,3,4,5],'id')
with make_batch_reader(output_url,num_epochs=1,workers_count=1,predicate=predicate_id) as reader:
    ds = make_petastorm_dataset(reader)
    train_values = list(ds.as_numpy_iterator())

For me, the issue is resolved by applying the in operator elementwise in the predicates.in_set function:

def do_include(self, values):
   def apply_elementwise(input):
       return input in self._inclusion_values
   return values[self._predicate_field].apply(apply_elementwise)

Instead of the whole dataframe at once:

def do_include(self, values):
    return values[self._predicate_field] in self._inclusion_values

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_set predicate raises error unhashable type: 'Series' #773

in_set predicate raises error unhashable type: 'Series' #773

Joachim-Sh commented Sep 5, 2022

in_set predicate raises error unhashable type: 'Series' #773

in_set predicate raises error unhashable type: 'Series' #773

Comments

Joachim-Sh commented Sep 5, 2022