Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use tokenizers with DataSet and DataBundle? #449

Open
davidleejy opened this issue Jan 25, 2023 · 1 comment
Open

How to use tokenizers with DataSet and DataBundle? #449

davidleejy opened this issue Jan 25, 2023 · 1 comment

Comments

@davidleejy
Copy link

I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.

Basically, just trying to do this:

# Initialize DataSet object `ds` with data.
# Initialize DataBundle object with DataSet object `ds`.
# Define tokenizer.
# Associate tokenizer with field in DataSet or DataBundle object.
# Hope to see tokenizer work when batches of data are extracted from DataSet object.
from fastNLP import DataSet
from fastNLP import Vocabulary
from fastNLP.io import DataBundle
from functools import partial
from transformers import GPT2Tokenizer

data = {'idx': [0, 1, 2],  
        'sentence':["This is an apple .", "I like apples .", "Apples are good for our health ."],
        'words': [['This', 'is', 'an', 'apple', '.'], 
                  ['I', 'like', 'apples', '.'], 
                  ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],
        'num': [5, 4, 7]}

dataset = DataSet(data)    # Initialize DataSet object with data.

data_bundle = DataBundle(datasets={'train': dataset})    # Initialize DataBundle object

# Define tokenizer:
tokenizer_in = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer_in.pad_token, tokenizer_in.padding_side = tokenizer_in.eos_token, 'left'
tokenizer_in_fn = partial(tokenizer_in.encode_plus, padding=True, return_attention_mask=True)
print(tokenizer_in_fn)       # ensure that settings are as expected.

# Associate tokenizer with field:
data_bundle.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')

print(ds[0:3])
# Gives:
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | idx | sentence       | words          | num | input_ids      | attention_mask     | length |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | 0   | This is an ... | ['This', 'i... | 5   | [1212, 318,... | [1, 1, 1, 1, 1]... | 5      |
# | 1   | I like appl... | ['I', 'like... | 4   | [40, 588, 2... | [1, 1, 1, 1]       | 4      |
# | 2   | Apples are ... | ['Apples', ... | 7   | [4677, 829,... | [1, 1, 1, 1, 1,... | 8      |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+

# Try to obtain batch data:
ds = data_bundle.get_dataset('train')
print(ds['sentence'].get([0,1,2])) # okay, no problem.
print(ds['input_ids'].get([0,1,2])) # throws exception.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[66], line 1
----> 1 print(ds['input_ids'].get([0,1,2]))

File ~/condaenvs/bbt-hf425-py310/lib/python3.10/site-packages/fastNLP/core/dataset/field.py:77, in FieldArray.get(self, indices)
     75 except BaseException as e:
     76     raise e
---> 77 return np.array(contents)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

I've tried associating the tokenizer to the DataSet object but the same exception is encountered:

# ds is initialized DataSet object
ds.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')
ds['input_ids'].get([0,1,2])  # throws same exception as above.

Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)

@x54-729
Copy link
Collaborator

x54-729 commented Jan 25, 2023

Thanks for your report! The example code works well at numpy version 1.21.6, you can temporarily avoid this problem by using numpy 1.21.6.
For more details, we will receive a warning at numpy 1.21.6:

/remote-home/shxing/anaconda3/envs/fastnlp/lib/python3.7/site-packages/fastNLP/core/dataset/field.py:77: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  return np.array(contents)

Therefore another solution can be changing source code at fastNLP/core/dataset/field.py:77 from

return np.array(contents)

to

return np.array(contents, dtype=object)

The output is:

[list([1212, 318, 281, 17180, 764]) list([40, 588, 22514, 764])
 list([4677, 829, 389, 922, 329, 674, 1535, 764])]

We feel sorry that we didn't take different packages' version into consideration and apologize to you for the inconvenience. We are going to discuss this problem later to provide a better solution in our future version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants