-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use tokenizers with DataSet and DataBundle? #449
Comments
Thanks for your report! The example code works well at numpy version 1.21.6, you can temporarily avoid this problem by using numpy 1.21.6. /remote-home/shxing/anaconda3/envs/fastnlp/lib/python3.7/site-packages/fastNLP/core/dataset/field.py:77: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
return np.array(contents) Therefore another solution can be changing source code at return np.array(contents) to return np.array(contents, dtype=object) The output is: [list([1212, 318, 281, 17180, 764]) list([40, 588, 22514, 764])
list([4677, 829, 389, 922, 329, 674, 1535, 764])] We feel sorry that we didn't take different packages' version into consideration and apologize to you for the inconvenience. We are going to discuss this problem later to provide a better solution in our future version. |
I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.
Basically, just trying to do this:
I've tried associating the tokenizer to the DataSet object but the same exception is encountered:
Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)
The text was updated successfully, but these errors were encountered: