-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Santosh Borse <[email protected]>
4231f1d
to
b0d1490
Compare
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Thank you very much, @santoshborse. I tested the transform by running
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with this transform is that it has no tests:
$ make test-src
...
source venv/bin/activate; \
export PYTHONPATH=../src:../: ; \
cd test; pytest -s .
/bin/bash: line 3: cd: test: No such file or directory
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/cma/de/data-prep-kit/transforms
configfile: pyproject.toml
plugins: cov-6.0.0, anyio-4.8.0
collected 0 items
=========================================================================================================== no tests ran in 0.01s ===========================================================================================================
make: *** [../../../.make.defaults:442: .defaults.test-src] Error 5
Please add some tests to the transform.
transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform.py
Outdated
Show resolved
Hide resolved
transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_python.py
Show resolved
Hide resolved
transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_ray.py
Show resolved
Hide resolved
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the content from here ? https://github.com/IBM/data-prep-kit/blob/tokenization2arrow_transform/transforms/Dockerfile.ray.template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, it may be better to use ray.transform. This will make it easier for maintaining the module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because we have transform_python
which we use as,
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
I makes more sense to have transform_ray
, but if you insist I will change that
# TODO: check if we should add anything to tokenization_metadata | ||
return [(bos.getvalue().to_pybytes(), ".arrow")], tokenization_metadata | ||
|
||
def transform_binary(self, file_name: str, byte_array: bytes) -> tuple[list[tuple[bytes, str]], dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear why this is implementing transform_binary() and not transform(). I think the code structure will be easier to understand/maitain if you implement transform() and then call super.transform() before calling transforms_to_arrow() .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform_binary
returns - tuple[list[tuple[bytes, str]], dict[str, Any]]:
transform
returns - tuple[list[pa.Table], dict[str, Any]]:
I am using transform_binary
so that I can return data in bytes
( so that f/w can write .arrow files )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide one or more Unit Test in the test folder.
Signed-off-by: Maroun Touma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss how we can redo this one. Maybe 2 requirements.txt, one that is used as part of the packaging and one that is used for pulling the dependency on the tokenization. Also, have you considered making this module as part of the tokenization module ? Would it be easier for inheritance for this module to be an extension on the tokenization rather than its own ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the module to pyproject.toml for when building the wheel.
Signed-off-by: Santosh Borse <[email protected]>
Thanks for adding the notebook, @santoshborse! |
Signed-off-by: Santosh Borse <[email protected]>
Why are these changes needed?
Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files
Related issue number (if any).
#1009