Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

santoshborse
Copy link
Collaborator

Why are these changes needed?

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files

Related issue number (if any).

#1009

@santoshborse santoshborse force-pushed the tokenization2arrow_transform branch from 4231f1d to b0d1490 Compare February 10, 2025 19:30
@touma-I touma-I requested a review from cmadam February 10, 2025 19:38
touma-I and others added 2 commits February 10, 2025 14:56
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
@shahrokhDaijavad
Copy link
Member

Thank you very much, @santoshborse. I tested the transform by running make run-cli-sample-python successfully.
There are some things missing like a test directory that we use for CI/CD testing and I will make the README more consistent with the other READMEs, but for now, one small change in README:

make run-cli-sample => make run-cli-sample-python
and another urgent request for a simple notebook that @Hajar-Emami can use to mimic in her GneissWeb recipe notebook.
An example of such a minimum notebook is this one:
https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/tokenization.ipynb
i.e.,

  1. Show the CLI option table that you have in README
  2. from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
  3. Tokenization2Arrow(input_folder= "test-data/ds02/input",
    output_folder= "output",
    .... other CLI arguments).transform()
  4. import glob
    glob.glob("output/*")

Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with this transform is that it has no tests:

$ make test-src
...
source venv/bin/activate;       \
export PYTHONPATH=../src:../: ;  \
cd test; pytest -s .
/bin/bash: line 3: cd: test: No such file or directory
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/cma/de/data-prep-kit/transforms
configfile: pyproject.toml
plugins: cov-6.0.0, anyio-4.8.0
collected 0 items                                                                                                                                                                                                                           

=========================================================================================================== no tests ran in 0.01s ===========================================================================================================
make: *** [../../../.make.defaults:442: .defaults.test-src] Error 5

Please add some tests to the transform.

Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, it may be better to use ray.transform. This will make it easier for maintaining the module

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because we have transform_python which we use as,
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow

I makes more sense to have transform_ray, but if you insist I will change that

# TODO: check if we should add anything to tokenization_metadata
return [(bos.getvalue().to_pybytes(), ".arrow")], tokenization_metadata

def transform_binary(self, file_name: str, byte_array: bytes) -> tuple[list[tuple[bytes, str]], dict[str, Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why this is implementing transform_binary() and not transform(). I think the code structure will be easier to understand/maitain if you implement transform() and then call super.transform() before calling transforms_to_arrow() .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform_binary returns - tuple[list[tuple[bytes, str]], dict[str, Any]]:
transform returns - tuple[list[pa.Table], dict[str, Any]]:

I am using transform_binary so that I can return data in bytes ( so that f/w can write .arrow files )

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide one or more Unit Test in the test folder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss how we can redo this one. Maybe 2 requirements.txt, one that is used as part of the packaging and one that is used for pulling the dependency on the tokenization. Also, have you considered making this module as part of the tokenization module ? Would it be easier for inheritance for this module to be an extension on the tokenization rather than its own ?

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the module to pyproject.toml for when building the wheel.

@shahrokhDaijavad
Copy link
Member

Thanks for adding the notebook, @santoshborse!

Signed-off-by: Santosh Borse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants