Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

santoshborse · 2025-02-10T19:27:58Z

Why are these changes needed?

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files

Related issue number (if any).

#1009

Signed-off-by: Santosh Borse <[email protected]>

Signed-off-by: Maroun Touma <[email protected]>

Signed-off-by: Santosh Borse <[email protected]>

shahrokhDaijavad · 2025-02-11T00:36:14Z

Thank you very much, @santoshborse. I tested the transform by running make run-cli-sample-python successfully.
There are some things missing like a test directory that we use for CI/CD testing and I will make the README more consistent with the other READMEs, but for now, one small change in README:

make run-cli-sample => make run-cli-sample-python
and another urgent request for a simple notebook that @Hajar-Emami can use to mimic in her GneissWeb recipe notebook.
An example of such a minimum notebook is this one:
https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/tokenization.ipynb
i.e.,

Show the CLI option table that you have in README
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
Tokenization2Arrow(input_folder= "test-data/ds02/input",
output_folder= "output",
.... other CLI arguments).transform()
import glob
glob.glob("output/*")

cmadam

One problem with this transform is that it has no tests:

$ make test-src
...
source venv/bin/activate;       \
export PYTHONPATH=../src:../: ;  \
cd test; pytest -s .
/bin/bash: line 3: cd: test: No such file or directory
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/cma/de/data-prep-kit/transforms
configfile: pyproject.toml
plugins: cov-6.0.0, anyio-4.8.0
collected 0 items                                                                                                                                                                                                                           

=========================================================================================================== no tests ran in 0.01s ===========================================================================================================
make: *** [../../../.make.defaults:442: .defaults.test-src] Error 5

Please add some tests to the transform.

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform.py

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_python.py

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_ray.py

Signed-off-by: Santosh Borse <[email protected]>

transforms/universal/tokenization2arrow/Dockerfile.python

touma-I · 2025-02-11T12:31:38Z

transforms/universal/tokenization2arrow/Dockerfile.ray

Can you use the content from here ? https://github.com/IBM/data-prep-kit/blob/tokenization2arrow_transform/transforms/Dockerfile.ray.template

touma-I · 2025-02-11T12:34:40Z

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_ray.py

For consistency, it may be better to use ray.transform. This will make it easier for maintaining the module

I think because we have transform_python which we use as,
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow

I makes more sense to have transform_ray, but if you insist I will change that

touma-I · 2025-02-11T12:38:37Z

transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform.py

+        # TODO: check if we should add anything to tokenization_metadata
+        return [(bos.getvalue().to_pybytes(), ".arrow")], tokenization_metadata
+
+    def transform_binary(self, file_name: str, byte_array: bytes) -> tuple[list[tuple[bytes, str]], dict[str, Any]]:


Not clear why this is implementing transform_binary() and not transform(). I think the code structure will be easier to understand/maitain if you implement transform() and then call super.transform() before calling transforms_to_arrow() .

transform_binary returns - tuple[list[tuple[bytes, str]], dict[str, Any]]:
transform returns - tuple[list[pa.Table], dict[str, Any]]:

I am using transform_binary so that I can return data in bytes ( so that f/w can write .arrow files )

touma-I

Please provide one or more Unit Test in the test folder.

Signed-off-by: Maroun Touma <[email protected]>

touma-I · 2025-02-11T17:20:09Z

transforms/universal/tokenization2arrow/requirements.txt

Let's discuss how we can redo this one. Maybe 2 requirements.txt, one that is used as part of the packaging and one that is used for pulling the dependency on the tokenization. Also, have you considered making this module as part of the tokenization module ? Would it be easier for inheritance for this module to be an extension on the tokenization rather than its own ?

touma-I

added the module to pyproject.toml for when building the wheel.

Signed-off-by: Santosh Borse <[email protected]>

shahrokhDaijavad · 2025-02-11T20:16:12Z

Thanks for adding the notebook, @santoshborse!

Signed-off-by: Santosh Borse <[email protected]>

santoshborse requested review from shahrokhDaijavad and touma-I February 10, 2025 19:27

First version of Tokenization2Arrow Transform

b0d1490

Signed-off-by: Santosh Borse <[email protected]>

santoshborse force-pushed the tokenization2arrow_transform branch from 4231f1d to b0d1490 Compare February 10, 2025 19:30

touma-I requested a review from cmadam February 10, 2025 19:38

touma-I and others added 2 commits February 10, 2025 14:56

enable workflow

66b1dd5

Signed-off-by: Maroun Touma <[email protected]>

Adding Docker file

68d2b7e

Signed-off-by: Santosh Borse <[email protected]>

cmadam reviewed Feb 11, 2025

View reviewed changes

santoshborse added 2 commits February 10, 2025 22:19

Adding 2 tests

16284b8

Signed-off-by: Santosh Borse <[email protected]>

Adding python test

dd2c80c

Signed-off-by: Santosh Borse <[email protected]>

touma-I reviewed Feb 11, 2025

View reviewed changes

transforms/universal/tokenization2arrow/Dockerfile.python Outdated Show resolved Hide resolved

touma-I reviewed Feb 11, 2025

View reviewed changes

touma-I requested changes Feb 11, 2025

View reviewed changes

Added tokenization2arrow module

4a56d77

Signed-off-by: Maroun Touma <[email protected]>

touma-I reviewed Feb 11, 2025

View reviewed changes

santoshborse added 2 commits February 11, 2025 14:05

Adding example notebook

1353b17

Remvoving explicit install of data-prep-toolkit

e5459ca

Signed-off-by: Santosh Borse <[email protected]>

Adding Ray based notebook

4d9723f

Signed-off-by: Santosh Borse <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

santoshborse commented Feb 10, 2025

shahrokhDaijavad commented Feb 11, 2025

cmadam left a comment

touma-I Feb 11, 2025

touma-I Feb 11, 2025

santoshborse Feb 11, 2025

touma-I Feb 11, 2025

santoshborse Feb 11, 2025

touma-I left a comment

touma-I Feb 11, 2025

touma-I left a comment

shahrokhDaijavad commented Feb 11, 2025

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

Are you sure you want to change the base?

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

Conversation

santoshborse commented Feb 10, 2025

Why are these changes needed?

Related issue number (if any).

shahrokhDaijavad commented Feb 11, 2025

cmadam left a comment

Choose a reason for hiding this comment

touma-I Feb 11, 2025

Choose a reason for hiding this comment

touma-I Feb 11, 2025

Choose a reason for hiding this comment

santoshborse Feb 11, 2025

Choose a reason for hiding this comment

touma-I Feb 11, 2025

Choose a reason for hiding this comment

santoshborse Feb 11, 2025

Choose a reason for hiding this comment

touma-I left a comment

Choose a reason for hiding this comment

touma-I Feb 11, 2025

Choose a reason for hiding this comment

touma-I left a comment

Choose a reason for hiding this comment

shahrokhDaijavad commented Feb 11, 2025