[REVIEW] Add Translation Module Example #96

VibhuJawa · 2024-06-02T02:30:44Z

Description

This PR adds a translation module based on Umair ahmeds initial work. This PR adds :

Add input tokens
Add crossfit sequence to sequence module
Add detokenization
Make work with nemo-curator modules

Checklist

I am familiar with the Contributing Guide.
The documentation is up to date with these changes.

Example Command:

  
   python3 translation_example.py \
  --input-data-dir /raid/vjawa/subset_CC-MAIN-2023-14_english/ --input-file-type jsonl \
  --output-data-dir /raid/vjawa/translation_CC-MAIN-2023-14_english --output-file-type parquet \
  --autocast \
  --pretrained-model-name-or-path /raid/vjawa/indictrans2-en-indic-1B/

Signed-off-by: Vibhu Jawa <[email protected]>

ayushdg

Thanks for the initial example.
Does it make sense to generalize this and move it to a module similar to DistributedDataClassifier or do you feel it's better as a standalone example?

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa · 2024-06-06T17:52:22Z

Does it make sense to generalize this and move it to a module similar to DistributedDataClassifier or do you feel it's better as a standalone example?

I think we can start with stand-alone example, just to show folks how to do generation models(like translation) with NeMo-Curator.

I think a module abstracts like DistributedDataClassifier away too much logic, it is useful for our models which we release but unsure if we should do the same abstraction for other models. I think as a first step we can always start with an example and then expand from there.

ayushdg · 2024-06-06T18:20:47Z

cc: @sarahyurick if you want to take a look as well.

VibhuJawa · 2024-06-06T20:55:40Z

There is a quick change i want to make before merging in, please hold off on merging

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa · 2024-06-10T19:31:02Z

@ayushdg/@sarahyurick, Read for review again. Made the minor change

ayushdg

Small nit: Not blocking.

General comment: A lot of the classifier examples work with dataframes directly rather than DocumentDataset for IO similar to other examples. Right now there's a lot of df manipulations for it to make sense to go with DocumentDatset, but I'm wondering if we should also expose map_partitions & apply to the DocumentDataset class itself.

ayushdg · 2024-06-12T23:07:51Z

examples/translation_example.py

+    input_files = [
+        os.path.join(args.input_data_dir, x) for x in os.listdir(args.input_data_dir)
+    ]


Can potentially replace with get_all_files_paths_under.

uahmed93 · 2024-06-14T10:27:25Z

With the changes require for passing translation config to CustomModel, I am getting following warning followed by an error :

Warning

2024-06-14 02:42:03,937 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.86 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,006 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.89 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,027 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.85 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,177 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.84 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,031 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,093 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,110 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,129 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.88 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,251 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.92 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,359 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.76 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,364 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43691 (pid=1215242) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,377 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:35729 (pid=1215237) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,429 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,430 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:38207 (pid=1215234) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,513 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:42121 (pid=1215230) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,545 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,733 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:45741 (pid=1215246) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,816 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,864 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:40093 (pid=1215249) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,868 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,926 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,979 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,062 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,135 - distributed.nanny - WARNING - Restarting worker

Error

2024-06-14 02:42:28,219 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 175, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 172, in _decode_default
    return pickle.loads(sub_header["pickled-obj"], buffers=sub_frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 94, in loads
    return pickle.loads(x, buffers=buffers)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 182, in host_deserialize
    obj = cls.device_deserialize(header, frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 136, in device_deserialize
    return typ.deserialize(header, frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/dataframe.py", line 1178, in deserialize
    obj = super().deserialize(
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/frame.py", line 113, in deserialize
    columns = deserialize_columns(header["columns"], frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 2418, in deserialize_columns
    colobj = col_typ.deserialize(meta, frames[:col_frame_count])
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1209, in deserialize
    data, frames = unpack(header["data"], frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1197, in unpack
    obj = klass.deserialize(header, frames[:count])
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/buffer.py", line 444, in deserialize
    owner = owner_type._from_host_memory(frame)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 178, in _from_host_memory
    ret._finalize_init(ptr_desc={"type": "cpu", "memoryview": data})
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 113, in _finalize_init
    raise ValueError(
ValueError: cannot create <class 'cudf.core.buffer.spillable_buffer.SpillableBufferOwner'> without a global spill manager

System Info

nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-b65d5e9d-eeaa-f149-71e9-86895ba5d11d)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-70a9c47b-e350-a7f4-4d67-b90c3b8cf39c)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-22501371-c67b-1183-65bd-cb03bc220f3b)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-1025102a-cde3-bb5a-792d-941ef232cf23)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-ac9a8547-e8ad-de83-7fe1-0e44ce1b2375)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-c403abb7-f38b-548d-e8bf-e1f890b2309f)

Why I am getting this and how to resolve it?

After Adding

os.environ["CUDF_SPILL"] = "on"

No change in results.

sarahyurick

LGTM!

My only comment is that I had to initialize and add a cache_dir=cache_dir at L48, L52, L99, and L109. Otherwise, I would get an error:

PermissionError: [Errno 13] Permission denied: '/home/nfs/syurick/.cache/huggingface/hub/.locks/models--ai4bharat--indictrans2-en-indic-1B'

Not sure if we want to allow the user to set their own cache_dir to handle this case, or if I'm getting this error because of setup issues on my end which we don't anticipate for users. LMK what you think.

Edit: Command I am using below.

python3 /home/nfs/syurick/NeMo-Curator/examples/translation_example.py \
--input-data-dir /home/nfs/syurick/LLM_domain_classifier_inference/justext_resiliparse_trafilatura2/ --input-file-type jsonl \
--output-data-dir /raid/syurick/translation_justext_resiliparse_trafilatura2 --output-file-type parquet \
--autocast \
--pretrained-model-name-or-path ai4bharat/indictrans2-en-indic-1B

uahmed93 · 2024-06-19T06:57:24Z

Hi @VibhuJawa ,
I getting an error regarding mismatch in output tensor sizes. When I am providing following translation config :

translation_config = TranslationConfig(
        pretrained_model_name_or_path=args.pretrained_model_name_or_path,
        max_length=256,
        num_beams=5,
        autocast=args.autocast,
    )

it fails with error as :

2024-06-18 23:40:52,501 - distributed.worker - WARNING - Compute Failed
Key:       ('single_partition_write_with_filename-8438c8f8f730c2b1d33a17630e343c07', 6)
Function:  subgraph_callable-02f8b235-601a-4d78-be44-a70bf33d
args:      ('outputs/', 'combine_text-60b511141364d07a62bf9fcf20d113b9', 'translate_tokens-fd51c23e5eefc82c0252f1a8e993a01e', '<crossfit.backend.torch.op.base.Predictor object a-cad1b684c98fcb3af6e5fe49db90bdc9', {'number': 6, 'division': None}, '<crossfit.op.tokenize.Tokenizer object at 0x1552ad-aede8748151e916cba92a7e6d2baedd2', {'number': 6, 'division': None}, 'preprocess_df-ee75cb0cd0a9eb07b9cb8e3dabef3b9e', 'process_input_text-7e46f561991596c150a5386e9b2fc247', 'read_single_partition-0103a9e25a6103736ebfd21f8468db77', ['inputs/text_ag.jsonl'])
kwargs:    {}
Exception: "RuntimeError('Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.')"

Traceback (most recent call last):
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 380, in <module>
    main()
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 322, in main
    main_func(args)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 309, in main_func
    write_to_disk(
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 469, in write_to_disk
    output = output.compute()
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 379, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 665, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/op/base.py", line 94, in __call__
    output = self.call(data, *args, partition_info=partition_info, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/backend/torch/op/base.py", line 90, in call
    outputs = cp.asarray(torch.cat(all_outputs_ls, dim=0))
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.

It seems to me this error is coming from crossfit from here

Moreover this type of error persist if we change max_length = 20 in TranslationConfig(from above) and it gave :

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 20 but got size 18 for tensor number 11 in the list.

cc @ayushdg

VibhuJawa · 2024-07-01T20:11:42Z

cc @ayushdg

This should be fixed after 2b7c794

ryantwolf · 2024-08-12T20:55:18Z

Closing in favor of #189

VibhuJawa added 4 commits June 1, 2024 19:26

Add translation Module

f0c80f8

Signed-off-by: Vibhu Jawa <[email protected]>

Fix loading from disk for translation Module

c1a3fd2

Signed-off-by: Vibhu Jawa <[email protected]>

Remove Un-used arg

115598f

Signed-off-by: Vibhu Jawa <[email protected]>

Use NeMo Curator Utils

9e8c2cd

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa self-assigned this Jun 4, 2024

VibhuJawa changed the title ~~[WIP] Add Translation Module Example~~ [REVIEW] Add Translation Module Example Jun 4, 2024

VibhuJawa requested a review from ayushdg June 4, 2024 01:27

VibhuJawa added the enhancement New feature or request label Jun 4, 2024

Use NeMo Curator Utils

75452e5

Signed-off-by: Vibhu Jawa <[email protected]>

ayushdg reviewed Jun 4, 2024

View reviewed changes

Add target tokenizer

7bb76ed

Signed-off-by: Vibhu Jawa <[email protected]>

ayushdg marked this pull request as draft June 10, 2024 17:12

VibhuJawa added 2 commits June 10, 2024 11:48

Added Translation pre-and-post

c3faf13

Signed-off-by: Vibhu Jawa <[email protected]>

Minor Cleanup

eee2e41

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa marked this pull request as ready for review June 10, 2024 19:30

ayushdg approved these changes Jun 12, 2024

View reviewed changes

sarahyurick requested changes Jun 18, 2024

View reviewed changes

ryantwolf mentioned this pull request Jul 1, 2024

Uahmed/indic translation crossfit #125

Closed

3 tasks

VibhuJawa force-pushed the vjawa/indic_translation_crossfit branch from 2b7c794 to eee2e41 Compare July 1, 2024 22:45

sarahyurick mentioned this pull request Jul 2, 2024

[RE-OPENED ELSEWHERE] HuggingFace support for Domain Classifier #138

Closed

ryantwolf closed this Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add Translation Module Example #96

[REVIEW] Add Translation Module Example #96

VibhuJawa commented Jun 2, 2024 •

edited

Loading

ayushdg left a comment

VibhuJawa commented Jun 6, 2024

ayushdg commented Jun 6, 2024

VibhuJawa commented Jun 6, 2024

VibhuJawa commented Jun 10, 2024

ayushdg left a comment

ayushdg Jun 12, 2024

uahmed93 commented Jun 14, 2024 •

edited

Loading

sarahyurick left a comment •

edited

Loading

uahmed93 commented Jun 19, 2024

VibhuJawa commented Jul 1, 2024

ryantwolf commented Aug 12, 2024

[REVIEW] Add Translation Module Example #96

[REVIEW] Add Translation Module Example #96

Conversation

VibhuJawa commented Jun 2, 2024 • edited Loading

Description

Checklist

ayushdg left a comment

Choose a reason for hiding this comment

VibhuJawa commented Jun 6, 2024

ayushdg commented Jun 6, 2024

VibhuJawa commented Jun 6, 2024

VibhuJawa commented Jun 10, 2024

ayushdg left a comment

Choose a reason for hiding this comment

ayushdg Jun 12, 2024

Choose a reason for hiding this comment

uahmed93 commented Jun 14, 2024 • edited Loading

sarahyurick left a comment • edited Loading

Choose a reason for hiding this comment

uahmed93 commented Jun 19, 2024

VibhuJawa commented Jul 1, 2024

ryantwolf commented Aug 12, 2024

VibhuJawa commented Jun 2, 2024 •

edited

Loading

uahmed93 commented Jun 14, 2024 •

edited

Loading

sarahyurick left a comment •

edited

Loading