Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add Translation Module Example #96

Closed

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Jun 2, 2024

Description

This PR adds a translation module based on Umair ahmeds initial work. This PR adds :

  • Add input tokens
  • Add crossfit sequence to sequence module
  • Add detokenization
  • Make work with nemo-curator modules

Checklist

  • I am familiar with the Contributing Guide.
  • The documentation is up to date with these changes.

Example Command:

  
   python3 translation_example.py \
  --input-data-dir /raid/vjawa/subset_CC-MAIN-2023-14_english/ --input-file-type jsonl \
  --output-data-dir /raid/vjawa/translation_CC-MAIN-2023-14_english --output-file-type parquet \
  --autocast \
  --pretrained-model-name-or-path /raid/vjawa/indictrans2-en-indic-1B/ 
  

@VibhuJawa VibhuJawa self-assigned this Jun 4, 2024
@VibhuJawa VibhuJawa changed the title [WIP] Add Translation Module Example [REVIEW] Add Translation Module Example Jun 4, 2024
@VibhuJawa VibhuJawa requested a review from ayushdg June 4, 2024 01:27
@VibhuJawa VibhuJawa added the enhancement New feature or request label Jun 4, 2024
Signed-off-by: Vibhu Jawa <[email protected]>
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial example.
Does it make sense to generalize this and move it to a module similar to DistributedDataClassifier or do you feel it's better as a standalone example?

Signed-off-by: Vibhu Jawa <[email protected]>
@VibhuJawa
Copy link
Collaborator Author

Does it make sense to generalize this and move it to a module similar to DistributedDataClassifier or do you feel it's better as a standalone example?

I think we can start with stand-alone example, just to show folks how to do generation models(like translation) with NeMo-Curator.

I think a module abstracts like DistributedDataClassifier away too much logic, it is useful for our models which we release but unsure if we should do the same abstraction for other models. I think as a first step we can always start with an example and then expand from there.

@ayushdg
Copy link
Collaborator

ayushdg commented Jun 6, 2024

cc: @sarahyurick if you want to take a look as well.

@VibhuJawa
Copy link
Collaborator Author

There is a quick change i want to make before merging in, please hold off on merging

@ayushdg ayushdg marked this pull request as draft June 10, 2024 17:12
Signed-off-by: Vibhu Jawa <[email protected]>
@VibhuJawa VibhuJawa marked this pull request as ready for review June 10, 2024 19:30
@VibhuJawa
Copy link
Collaborator Author

@ayushdg/@sarahyurick, Read for review again. Made the minor change

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: Not blocking.

General comment: A lot of the classifier examples work with dataframes directly rather than DocumentDataset for IO similar to other examples. Right now there's a lot of df manipulations for it to make sense to go with DocumentDatset, but I'm wondering if we should also expose map_partitions & apply to the DocumentDataset class itself.

Comment on lines +163 to +165
input_files = [
os.path.join(args.input_data_dir, x) for x in os.listdir(args.input_data_dir)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can potentially replace with get_all_files_paths_under.

@uahmed93
Copy link

uahmed93 commented Jun 14, 2024

With the changes require for passing translation config to CustomModel, I am getting following warning followed by an error :

Warning

2024-06-14 02:42:03,937 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.86 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,006 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.89 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,027 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.85 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,177 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.84 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,031 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,093 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,110 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,129 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.88 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,251 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.92 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,359 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.76 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,364 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43691 (pid=1215242) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,377 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:35729 (pid=1215237) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,429 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,430 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:38207 (pid=1215234) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,513 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:42121 (pid=1215230) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,545 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,733 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:45741 (pid=1215246) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,816 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,864 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:40093 (pid=1215249) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,868 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,926 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,979 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,062 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,135 - distributed.nanny - WARNING - Restarting worker

Error

2024-06-14 02:42:28,219 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 175, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 172, in _decode_default
    return pickle.loads(sub_header["pickled-obj"], buffers=sub_frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 94, in loads
    return pickle.loads(x, buffers=buffers)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 182, in host_deserialize
    obj = cls.device_deserialize(header, frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 136, in device_deserialize
    return typ.deserialize(header, frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/dataframe.py", line 1178, in deserialize
    obj = super().deserialize(
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/frame.py", line 113, in deserialize
    columns = deserialize_columns(header["columns"], frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 2418, in deserialize_columns
    colobj = col_typ.deserialize(meta, frames[:col_frame_count])
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1209, in deserialize
    data, frames = unpack(header["data"], frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1197, in unpack
    obj = klass.deserialize(header, frames[:count])
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/buffer.py", line 444, in deserialize
    owner = owner_type._from_host_memory(frame)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 178, in _from_host_memory
    ret._finalize_init(ptr_desc={"type": "cpu", "memoryview": data})
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 113, in _finalize_init
    raise ValueError(
ValueError: cannot create <class 'cudf.core.buffer.spillable_buffer.SpillableBufferOwner'> without a global spill manager

System Info

nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-b65d5e9d-eeaa-f149-71e9-86895ba5d11d)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-70a9c47b-e350-a7f4-4d67-b90c3b8cf39c)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-22501371-c67b-1183-65bd-cb03bc220f3b)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-1025102a-cde3-bb5a-792d-941ef232cf23)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-ac9a8547-e8ad-de83-7fe1-0e44ce1b2375)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-c403abb7-f38b-548d-e8bf-e1f890b2309f)

Why I am getting this and how to resolve it?

After Adding

os.environ["CUDF_SPILL"] = "on"

No change in results.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

My only comment is that I had to initialize and add a cache_dir=cache_dir at L48, L52, L99, and L109. Otherwise, I would get an error:

PermissionError: [Errno 13] Permission denied: '/home/nfs/syurick/.cache/huggingface/hub/.locks/models--ai4bharat--indictrans2-en-indic-1B'

Not sure if we want to allow the user to set their own cache_dir to handle this case, or if I'm getting this error because of setup issues on my end which we don't anticipate for users. LMK what you think.

Edit: Command I am using below.

python3 /home/nfs/syurick/NeMo-Curator/examples/translation_example.py \
--input-data-dir /home/nfs/syurick/LLM_domain_classifier_inference/justext_resiliparse_trafilatura2/ --input-file-type jsonl \
--output-data-dir /raid/syurick/translation_justext_resiliparse_trafilatura2 --output-file-type parquet \
--autocast \
--pretrained-model-name-or-path ai4bharat/indictrans2-en-indic-1B

@uahmed93
Copy link

Hi @VibhuJawa ,
I getting an error regarding mismatch in output tensor sizes. When I am providing following translation config :

translation_config = TranslationConfig(
        pretrained_model_name_or_path=args.pretrained_model_name_or_path,
        max_length=256,
        num_beams=5,
        autocast=args.autocast,
    )

it fails with error as :

2024-06-18 23:40:52,501 - distributed.worker - WARNING - Compute Failed
Key:       ('single_partition_write_with_filename-8438c8f8f730c2b1d33a17630e343c07', 6)
Function:  subgraph_callable-02f8b235-601a-4d78-be44-a70bf33d
args:      ('outputs/', 'combine_text-60b511141364d07a62bf9fcf20d113b9', 'translate_tokens-fd51c23e5eefc82c0252f1a8e993a01e', '<crossfit.backend.torch.op.base.Predictor object a-cad1b684c98fcb3af6e5fe49db90bdc9', {'number': 6, 'division': None}, '<crossfit.op.tokenize.Tokenizer object at 0x1552ad-aede8748151e916cba92a7e6d2baedd2', {'number': 6, 'division': None}, 'preprocess_df-ee75cb0cd0a9eb07b9cb8e3dabef3b9e', 'process_input_text-7e46f561991596c150a5386e9b2fc247', 'read_single_partition-0103a9e25a6103736ebfd21f8468db77', ['inputs/text_ag.jsonl'])
kwargs:    {}
Exception: "RuntimeError('Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.')"

Traceback (most recent call last):
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 380, in <module>
    main()
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 322, in main
    main_func(args)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 309, in main_func
    write_to_disk(
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 469, in write_to_disk
    output = output.compute()
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 379, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 665, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/op/base.py", line 94, in __call__
    output = self.call(data, *args, partition_info=partition_info, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/backend/torch/op/base.py", line 90, in call
    outputs = cp.asarray(torch.cat(all_outputs_ls, dim=0))
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.

It seems to me this error is coming from crossfit from here

Moreover this type of error persist if we change max_length = 20 in TranslationConfig(from above) and it gave :

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 20 but got size 18 for tensor number 11 in the list.

cc @ayushdg

@VibhuJawa
Copy link
Collaborator Author

cc @ayushdg

This should be fixed after 2b7c794

@ryantwolf
Copy link
Collaborator

Closing in favor of #189

@ryantwolf ryantwolf closed this Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants