-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add Translation Module Example #96
Conversation
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initial example.
Does it make sense to generalize this and move it to a module similar to DistributedDataClassifier
or do you feel it's better as a standalone example?
Signed-off-by: Vibhu Jawa <[email protected]>
I think we can start with stand-alone example, just to show folks how to do generation models(like translation) with NeMo-Curator. I think a module abstracts like |
cc: @sarahyurick if you want to take a look as well. |
There is a quick change i want to make before merging in, please hold off on merging |
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
@ayushdg/@sarahyurick, Read for review again. Made the minor change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: Not blocking.
General comment: A lot of the classifier examples work with dataframes directly rather than DocumentDataset
for IO similar to other examples. Right now there's a lot of df manipulations for it to make sense to go with DocumentDatset, but I'm wondering if we should also expose map_partitions
& apply
to the DocumentDataset
class itself.
input_files = [ | ||
os.path.join(args.input_data_dir, x) for x in os.listdir(args.input_data_dir) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can potentially replace with get_all_files_paths_under
.
With the changes require for passing translation config to CustomModel, I am getting following warning followed by an error : Warning
Error
System Info
Why I am getting this and how to resolve it? After Adding
No change in results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
My only comment is that I had to initialize and add a cache_dir=cache_dir
at L48, L52, L99, and L109. Otherwise, I would get an error:
PermissionError: [Errno 13] Permission denied: '/home/nfs/syurick/.cache/huggingface/hub/.locks/models--ai4bharat--indictrans2-en-indic-1B'
Not sure if we want to allow the user to set their own cache_dir
to handle this case, or if I'm getting this error because of setup issues on my end which we don't anticipate for users. LMK what you think.
Edit: Command I am using below.
python3 /home/nfs/syurick/NeMo-Curator/examples/translation_example.py \
--input-data-dir /home/nfs/syurick/LLM_domain_classifier_inference/justext_resiliparse_trafilatura2/ --input-file-type jsonl \
--output-data-dir /raid/syurick/translation_justext_resiliparse_trafilatura2 --output-file-type parquet \
--autocast \
--pretrained-model-name-or-path ai4bharat/indictrans2-en-indic-1B
Hi @VibhuJawa ,
it fails with error as :
It seems to me this error is coming from crossfit from here Moreover this type of error persist if we change max_length = 20 in TranslationConfig(from above) and it gave :
cc @ayushdg |
2b7c794
to
eee2e41
Compare
Closing in favor of #189 |
Description
This PR adds a translation module based on Umair ahmeds initial work. This PR adds :
Checklist
Example Command: