Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #43 (empty files creation) and improve reading/writing speed #44

Closed
wants to merge 4 commits into from

Conversation

miguelusque
Copy link
Contributor

This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero.

In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method.

@ayushdg ayushdg requested review from ryantwolf and ayushdg May 6, 2024 22:20
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for this @miguelusque! I love the kind of change where I can just run it on a large chunk of data, and use diff to see that everything is ok. Good catch with the speedup. I only have one minor suggestion, but we should be good.

nemo_curator/utils/file_utils.py Show resolved Hide resolved
miguelusque and others added 3 commits May 7, 2024 06:01
…g performance

This commit fixes issue NVIDIA#43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero.

In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method.

Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
There were some style errors to be fixed.

Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
@miguelusque miguelusque force-pushed the miguelusque-fix-43 branch from 11412bd to beecf84 Compare May 7, 2024 10:02
@miguelusque miguelusque requested a review from ryantwolf May 7, 2024 15:38
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes @miguelusque. Minor suggestion around function naming, but overall looks good to me!

nemo_curator/utils/file_utils.py Outdated Show resolved Hide resolved
@miguelusque miguelusque requested a review from ayushdg May 7, 2024 17:29
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @miguelusque

@miguelusque miguelusque closed this May 8, 2024
@miguelusque miguelusque deleted the miguelusque-fix-43 branch May 8, 2024 13:12
@miguelusque miguelusque changed the title Fix issue #43 (empty files creation) and improve reading/writing speed Fix #43 (empty files creation) and improve reading/writing speed Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants