Skip to content

Conversation

justinvyu
Copy link
Contributor

Summary

The main change of this PR is to create a deepcopy of the base dataset's context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning phase are also propagated to the global context, which can affect future dataset executions launched from the same process.

Misc. drive-by changes

  • Utility to create a StorageContext from the RunConfig directly
  • Pipe the DatasetShardMetadata from the outermost level among other changes, for easier patching

@justinvyu justinvyu requested review from a team as code owners September 3, 2025 22:49
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a key fix to prevent state leakage in DataContext by using a deep copy. It also includes several nice refactorings, such as simplifying DatasetsSetupCallback's initialization by using TrainRunContext, and centralizing StorageContext creation within RunConfig.

I have a few suggestions:

  • A critical issue where a test is broken due to the refactoring of DatasetsSetupCallback.
  • A high-severity suggestion to improve the performance of the new storage_context property by caching its result.
  • A medium-severity suggestion to restore a helpful note in a docstring for better code maintainability.

Overall, the changes are well-structured and improve the codebase. Addressing the identified issues will make this PR even better.

Signed-off-by: Justin Yu <[email protected]>
self._scaling_config = scaling_config
def __init__(self, train_run_context: TrainRunContext):
self._datasets = train_run_context.datasets
self._data_config = copy.deepcopy(train_run_context.dataset_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be deepcopied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to deepcopy to avoid modifying the user's configs in place

@ray-gardener ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues labels Sep 4, 2025
Copy link
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
@justinvyu justinvyu enabled auto-merge (squash) September 4, 2025 07:43
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 4, 2025
@justinvyu justinvyu merged commit 6f3689a into ray-project:master Sep 4, 2025
6 of 7 checks passed
@justinvyu justinvyu deleted the ds_callback_run_context branch September 4, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests train Ray Train Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants