[data][train] Create a deepcopy of the data context on the split coordinator process #56211

justinvyu · 2025-09-03T22:49:32Z

Summary

The main change of this PR is to create a deepcopy of the base dataset's context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning phase are also propagated to the global context, which can affect future dataset executions launched from the same process.

Misc. drive-by changes

Utility to create a StorageContext from the RunConfig directly
Pipe the DatasetShardMetadata from the outermost level among other changes, for easier patching

… from runconfig Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a key fix to prevent state leakage in DataContext by using a deep copy. It also includes several nice refactorings, such as simplifying DatasetsSetupCallback's initialization by using TrainRunContext, and centralizing StorageContext creation within RunConfig.

I have a few suggestions:

A critical issue where a test is broken due to the refactoring of DatasetsSetupCallback.
A high-severity suggestion to improve the performance of the new storage_context property by caching its result.
A medium-severity suggestion to restore a helpful note in a docstring for better code maintainability.

Overall, the changes are well-structured and improve the codebase. Addressing the identified issues will make this PR even better.

python/ray/train/v2/_internal/callbacks/datasets.py

python/ray/train/v2/api/config.py

python/ray/train/v2/_internal/execution/train_fn_utils.py

Signed-off-by: Justin Yu <[email protected]>

matthewdeng · 2025-09-03T23:33:52Z

python/ray/train/v2/_internal/callbacks/datasets.py

-        self._scaling_config = scaling_config
+    def __init__(self, train_run_context: TrainRunContext):
+        self._datasets = train_run_context.datasets
+        self._data_config = copy.deepcopy(train_run_context.dataset_config)


Does this need to be deepcopied?

I want to deepcopy to avoid modifying the user's configs in place

srinathk10

LGTM

…allback_run_context Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

justinvyu added 3 commits September 3, 2025 15:35

[oss] pass train run context into ds callback + build storage context…

ea6b2e8

… from runconfig Signed-off-by: Justin Yu <[email protected]>

port over some changes

3cfb45f

Signed-off-by: Justin Yu <[email protected]>

pass ds metadata from the outermost level

ab1d8d5

Signed-off-by: Justin Yu <[email protected]>

justinvyu requested review from a team as code owners September 3, 2025 22:49

gemini-code-assist bot reviewed Sep 3, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/datasets.py Show resolved Hide resolved

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/train_fn_utils.py Outdated Show resolved Hide resolved

update gen dataset import

e437ec2

Signed-off-by: Justin Yu <[email protected]>

matthewdeng approved these changes Sep 3, 2025

View reviewed changes

ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues labels Sep 4, 2025

srinathk10 approved these changes Sep 4, 2025

View reviewed changes

justinvyu added 3 commits September 3, 2025 23:51

Merge branch 'master' of https://github.com/ray-project/ray into ds_c…

b317321

…allback_run_context Signed-off-by: Justin Yu <[email protected]>

fix test

275c5bd

Signed-off-by: Justin Yu <[email protected]>

cached property

2cd38cc

Signed-off-by: Justin Yu <[email protected]>

justinvyu enabled auto-merge (squash) September 4, 2025 07:43

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 4, 2025

justinvyu merged commit 6f3689a into ray-project:master Sep 4, 2025
6 of 7 checks passed

justinvyu deleted the ds_callback_run_context branch September 4, 2025 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data][train] Create a deepcopy of the data context on the split coordinator process #56211

[data][train] Create a deepcopy of the data context on the split coordinator process #56211

Uh oh!

justinvyu commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng Sep 3, 2025

Uh oh!

justinvyu Sep 4, 2025

Uh oh!

srinathk10 left a comment

Uh oh!

Uh oh!

Uh oh!

[data][train] Create a deepcopy of the data context on the split coordinator process #56211

[data][train] Create a deepcopy of the data context on the split coordinator process #56211

Uh oh!

Conversation

justinvyu commented Sep 3, 2025

Summary

Misc. drive-by changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!