-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[data][train] Create a deepcopy of the data context on the split coordinator process #56211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data][train] Create a deepcopy of the data context on the split coordinator process #56211
Conversation
… from runconfig Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a key fix to prevent state leakage in DataContext
by using a deep copy. It also includes several nice refactorings, such as simplifying DatasetsSetupCallback
's initialization by using TrainRunContext
, and centralizing StorageContext
creation within RunConfig
.
I have a few suggestions:
- A critical issue where a test is broken due to the refactoring of
DatasetsSetupCallback
. - A high-severity suggestion to improve the performance of the new
storage_context
property by caching its result. - A medium-severity suggestion to restore a helpful note in a docstring for better code maintainability.
Overall, the changes are well-structured and improve the codebase. Addressing the identified issues will make this PR even better.
Signed-off-by: Justin Yu <[email protected]>
self._scaling_config = scaling_config | ||
def __init__(self, train_run_context: TrainRunContext): | ||
self._datasets = train_run_context.datasets | ||
self._data_config = copy.deepcopy(train_run_context.dataset_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be deepcopied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to deepcopy to avoid modifying the user's configs in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…allback_run_context Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Summary
The main change of this PR is to create a deepcopy of the base dataset's context before setting the process-global context.
Otherwise, mutations to the base dataset's context during the planning phase are also propagated to the global context, which can affect future dataset executions launched from the same process.
Misc. drive-by changes
StorageContext
from theRunConfig
directlyDatasetShardMetadata
from the outermost level among other changes, for easier patching