-
Notifications
You must be signed in to change notification settings - Fork 504
[TensorFlow][Training][Sagemaker] TensorFlow 2.19.0 Currency Release #4789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bhanutejagk
wants to merge
89
commits into
aws:master
Choose a base branch
from
bhanutejagk:TF2.19
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
89 commits
Select commit
Hold shift + click to select a range
3dc6380
Building for training TF 2.19 EC2
bd6674d
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
c3259be
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
cd16864
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
80faa58
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
edfaf85
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
9eb5188
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
3db0623
Added buildspec-2-19-sm.yml file and made other changes to build sm i…
45502f6
Merge branch 'master' into TF2.19
bhanutejagk 70408d0
commented autopatch_build line in buildspec
f8bd89a
Commented out SMP_URL and its installation code logic
7087d67
Added logic for timout and retry for failed downloads
9d58e54
Resolved error due to '=' instead of '=='
2a7ef55
Modified 'default-timeout' to '--default-timeout'
ef5743b
Merge branch 'master' into TF2.19
bhanutejagk b38268a
Merge branch 'master' into TF2.19
bhanutejagk 6b02887
Merge branch 'master' into TF2.19
bhanutejagk dea10d5
changed build to false
11d7cb7
Skipped tests which lead to failures and updated python version to 312
845638d
Merge branch 'master' into TF2.19
bhanutejagk 7fcd1e9
debugging with print statements in test_pip_check
8aa3c2a
Added decorater to skip the sagemaker tests as sm profiler binary is …
f7acca8
modified dlc_template.py and added copy command to dockerfiles for si…
34f04c4
Building the image since copy command in docker files is added
169a633
Modified main to remove telemetry test failures
fa1b332
Pinned previous versions for tensorflow-datasets & tensorflow-metadat…
3bd7184
added 3.12 py_version as not included previously
66e090e
added 2 allowlist files to handle unremovable errors
84ff7e3
reverted back to py3.10 from py 3.12 as protobuf confict occurs and c…
10a1b4f
removed prev tags to tensorflow metadata since we reverted back to 3.…
b0cf14c
Merge branch 'master' into TF2.19
bhanutejagk ae2f718
changed tensorflow-metadata version as it is incompatible with protobuf
a292530
building updated image after yadans push
b5a0c41
Building for training TF 2.19 EC2
eb68e6f
added telemetry to bashrc and entrypoint
fe06f1f
Added cross-spawn allownlist code to remove sanity test errors
58dd208
removed build as we only change is made to remove test errors
2af2427
deleted extra docker files and rebuilding image
87bfd61
Merge branch 'master' into TF2.19
bhanutejagk 5df4eff
added telemtry to bashrc and entrypoint
2ae1a6f
doing build
626a4fa
Added logic for timout and retry for failed downloads
6906342
Resolved error due to '=' instead of '=='
d4d006d
Modified 'default-timeout' to '--default-timeout'
3a14eda
changed build to false
37ccd15
Skipped tests which lead to failures and updated python version to 312
68739b6
modified dlc_template.py and added copy command to dockerfiles for si…
ab3b7a5
Building the image since copy command in docker files is added
af26262
added 3.12 py_version as not included previously
c69b8dd
reverted back to py3.10 from py 3.12 as protobuf confict occurs and c…
73b542a
building updated image after yadans push
9672506
reverted back correct changes from conflicts while rebasing
a5a9119
changed code for adding telemetry
8eee9f7
added dockerd_ec2_entyrpoint in buildspec
d30fa44
Merge branch 'master' into TF2.19
23a7d83
updating py to 3.12 as sagemaker toolkit blocker removed
8e40217
added 312 in test_training.py and in conftest.py as latest version
3ec8768
Merge remote-tracking branch 'upstream/master' into TF2.19
9d4a751
Installed rust and cargo & modified sagemaker-training to <5.0
5ed527f
removed the version constraints on sagemaker-training and y-py
2d7de83
modified RUST installation section
b2956a5
removed source command in rust installation
66fec1b
modified range for sagemaker version to install
a69fc39
Merge branch 'master' into TF2.19
bddcb90
removed pins for sagemaker relevant packages to see if it can find a …
7878509
modified code for telemetry entrypoints
2dac2b9
Merge branch 'master' into TF2.19
463cc91
added cve's to allow list and modified telemetry code
f078c25
Rerunning sagemaker-local-tests and sanity tests
816ca58
Removed sitecustomize file in docker files
10d688c
reverted change in 2.18 file which by mistake was changed to 2.19
80ea17d
building image and testing sanity tests.
f3b71d3
removed protobuf version constraint as sagemaker-training-toolkit is …
f8c6346
Merge branch 'master' into TF2.19
e7298a9
running all basic tests
0b80737
Testing image with deep tests
ae34cb8
enabled further testing
619c83b
reverting back toml file
07d8973
reverted back dlc_template file
be4f400
removed unwanted commented lines in docker files
38a0eac
formatted code using black
990f6bb
downgraded the black version to 23.12.1 and formatted the files
694f98a
changes done based on review. refactoring and minor changes
0f8bc9e
building image
ca19de7
refactoring code and checking if it works
09e9185
Merge branch 'master' into TF2.19
d4dca1c
trying to build the image since it failed previously
d422c3d
modified code for PYTHON, PYTHON_VERSION from ARG to ENV
20ba3a0
Merge branch 'master' into TF2.19
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment> | ||
prod_account_id: &PROD_ACCOUNT_ID 763104351884 | ||
region: ®ION <set-$REGION-in-environment> | ||
framework: &FRAMEWORK tensorflow | ||
version: &VERSION 2.19.0 | ||
short_version: &SHORT_VERSION "2.19" | ||
arch_type: x86 | ||
#autopatch_build: "True" | ||
|
||
repository_info: | ||
training_repository: &TRAINING_REPOSITORY | ||
image_type: &TRAINING_IMAGE_TYPE training | ||
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ] | ||
repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE] | ||
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ] | ||
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ] | ||
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, | ||
*RELEASE_REPOSITORY_NAME ] | ||
|
||
context: | ||
training_context: &TRAINING_CONTEXT | ||
start_cuda_compat: | ||
source: docker/build_artifacts/start_cuda_compat.sh | ||
target: start_cuda_compat.sh | ||
dockerd-entrypoint: | ||
source: docker/build_artifacts/dockerd-entrypoint.py | ||
target: dockerd-entrypoint.py | ||
dockerd_ec2_entrypoint: | ||
source: docker/build_artifacts/dockerd_ec2_entrypoint.sh | ||
target: dockerd_ec2_entrypoint.sh | ||
deep_learning_container: | ||
source: ../../src/deep_learning_container.py | ||
target: deep_learning_container.py | ||
|
||
images: | ||
BuildTensorflowSageMakerCpuPy310TrainingDockerImage: | ||
<<: *TRAINING_REPOSITORY | ||
build: &TENSORFLOW_CPU_TRAINING_PY3 false | ||
image_size_baseline: &IMAGE_SIZE_BASELINE 7500 | ||
device_type: &DEVICE_TYPE cpu | ||
python_version: &DOCKER_PYTHON_VERSION py3 | ||
tag_python_version: &TAG_PYTHON_VERSION py312 | ||
os_version: &OS_VERSION ubuntu22.04 | ||
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ] | ||
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ] | ||
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ] | ||
# build_tag_override: "pr:2.16.2-cpu-py310-ubuntu20.04-sagemaker-pr-4362-autopatch" | ||
target: sagemaker | ||
enable_test_promotion: true | ||
context: | ||
<<: *TRAINING_CONTEXT | ||
BuildTensorflowSageMakerGpuPy310Cu125TrainingDockerImage: | ||
<<: *TRAINING_REPOSITORY | ||
build: &TENSORFLOW_GPU_TRAINING_PY3 false | ||
image_size_baseline: &IMAGE_SIZE_BASELINE 11998 | ||
device_type: &DEVICE_TYPE gpu | ||
python_version: &DOCKER_PYTHON_VERSION py3 | ||
tag_python_version: &TAG_PYTHON_VERSION py312 | ||
cuda_version: &CUDA_VERSION cu125 | ||
os_version: &OS_VERSION ubuntu22.04 | ||
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ] | ||
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ] | ||
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile., *DEVICE_TYPE ] | ||
# build_tag_override: "pr:2.16.2-gpu-py310-cu123-ubuntu20.04-sagemaker-pr-4362-autopatch" | ||
target: sagemaker | ||
enable_test_promotion: true | ||
context: | ||
<<: *TRAINING_CONTEXT |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
buildspec_pointer: buildspec-2-18-sm.yml | ||
buildspec_pointer: buildspec-2-19-sm.yml |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.