Skip to content

[TensorFlow][Training][Sagemaker] TensorFlow 2.19.0 Currency Release #4789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 89 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
3dc6380
Building for training TF 2.19 EC2
May 8, 2025
bd6674d
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
May 8, 2025
c3259be
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
May 8, 2025
cd16864
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
May 8, 2025
80faa58
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
May 8, 2025
edfaf85
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
May 8, 2025
9eb5188
[TensorFlow][Training][EC2] TensorFlow 2.19.0 Currency Release
May 8, 2025
3db0623
Added buildspec-2-19-sm.yml file and made other changes to build sm i…
May 8, 2025
45502f6
Merge branch 'master' into TF2.19
bhanutejagk May 8, 2025
70408d0
commented autopatch_build line in buildspec
May 8, 2025
f8bd89a
Commented out SMP_URL and its installation code logic
May 8, 2025
7087d67
Added logic for timout and retry for failed downloads
May 8, 2025
9d58e54
Resolved error due to '=' instead of '=='
May 9, 2025
2a7ef55
Modified 'default-timeout' to '--default-timeout'
May 9, 2025
ef5743b
Merge branch 'master' into TF2.19
bhanutejagk May 9, 2025
b38268a
Merge branch 'master' into TF2.19
bhanutejagk May 12, 2025
6b02887
Merge branch 'master' into TF2.19
bhanutejagk May 12, 2025
dea10d5
changed build to false
May 12, 2025
11d7cb7
Skipped tests which lead to failures and updated python version to 312
May 13, 2025
845638d
Merge branch 'master' into TF2.19
bhanutejagk May 13, 2025
7fcd1e9
debugging with print statements in test_pip_check
May 13, 2025
8aa3c2a
Added decorater to skip the sagemaker tests as sm profiler binary is …
May 14, 2025
f7acca8
modified dlc_template.py and added copy command to dockerfiles for si…
May 14, 2025
34f04c4
Building the image since copy command in docker files is added
May 14, 2025
169a633
Modified main to remove telemetry test failures
May 14, 2025
fa1b332
Pinned previous versions for tensorflow-datasets & tensorflow-metadat…
May 14, 2025
3bd7184
added 3.12 py_version as not included previously
May 14, 2025
66e090e
added 2 allowlist files to handle unremovable errors
May 14, 2025
84ff7e3
reverted back to py3.10 from py 3.12 as protobuf confict occurs and c…
May 15, 2025
10a1b4f
removed prev tags to tensorflow metadata since we reverted back to 3.…
May 16, 2025
b0cf14c
Merge branch 'master' into TF2.19
bhanutejagk May 16, 2025
ae2f718
changed tensorflow-metadata version as it is incompatible with protobuf
May 16, 2025
a292530
building updated image after yadans push
May 27, 2025
b5a0c41
Building for training TF 2.19 EC2
May 8, 2025
eb68e6f
added telemetry to bashrc and entrypoint
May 28, 2025
fe06f1f
Added cross-spawn allownlist code to remove sanity test errors
May 29, 2025
58dd208
removed build as we only change is made to remove test errors
May 29, 2025
2af2427
deleted extra docker files and rebuilding image
May 29, 2025
87bfd61
Merge branch 'master' into TF2.19
bhanutejagk May 29, 2025
5df4eff
added telemtry to bashrc and entrypoint
May 29, 2025
2ae1a6f
doing build
May 29, 2025
626a4fa
Added logic for timout and retry for failed downloads
May 8, 2025
6906342
Resolved error due to '=' instead of '=='
May 9, 2025
d4d006d
Modified 'default-timeout' to '--default-timeout'
May 9, 2025
3a14eda
changed build to false
May 12, 2025
37ccd15
Skipped tests which lead to failures and updated python version to 312
May 13, 2025
68739b6
modified dlc_template.py and added copy command to dockerfiles for si…
May 14, 2025
ab3b7a5
Building the image since copy command in docker files is added
May 14, 2025
af26262
added 3.12 py_version as not included previously
May 14, 2025
c69b8dd
reverted back to py3.10 from py 3.12 as protobuf confict occurs and c…
May 15, 2025
73b542a
building updated image after yadans push
May 27, 2025
9672506
reverted back correct changes from conflicts while rebasing
May 30, 2025
a5a9119
changed code for adding telemetry
May 30, 2025
8eee9f7
added dockerd_ec2_entyrpoint in buildspec
May 30, 2025
d30fa44
Merge branch 'master' into TF2.19
Jun 11, 2025
23a7d83
updating py to 3.12 as sagemaker toolkit blocker removed
Jun 11, 2025
8e40217
added 312 in test_training.py and in conftest.py as latest version
Jun 16, 2025
3ec8768
Merge remote-tracking branch 'upstream/master' into TF2.19
Jun 17, 2025
9d4a751
Installed rust and cargo & modified sagemaker-training to <5.0
Jun 17, 2025
5ed527f
removed the version constraints on sagemaker-training and y-py
Jun 17, 2025
2d7de83
modified RUST installation section
Jun 18, 2025
b2956a5
removed source command in rust installation
Jun 18, 2025
66fec1b
modified range for sagemaker version to install
Jun 18, 2025
a69fc39
Merge branch 'master' into TF2.19
Jun 18, 2025
bddcb90
removed pins for sagemaker relevant packages to see if it can find a …
Jun 19, 2025
7878509
modified code for telemetry entrypoints
Jun 19, 2025
2dac2b9
Merge branch 'master' into TF2.19
Jun 19, 2025
463cc91
added cve's to allow list and modified telemetry code
Jun 19, 2025
f078c25
Rerunning sagemaker-local-tests and sanity tests
Jun 20, 2025
816ca58
Removed sitecustomize file in docker files
Jun 20, 2025
10d688c
reverted change in 2.18 file which by mistake was changed to 2.19
Jun 20, 2025
80ea17d
building image and testing sanity tests.
Jun 20, 2025
f3b71d3
removed protobuf version constraint as sagemaker-training-toolkit is …
Jun 20, 2025
f8c6346
Merge branch 'master' into TF2.19
Jun 20, 2025
e7298a9
running all basic tests
Jun 20, 2025
0b80737
Testing image with deep tests
Jun 20, 2025
ae34cb8
enabled further testing
Jun 20, 2025
619c83b
reverting back toml file
Jun 20, 2025
07d8973
reverted back dlc_template file
Jun 20, 2025
be4f400
removed unwanted commented lines in docker files
Jun 21, 2025
38a0eac
formatted code using black
Jun 23, 2025
990f6bb
downgraded the black version to 23.12.1 and formatted the files
Jun 23, 2025
694f98a
changes done based on review. refactoring and minor changes
Jun 25, 2025
0f8bc9e
building image
Jun 25, 2025
ca19de7
refactoring code and checking if it works
Jun 25, 2025
09e9185
Merge branch 'master' into TF2.19
Jun 25, 2025
d4dca1c
trying to build the image since it failed previously
Jun 25, 2025
d422c3d
modified code for PYTHON, PYTHON_VERSION from ARG to ENV
Jun 26, 2025
20ba3a0
Merge branch 'master' into TF2.19
Jun 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions dlc_developer_config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,12 @@ deep_canary_mode = false
[build]
# Add in frameworks you would like to build. By default, builds are disabled unless you specify building an image.
# available frameworks - ["base", "vllm", "autogluon", "huggingface_tensorflow", "huggingface_pytorch", "huggingface_tensorflow_trcomp", "huggingface_pytorch_trcomp", "pytorch_trcomp", "tensorflow", "pytorch", "stabilityai_pytorch"]
build_frameworks = []
build_frameworks = ["tensorflow"]


# By default we build both training and inference containers. Set true/false values to determine which to build.
build_training = true
build_inference = true
build_inference = false

# Set do_build to "false" to skip builds and test the latest image built by this PR
# Note: at least one build is required to set do_build to "false"
Expand Down Expand Up @@ -120,7 +120,7 @@ use_scheduler = false

# Standard Framework Training
dlc-pr-pytorch-training = ""
dlc-pr-tensorflow-2-training = ""
dlc-pr-tensorflow-2-training = "tensorflow/training/buildspec-2-19-sm.yml"
dlc-pr-autogluon-training = ""

# ARM64 Training
Expand Down Expand Up @@ -176,4 +176,4 @@ dlc-pr-stabilityai-pytorch-inference = ""

# EIA Inference
dlc-pr-pytorch-eia-inference = ""
dlc-pr-tensorflow-2-eia-inference = ""
dlc-pr-tensorflow-2-eia-inference = ""
68 changes: 68 additions & 0 deletions tensorflow/training/buildspec-2-19-sm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
prod_account_id: &PROD_ACCOUNT_ID 763104351884
region: &REGION <set-$REGION-in-environment>
framework: &FRAMEWORK tensorflow
version: &VERSION 2.19.0
short_version: &SHORT_VERSION "2.19"
arch_type: x86
#autopatch_build: "True"

repository_info:
training_repository: &TRAINING_REPOSITORY
image_type: &TRAINING_IMAGE_TYPE training
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE]
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/,
*RELEASE_REPOSITORY_NAME ]

context:
training_context: &TRAINING_CONTEXT
start_cuda_compat:
source: docker/build_artifacts/start_cuda_compat.sh
target: start_cuda_compat.sh
dockerd-entrypoint:
source: docker/build_artifacts/dockerd-entrypoint.py
target: dockerd-entrypoint.py
dockerd_ec2_entrypoint:
source: docker/build_artifacts/dockerd_ec2_entrypoint.sh
target: dockerd_ec2_entrypoint.sh
deep_learning_container:
source: ../../src/deep_learning_container.py
target: deep_learning_container.py

images:
BuildTensorflowSageMakerCpuPy310TrainingDockerImage:
<<: *TRAINING_REPOSITORY
build: &TENSORFLOW_CPU_TRAINING_PY3 false
image_size_baseline: &IMAGE_SIZE_BASELINE 7500
device_type: &DEVICE_TYPE cpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py312
os_version: &OS_VERSION ubuntu22.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
# build_tag_override: "pr:2.16.2-cpu-py310-ubuntu20.04-sagemaker-pr-4362-autopatch"
target: sagemaker
enable_test_promotion: true
context:
<<: *TRAINING_CONTEXT
BuildTensorflowSageMakerGpuPy310Cu125TrainingDockerImage:
<<: *TRAINING_REPOSITORY
build: &TENSORFLOW_GPU_TRAINING_PY3 false
image_size_baseline: &IMAGE_SIZE_BASELINE 11998
device_type: &DEVICE_TYPE gpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py312
cuda_version: &CUDA_VERSION cu125
os_version: &OS_VERSION ubuntu22.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile., *DEVICE_TYPE ]
# build_tag_override: "pr:2.16.2-gpu-py310-cu123-ubuntu20.04-sagemaker-pr-4362-autopatch"
target: sagemaker
enable_test_promotion: true
context:
<<: *TRAINING_CONTEXT
2 changes: 1 addition & 1 deletion tensorflow/training/buildspec.yml
Original file line number Diff line number Diff line change
@@ -1 +1 @@
buildspec_pointer: buildspec-2-18-sm.yml
buildspec_pointer: buildspec-2-19-sm.yml
Loading