Skip to content

Conversation

@ChaonengQuan
Copy link
Contributor

@ChaonengQuan ChaonengQuan commented Jan 8, 2025

Summary

In Dockerfile

  1. create a folder in /tmp/recovery-mode, which will be used as home directory in recovery mode
  2. create a new micromamba environment recovery-mode, which only contains 2 basic jupyterlab extension required for recovery mode

In Entrypoint scripts:

  1. Add if condition to start switch micromamba environment based on recovery mode

Motivation

From the customer’s perspective, Recovery Mode allows them to quickly recover from issues on their own, without waiting more than a day (sometimes even longer) for AWS support. This minimal environment provides immediate access to their data and lets them continue AI/ML tasks with minimal downtime, improving both productivity and overall experience.

Design Doc

https://quip-amazon.com/5M9NAedp2jHN/LLD-StudioV2-Recovery-Mode

Test

Added new unit test to test jupyterlab entrypoint

(sagemaker-distribution) 
(25-01-09 17:09:47) <0> [~/workplace/sagemaker-distribution-condarc-fix]  
dev-dsk-chaoneng-2a-e96269ee % pytest -n auto -rA --local-image-version 2.3.0 -m cpu -k 'recovery-mode'

============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.12.7, pytest-8.3.4, pluggy-1.5.0
rootdir: /workplace/chaoneng/sagemaker-distribution-condarc-fix
configfile: pytest.ini
plugins: xdist-3.6.1, mock-3.14.0
8 workers [1 item]      
.                                                                                                                                                                                                          [100%]
================================================================================================ warnings summary ================================================================================================
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
  /local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    EPOCH_NAIVE = datetime.utcfromtimestamp(0)

src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
  /workplace/chaoneng/sagemaker-distribution-condarc-fix/src/utils.py:4: DeprecationWarning: conda.cli.python_api is deprecated and will be removed in 25.9. Use `conda.testing.conda_cli` instead.
    import conda.cli.python_api

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================================== PASSES =====================================================================================================
___________________________________________________________________ test_dockerfiles_for_cpu[recovery-mode.test.Dockerfile-required_packages0] ___________________________________________________________________
[gw0] linux -- Python 3.12.7 /local/home/chaoneng/micromamba/envs/sagemaker-distribution/bin/python3.12
---------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------
Will start running test for: recovery-mode.test.Dockerfile against: localhost/sagemaker-distribution:2.3.0-cpu
Built a test image: sha256:cf11fb8245fa5bf55fc9d0e06b029e5485f991227c1216569059a07ceb5aa17b, will now execute its default CMD.
Starting test to verify JupyterLab can be started...
Container logs indicate JupyterLab started successfully.
Stopped and removed the container.
============================================================================================ short test summary info =============================================================================================
PASSED test/test_dockerfile_based_harness.py::test_dockerfiles_for_cpu[recovery-mode.test.Dockerfile-required_packages0]
========================================================================================= 1 passed, 16 warnings in 6.49s =========================================================================================

@claytonparnell
Copy link
Contributor

Can you also add a unit test for this functionality? It can be pretty simple, setting the env variable then calling script, and checking current mamba env.

@ChaonengQuan
Copy link
Contributor Author

Can you also add a unit test for this functionality? It can be pretty simple, setting the env variable then calling script, and checking current mamba env.

Added p0 sanity test

# Setup the Recovery Mode home directory and micromamba environment
mkdir -p $RECOVERY_MODE_HOME && \
chown $MAMBA_USER:$MAMBA_USER $RECOVERY_MODE_HOME && \
micromamba create -n recovery-mode && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified image size changes with these lines? I'm concerning if setting up a separate python venv will significantly increase the image size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think image size will be reduced even more in the latest commit, since I move the install part before micromamba clean statement so that some file will be cleaned after that

Copy link
Contributor

@aws-tianquaw aws-tianquaw Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, can you verify the latest size increase with your latest DockerFile once?

@ChaonengQuan ChaonengQuan self-assigned this Jan 10, 2025
@ChaonengQuan
Copy link
Contributor Author

Image size impact is < 5% to compare to 2.3.0 vanilla version, should be safe to merge

localhost/sagemaker-distribution                  2.3.0-cpu           bbf559c9f085   13 minutes ago   7.99GB
localhost/sagemaker-distribution                  2.3.0-gpu           16c499e1ad6c   5 minutes ago   13.3GB

I ran the same script and build an SMD 2.3.0 with no changes, this is the size

localhost/sagemaker-distribution         2.3.0-cpu           7db6a8ebdaef   19 seconds ago   7.66GB
localhost/sagemaker-distribution                  2.3.0-gpu           16c499e1ad6c   5 minutes ago   13.0GB

(7.66 - 7.33) / 7.33 = 4%

See also image size after compression on ECR

Screenshot 2025-01-10 at 16 59 38

@claytonparnell claytonparnell merged commit 1a090b2 into aws:main Jan 13, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants