Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add recovery mode setup #543

Merged
merged 14 commits into from
Jan 13, 2025
Merged

feat: add recovery mode setup #543

merged 14 commits into from
Jan 13, 2025

Conversation

ChaonengQuan
Copy link
Contributor

@ChaonengQuan ChaonengQuan commented Jan 8, 2025

Summary

In Dockerfile

  1. create a folder in /tmp/recovery-mode, which will be used as home directory in recovery mode
  2. create a new micromamba environment recovery-mode, which only contains 2 basic jupyterlab extension required for recovery mode

In Entrypoint scripts:

  1. Add if condition to start switch micromamba environment based on recovery mode

Motivation

From the customer’s perspective, Recovery Mode allows them to quickly recover from issues on their own, without waiting more than a day (sometimes even longer) for AWS support. This minimal environment provides immediate access to their data and lets them continue AI/ML tasks with minimal downtime, improving both productivity and overall experience.

Design Doc

https://quip-amazon.com/5M9NAedp2jHN/LLD-StudioV2-Recovery-Mode

Test

Added new unit test to test jupyterlab entrypoint

(sagemaker-distribution) 
(25-01-09 17:09:47) <0> [~/workplace/sagemaker-distribution-condarc-fix]  
dev-dsk-chaoneng-2a-e96269ee % pytest -n auto -rA --local-image-version 2.3.0 -m cpu -k 'recovery-mode'

============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.12.7, pytest-8.3.4, pluggy-1.5.0
rootdir: /workplace/chaoneng/sagemaker-distribution-condarc-fix
configfile: pytest.ini
plugins: xdist-3.6.1, mock-3.14.0
8 workers [1 item]      
.                                                                                                                                                                                                          [100%]
================================================================================================ warnings summary ================================================================================================
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
../../../local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426
  /local/home/chaoneng/micromamba/envs/sagemaker-distribution/lib/python3.12/site-packages/boltons/timeutils.py:426: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    EPOCH_NAIVE = datetime.utcfromtimestamp(0)

src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
src/utils.py:4
  /workplace/chaoneng/sagemaker-distribution-condarc-fix/src/utils.py:4: DeprecationWarning: conda.cli.python_api is deprecated and will be removed in 25.9. Use `conda.testing.conda_cli` instead.
    import conda.cli.python_api

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================================== PASSES =====================================================================================================
___________________________________________________________________ test_dockerfiles_for_cpu[recovery-mode.test.Dockerfile-required_packages0] ___________________________________________________________________
[gw0] linux -- Python 3.12.7 /local/home/chaoneng/micromamba/envs/sagemaker-distribution/bin/python3.12
---------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------
Will start running test for: recovery-mode.test.Dockerfile against: localhost/sagemaker-distribution:2.3.0-cpu
Built a test image: sha256:cf11fb8245fa5bf55fc9d0e06b029e5485f991227c1216569059a07ceb5aa17b, will now execute its default CMD.
Starting test to verify JupyterLab can be started...
Container logs indicate JupyterLab started successfully.
Stopped and removed the container.
============================================================================================ short test summary info =============================================================================================
PASSED test/test_dockerfile_based_harness.py::test_dockerfiles_for_cpu[recovery-mode.test.Dockerfile-required_packages0]
========================================================================================= 1 passed, 16 warnings in 6.49s =========================================================================================

template/v2/Dockerfile Outdated Show resolved Hide resolved
@claytonparnell
Copy link
Contributor

Can you also add a unit test for this functionality? It can be pretty simple, setting the env variable then calling script, and checking current mamba env.

@ChaonengQuan
Copy link
Contributor Author

Can you also add a unit test for this functionality? It can be pretty simple, setting the env variable then calling script, and checking current mamba env.

Added p0 sanity test

template/v2/Dockerfile Outdated Show resolved Hide resolved
# Setup the Recovery Mode home directory and micromamba environment
mkdir -p $RECOVERY_MODE_HOME && \
chown $MAMBA_USER:$MAMBA_USER $RECOVERY_MODE_HOME && \
micromamba create -n recovery-mode && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified image size changes with these lines? I'm concerning if setting up a separate python venv will significantly increase the image size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think image size will be reduced even more in the latest commit, since I move the install part before micromamba clean statement so that some file will be cleaned after that

Copy link
Contributor

@aws-tianquaw aws-tianquaw Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, can you verify the latest size increase with your latest DockerFile once?

@ChaonengQuan ChaonengQuan self-assigned this Jan 10, 2025
@ChaonengQuan
Copy link
Contributor Author

Image size impact is < 5% to compare to 2.3.0 vanilla version, should be safe to merge

localhost/sagemaker-distribution                  2.3.0-cpu           bbf559c9f085   13 minutes ago   7.99GB
localhost/sagemaker-distribution                  2.3.0-gpu           16c499e1ad6c   5 minutes ago   13.3GB

I ran the same script and build an SMD 2.3.0 with no changes, this is the size

localhost/sagemaker-distribution         2.3.0-cpu           7db6a8ebdaef   19 seconds ago   7.66GB
localhost/sagemaker-distribution                  2.3.0-gpu           16c499e1ad6c   5 minutes ago   13.0GB

(7.66 - 7.33) / 7.33 = 4%

See also image size after compression on ECR

Screenshot 2025-01-10 at 16 59 38

@claytonparnell claytonparnell merged commit 1a090b2 into aws:main Jan 13, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants