SageMaker Serverless Inference GA changes (aws#3340)

eitansela · web-flow · commit 42a1a097d4d4 · 2022-04-19T09:08:22.000-07:00
* SageMaker Serverless Inference GA changes

* Update huggingface-text-classification-serverless-inference.ipynb

* SageMaker Serverless Inference GA changes

* SageMaker Serverless Inference GA changes

* SageMaker Serverless Inference GA changes

* Update Serverless-Inference-Walkthrough.ipynb

* Update Serverless-Inference-Walkthrough.ipynb

* Update Serverless-Inference-Walkthrough.ipynb
diff --git a/sagemaker_processing/fairness_and_explainability/text_explainability_sagemaker_algorithm/container/readme.md b/sagemaker_processing/fairness_and_explainability/text_explainability_sagemaker_algorithm/container/readme.md
@@ -4,7 +4,7 @@ This example shows how to package an algorithm for use with SageMaker.
 
 SageMaker supports two execution modes: _training_ where the algorithm uses input data to train a new model and _serving_ where the algorithm accepts HTTP requests and uses the previously trained model to do an inference (also called "scoring", "prediction", or "transformation").
 
-The algorithm that we have built here supports both training and scoring in SageMaker with the same container image. It is perfectly reasonable to build an algorithm that supports only training _or_ scoring as well as to build an algorithm that has separate container images for training and scoring.
+The algorithm that we have built here supports both training and scoring in SageMaker with the same container image. It is perfectly reasonable to build an algorithm that supports only training _or_ scoring as well as to build an algorithm that has separate container images for training and scoring.v
 
 In order to build a production grade inference server into the container, we use the following stack to make the implementer's job simple:
 
@@ -16,11 +16,11 @@ In order to build a production grade inference server into the container, we use
 
 The components are as follows:
 
-* __Dockerfile__: The _Dockerfile_ describes how the image is built and what it contains. It is a recipe for your container and gives you tremendous flexibility to construct almost any execution environment you can imagine. Here, we use the Dockerfile to describe a pretty standard python science stack and the simple scripts that we're going to add to it. See the [Dockerfile reference][dockerfile] for what's possible here.
+* __Dockerfile__: The _Dockerfile_ describes how the image is built and what it contains. It is a recipe for your container and gives you tremendous flexibility to construct almost any execution environment you can imagine. Here. we use the Dockerfile to describe a pretty standard python science stack and the simple scripts that we're going to add to it. See the [Dockerfile reference][dockerfile] for what's possible here.
 
 * __build\_and\_push.sh__: The script to build the Docker image (using the Dockerfile above) and push it to the [Amazon EC2 Container Registry (ECR)][ecr] so that it can be deployed to SageMaker. Specify the name of the image as the argument to this script. The script will generate a full name for the repository in your account and your configured AWS region. If this ECR repository doesn't exist, the script will create it.
 
-* __blazing_text__: The directory that contains the application to run in the container. See the next section for details about each of the files.
+* __blazing_text__: The directory that contains the application to run in the container. See the next session for details about each of the files.
 
 * __local-test__: A directory containing scripts and a setup for running a simple training and inference jobs locally so that you can test that everything is set up correctly. See below for details.
 
diff --git a/serverless-inference/Serverless-Inference-Walkthrough.ipynb b/serverless-inference/Serverless-Inference-Walkthrough.ipynb
@@ -13,7 +13,7 @@
     "For this notebook we'll be working with the SageMaker XGBoost Algorithm to train a model and then deploy a serverless endpoint. We will be using the public S3 Abalone regression dataset for this example.\n",
     "\n",
     "<b>Notebook Setting</b>\n",
-    "- <b>SageMaker Classic Notebook Instance</b>: ml.m5.xlarge Notebook Instance & conda_python3 Kernel\n",
+    "- <b>SageMaker Classic Notebook Instance</b>: ml.m5.xlarge Notebook Instance & `conda_python3` Kernel\n",
     "- <b>SageMaker Studio</b>: Python 3 (Data Science)\n",
     "- <b>Regions Available</b>: SageMaker Serverless Inference is currently available in the following regions: US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), Asia Pacific (Tokyo) and Asia Pacific (Sydney)"
    ]
@@ -49,7 +49,7 @@
    "id": "1affea20",
    "metadata": {},
    "source": [
-    "Let's start by installing preview wheels of the Python SDK, boto and aws cli"
+    "Let's start by upgrading the Python SDK, `boto3` and AWS `CLI` (Command Line Interface) packages."
    ]
   },
   {
@@ -313,7 +313,7 @@
    "source": [
     "### Endpoint Configuration Creation\n",
     "\n",
-    "This is where you can adjust the <b>Serverless Configuration</b> for your endpoint. The current max concurrent invocations for a single endpoint, known as <b>MaxConcurrency</b>, can be any value from <b>1 to 50</b>, and <b>MemorySize</b> can be any of the following: <b>1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB</b>."
+    "This is where you can adjust the <b>Serverless Configuration</b> for your endpoint. The current max concurrent invocations for a single endpoint, known as `MaxConcurrency`, can be any value from <b>1 to 200</b>, and `MemorySize` can be any of the following: <b>1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB</b>."
    ]
   },
   {
@@ -373,7 +373,7 @@
    "id": "831d2181",
    "metadata": {},
    "source": [
-    "Wait until the endpoint status is InService before invoking the endpoint."
+    "Wait until the endpoint status is `InService` before invoking the endpoint."
    ]
   },
   {
diff --git a/serverless-inference/huggingface-serverless-inference/huggingface-text-classification-serverless-inference.ipynb b/serverless-inference/huggingface-serverless-inference/huggingface-text-classification-serverless-inference.ipynb
@@ -85,14 +85,14 @@
    "source": [
     "import sys\n",
     "\n",
-    "!{sys.executable} -m pip install \"scikit_learn==0.20.0\" \"sagemaker>=2.75.1\" \"transformers==4.6.1\" \"datasets==1.6.2\" \"nltk==3.4.4\""
+    "!{sys.executable} -m pip install \"scikit_learn==0.20.0\" \"sagemaker>=2.86.1\" \"transformers==4.6.1\" \"datasets==1.6.2\" \"nltk==3.4.4\""
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Make sure SageMaker version is >= 2.75.1"
+    "Make sure SageMaker version is >= 2.86.1"
    ]
   },
   {
@@ -1097,7 +1097,7 @@
     "\n",
     "#### Concurrent invocations - `max_concurrency`\n",
     "   \n",
-    "Serverless Inference manages predefined scaling policies and quotas for the capacity of your endpoint. Serverless endpoints have a quota for how many concurrent invocations can be processed at the same time. If the endpoint is invoked before it finishes processing the first request, then it handles the second request concurrently. You can set the maximum concurrency for a <b>single endpoint up to 50</b>, and the total number of serverless endpoint variants you can host in a Region is 50. The total concurrency you can share between all serverless endpoints per Region in your account is 200. The maximum concurrency for an individual endpoint prevents that endpoint from taking up all the invocations allowed for your account, and any endpoint invocations beyond the maximum are throttled."
+    "Serverless Inference manages predefined scaling policies and quotas for the capacity of your endpoint. Serverless endpoints have a quota for how many concurrent invocations can be processed at the same time. If the endpoint is invoked before it finishes processing the first request, then it handles the second request concurrently. You can set the maximum concurrency for a <b>single endpoint up to 200</b>, and the total number of serverless endpoint variants you can host in a Region is 50. The total concurrency you can share between all serverless endpoints per Region in your account is 200. The maximum concurrency for an individual endpoint prevents that endpoint from taking up all the invocations allowed for your account, and any endpoint invocations beyond the maximum are throttled."
    ]
   },
   {
@@ -1114,33 +1114,6 @@
     ")"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### HuggingFace Inference Image `URI`\n",
-    "\n",
-    "In order to deploy the SageMaker Endpoint with Serverless configuration, we will need to supply the HuggingFace Inference Image URI."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "image_uri = sagemaker.image_uris.retrieve(\n",
-    "    framework=\"huggingface\",\n",
-    "    base_framework_version=\"pytorch1.7\",\n",
-    "    region=sess.boto_region_name,\n",
-    "    version=\"4.6\",\n",
-    "    py_version=\"py36\",\n",
-    "    instance_type=\"ml.m5.large\",\n",
-    "    image_scope=\"inference\",\n",
-    ")\n",
-    "image_uri"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1157,9 +1130,7 @@
    "source": [
     "%%time\n",
     "\n",
-    "predictor = huggingface_estimator.deploy(\n",
-    "    serverless_inference_config=serverless_config, image_uri=image_uri\n",
-    ")"
+    "predictor = huggingface_estimator.deploy(serverless_inference_config=serverless_config)"
    ]
   },
   {
diff --git a/serverless-inference/serverless-model-registry.ipynb b/serverless-inference/serverless-model-registry.ipynb
@@ -20,7 +20,7 @@
    "source": [
     "Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for customers to deploy and scale ML models. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints also automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies.\n",
     "\n",
-    "[SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) can be used to catalog and manage different model versions. Model Registry now supports deploying registered models to serverless endpoints. For this notebook we will take the existing [XGBoost Serverless example](https://github.com/aws/amazon-sagemaker-examples/blob/main/serverless-inference/Serverless-Inference-Walkthrough.ipynb) and integrate with the Model Registry. From there we will take our trained model and deploy it to a serverless endpoint using the Boto3 Python SDK. Note that there is not Model Registry support for the SageMaker SDK with serverless endpoints at the moment.\n",
+    "[SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) can be used to catalog and manage different model versions. Model Registry now supports deploying registered models to serverless endpoints. For this notebook we will take the existing [XGBoost Serverless example](https://github.com/aws/amazon-sagemaker-examples/blob/main/serverless-inference/Serverless-Inference-Walkthrough.ipynb) and integrate with the Model Registry. From there we will take our trained model and deploy it to a serverless endpoint using the Boto3 Python SDK. Note that there is no support for Model Registry in the SageMaker SDK with serverless endpoints at the moment.\n",
     "\n",
     "<b>Notebook Setting</b>\n",
     "- <b>SageMaker Studio</b>: Python 3 (Data Science)\n",
@@ -351,7 +351,7 @@
    "metadata": {},
    "source": [
     "### Endpoint Configuration Creation\n",
-    "This is where you can adjust the [Serverless Configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-create.html) for your endpoint. The current max concurrent invocations for a single endpoint, known as MaxConcurrency, can be any value from 1 to 50, and MemorySize can be any of the following: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB."
+    "This is where you can adjust the [Serverless Configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-create.html) for your endpoint. The current max concurrent invocations for a single endpoint, known as `MaxConcurrency`, can be any value from 1 to 200, and `MemorySize` can be any of the following: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB."
    ]
   },
   {
@@ -402,7 +402,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Wait until the endpoint status is InService before invoking the endpoint."
+    "Wait until the endpoint status is `InService` before invoking the endpoint."
    ]
   },
   {
@@ -488,4 +488,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}

Original file line number	Diff line number	Diff line change
`@@ -13,7 +13,7 @@`
`13`	`13`	`"For this notebook we'll be working with the SageMaker XGBoost Algorithm to train a model and then deploy a serverless endpoint. We will be using the public S3 Abalone regression dataset for this example.\n",`
`14`	`14`	`"\n",`
`15`	`15`	`"<b>Notebook Setting</b>\n",`
`16`		`- "- <b>SageMaker Classic Notebook Instance</b>: ml.m5.xlarge Notebook Instance & conda_python3 Kernel\n",`
	`16`	+ "- <b>SageMaker Classic Notebook Instance</b>: ml.m5.xlarge Notebook Instance & `conda_python3` Kernel\n",
`17`	`17`	`"- <b>SageMaker Studio</b>: Python 3 (Data Science)\n",`
`18`	`18`	`"- <b>Regions Available</b>: SageMaker Serverless Inference is currently available in the following regions: US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), Asia Pacific (Tokyo) and Asia Pacific (Sydney)"`
`19`	`19`	`]`
`@@ -49,7 +49,7 @@`
`49`	`49`	`"id": "1affea20",`
`50`	`50`	`"metadata": {},`
`51`	`51`	`"source": [`
`52`		`- "Let's start by installing preview wheels of the Python SDK, boto and aws cli"`
	`52`	+ "Let's start by upgrading the Python SDK, `boto3` and AWS `CLI` (Command Line Interface) packages."
`53`	`53`	`]`
`54`	`54`	`},`
`55`	`55`	`{`
`@@ -313,7 +313,7 @@`
`313`	`313`	`"source": [`
`314`	`314`	`"### Endpoint Configuration Creation\n",`
`315`	`315`	`"\n",`
`316`		`- "This is where you can adjust the <b>Serverless Configuration</b> for your endpoint. The current max concurrent invocations for a single endpoint, known as <b>MaxConcurrency</b>, can be any value from <b>1 to 50</b>, and <b>MemorySize</b> can be any of the following: <b>1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB</b>."`
	`316`	+ "This is where you can adjust the <b>Serverless Configuration</b> for your endpoint. The current max concurrent invocations for a single endpoint, known as `MaxConcurrency`, can be any value from <b>1 to 200</b>, and `MemorySize` can be any of the following: <b>1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB</b>."
`317`	`317`	`]`
`318`	`318`	`},`
`319`	`319`	`{`
`@@ -373,7 +373,7 @@`
`373`	`373`	`"id": "831d2181",`
`374`	`374`	`"metadata": {},`
`375`	`375`	`"source": [`
`376`		`- "Wait until the endpoint status is InService before invoking the endpoint."`
	`376`	+ "Wait until the endpoint status is `InService` before invoking the endpoint."
`377`	`377`	`]`
`378`	`378`	`},`
`379`	`379`	`{`