Skip to content

Commit bb6e3df

Browse files
Aydin-abangelinalg
andauthored
angelina: Apply suggestions from code review (large size notebook)
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
1 parent 545cd80 commit bb6e3df

File tree

1 file changed

+27
-27
lines changed
  • doc/source/serve/tutorials/deployment-serve-llm/large-size-llm

1 file changed

+27
-27
lines changed

doc/source/serve/tutorials/deployment-serve-llm/large-size-llm/notebook.ipynb

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,27 +7,27 @@
77
"source": [
88
"# Deploy a large size LLM\n",
99
"\n",
10-
"A large size LLM typically runs on multiple nodes with multiple GPUs, prioritizing peak quality and capability: stronger reasoning, broader knowledge, longer context windows, more robust generalization. It’s the right choice when state-of-the-art results are required and higher latency, complexity, and cost are acceptable trade-offs.\n",
10+
"A large LLM typically runs on multiple nodes with multiple GPUs, prioritizing peak quality and capability: stronger reasoning, broader knowledge, longer context windows, more robust generalization. When higher latency, complexity, and cost are acceptable trade-offs because you require state-of-the-art results.\n",
1111
"\n",
12-
"This tutorial deploys a large size LLM like DeepSeek-R1 (685&nbsp;B parameters) using Ray Serve LLM. For smaller models, see [Deploying a small size LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html) or [Deploying a medium size LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/medium-size-llm/README.html).\n",
12+
"This tutorial deploys DeepSeek-R1, a large LLM with 685&nbsp;B parameters, using Ray Serve LLM. For smaller models, see [Deploying a small-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html) or [Deploying a medium-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/medium-size-llm/README.html).\n",
1313
"\n",
1414
"---\n",
1515
"\n",
16-
"## Challenges of large-scale deployment\n",
16+
"## Challenges of large-scale deployments\n",
1717
"\n",
18-
"Deploying a 685&nbsp;B-parameter model like DeepSeek-R1 presents significant technical challenges. At this scale, the model can't fit on a single GPU or even a single node. It must be distributed across multiple GPUs and nodes using *tensor parallelism* (splitting tensors within each layer) and *pipeline parallelism* (spreading layers across devices). \n",
18+
"Deploying a 685&nbsp;B-parameter model like DeepSeek-R1 presents significant technical challenges. At this scale, the model can't fit on a single GPU or even a single node. You must distribute it across multiple GPUs and nodes using *tensor parallelism* (splitting tensors within each layer) and *pipeline parallelism* (spreading layers across devices). \n",
1919
"\n",
2020
"Deploying a model of this scale normally requires you to manually launch and coordinate multiple nodes, unless you use a managed platform like [Anyscale](https://www.anyscale.com/), which automates cluster scaling and node orchestration. See [Deploy to production with Anyscale Services](#deploy-to-production-with-anyscale-services) for more details.\n",
2121
"\n",
2222
"---\n",
2323
"\n",
2424
"## Configure Ray Serve LLM\n",
2525
"\n",
26-
"A large size LLM is typically deployed across multiple nodes with multiple GPUs. To fully utilize the hardware, set `pipeline_parallel_size` to the number of nodes and `tensor_parallel_size` to the number of GPUs per node, which distributes the model’s weights evenly.\n",
26+
"A large-sized LLM is typically deployed across multiple nodes with multiple GPUs. To fully utilize the hardware, set `pipeline_parallel_size` to the number of nodes and `tensor_parallel_size` to the number of GPUs per node, which distributes the model’s weights evenly.\n",
2727
"\n",
2828
"Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object.\n",
2929
"\n",
30-
"**Optional:** Since Deepseek-R1 is a reasoning model, we use vLLM’s built-in reasoning parser to correctly separate its reasoning content from the final response. See [Deploying a reasoning LLM: Parse reasoning outputs](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/reasoning-llm/README.html#parse-reasoning-outputs)."
30+
"**Optional:** Because Deepseek-R1 is a reasoning model, this tutorial uses vLLM’s built-in reasoning parser to correctly separate its reasoning content from the final response. See [Deploying a reasoning LLM: Parse reasoning outputs](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/reasoning-llm/README.html#parse-reasoning-outputs)."
3131
]
3232
},
3333
{
@@ -51,7 +51,7 @@
5151
" min_replicas=1, max_replicas=1,\n",
5252
" )\n",
5353
" ),\n",
54-
" ### Uncomment if your model is gated and need your Huggingface Token to access it\n",
54+
" ### Uncomment if your model is gated and needs your Hugging Face token to access it.\n",
5555
" #runtime_env=dict(\n",
5656
" # env_vars={\n",
5757
" # \"HF_TOKEN\": os.environ.get(\"HF_TOKEN\")\n",
@@ -74,7 +74,7 @@
7474
"id": "6b2231a5",
7575
"metadata": {},
7676
"source": [
77-
"**Note:** Before moving to a production setup, migrate to using a [Serve config file](https://docs.ray.io/en/latest/serve/production-guide/config.html) to make your deployment version-controlled, reproducible, and easier to maintain for CI/CD pipelines. See [Serving LLMs: Production Guide](https://docs.ray.io/en/latest/serve/llm/serving-llms.html#production-deployment) for an example.\n",
77+
"**Note:** Before moving to a production setup, migrate to a [Serve config file](https://docs.ray.io/en/latest/serve/production-guide/config.html) to make your deployment version-controlled, reproducible, and easier to maintain for CI/CD pipelines. See [Serving LLMs: Production Guide](https://docs.ray.io/en/latest/serve/llm/serving-llms.html#production-deployment) for an example.\n",
7878
"\n",
7979
"---\n",
8080
"\n",
@@ -223,9 +223,9 @@
223223
"\n",
224224
"---\n",
225225
"\n",
226-
"## Deploy to production with Anyscale Services\n",
226+
"## Deploy to production with Anyscale services\n",
227227
"\n",
228-
"For production deployment, use Anyscale Services to deploy the Ray Serve app to a dedicated cluster without modifying the code. Anyscale provides scalability, fault tolerance, and load balancing, keeping the service resilient against node failures, high traffic, and rolling updates, while also automating multi-node setup and autoscaling for large models like DeepSeek-R1.\n",
228+
"For production deployment, use Anyscale services to deploy the Ray Serve app to a dedicated cluster without modifying the code. Anyscale provides scalability, fault tolerance, and load balancing, keeping the service resilient against node failures, high traffic, and rolling updates, while also automating multi-node setup and autoscaling for large models like DeepSeek-R1.\n",
229229
"\n",
230230
"**Beware**: this is an expensive deployment. At the time of writing, the deployment cost is around \\$110 USD per hour in the `us-west-2` AWS region using on-demand instances. Because this node has a high amount of inter-node traffic, and cross-zone traffic is expensive (around \\$0.02 per GB), it's recommended to *disable cross-zone autoscaling*. This demo is pre-configured with cross-zone autoscaling disabled for your convenience.\n",
231231
"\n",
@@ -239,9 +239,9 @@
239239
"\n",
240240
"### Launch the service\n",
241241
"\n",
242-
"Anyscale provides out-of-the-box images (`anyscale/ray-llm`) which come pre-loaded with Ray Serve LLM, vLLM, and all required GPU/runtime dependencies. This makes it easy to get started without building a custom image.\n",
242+
"Anyscale provides out-of-the-box images (`anyscale/ray-llm`), which come pre-loaded with Ray Serve LLM, vLLM, and all required GPU/runtime dependencies. This makes it easy to get started without building a custom image.\n",
243243
"\n",
244-
"Create your Anyscale Service configuration in a new `service.yaml` file:\n",
244+
"Create your Anyscale service configuration in a new `service.yaml` file:\n",
245245
"```yaml\n",
246246
"#service.yaml\n",
247247
"name: deploy-deepseek-r1\n",
@@ -274,7 +274,7 @@
274274
"- import_path: serve_deepseek_r1:app\n",
275275
"```\n",
276276
"\n",
277-
"Deploy your Service"
277+
"Deploy your service"
278278
]
279279
},
280280
{
@@ -295,7 +295,7 @@
295295
"id": "18226fd7",
296296
"metadata": {},
297297
"source": [
298-
"**Note:** If your model is gated, make sure to pass your HuggingFace Token to the Service with `--env HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>`\n",
298+
"**Note:** If your model is gated, make sure to pass your Hugging Face token to the service with `--env HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>`\n",
299299
"\n",
300300
"**Custom Dockerfile** \n",
301301
"You can customize the container by building your own Dockerfile. In your Anyscale Service config, reference the Dockerfile with `containerfile` (instead of `image_uri`):\n",
@@ -319,22 +319,22 @@
319319
"```console\n",
320320
"(anyscale +3.9s) curl -H \"Authorization: Bearer <YOUR-TOKEN>\" <YOUR-ENDPOINT>\n",
321321
"```\n",
322-
"You can also retrieve both from the service page in the Anyscale Console. Click the **Query** button at the top. See [Send requests](#send-requests) for example requests, but make sure to use the correct endpoint and authentication token. \n",
322+
"You can also retrieve both from the service page in the Anyscale console. Click the **Query** button at the top. See [Send requests](#send-requests) for example requests, but make sure to use the correct endpoint and authentication token. \n",
323323
"\n",
324324
"---\n",
325325
"\n",
326326
"### Access the Serve LLM dashboard\n",
327327
"\n",
328-
"See [Enable LLM monitoring](#enable-llm-monitoring) for instructions on enabling LLM-specific logging. To open the Ray Serve LLM Dashboard from an Anyscale Service:\n",
328+
"See [Enable LLM monitoring](#enable-llm-monitoring) for instructions on enabling LLM-specific logging. To open the Ray Serve LLM dashboard from an Anyscale service:\n",
329329
"1. In the Anyscale console, go to your **Service** or **Workspace**\n",
330330
"2. Navigate to the **Metrics** tab\n",
331-
"3. Expand **View in Grafana** and click **Serve LLM Dashboard**\n",
331+
"3. Click **View in Grafana** and click **Serve LLM Dashboard**\n",
332332
"\n",
333333
"---\n",
334334
"\n",
335335
"### Shutdown \n",
336336
" \n",
337-
"Shutdown your Anyscale Service:"
337+
"Shutdown your Anyscale service:"
338338
]
339339
},
340340
{
@@ -358,7 +358,7 @@
358358
"\n",
359359
"## Enable LLM monitoring\n",
360360
"\n",
361-
"The *Serve LLM Dashboard* offers deep visibility into model performance, latency, and system behavior, including:\n",
361+
"The *Serve LLM dashboard* offers deep visibility into model performance, latency, and system behavior, including:\n",
362362
"\n",
363363
"* Token throughput (tokens/sec)\n",
364364
"* Latency metrics: Time To First Token (TTFT), Time Per Output Token (TPOT)\n",
@@ -390,7 +390,7 @@
390390
"INFO 07-30 11:56:04 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 29.06x\n",
391391
"```\n",
392392
"\n",
393-
"Here are a few ways to improve concurrency depending on your model and hardware: \n",
393+
"The following are a few ways to improve concurrency depending on your model and hardware: \n",
394394
"\n",
395395
"**Reduce `max_model_len`** \n",
396396
"Lowering `max_model_len` reduces the memory needed for KV cache.\n",
@@ -399,14 +399,14 @@
399399
"* `max_model_len = 32,768` → concurrency ≈ 29\n",
400400
"* `max_model_len = 16,384` → concurrency ≈ 58\n",
401401
"\n",
402-
"**Use Distilled or Quantized Models** \n",
402+
"**Use distilled or quantized models** \n",
403403
"Quantizing or distilling your model reduces its memory footprint, freeing up space for more KV cache and enabling more concurrent requests. For example, see [`deepseek-ai/DeepSeek-R1-Distill-Llama-70B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) for a distilled version of DeepSeek-R1.\n",
404404
"\n",
405405
"\n",
406406
"**Upgrade to GPUs with more memory** \n",
407407
"Some GPUs provide significantly more room for KV cache and allow for higher concurrency out of the box.\n",
408408
"\n",
409-
"**Scale with more Replicas** \n",
409+
"**Scale with more replicas** \n",
410410
"In addition to tuning per-GPU concurrency, you can scale *horizontally* by increasing the number of replicas in your config. \n",
411411
"Each replica runs on its own GPU, so raising the replica count increases the total number of concurrent requests your service can handle, especially under sustained or bursty traffic.\n",
412412
"```yaml\n",
@@ -416,24 +416,24 @@
416416
" max_replicas: 4\n",
417417
"```\n",
418418
"\n",
419-
"*For more details on tuning strategies, hardware guidance, and serving configurations, see the [GPU Selection Guide for LLM Serving](https://docs.anyscale.com/overview) and [Tuning vLLM and Ray Serve Parameters for LLM Deployment](https://docs.anyscale.com/overview).*\n",
419+
"*For more details on tuning strategies, hardware guidance, and serving configurations, see [Choose a GPU for LLM serving](https://docs.anyscale.com/llm/serving/gpu-guidance) and [Tune parameters for LLMs on Anyscale services](https://docs.anyscale.com/llm/serving/parameter-tuning).*\n",
420420
"\n",
421421
"---\n",
422422
"\n",
423423
"## Troubleshooting\n",
424424
"\n",
425-
"**HuggingFace Auth Errors** \n",
425+
"**Hugging Face auth errors** \n",
426426
"Some models, such as Llama-3.1, are gated and require prior authorization from the organization. See your model’s documentation for instructions on obtaining access.\n",
427427
"\n",
428-
"**Out-Of-Memory Errors** \n",
428+
"**Out-Of-Memory errors** \n",
429429
"Out‑of‑memory (OOM) errors are one of the most common failure modes when deploying LLMs, especially as model sizes, and context length increase. \n",
430-
"See this [Troubleshooting Guide](https://docs.anyscale.com/overview) for common errors and how to fix them.\n",
430+
"See [Troubleshooting Guide](https://docs.anyscale.com/overview) for common errors and how to fix them.\n",
431431
"\n",
432432
"---\n",
433433
"\n",
434434
"## Summary\n",
435435
"\n",
436-
"In this tutorial, you deployed a large size LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM, deploy your service on your Ray cluster, and how to send requests. You also learned how to monitor your app and common troubleshooting issues."
436+
"In this tutorial, you deployed a large-sized LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM, deploy your service on your Ray cluster, and how to send requests. You also learned how to monitor your app and troubleshoot common issues."
437437
]
438438
}
439439
],

0 commit comments

Comments
 (0)