Skip to content

Commit

Permalink
Merge pull request #1377 from madeline-underwood/RTP-LLM-chatbot
Browse files Browse the repository at this point in the history
Rtp llm chatbot_KB to review
  • Loading branch information
jasonrandrews authored Nov 12, 2024
2 parents 4cc83ec + f77a1c1 commit 7a8c687
Show file tree
Hide file tree
Showing 6 changed files with 106 additions and 50 deletions.
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
---
title: Run a Large Language Model (LLM) chatbot with rtp-llm on Arm servers
title: Run an LLM chatbot with rtp-llm on Arm-based servers

minutes_to_complete: 30

who_is_this_for: This is an introductory topic for developers interested in running LLMs on Arm-based servers.
who_is_this_for: This is an introductory topic for developers who are interested in running a Large Language Model (LLM) with rtp-llm on Arm-based servers.

learning_objectives:
- Build rtp-llm on your Arm server.
- Build rtp-llm on an Arm-based server.
- Download a Qwen model from Hugging Face.
- Run a Large Language Model with rtp-llm.

prerequisites:
- An Arm Neoverse N2 or Neoverse V2 [based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server. This Learning Path was tested on an AliCloud Yitian710 g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance to test Arm performance optimizations.
- Any Arm Neoverse N2-based or Arm Neoverse V2-based instance running Ubuntu 22.04 LTS from a cloud service provider or an on-premise Arm server.
- For the server, at least four cores and 16GB of RAM, with disk storage configured up to at least 32 GB.

author_primary: Tianyu Li

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,12 @@ next_step_guidance: >
recommended_path: "/learning-paths/servers-and-cloud-computing/nlp-hugging-face/"


further_reading:
- resource:
title: Qwen2-0.5B-Instruct
link: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
type: website
- resource:
title: Getting started with RTP-LLM
link: https://github.com/alibaba/rtp-llm
Expand All @@ -18,9 +23,10 @@ further_reading:
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
type: blog
- resource:
title: Qwen2-0.5B-Instruct
link: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
title: Get started with Arm-based cloud instances
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/
type: website



# ================================================================================
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,33 @@
review:
- questions:
question: >
Can you run LLMs on Arm CPUs?
Are at least four cores, 16GB of RAM, and 32GB of disk storage required to run the LLM chatbot using rtp-llm on an Arm-based server?
answers:
- "Yes"
- "No"
correct_answer: 1
explanation: >
Yes. The advancements made in the Generative AI space with smaller parameter models make LLM inference on CPUs very efficient.
It depends on the size of the LLM. The higher the number of parameters of the model, the greater the system requirements.
- questions:
question: >
Can rtp-llm be built and run on CPU?
Does the rtp-llm project use the --config=arm option to optimize LLM inference for Arm CPUs?
answers:
- "Yes"
- "No"
correct_answer: 1
explanation: >
Yes. rtp-llm not only support built and run on GPU, but also it can be run on Arm CPU.
rtp-llm uses the GPU for inference by default. rtp-llm optimizes LLM inference on Arm architecture by providing a configuration option --config=arm during the build process.
- questions:
question: >
Is the given Python script the only way to run the LLM chatbot on an Arm AArch64 CPU and output a response from the model?
answers:
- "Yes"
- "No"
correct_answer: 2
explanation: >
rtp-llm can also be deployed as an API server, and the user can use curl or another client to generate an LLM chatbot response.
# ================================================================================
# FIXED, DO NOT MODIFY
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: Background
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---
Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will learn how to run the generative AI inference-based use case of an LLM chatbot on an Arm-based CPU. You will do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on an Arm-based CPU using `rtp-llm`.


{{% notice Note %}}
This Learning Path has been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance.
{{% /notice %}}


[rtp-llm](https://github.com/alibaba/rtp-llm) is an open-source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware.

RTP-LLM is a Large Language Model inference acceleration engine developed by Alibaba. Qwen is the name given to a series of Large Language Models developed by Alibaba Cloud that are capable of performing a variety of tasks.

Alibaba Cloud offer a wide range of models, each suitable for different tasks and use cases.

Besides generating text, they are also able to perform actions such as:

* Answering questions, through information retrieval, and analysis.
* Processing images, and producing written descriptions of visual content.
* Processing audio content.
* Provide multilingual support, with over 27 additional languages, on top of the core languages of English and Chinese.

Qwen is open source, flexible, and encourages contribution from the software development community.




Original file line number Diff line number Diff line change
@@ -1,23 +1,13 @@
---
title: Run a Large Language model (LLM) chatbot with rtp-llm on Arm servers
title: Run an LLM chatbot with rtp-llm on an Arm server
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Before you begin
The instructions in this Learning Path are for any Arm Neoverse N2 or Neoverse V2 based server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 16GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance.

## Overview

Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you will learn how to run generative AI inference-based use case like a LLM chatbot on Arm-based CPUs. You do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on your Arm-based CPU using `rtp-llm`.

[rtp-llm](https://github.com/alibaba/rtp-llm) is an open source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware.

## Install dependencies

Install `micromamba` to setup python 3.10 at path `/opt/conda310`, required by `rtp-llm` build system:
Install `micromamba` to set up python 3.10 at path `/opt/conda310`, as required by the `rtp-llm` build system:

```bash
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
Expand All @@ -34,14 +24,14 @@ chmod +x bazelisk-linux-arm64
sudo mv bazelisk-linux-arm64 /usr/bin/bazelisk
```

Install `git/gcc/g++` on your machine:
Install `git/gcc/g++`:

```bash
sudo apt install git -y
sudo apt install build-essential -y
```

Install `openblas` developmwnt package and fix the header paths:
Install the `openblas` development package and fix the header paths:

```bash
sudo apt install libopenblas-dev
Expand All @@ -53,28 +43,28 @@ sudo ln -sf /usr/include/aarch64-linux-gnu/cblas.h /usr/include/openblas/cblas.h

You are now ready to start building `rtp-llm`.

Clone the source repository for rtp-llm:
Start by cloning the source repository for rtp-llm:

```bash
git clone https://github.com/alibaba/rtp-llm
cd rtp-llm
git checkout 4656265
```

Comment out the lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the Internet.
Next, comment out lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the web:

```bash
sed -i '7,10 s/^/#/' deps/requirements_lock_torch_arm.txt
```

By default, `rtp-llm` builds for GPU only on Linux. You need to provide extra config `--config=arm` to build it for the Arm CPU that you will run it on:
By default, `rtp-llm` builds for GPU only on Linux. You need to provide the additional flag `--config=arm` to build it for the Arm CPU that you will run it on.

Configure and build:

```bash
bazelisk build --config=arm //maga_transformer:maga_transformer_aarch64
```
The output from your build should look like:
The output from your build should look like this:

```output
INFO: 10094 processes: 8717 internal, 1377 local.
Expand All @@ -87,7 +77,7 @@ Install the built wheel package:
pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
```

Create a file named `python-test.py` in your `/tmp` directory with the contents below:
Create a file named `python-test.py` in your `/tmp` directory with the contents shown below:

```python
from maga_transformer.pipeline import Pipeline
Expand Down Expand Up @@ -140,7 +130,9 @@ Now run this file:
python /tmp/python-test.py
```

If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input. A snippet of the output is shown below:
If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input.

A snippet of the output is shown below:

```output
['I am a large language model created by Alibaba Cloud. My name is Qwen.']
Expand Down Expand Up @@ -174,5 +166,7 @@ If `rtp-llm` has built correctly on your machine, you will see the LLM model res
```


You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
You have successfully run a LLM chatbot with Arm optimizations, running on an Arm AArch64 CPU on your server.

You can continue to experiment with the chatbot by trying out different prompts on the model.

Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,32 @@ weight: 4
### FIXED, DO NOT MODIFY
layout: learningpathall
---
## Setup

You can use the `rtp-llm` server program and submit requests using an OpenAI-compatible API.
This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.
You can now move on to using the `rtp-llm` server program and submitting requests using an OpenAI-compatible API.

One additional software package is required for this section. Install `jq` on your computer using:
This enables applications to be created which access the LLM multiple times without starting and stopping it.

You can also access the server over the network to another machine hosting the LLM.

One additional software package is required for this section.

Install `jq` on your computer using the following commands:

```bash
sudo apt install jq -y
```

# Running the Server
## Install Hugging Face Hub
## Running the Server

There are a few different ways you can download the Qwen2 0.5B model. In this Learning Path, you download the model from Hugging Face.
There are a few different ways you can download the Qwen2 0.5B model. In this Learning Path, you will download the model from Hugging Face.

[Hugging Face](https://huggingface.co/) is an open source AI community where you can host your own AI models, train them and collaborate with others in the community. You can browse through the thousands of models that are available for a variety of use cases like NLP, audio, and computer vision.
[Hugging Face](https://huggingface.co/) is an open source AI community where you can host your own AI models, train them, and collaborate with others in the community. You can browse through thousands of models that are available for a variety of use cases such as Natural Language Processing (NLP), audio, and computer vision.

The `huggingface_hub` library provides APIs and tools that let you easily download and fine-tune pre-trained models. You will use `huggingface-cli` to download the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct).

## Install Hugging Face Hub

Install the required Python packages:

```bash
Expand Down Expand Up @@ -51,14 +58,18 @@ You can now download the model using the huggingface cli:
huggingface-cli download Qwen/Qwen2-0.5B-Instruct
```

## Start rtp-llm server
The server executable has already compiled during the stage detailed in the previous section, when you ran `bazelisk build`. Install the pip wheel in your active virtual environment:
## Start the rtp-llm server

{{% notice Note %}}
The server executable compiled during the previous stage, when you ran `bazelisk build`. {{% /notice %}}

Install the pip wheel in your active virtual environment:

```bash
pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
pip install grpcio-tools
```
Start the server from the command line, it listens on port 8088:
Start the server from the command line. It listens on port 8088:

```bash
export CHECKPOINT_PATH=${HOME}/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/
Expand All @@ -67,8 +78,9 @@ export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
MODEL_TYPE=qwen_2 FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
```

# Client
## Use curl
## Client

### Using curl

You can access the API using the `curl` command.

Expand All @@ -90,15 +102,15 @@ curl http://localhost:8088/v1/chat/completions -H "Content-Type: application/jso
}' 2>/dev/null | jq -C
```

The `model` value in the API is not used, you can enter any value. This is because there is only one model loaded in the server.
The `model` value in the API is not used, and you can enter any value. This is because there is only one model loaded in the server.

Run the script:

```bash
bash ./curl-test.sh
```

The `curl` command accesses the LLM and you see the output:
The `curl` command accesses the LLM and you should see the output:

```output
{
Expand All @@ -124,9 +136,9 @@ The `curl` command accesses the LLM and you see the output:
}
```

In the returned JSON data you see the LLM output, including the content created from the prompt.
In the returned JSON data, you will see the LLM output, including the content created from the prompt.

## Use Python
### Using Python

You can also use a Python program to access the OpenAI-compatible API.

Expand Down Expand Up @@ -165,13 +177,13 @@ for chunk in completion:
print(chunk.choices[0].delta.content or "", end="")
```

Run the Python file (make sure the server is still running):
Ensure that the server is still running, and then run the Python file:

```bash
python ./python-test.py
```

You see the output generated by the LLM:
You should see the output generated by the LLM:

```output
Sure, here's a simple C++ program that prints "Hello, World!" to the console:
Expand All @@ -187,4 +199,4 @@ int main() {
This program includes the `iostream` library, which is used for input/output operations. The `main` function is the entry point of the program, and it calls the `cout` object to print the message "Hello, World!" to the console.
```

You can continue to experiment with different large language models and write scripts to access them.
Now you can continue to experiment with different large language models, and have a go at writing scripts to access them.

0 comments on commit 7a8c687

Please sign in to comment.