Chapter 5: Use vLLM build a inference service like openai chatGPT

In this chapter5 We use vLLM build inference service , why we choice vLLM , there have two reasons: 1. vLLM have good performence, you can found more performance information https://vllm.ai/ , 2. vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

Step1. Install vLLM and start vLLM Server

pip install vllm
python -m vllm.entrypoints.openai.api_server --model ./Llama2/models/llama-2-7b-chat-hf

#list models
 curl http://127.0.0.1:8000/v1/models|jq
 {
  "object": "list",
  "data": [
    {
      "id": "./Llama2/models/llama-2-7b-chat-hf",
      "object": "model",
      "created": 1692330491,
      "owned_by": "vllm",
      "root": "./Llama2/models/llama-2-7b-chat-hf",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-ed8520baef03464d8314f1010b48f7ec",
          "object": "model_permission",
          "created": 1692330491,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Step2. Use OpenAI client invoke vLLM API Server

We can use open client send our prompt to vLLM API Server.

#use openai client send query to vLLM API Server
...
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"
...
 completion = openai.Completion.create(
            model="./Llama2/models/llama-2-7b-chat-hf",
            prompt=prompt,
            temperature=0.6,
            max_tokens=2048,
            )
  ...

Step3 start chat coversation

Start chat coversation

python chapter5.py --file_path ../../pdf/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chapter5.md

chapter5.md

Chapter 5: Use vLLM build a inference service like openai chatGPT

Step1. Install vLLM and start vLLM Server

Step2. Use OpenAI client invoke vLLM API Server

Step3 start chat coversation

Files

chapter5.md

Latest commit

History

chapter5.md

File metadata and controls

Chapter 5: Use vLLM build a inference service like openai chatGPT

Step1. Install vLLM and start vLLM Server

Step2. Use OpenAI client invoke vLLM API Server

Step3 start chat coversation