Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better LLM response format #387

Merged
merged 2 commits into from
Jan 14, 2025

Conversation

kyriediculous
Copy link
Contributor

  • Improve LLM responses (openAI compatible format for non-streaming responses)
  • Use uuid for vLLM request generation

@kyriediculous
Copy link
Contributor Author

I don't know why exactly but I keep running into differently generated bindings than you guys. Might be a versioning thing.

@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Dec 29, 2024

Reviewed the updates, some comments below:

  • stream_generator needs to be updated to process the LLMResponse returned now. Below is how I did it to test end to end. You may want to do differently, just wanted to provide head start.
 async def stream_generator(generator):
     try:
         async for chunk in generator:
-            if isinstance(chunk, dict):
-                if "choices" in chunk:
+            if isinstance(chunk, LLMResponse):
+                if len(chunk.choices) > 0:
                     # Regular streaming chunk or final chunk
-                    yield f"data: {json.dumps(chunk)}\n\n"
-                    if chunk["choices"][0].get("finish_reason") == "stop":
+                    yield f"data: {chunk.model_dump_json()}\n\n"
+                    if chunk.choices[0].finish_reason == "stop":
                         break

handleStreamingResponse also needs update, example I did to test end to end:

@@ -795,16 +796,16 @@ func (w *Worker) handleStreamingResponse(ctx context.Context, c *RunnerContainer
                                                return
                                        }
 
-                                       var streamData LlmStreamChunk
+                                       var streamData LLMResponse
                                        if err := json.Unmarshal([]byte(data), &streamData); err != nil {
                                                slog.Error("Error unmarshaling stream data", slog.String("err", err.Error()))
                                                continue
                                        }
 
-                                       totalTokens += streamData.TokensUsed
-
+                                       totalTokens = streamData.TokensUsed
+                                       chunk := LlmStreamChunk{Chunk: data, Done: false, TokensUsed: totalTokens}
                                        select {
-                                       case outputChan <- streamData:
+                                       case outputChan <- chunk:
                                        case <-ctx.Done():
                                                return
                                        }
  • The defer cancel() in LLM function of worker.go needs to be removed, it returns the container immediately for streamed responses. Was not able to test this until testing through go-livepeer managed containers.
+++ b/worker/worker.go
@@ -397,7 +397,7 @@ func (w *Worker) AudioToText(ctx context.Context, req GenAudioToTextMultipartReq
 func (w *Worker) LLM(ctx context.Context, req GenLLMJSONRequestBody) (interface{}, error) {
        isStreaming := req.Stream != nil && *req.Stream
        ctx, cancel := context.WithCancel(ctx)
-       defer cancel()
+
        c, err := w.borrowContainer(ctx, "llm", *req.Model)
        if err != nil {
                return nil, err
  • The total_tokens seem to be double counted, maybe because the tokenizer is counting the space as a token? Is this expected behavior? note: this was not changed on this one and does not cause payment issues but wanted to confirm.

  • Do you still want to use LlmStreamChunk to send the response through go-livepeer? If want to change it to the LLMResponse will need to update go-livepeer but should be quick update. Example response chunk from streaming:

data: {"chunk":"{\"choices\":[{\"delta\":{\"role\":\"assistant\",\"content\":\"\"},\"index\":0,\"finish_reason\":\"stop\"}],\"tokens_used\":552,\"id\":\"chatcmpl-d2dd5298-8330-48db-8370-901dea8e5a12\",\"model\":\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\"created\":1735437990}","tokens_used":552}

@ad-astra-video
Copy link
Collaborator

@kyriediculous can you comment on these items above:

handleStreamingResponse also needs update, example I did to test end to end

The defer cancel() in LLM function of worker.go needs to be removed, it returns the container immediately for streamed responses. Was not able to test this until testing through go-livepeer managed containers.

Do you still want to use LlmStreamChunk to send the response through go-livepeer?

Copy link
Collaborator

@ad-astra-video ad-astra-video left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple additional updates for change in response.

runner/app/pipelines/llm.py Outdated Show resolved Hide resolved
runner/app/pipelines/llm.py Outdated Show resolved Hide resolved
runner/app/pipelines/llm.py Outdated Show resolved Hide resolved
runner/app/pipelines/llm.py Outdated Show resolved Hide resolved
Copy link
Member

@victorges victorges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, but I haven't checked the OpenAPI schema to make sure the changes make it compatible. Only reviewed the changes here.

runner/app/routes/utils.py Outdated Show resolved Hide resolved
@kyriediculous kyriediculous force-pushed the nv/better-llm-response branch from 97af593 to 5d10a62 Compare January 13, 2025 11:17
@ad-astra-video
Copy link
Collaborator

Approved! Will fix openapi gen in separate PR.

@ad-astra-video ad-astra-video merged commit 1ede01e into livepeer:main Jan 14, 2025
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants