feat: better LLM response format #387

kyriediculous · 2024-12-28T16:48:30Z

Improve LLM responses (openAI compatible format for non-streaming responses)
Use uuid for vLLM request generation

kyriediculous · 2024-12-28T19:36:41Z

I don't know why exactly but I keep running into differently generated bindings than you guys. Might be a versioning thing.

ad-astra-video · 2024-12-29T02:32:51Z

Reviewed the updates, some comments below:

stream_generator needs to be updated to process the LLMResponse returned now. Below is how I did it to test end to end. You may want to do differently, just wanted to provide head start.

 async def stream_generator(generator):
     try:
         async for chunk in generator:
-            if isinstance(chunk, dict):
-                if "choices" in chunk:
+            if isinstance(chunk, LLMResponse):
+                if len(chunk.choices) > 0:
                     # Regular streaming chunk or final chunk
-                    yield f"data: {json.dumps(chunk)}\n\n"
-                    if chunk["choices"][0].get("finish_reason") == "stop":
+                    yield f"data: {chunk.model_dump_json()}\n\n"
+                    if chunk.choices[0].finish_reason == "stop":
                         break

handleStreamingResponse also needs update, example I did to test end to end:

@@ -795,16 +796,16 @@ func (w *Worker) handleStreamingResponse(ctx context.Context, c *RunnerContainer
                                                return
                                        }
 
-                                       var streamData LlmStreamChunk
+                                       var streamData LLMResponse
                                        if err := json.Unmarshal([]byte(data), &streamData); err != nil {
                                                slog.Error("Error unmarshaling stream data", slog.String("err", err.Error()))
                                                continue
                                        }
 
-                                       totalTokens += streamData.TokensUsed
-
+                                       totalTokens = streamData.TokensUsed
+                                       chunk := LlmStreamChunk{Chunk: data, Done: false, TokensUsed: totalTokens}
                                        select {
-                                       case outputChan <- streamData:
+                                       case outputChan <- chunk:
                                        case <-ctx.Done():
                                                return
                                        }

The defer cancel() in LLM function of worker.go needs to be removed, it returns the container immediately for streamed responses. Was not able to test this until testing through go-livepeer managed containers.

+++ b/worker/worker.go
@@ -397,7 +397,7 @@ func (w *Worker) AudioToText(ctx context.Context, req GenAudioToTextMultipartReq
 func (w *Worker) LLM(ctx context.Context, req GenLLMJSONRequestBody) (interface{}, error) {
        isStreaming := req.Stream != nil && *req.Stream
        ctx, cancel := context.WithCancel(ctx)
-       defer cancel()
+
        c, err := w.borrowContainer(ctx, "llm", *req.Model)
        if err != nil {
                return nil, err

The total_tokens seem to be double counted, maybe because the tokenizer is counting the space as a token? Is this expected behavior? note: this was not changed on this one and does not cause payment issues but wanted to confirm.
Do you still want to use LlmStreamChunk to send the response through go-livepeer? If want to change it to the LLMResponse will need to update go-livepeer but should be quick update. Example response chunk from streaming:

data: {"chunk":"{\"choices\":[{\"delta\":{\"role\":\"assistant\",\"content\":\"\"},\"index\":0,\"finish_reason\":\"stop\"}],\"tokens_used\":552,\"id\":\"chatcmpl-d2dd5298-8330-48db-8370-901dea8e5a12\",\"model\":\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\"created\":1735437990}","tokens_used":552}

runner/app/pipelines/llm.py

ad-astra-video · 2025-01-04T14:34:46Z

@kyriediculous can you comment on these items above:

handleStreamingResponse also needs update, example I did to test end to end

The defer cancel() in LLM function of worker.go needs to be removed, it returns the container immediately for streamed responses. Was not able to test this until testing through go-livepeer managed containers.

Do you still want to use LlmStreamChunk to send the response through go-livepeer?

ad-astra-video

Couple additional updates for change in response.

runner/app/pipelines/llm.py

victorges

Code LGTM, but I haven't checked the OpenAPI schema to make sure the changes make it compatible. Only reviewed the changes here.

runner/app/routes/utils.py

runner/app/routes/llm.py

runner/app/routes/utils.py

ad-astra-video · 2025-01-14T07:15:39Z

Approved! Will fix openapi gen in separate PR.

kyriediculous force-pushed the nv/better-llm-response branch from 5e5da0e to d780528 Compare December 28, 2024 17:16

ad-astra-video reviewed Dec 29, 2024

View reviewed changes

runner/app/pipelines/llm.py Outdated Show resolved Hide resolved

ad-astra-video mentioned this pull request Dec 29, 2024

ai: Update llm pipeline livepeer/go-livepeer#3336

Merged

5 tasks

ad-astra-video reviewed Jan 7, 2025

View reviewed changes

runner/app/pipelines/llm.py Outdated Show resolved Hide resolved

runner/app/pipelines/llm.py Outdated Show resolved Hide resolved

runner/app/pipelines/llm.py Outdated Show resolved Hide resolved

runner/app/pipelines/llm.py Outdated Show resolved Hide resolved

victorges approved these changes Jan 7, 2025

View reviewed changes

runner/app/routes/utils.py Outdated Show resolved Hide resolved

0xspeedybird reviewed Jan 10, 2025

View reviewed changes

runner/app/routes/llm.py Outdated Show resolved Hide resolved

0xspeedybird reviewed Jan 10, 2025

View reviewed changes

runner/app/routes/utils.py Outdated Show resolved Hide resolved

kyriediculous added 2 commits January 13, 2025 12:17

feat: use uuid for vLLM request generation

147ee3c

feat: better LLM response format

5d10a62

kyriediculous force-pushed the nv/better-llm-response branch from 97af593 to 5d10a62 Compare January 13, 2025 11:17

ad-astra-video merged commit 1ede01e into livepeer:main Jan 14, 2025
6 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: better LLM response format #387

feat: better LLM response format #387

Uh oh!

kyriediculous commented Dec 28, 2024

Uh oh!

kyriediculous commented Dec 28, 2024

Uh oh!

ad-astra-video commented Dec 29, 2024 •

edited

Loading

Uh oh!

Uh oh!

ad-astra-video commented Jan 4, 2025

Uh oh!

ad-astra-video left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

victorges left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ad-astra-video commented Jan 14, 2025

Uh oh!

Uh oh!

Uh oh!

feat: better LLM response format #387

feat: better LLM response format #387

Uh oh!

Conversation

kyriediculous commented Dec 28, 2024

Uh oh!

kyriediculous commented Dec 28, 2024

Uh oh!

ad-astra-video commented Dec 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ad-astra-video commented Jan 4, 2025

Uh oh!

ad-astra-video left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

victorges left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ad-astra-video commented Jan 14, 2025

Uh oh!

Uh oh!

Uh oh!

ad-astra-video commented Dec 29, 2024 •

edited

Loading