-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAI API standard conformity #9
Open
vlbosch
wants to merge
3
commits into
nath1295:main
Choose a base branch
from
vlbosch:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MLX-Textgen offers OpenAI API compatibility, but some demanding clients like Witsy.ai expect the API format to be exactly followed. The original implementation deviated from the OpenAI API specification in several subtle ways: 1. For streaming chat completions, an initial message with only `{"delta": {"role": "assistant"}}` was missing 2. The deltas often contained multiple fields at once, such as `{"role": "assistant", "content": "text"}`, while OpenAI's API sends only one field per delta object 3. The final chunk sometimes lacked the empty delta object with `"finish_reason": "stop"` 4. Some metadata such as `system_fingerprint` and `logprobs` were missing 5. The `data:` prefix and `\n\n` suffix for SSE chunks were sometimes inconsistent 6. Inconsistent completion IDs within a single streaming session ### Solution This PR implements precise OpenAI API compatibility by: 1. Restructuring the streaming functions to exactly follow the correct format 2. Correctly formatting each streamed chunk with the exact fields and values 3. Using a consistent ID and timestamp for all chunks in a session 4. Properly handling delta objects with only one change per chunk 5. Adding an explicit final message with `"finish_reason": "stop"` 6. Retaining the original functionality, including toolcalling The main changed components are: 1. `async_generate_stream` - For streaming text completions 2. `async_generate` - For non-streaming text completions 3. `async_chat_generate_stream` - For streaming chat completions 4. `async_chat_generate` - For non-streaming chat completions For debugging purposes, a logger has also been added that writes detailed information about the streaming messages to a log file. ### Technical Details #### Correct Structure of Chat Completion Stream A chat completion stream must exactly follow this structure: 1. Start with a message that only contains the role: ```json data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]} ``` 2. Follow this with messages that only contain content: ```json data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]} ``` 3. End with an empty delta object and `"finish_reason": "stop"`: ```json data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]} ``` 4. Conclude with `data: [DONE]` #### Important Changes in the Implementation * We create a consistent completion ID and timestamp for all chunks in a session * For delta objects, we ensure that only one field is used at a time (role OR content OR tool_calls) * Empty deltas are skipped except for the final message * SSE chunks always contain the `data:` prefix and `\n\n` suffix ### Testing These changes have been tested with Witsy.ai as the client and resolve the issues with streaming. All functionality is retained, including: * Text completion streaming * Chat completion streaming * Tool calling * Non-streaming requests
MLX-Textgen offers OpenAI API compatibility, but some demanding clients like Witsy.ai expect the API format to be exactly followed. The original implementation deviated from the OpenAI API specification in several subtle ways: 1. For streaming chat completions, an initial message with only `{"delta": {"role": "assistant"}}` was missing 2. The deltas often contained multiple fields at once, such as `{"role": "assistant", "content": "text"}`, while OpenAI's API sends only one field per delta object 3. The final chunk sometimes lacked the empty delta object with `"finish_reason": "stop"` 4. Some metadata such as `system_fingerprint` and `logprobs` were missing 5. The `data:` prefix and `\n\n` suffix for SSE chunks were sometimes inconsistent 6. Inconsistent completion IDs within a single streaming session ### Solution This PR implements precise OpenAI API compatibility by: 1. Restructuring the streaming functions to exactly follow the correct format 2. Correctly formatting each streamed chunk with the exact fields and values 3. Using a consistent ID and timestamp for all chunks in a session 4. Properly handling delta objects with only one change per chunk 5. Adding an explicit final message with `"finish_reason": "stop"` 6. Retaining the original functionality, including toolcalling The main changed components are: 1. `async_generate_stream` - For streaming text completions 2. `async_generate` - For non-streaming text completions 3. `async_chat_generate_stream` - For streaming chat completions 4. `async_chat_generate` - For non-streaming chat completions For debugging purposes, a logger has also been added that writes detailed information about the streaming messages to a log file. ### Technical Details #### Correct Structure of Chat Completion Stream A chat completion stream must exactly follow this structure: 1. Start with a message that only contains the role: ```json data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]} ``` 2. Follow this with messages that only contain content: ```json data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]} ``` 3. End with an empty delta object and `"finish_reason": "stop"`: ```json data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]} ``` 4. Conclude with `data: [DONE]` #### Important Changes in the Implementation * We create a consistent completion ID and timestamp for all chunks in a session * For delta objects, we ensure that only one field is used at a time (role OR content OR tool_calls) * Empty deltas are skipped except for the final message * SSE chunks always contain the `data:` prefix and `\n\n` suffix ### Testing These changes have been tested with Witsy.ai as the client and resolve the issues with streaming. All functionality is retained, including: * Text completion streaming * Chat completion streaming * Tool calling * Non-streaming requests
# Conflicts: # src/mlx_textgen/server.py
Hi @vlbosch , I am wondering if the completion id is not consistent within the same streaming session in the current session, as I think the id is created within the ModelEngine class. For the rest, I do agree that it is more or less what you described. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem Description
MLX-Textgen offers OpenAI API compatibility, but some clients like Witsy.ai expect the API format to be exactly followed. The original implementation deviated from the OpenAI API specification in several subtle ways:
For streaming chat completions, an initial message with only
{"delta": {"role": "assistant"}}
was missingThe deltas often contained multiple fields at once, such as
{"role": "assistant", "content": "text"}
, while OpenAI's API sends only one field per delta objectThe final chunk sometimes lacked the empty delta object with
"finish_reason": "stop"
Some metadata such as
system_fingerprint
andlogprobs
were missingThe
data:
prefix and\n\n
suffix for SSE chunks were sometimes inconsistentInconsistent completion IDs within a single streaming session
Solution
This PR implements precise OpenAI API compatibility by:
Restructuring the streaming functions to exactly follow the correct format
Correctly formatting each streamed chunk with the exact fields and values
Using a consistent ID and timestamp for all chunks in a session
Properly handling delta objects with only one change per chunk
Adding an explicit final message with
"finish_reason": "stop"
Retaining the original functionality, including toolcalling
The main changed components are:
async_generate_stream
- For streaming text completionsasync_generate
- For non-streaming text completionsasync_chat_generate_stream
- For streaming chat completionsasync_chat_generate
- For non-streaming chat completionsFor debugging purposes, a logger has also been added that writes detailed information about the streaming messages to a log file.
Technical Details
Correct Structure of Chat Completion Stream
A chat completion stream must exactly follow this structure:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]}
"finish_reason": "stop"
:data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
data: [DONE]
Important Changes in the Implementation
We create a consistent completion ID and timestamp for all chunks in a session
For delta objects, we ensure that only one field is used at a time (role OR content OR tool_calls)
Empty deltas are skipped except for the final message
SSE chunks always contain the
data:
prefix and\n\n
suffixTesting
These changes have been tested with Witsy.ai as the client and resolve the issues with streaming. All functionality is retained, including:
Text completion streaming
Chat completion streaming
Tool calling
Non-streaming requests