Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI API standard conformity #9

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

vlbosch
Copy link
Contributor

@vlbosch vlbosch commented Mar 3, 2025

Problem Description

MLX-Textgen offers OpenAI API compatibility, but some clients like Witsy.ai expect the API format to be exactly followed. The original implementation deviated from the OpenAI API specification in several subtle ways:

  1. For streaming chat completions, an initial message with only {"delta": {"role": "assistant"}} was missing

  2. The deltas often contained multiple fields at once, such as {"role": "assistant", "content": "text"}, while OpenAI's API sends only one field per delta object

  3. The final chunk sometimes lacked the empty delta object with "finish_reason": "stop"

  4. Some metadata such as system_fingerprint and logprobs were missing

  5. The data: prefix and \n\n suffix for SSE chunks were sometimes inconsistent

  6. Inconsistent completion IDs within a single streaming session

Solution

This PR implements precise OpenAI API compatibility by:

  1. Restructuring the streaming functions to exactly follow the correct format

  2. Correctly formatting each streamed chunk with the exact fields and values

  3. Using a consistent ID and timestamp for all chunks in a session

  4. Properly handling delta objects with only one change per chunk

  5. Adding an explicit final message with "finish_reason": "stop"

  6. Retaining the original functionality, including toolcalling

The main changed components are:

  1. async_generate_stream - For streaming text completions

  2. async_generate - For non-streaming text completions

  3. async_chat_generate_stream - For streaming chat completions

  4. async_chat_generate - For non-streaming chat completions

For debugging purposes, a logger has also been added that writes detailed information about the streaming messages to a log file.

Technical Details

Correct Structure of Chat Completion Stream

A chat completion stream must exactly follow this structure:

  1. Start with a message that only contains the role:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
  1. Follow this with messages that only contain content:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]}
  1. End with an empty delta object and "finish_reason": "stop":
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
  1. Conclude with data: [DONE]

Important Changes in the Implementation

  • We create a consistent completion ID and timestamp for all chunks in a session

  • For delta objects, we ensure that only one field is used at a time (role OR content OR tool_calls)

  • Empty deltas are skipped except for the final message

  • SSE chunks always contain the data: prefix and \n\n suffix

Testing

These changes have been tested with Witsy.ai as the client and resolve the issues with streaming. All functionality is retained, including:

  • Text completion streaming

  • Chat completion streaming

  • Tool calling

  • Non-streaming requests

vlbosch added 3 commits March 3, 2025 12:04
MLX-Textgen offers OpenAI API compatibility, but some demanding clients like Witsy.ai expect the API format to be exactly followed. The original implementation deviated from the OpenAI API specification in several subtle ways:
1.  For streaming chat completions, an initial message with only `{"delta": {"role": "assistant"}}` was missing

2.  The deltas often contained multiple fields at once, such as `{"role": "assistant", "content": "text"}`, while OpenAI's API sends only one field per delta object

3.  The final chunk sometimes lacked the empty delta object with `"finish_reason": "stop"`

4.  Some metadata such as `system_fingerprint` and `logprobs` were missing

5.  The `data:` prefix and `\n\n` suffix for SSE chunks were sometimes inconsistent

6.  Inconsistent completion IDs within a single streaming session

### Solution
This PR implements precise OpenAI API compatibility by:
1.  Restructuring the streaming functions to exactly follow the correct format

2.  Correctly formatting each streamed chunk with the exact fields and values

3.  Using a consistent ID and timestamp for all chunks in a session

4.  Properly handling delta objects with only one change per chunk

5.  Adding an explicit final message with `"finish_reason": "stop"`

6.  Retaining the original functionality, including toolcalling

The main changed components are:
1.  `async_generate_stream` - For streaming text completions

2.  `async_generate` - For non-streaming text completions

3.  `async_chat_generate_stream` - For streaming chat completions

4.  `async_chat_generate` - For non-streaming chat completions

For debugging purposes, a logger has also been added that writes detailed information about the streaming messages to a log file.
### Technical Details
#### Correct Structure of Chat Completion Stream
A chat completion stream must exactly follow this structure:
1.  Start with a message that only contains the role:

```json
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
```
2.  Follow this with messages that only contain content:

```json
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]}
```
3.  End with an empty delta object and `"finish_reason": "stop"`:

```json
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
```
4.  Conclude with `data: [DONE]`

#### Important Changes in the Implementation
*   We create a consistent completion ID and timestamp for all chunks in a session

*   For delta objects, we ensure that only one field is used at a time (role OR content OR tool_calls)

*   Empty deltas are skipped except for the final message

*   SSE chunks always contain the `data:` prefix and `\n\n` suffix

### Testing
These changes have been tested with Witsy.ai as the client and resolve the issues with streaming. All functionality is retained, including:
*   Text completion streaming

*   Chat completion streaming

*   Tool calling

*   Non-streaming requests
MLX-Textgen offers OpenAI API compatibility, but some demanding clients like Witsy.ai expect the API format to be exactly followed. The original implementation deviated from the OpenAI API specification in several subtle ways:
1.  For streaming chat completions, an initial message with only `{"delta": {"role": "assistant"}}` was missing

2.  The deltas often contained multiple fields at once, such as `{"role": "assistant", "content": "text"}`, while OpenAI's API sends only one field per delta object

3.  The final chunk sometimes lacked the empty delta object with `"finish_reason": "stop"`

4.  Some metadata such as `system_fingerprint` and `logprobs` were missing

5.  The `data:` prefix and `\n\n` suffix for SSE chunks were sometimes inconsistent

6.  Inconsistent completion IDs within a single streaming session

### Solution
This PR implements precise OpenAI API compatibility by:
1.  Restructuring the streaming functions to exactly follow the correct format

2.  Correctly formatting each streamed chunk with the exact fields and values

3.  Using a consistent ID and timestamp for all chunks in a session

4.  Properly handling delta objects with only one change per chunk

5.  Adding an explicit final message with `"finish_reason": "stop"`

6.  Retaining the original functionality, including toolcalling

The main changed components are:
1.  `async_generate_stream` - For streaming text completions

2.  `async_generate` - For non-streaming text completions

3.  `async_chat_generate_stream` - For streaming chat completions

4.  `async_chat_generate` - For non-streaming chat completions

For debugging purposes, a logger has also been added that writes detailed information about the streaming messages to a log file.
### Technical Details
#### Correct Structure of Chat Completion Stream
A chat completion stream must exactly follow this structure:
1.  Start with a message that only contains the role:

```json
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
```
2.  Follow this with messages that only contain content:

```json
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]}
```
3.  End with an empty delta object and `"finish_reason": "stop"`:

```json
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"model-name","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
```
4.  Conclude with `data: [DONE]`

#### Important Changes in the Implementation
*   We create a consistent completion ID and timestamp for all chunks in a session

*   For delta objects, we ensure that only one field is used at a time (role OR content OR tool_calls)

*   Empty deltas are skipped except for the final message

*   SSE chunks always contain the `data:` prefix and `\n\n` suffix

### Testing
These changes have been tested with Witsy.ai as the client and resolve the issues with streaming. All functionality is retained, including:
*   Text completion streaming

*   Chat completion streaming

*   Tool calling

*   Non-streaming requests
# Conflicts:
#	src/mlx_textgen/server.py
@nath1295
Copy link
Owner

nath1295 commented Mar 5, 2025

Hi @vlbosch ,

I am wondering if the completion id is not consistent within the same streaming session in the current session, as I think the id is created within the ModelEngine class. For the rest, I do agree that it is more or less what you described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants