How could I set the stop sequence for inference like in CodeLlama 70B? #666

davideuler · 2024-02-11T06:11:17Z

davideuler
Feb 11, 2024

When running inference with CodeLlama 70B, I need to specify the stop sequence in llama.cpp or in ollama.
When I run CodeLlama 70B 4bit MLX, it outputs lots of EOT and could not stop. I am not sure if it is caused by stop sequences settings. How could I set the stop sequence in MLX?

mzbac · 2024-02-11T06:33:08Z

mzbac
Feb 11, 2024

This is more of an application implementation issue than an mlx issue. I have implemented this in one of my projects. You can take a look here.
https://github.com/mzbac/mlx-llm-server/blob/main/mlx_llm_server/app.py#L158-L165
https://github.com/mzbac/mlx-llm-server/blob/main/mlx_llm_server/app.py#L93

6 replies

mzbac Feb 11, 2024

You may need to double-check if you are using the base model for code completion instead of using the chat model. The chat model is not normally trained for auto-complete tasks.

davideuler Feb 11, 2024
Author

I'm using model from here: https://huggingface.co/mlx-community/CodeLlama-70b-Instruct-hf-4bit-MLX

Thanks for the mlx-llm-server, I can run curl to get response from api as the README in github project.
But when I try to use the mlx-llm-server exposed API to a webui, it failed. Not sure what's the problem. Maybe the streaming api is not fully compatible with OpenAI API.

I tested both the following 3 projects, none of them could work properly with mlx-llm-server exposed OpenAI compatible API.
https://github.com/Niek/chatgpt-web
https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web
https://github.com/ztjhz/BetterChatGPT

Do you use a webui front-end to work with mlx-llm-server? Thanks.

mzbac Feb 11, 2024

There's a issue with llama tokenizer, may cause the problem, but I have test the integration with chat-ui using another models it was working fine. I am currently working on add openai like api support into mlx-lm, hopefully it will get merged soon. So you can use that instead. FYI
ml-explore/mlx-examples#429

davideuler Feb 11, 2024
Author

That's cool. I couldn't wait to try the merged version with openai like api support.

I'll try chat-ui.

davideuler Feb 11, 2024
Author

I've tried chat-ui, The mlx-llm-server exposed API works with huggingface chat-ui. Great job, thanks mzbac!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How could I set the stop sequence for inference like in CodeLlama 70B? #666

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How could I set the stop sequence for inference like in CodeLlama 70B? #666

davideuler Feb 11, 2024

Replies: 1 comment · 6 replies

mzbac Feb 11, 2024

mzbac Feb 11, 2024

davideuler Feb 11, 2024 Author

mzbac Feb 11, 2024

davideuler Feb 11, 2024 Author

davideuler Feb 11, 2024 Author

davideuler
Feb 11, 2024

Replies: 1 comment 6 replies

mzbac
Feb 11, 2024

davideuler Feb 11, 2024
Author

davideuler Feb 11, 2024
Author

davideuler Feb 11, 2024
Author