Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Context Window Size for Ollama Chat #6582

Merged
merged 3 commits into from
Mar 28, 2025
Merged

Conversation

MarcusYuan
Copy link
Contributor

Dynamic Context Window Size for Ollama Chat

Problem Statement

Previously, the Ollama chat implementation used a fixed context window size of 32768 tokens. This caused two main issues:

  1. Performance degradation due to unnecessarily large context windows for small conversations
  2. Potential business logic failures when using smaller fixed sizes (e.g., 2048 tokens)

Solution

Implemented a dynamic context window size calculation that:

  1. Uses a base context size of 8192 tokens
  2. Applies a 1.2x buffer ratio to the total token count
  3. Adds multiples of 8192 tokens based on the buffered token count
  4. Implements a smart context size update strategy

Implementation Details

Token Counting Logic

def count_tokens(text):
    """Calculate token count for text"""
    # Simple calculation: 1 token per ASCII character
    # 2 tokens for non-ASCII characters (Chinese, Japanese, Korean, etc.)
    total = 0
    for char in text:
        if ord(char) < 128:  # ASCII characters
            total += 1
        else:  # Non-ASCII characters
            total += 2
    return total

Dynamic Context Calculation

def _calculate_dynamic_ctx(self, history):
    """Calculate dynamic context window size"""
    # Calculate total tokens for all messages
    total_tokens = 0
    for message in history:
        content = message.get("content", "")
        content_tokens = count_tokens(content)
        role_tokens = 4  # Role marker token overhead
        total_tokens += content_tokens + role_tokens

    # Apply 1.2x buffer ratio
    total_tokens_with_buffer = int(total_tokens * 1.2)
    
    # Calculate context size in multiples of 8192
    if total_tokens_with_buffer <= 8192:
        ctx_size = 8192
    else:
        ctx_multiplier = (total_tokens_with_buffer // 8192) + 1
        ctx_size = ctx_multiplier * 8192
    
    return ctx_size

Integration in Chat Method

def chat(self, system, history, gen_conf):
    if system:
        history.insert(0, {"role": "system", "content": system})
    if "max_tokens" in gen_conf:
        del gen_conf["max_tokens"]
    try:
        # Calculate new context size
        new_ctx_size = self._calculate_dynamic_ctx(history)
        
        # Prepare options with context size
        options = {
            "num_ctx": new_ctx_size
        }
        # Add other generation options
        if "temperature" in gen_conf:
            options["temperature"] = gen_conf["temperature"]
        if "max_tokens" in gen_conf:
            options["num_predict"] = gen_conf["max_tokens"]
        if "top_p" in gen_conf:
            options["top_p"] = gen_conf["top_p"]
        if "presence_penalty" in gen_conf:
            options["presence_penalty"] = gen_conf["presence_penalty"]
        if "frequency_penalty" in gen_conf:
            options["frequency_penalty"] = gen_conf["frequency_penalty"]
            
        # Make API call with dynamic context size
        response = self.client.chat(
            model=self.model_name,
            messages=history,
            options=options,
            keep_alive=60
        )
        return response["message"]["content"].strip(), response.get("eval_count", 0) + response.get("prompt_eval_count", 0)
    except Exception as e:
        return "**ERROR**: " + str(e), 0

Benefits

  1. Improved Performance: Uses appropriate context windows based on conversation length
  2. Better Resource Utilization: Context window size scales with content
  3. Maintained Compatibility: Works with existing business logic
  4. Predictable Scaling: Context growth in 8192-token increments
  5. Smart Updates: Context size updates are optimized to reduce unnecessary model reloads

Future Considerations

  1. Fine-tune buffer ratio based on usage patterns
  2. Add monitoring for context window utilization
  3. Consider language-specific token counting optimizations
  4. Implement adaptive threshold based on conversation patterns
  5. Add metrics for context size update frequency

## Problem Statement
Previously, the Ollama chat implementation used a fixed context window size of 32768 tokens. This caused two main issues:
1. Performance degradation due to unnecessarily large context windows for small conversations
2. Potential business logic failures when using smaller fixed sizes (e.g., 2048 tokens)

## Solution
Implemented a dynamic context window size calculation that:
1. Uses a base context size of 8192 tokens
2. Applies a 1.2x buffer ratio to the total token count
3. Adds multiples of 8192 tokens based on the buffered token count
4. Implements a smart context size update strategy
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🌈 python Pull requests that update Python code 💞 feature Feature request, pull request that fullfill a new feature. labels Mar 27, 2025
@yingfeng yingfeng added the ci Continue Integration label Mar 27, 2025
@yingfeng yingfeng changed the title # Dynamic Context Window Size for Ollama Chat Dynamic Context Window Size for Ollama Chat Mar 27, 2025
@asiroliu
Copy link
Contributor

@MarcusYuan @KevinHuSh
During chat operations, every stream response is being output in real-time.
20250327161410_rec_

@MarcusYuan
Copy link
Contributor Author

@asiroliu "I don’t understand what you’re trying to say."

@asiroliu
Copy link
Contributor

@MarcusYuan
The expected behavior is incremental word-by-word or line-by-line response generation, but the current implementation displays outputs in unpredictable locations.

@MarcusYuan
Copy link
Contributor Author

@asiroliu What I did was dynamically modify Ollama's "num_ctx": ctx_size. This only solves the issue where a fixed (hardcoded) value would cause truncation when too many input tokens are processed. I don’t see how this relates to the video you mentioned above.

@asiroliu
Copy link
Contributor

asiroliu commented Mar 28, 2025

@MarcusYuan
I've tested both your submitted version and the nightly build (image ID: e0655386618e). The issue does not occur in the nightly version.

Steps to reproduce:

  1. deploy local ollama https://ragflow.io/docs/dev/deploy_local_llm
  • chat model: llama3.2
  • embedding model: bge-m3
  1. Retrieved your full codebase
git clone -b main https://github.com/MarcusYuan/ragflow.git 6582
  1. bulit Docker image
cd 6582
docker build --progress=plain --build-arg LIGHTEN=1 --build-arg NEED_MIRROR=1 -f Dockerfile -t infiniflow/ragflow:6582 .
  1. Deployed container from the built image
cd docker
sed -i "s#^RAGFLOW_IMAGE=.*#RAGFLOW_IMAGE=infiniflow/ragflow:6582#" .env
docker compose -f docker-compose.yml up

@KevinHuSh KevinHuSh merged commit c61df5d into infiniflow:main Mar 28, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continue Integration 💞 feature Feature request, pull request that fullfill a new feature. 🌈 python Pull requests that update Python code size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants