Skip to content

[Feature] Add Rate Limiting Middleware to Prevent LLM API Overuse #55

@SandeepChauhan00

Description

@SandeepChauhan00

Description

The /api/chat endpoint in backend/main.py makes direct LLM API calls with no rate limiting or request throttling. This creates risks of uncontrolled API costs, unhandled 429 errors from LLM providers, and vulnerability to abuse or accidental request loops in multi-user deployments.

Problem

Currently, there is no mechanism to limit how many requests a user can send to the LLM API within a given time window. The chat_endpoint function in main.py directly calls assistant.handle_chat() with no throttling.

Impact

  • Unlimited rapid requests hit the LLM API simultaneously
  • Google Gemini/Vertex AI rate limits trigger unhandled errors
  • No cost control or usage visibility
  • No protection against bot abuse or accidental loops

Steps to Reproduce

Send 50 rapid concurrent requests — all go through with zero throttling:

for i in $(seq 1 50); do
  curl -X POST http://localhost:8000/api/chat \
    -H "Content-Type: application/json" \
    -d '{"query": "What is a neuron?"}' &
done

Expected Behavior

Requests should be throttled or queued after a configurable limit.

Actual Behavior

All 50 requests hit the LLM API simultaneously, causing potential rate limit errors or unexpected billing spikes.

Proposed Solution

Integrate slowapi — a FastAPI-compatible rate limiting library.

Add to backend/main.py:

from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from fastapi.responses import JSONResponse

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.exception_handler(RateLimitExceeded)
async def rate_limit_handler(request, exc):
    return JSONResponse(
        status_code=429,
        content={"detail": "Too many requests. Please wait and try again."}
    )

@app.post("/api/chat", response_model=ChatResponse, tags=["Chat"])
@limiter.limit("10/minute")
async def chat_endpoint(request: Request, msg: ChatMessage):
    ...

Add to pyproject.toml dependencies:

"slowapi>=0.1.9",

Add to .env.template:

RATE_LIMIT=10/minute

Acceptance Criteria

  • Rate limiting middleware added to /api/chat endpoint in backend/main.py
  • Limit is configurable via .env file
  • Returns clear 429 response with user-friendly error message
  • Basic request count logging added for monitoring
  • Existing tests still pass after integration

Environment

  • OS: Any (Linux/macOS/Windows)
  • Python: 3.12+
  • Framework: FastAPI
  • LLM Provider: Google Gemini / Vertex AI
  • Relevant Files: backend/main.py, pyproject.toml, .env.template

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions