-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Description
The /api/chat endpoint in backend/main.py makes direct LLM API calls with no rate limiting or request throttling. This creates risks of uncontrolled API costs, unhandled 429 errors from LLM providers, and vulnerability to abuse or accidental request loops in multi-user deployments.
Problem
Currently, there is no mechanism to limit how many requests a user can send to the LLM API within a given time window. The chat_endpoint function in main.py directly calls assistant.handle_chat() with no throttling.
Impact
- Unlimited rapid requests hit the LLM API simultaneously
- Google Gemini/Vertex AI rate limits trigger unhandled errors
- No cost control or usage visibility
- No protection against bot abuse or accidental loops
Steps to Reproduce
Send 50 rapid concurrent requests — all go through with zero throttling:
for i in $(seq 1 50); do
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"query": "What is a neuron?"}' &
done
Expected Behavior
Requests should be throttled or queued after a configurable limit.
Actual Behavior
All 50 requests hit the LLM API simultaneously, causing potential rate limit errors or unexpected billing spikes.
Proposed Solution
Integrate slowapi — a FastAPI-compatible rate limiting library.
Add to backend/main.py:
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from fastapi.responses import JSONResponse
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
@app.exception_handler(RateLimitExceeded)
async def rate_limit_handler(request, exc):
return JSONResponse(
status_code=429,
content={"detail": "Too many requests. Please wait and try again."}
)
@app.post("/api/chat", response_model=ChatResponse, tags=["Chat"])
@limiter.limit("10/minute")
async def chat_endpoint(request: Request, msg: ChatMessage):
...
Add to pyproject.toml dependencies:
"slowapi>=0.1.9",
Add to .env.template:
RATE_LIMIT=10/minute
Acceptance Criteria
- Rate limiting middleware added to
/api/chatendpoint inbackend/main.py - Limit is configurable via
.envfile - Returns clear
429response with user-friendly error message - Basic request count logging added for monitoring
- Existing tests still pass after integration
Environment
- OS: Any (Linux/macOS/Windows)
- Python: 3.12+
- Framework: FastAPI
- LLM Provider: Google Gemini / Vertex AI
- Relevant Files:
backend/main.py,pyproject.toml,.env.template