Token Buckets
Understanding the token bucket algorithm that powers Choked’s dual rate limiting system.What is a Token Bucket?
A token bucket is a rate limiting algorithm that allows for controlled bursts of traffic while maintaining an average rate limit over time. Choked implements dual token buckets - one for requests and one for tokens. Think of it like two physical buckets:- Request bucket: Holds request tokens (each function call = 1 token)
- Token bucket: Holds estimated tokens (based on input text content)
- Both buckets refill at steady rates
- Function calls must acquire tokens from BOTH buckets (if both limits are set)
How It Works
1. Dual Bucket System
When you create rate limits with both request and token limits:2. Token Consumption
Each function call must acquire from both buckets:3. Rate Limit Enforcement
Both limits must be satisfied:4. Automatic Token Estimation
Token estimators analyze function arguments:Algorithm Details
Rate Calculation
Token Availability Check
Before each function call:-
Request bucket check:
- Calculate elapsed time since last refill
- Add request tokens:
elapsed_time × request_refill_rate - Cap at capacity:
min(current + new, capacity) - Check if ≥1 request token available
-
Token bucket check (if token_limit set):
- Estimate tokens needed from function arguments
- Calculate elapsed time and refill token bucket
- Check if estimated tokens are available
-
Proceed or wait:
- If both buckets have sufficient tokens: consume and proceed
- If either bucket lacks tokens: wait with exponential backoff
Exponential Backoff
When limits are reached, Choked uses exponential backoff with jitter:Token Estimation
Supported Estimators
OpenAI Estimator ("openai")
VoyageAI Estimator ("voyageai")
Default Estimator ("default")
Text Extraction Logic
Estimators automatically extract text from:Fallback Behavior
If token estimation fails:- VoyageAI → Falls back to OpenAI estimator
- OpenAI → Falls back to word-based estimation (~0.75 tokens/word)
- Word-based → Returns minimum 1 token
Practical Examples
Example 1: OpenAI Chat API
- Short messages: Limited primarily by request rate (50/s)
- Long messages: May hit token limit first (100K/m = ~1,667/s)
- Automatic balancing between both limits
Example 2: VoyageAI Embeddings
- No request limiting (only token limiting)
- Large batches automatically throttled based on estimated tokens
- Scales naturally with content size
Example 3: Multi-Worker Coordination
- All workers share 100 requests/second AND 200K tokens/minute
- Natural load balancing based on actual usage patterns
- No manual coordination needed between workers
Comparison with Single-Limit Systems
Traditional Request-Only Limiting
Choked’s Dual Limiting
Benefits of Dual System
- Fair Resource Usage: Large requests consume proportionally more tokens
- Natural Load Balancing: Workers automatically balance based on content size
- API Compliance: Respects both rate limits imposed by API providers
- Burst Handling: Allows bursts within both request and token constraints
Tuning Guidelines
Choose Request Limits
- High values (100+/s): For APIs with generous request limits
- Medium values (10-100/s): For standard API rate limits
- Low values (1-10/s): For strict or expensive APIs
Choose Token Limits
- Based on API provider limits: Match your API’s token/minute limits
- Content-aware: Consider typical content size in your application
- Headroom: Leave 10-20% buffer for estimation errors
Optimal Combinations
Estimator Selection
"openai": For OpenAI APIs, general text processing"voyageai": For VoyageAI embedding APIs"default": When unsure (same as OpenAI)