Skip to main content

Token Buckets

Understanding the token bucket algorithm that powers Choked’s dual rate limiting system.

What is a Token Bucket?

A token bucket is a rate limiting algorithm that allows for controlled bursts of traffic while maintaining an average rate limit over time. Choked implements dual token buckets - one for requests and one for tokens. Think of it like two physical buckets:
  • Request bucket: Holds request tokens (each function call = 1 token)
  • Token bucket: Holds estimated tokens (based on input text content)
  • Both buckets refill at steady rates
  • Function calls must acquire tokens from BOTH buckets (if both limits are set)

How It Works

1. Dual Bucket System

When you create rate limits with both request and token limits:
@choke(key="api", request_limit="50/s", token_limit="100000/m", token_estimator="openai")
This creates:
Request bucket: 50 tokens capacity, refills at 50 tokens/second
Token bucket: 100,000 tokens capacity, refills at 1,667 tokens/second

2. Token Consumption

Each function call must acquire from both buckets:
@choke(key="api", request_limit="10/s", token_limit="1000/m", token_estimator="openai")
def chat_completion(messages):
    return openai.chat.completions.create(model="gpt-4", messages=messages)

# Function call with messages = [{"role": "user", "content": "Hello world"}]
# Consumes: 1 request token + ~3 estimated tokens

3. Rate Limit Enforcement

Both limits must be satisfied:
# If request bucket has tokens but token bucket is empty:
#   → Function waits until token bucket refills

# If token bucket has tokens but request bucket is empty:  
#   → Function waits until request bucket refills

# Both buckets must have sufficient tokens for function to proceed

4. Automatic Token Estimation

Token estimators analyze function arguments:
@choke(key="embed", token_limit="1000000/m", token_estimator="voyageai")
def get_embeddings(texts):
    # Automatically estimates tokens from 'texts' parameter
    return voyage.embed(texts)

# Call with texts=["Hello", "World"] 
# Automatically estimates ~2 tokens total

Algorithm Details

Rate Calculation

# For request_limit="50/s":
request_capacity = 50
request_refill_rate = 50 tokens/second

# For token_limit="100000/m":  
token_capacity = 100000
token_refill_rate = 100000 / 60 = 1667 tokens/second

Token Availability Check

Before each function call:
  1. Request bucket check:
    • Calculate elapsed time since last refill
    • Add request tokens: elapsed_time × request_refill_rate
    • Cap at capacity: min(current + new, capacity)
    • Check if ≥1 request token available
  2. Token bucket check (if token_limit set):
    • Estimate tokens needed from function arguments
    • Calculate elapsed time and refill token bucket
    • Check if estimated tokens are available
  3. Proceed or wait:
    • If both buckets have sufficient tokens: consume and proceed
    • If either bucket lacks tokens: wait with exponential backoff

Exponential Backoff

When limits are reached, Choked uses exponential backoff with jitter:
wait_time = base_sleep_time × (2 ** attempt) × random_jitter
# random_jitter between 0.8 and 1.2 to prevent thundering herd

Token Estimation

Supported Estimators

OpenAI Estimator ("openai")

@choke(key="gpt", token_limit="100000/m", token_estimator="openai")
def chat_completion(messages):
    # Uses tiktoken with GPT-4 tokenizer
    # Handles: [{"role": "user", "content": "Hello"}] format
    return openai.chat.completions.create(model="gpt-4", messages=messages)

VoyageAI Estimator ("voyageai")

@choke(key="embed", token_limit="1000000/m", token_estimator="voyageai")
def get_embeddings(texts):
    # Uses HuggingFace voyageai/voyage-3.5 tokenizer
    # Handles: ["text1", "text2", ...] format
    return voyage.embed(texts)

Default Estimator ("default")

@choke(key="general", token_limit="50000/m", token_estimator="default")
def process_text(content):
    # Uses tiktoken with GPT-4 tokenizer (same as "openai")
    return process(content)

Text Extraction Logic

Estimators automatically extract text from:
# String arguments
@choke(key="api", token_limit="1000/m", token_estimator="openai")
def process(text: str):  # ← Extracted
    pass

# Keyword arguments  
@choke(key="api", token_limit="1000/m", token_estimator="openai")
def process(prompt: str, content: str):  # ← Both extracted
    pass

# Lists of strings
@choke(key="api", token_limit="1000/m", token_estimator="openai") 
def process(texts: list[str]):  # ← All strings extracted
    pass

# OpenAI message format
@choke(key="api", token_limit="1000/m", token_estimator="openai")
def chat(messages):  # ← Content from messages extracted
    # messages = [{"role": "user", "content": "Hello"}]
    pass

Fallback Behavior

If token estimation fails:
  1. VoyageAI → Falls back to OpenAI estimator
  2. OpenAI → Falls back to word-based estimation (~0.75 tokens/word)
  3. Word-based → Returns minimum 1 token

Practical Examples

Example 1: OpenAI Chat API

from choked import Choked

choke = Choked(redis_url="redis://localhost:6379/0")

@choke(key="openai_chat", request_limit="50/s", token_limit="100000/m", token_estimator="openai")
def chat_completion(messages):
    return openai.chat.completions.create(
        model="gpt-4",
        messages=messages
    )

# Call with short message
result = chat_completion([{"role": "user", "content": "Hi"}])
# Consumes: 1 request + ~1 token

# Call with long message  
long_msg = [{"role": "user", "content": "Very long message..." * 100}]
result = chat_completion(long_msg)  
# Consumes: 1 request + ~300 tokens
Behavior:
  • Short messages: Limited primarily by request rate (50/s)
  • Long messages: May hit token limit first (100K/m = ~1,667/s)
  • Automatic balancing between both limits

Example 2: VoyageAI Embeddings

@choke(key="voyage_embed", token_limit="1000000/m", token_estimator="voyageai")
def get_embeddings(texts, model="voyage-3"):
    return voyage.embed(texts, model=model)

# Small batch
result = get_embeddings(["Hello", "World"])
# Consumes: ~2 tokens

# Large batch
large_batch = ["Document " + str(i) * 100 for i in range(100)]
result = get_embeddings(large_batch)
# Consumes: ~10,000+ tokens, may trigger rate limiting
Behavior:
  • No request limiting (only token limiting)
  • Large batches automatically throttled based on estimated tokens
  • Scales naturally with content size

Example 3: Multi-Worker Coordination

# Multiple workers sharing the same API key
@choke(key="shared_openai", request_limit="100/s", token_limit="200000/m", token_estimator="openai")
def worker_chat(messages):
    return openai.chat.completions.create(
        model="gpt-4",
        messages=messages
    )

# Worker A processes short messages → uses more request tokens
# Worker B processes long messages → uses more token tokens  
# Both automatically coordinate through shared buckets
Behavior:
  • All workers share 100 requests/second AND 200K tokens/minute
  • Natural load balancing based on actual usage patterns
  • No manual coordination needed between workers

Comparison with Single-Limit Systems

Traditional Request-Only Limiting

# Old approach: Only request limiting
@rate_limit("100/minute")
def api_call(data):
    return process(data)

# Problem: Short requests and long requests treated equally
# 100 short requests = 100 long requests (unfair for token-based APIs)

Choked’s Dual Limiting

# New approach: Both request and token limiting
@choke(key="api", request_limit="100/m", token_limit="10000/m", token_estimator="openai")
def api_call(data):
    return process(data)

# Solution: Short requests limited by request rate
#          Long requests limited by token rate
#          Fair usage for token-based APIs

Benefits of Dual System

  1. Fair Resource Usage: Large requests consume proportionally more tokens
  2. Natural Load Balancing: Workers automatically balance based on content size
  3. API Compliance: Respects both rate limits imposed by API providers
  4. Burst Handling: Allows bursts within both request and token constraints

Tuning Guidelines

Choose Request Limits

  • High values (100+/s): For APIs with generous request limits
  • Medium values (10-100/s): For standard API rate limits
  • Low values (1-10/s): For strict or expensive APIs

Choose Token Limits

  • Based on API provider limits: Match your API’s token/minute limits
  • Content-aware: Consider typical content size in your application
  • Headroom: Leave 10-20% buffer for estimation errors

Optimal Combinations

# High-frequency, short content (e.g., simple API calls)
@choke(key="simple", request_limit="100/s", token_limit="50000/m", token_estimator="openai")

# Medium-frequency, medium content (e.g., chat applications)  
@choke(key="chat", request_limit="50/s", token_limit="100000/m", token_estimator="openai")

# Low-frequency, long content (e.g., document processing)
@choke(key="docs", request_limit="10/s", token_limit="200000/m", token_estimator="openai")

# Token-only for embedding APIs
@choke(key="embed", token_limit="1000000/m", token_estimator="voyageai")

Estimator Selection

  • "openai": For OpenAI APIs, general text processing
  • "voyageai": For VoyageAI embedding APIs
  • "default": When unsure (same as OpenAI)
The dual token bucket system provides sophisticated rate limiting that adapts to both request frequency and content size, making it ideal for modern AI/ML APIs with complex rate limiting requirements.