Skip to main content

Welcome to Choked

Choked is a simple and powerful Python rate limiting library that uses the token bucket algorithm to control the rate of function calls with support for both request-based and token-based limiting.

Features

  • Easy to use: Simple class-based API with decorator pattern
  • Dual limiting: Support both request limits and token limits (for AI/ML APIs)
  • Flexible backends: Supports both Redis and managed proxy service backends
  • Async/Sync support: Works with both synchronous and asynchronous functions
  • Smart token estimation: Built-in estimators for OpenAI, VoyageAI, and general text
  • Exponential backoff: Smart retry logic with jitter to prevent thundering herd
  • Distributed: Share rate limits across multiple processes or servers
  • Multi-worker scaling: Perfect for managing multiple workers using the same API key

Quick Start

Install choked:
pip install choked
Create a Choked instance and use as decorator:
from choked import Choked

# Using Redis backend
choke = Choked(redis_url="redis://localhost:6379/0")

@choke(key="api_calls", request_limit="10/m")
def make_api_call():
    # This function is rate limited to 10 requests per minute
    return "API response"

# The decorator handles rate limiting automatically
make_api_call()  # Works immediately

How it Works

Choked uses a token bucket algorithm with dual limiting support:
  1. Request limiting: Each function call consumes 1 request token
  2. Token limiting: Each function call consumes estimated tokens based on input text
  3. Buckets refill at steady rates (e.g., “100/s” = 100 tokens per second)
  4. When limits are reached, functions wait with exponential backoff
This allows for burst traffic while maintaining average rate limits, perfect for AI/ML APIs.

Perfect for AI/ML APIs

Choked excels with token-based APIs like OpenAI, VoyageAI, and others:
# Using managed proxy service
choke = Choked(api_token="your-api-token")

@choke(key="openai_chat", request_limit="50/s", token_limit="100000/m", token_estimator="openai")
def chat_completion(messages):
    # Rate limited by both requests (50/s) and tokens (100K/m)
    return openai.chat.completions.create(
        model="gpt-4",
        messages=messages
    )

# Automatic token estimation from messages
# Dual rate limiting prevents both request and token overages
Benefits:
  • Dual limiting: Respect both request and token limits simultaneously
  • Smart estimation: Automatic token counting for popular AI services
  • Auto-scaling: Add/remove workers without changing rate limits
  • No overages: Never exceed your API provider’s limits