Rate Limiting Patterns
Practical patterns for configuring rate limits based on your API tier, deployment environment, and workload.
Pattern 1: Multi-Tier Configuration
Section titled “Pattern 1: Multi-Tier Configuration”Adjust rate limits per environment using TOML profiles:
# Development: Conservative limits to avoid burning quota[dev.rate-limits]requests-per-minute = 5tokens-per-minute = 10000
# Production: Match your actual API tier[prod.rate-limits]requests-per-minute = 50tokens-per-minute = 100000Use with:
llmist dev "test prompt" # Uses dev limitsllmist prod "real task" # Uses prod limitsPattern 2: Provider-Specific Defaults
Section titled “Pattern 2: Provider-Specific Defaults”llmist auto-detects provider defaults, but override for your specific tier:
import { LLMist } from 'llmist';
// Anthropic Tier 2 (100 RPM, 80K TPM)const anthropicAgent = LLMist.createAgent() .withModel('sonnet') .withRateLimits({ requestsPerMinute: 100, tokensPerMinute: 80_000, });
// Gemini 1.5 Pro with higher limitsconst geminiAgent = LLMist.createAgent() .withModel('gemini:gemini-1.5-pro') .withRateLimits({ requestsPerMinute: 360, tokensPerMinute: 4_000_000, });Pattern 3: Batch Processing with Rate Limits
Section titled “Pattern 3: Batch Processing with Rate Limits”Process multiple tasks while respecting rate limits:
const tasks = ['task1', 'task2', 'task3', /* ...100 tasks */ ];
const agent = LLMist.createAgent() .withModel('haiku') .withRateLimits({ requestsPerMinute: 50, tokensPerMinute: 100_000, }) .withHooks({ observers: { onRateLimitThrottle: (ctx) => { console.log(`[Throttled] Waiting ${Math.ceil(ctx.delayMs / 1000)}s...`); }, }, });
for (const task of tasks) { const result = await agent.askAndCollect(task); console.log(`Completed: ${task}`); // Agent automatically paces requests}Pattern 4: Cost + Rate Limit Tracking
Section titled “Pattern 4: Cost + Rate Limit Tracking”Track both cost and rate limit usage together:
let totalCost = 0;let throttleCount = 0;
const agent = LLMist.createAgent() .withModel('sonnet') .withRateLimits({ requestsPerMinute: 50, tokensPerMinute: 40_000, safetyMargin: 0.8, // Start throttling at 80% }) .withHooks({ observers: { onLLMCallComplete: (ctx) => { totalCost += ctx.cost ?? 0; console.log(`Cost so far: $${totalCost.toFixed(4)}`); }, onRateLimitThrottle: (ctx) => { throttleCount++; console.log(`Throttled ${throttleCount}x (RPM: ${ctx.stats.requestsInCurrentMinute}/${ctx.stats.requestsPerMinute})`); }, }, });Pattern 5: Dynamic Rate Limits Based on Time
Section titled “Pattern 5: Dynamic Rate Limits Based on Time”Adjust limits during peak/off-peak hours:
function getRateLimits(): RateLimitConfig { const hour = new Date().getHours(); const isPeakHours = hour >= 9 && hour <= 17;
return { requestsPerMinute: isPeakHours ? 30 : 50, // More conservative during peak tokensPerMinute: isPeakHours ? 60_000 : 100_000, safetyMargin: isPeakHours ? 0.7 : 0.8, };}
const agent = LLMist.createAgent() .withModel('sonnet') .withRateLimits(getRateLimits()) .ask('...');Pattern 6: Disable for Local/Mock Providers
Section titled “Pattern 6: Disable for Local/Mock Providers”Skip rate limiting for local models or mocks:
const isLocal = process.env.LLM_PROVIDER === 'local';
const agent = LLMist.createAgent() .withModel(isLocal ? 'local:llama3' : 'sonnet') .withRateLimits(isLocal ? { enabled: false } : { requestsPerMinute: 50, tokensPerMinute: 40_000, });Pattern 7: Subagent Rate Limiting
Section titled “Pattern 7: Subagent Rate Limiting”When using subagents (like BrowseWeb), rate limits are shared across the entire agent tree:
// Parent agent: 20 RPM limitconst parent = LLMist.createAgent() .withModel('sonnet') .withRateLimits({ requestsPerMinute: 20, tokensPerMinute: 50_000, }) .withGadgets(BrowseWeb); // BrowseWeb is a subagent gadget
// BrowseWeb subagent shares the 20 RPM limit with parent// Total system throughput: max 20 RPM (parent + subagent combined)This ensures your configured limits actually protect against provider quotas, since API limits apply to your API key—not individual agent instances. The RateLimitTracker is shared via ExecutionContext when subagents are created with withParentContext(ctx).
Pattern 8: Understanding Which Limit Triggered
Section titled “Pattern 8: Understanding Which Limit Triggered”The triggeredBy field in RateLimitStats tells you exactly which limit(s) caused throttling:
.withHooks({ observers: { onRateLimitThrottle: (ctx) => { const { triggeredBy } = ctx.stats;
if (triggeredBy?.daily) { // Daily token limit reached - wait until midnight UTC console.log(`Daily limit hit: ${triggeredBy.daily.current}/${triggeredBy.daily.limit} tokens`); } else if (triggeredBy?.rpm) { // Request rate limit console.log(`RPM limit: ${triggeredBy.rpm.current}/${triggeredBy.rpm.limit} requests`); } else if (triggeredBy?.tpm) { // Token rate limit console.log(`TPM limit: ${triggeredBy.tpm.current}/${triggeredBy.tpm.limit} tokens`); } }, },})Troubleshooting
Section titled “Troubleshooting”Problem: Still hitting rate limits despite configuration
Solution: Check safety margin and actual token usage:
.withHooks({ observers: { onRateLimitThrottle: (ctx) => { const { rpm, tpm, triggeredBy } = ctx.stats; console.log(`Current: RPM=${rpm}, TPM=${tpm}`); if (triggeredBy) { console.log(`Triggered by: ${Object.keys(triggeredBy).join(', ')}`); } // If these are below your configured limits, increase safetyMargin }, },})Problem: Too much throttling, performance is slow
Solution: Lower safety margin or increase limits:
.withRateLimits({ requestsPerMinute: 50, tokensPerMinute: 100_000, safetyMargin: 0.95, // Only throttle at 95% usage})See Also
Section titled “See Also”- Retry Strategies - Reactive error handling
- Cost Tracking - Monitor spending
- Hooks Guide - Custom monitoring
- CLI Configuration - TOML rate limit configuration