Skip to content

Rate Limiting Patterns

Practical patterns for configuring rate limits based on your API tier, deployment environment, and workload.

Adjust rate limits per environment using TOML profiles:

~/.llmist/cli.toml
# Development: Conservative limits to avoid burning quota
[dev.rate-limits]
requests-per-minute = 5
tokens-per-minute = 10000
# Production: Match your actual API tier
[prod.rate-limits]
requests-per-minute = 50
tokens-per-minute = 100000

Use with:

Terminal window
llmist dev "test prompt" # Uses dev limits
llmist prod "real task" # Uses prod limits

llmist auto-detects provider defaults, but override for your specific tier:

import { LLMist } from 'llmist';
// Anthropic Tier 2 (100 RPM, 80K TPM)
const anthropicAgent = LLMist.createAgent()
.withModel('sonnet')
.withRateLimits({
requestsPerMinute: 100,
tokensPerMinute: 80_000,
});
// Gemini 1.5 Pro with higher limits
const geminiAgent = LLMist.createAgent()
.withModel('gemini:gemini-1.5-pro')
.withRateLimits({
requestsPerMinute: 360,
tokensPerMinute: 4_000_000,
});

Pattern 3: Batch Processing with Rate Limits

Section titled “Pattern 3: Batch Processing with Rate Limits”

Process multiple tasks while respecting rate limits:

const tasks = ['task1', 'task2', 'task3', /* ...100 tasks */ ];
const agent = LLMist.createAgent()
.withModel('haiku')
.withRateLimits({
requestsPerMinute: 50,
tokensPerMinute: 100_000,
})
.withHooks({
observers: {
onRateLimitThrottle: (ctx) => {
console.log(`[Throttled] Waiting ${Math.ceil(ctx.delayMs / 1000)}s...`);
},
},
});
for (const task of tasks) {
const result = await agent.askAndCollect(task);
console.log(`Completed: ${task}`);
// Agent automatically paces requests
}

Track both cost and rate limit usage together:

let totalCost = 0;
let throttleCount = 0;
const agent = LLMist.createAgent()
.withModel('sonnet')
.withRateLimits({
requestsPerMinute: 50,
tokensPerMinute: 40_000,
safetyMargin: 0.8, // Start throttling at 80%
})
.withHooks({
observers: {
onLLMCallComplete: (ctx) => {
totalCost += ctx.cost ?? 0;
console.log(`Cost so far: $${totalCost.toFixed(4)}`);
},
onRateLimitThrottle: (ctx) => {
throttleCount++;
console.log(`Throttled ${throttleCount}x (RPM: ${ctx.stats.requestsInCurrentMinute}/${ctx.stats.requestsPerMinute})`);
},
},
});

Pattern 5: Dynamic Rate Limits Based on Time

Section titled “Pattern 5: Dynamic Rate Limits Based on Time”

Adjust limits during peak/off-peak hours:

function getRateLimits(): RateLimitConfig {
const hour = new Date().getHours();
const isPeakHours = hour >= 9 && hour <= 17;
return {
requestsPerMinute: isPeakHours ? 30 : 50, // More conservative during peak
tokensPerMinute: isPeakHours ? 60_000 : 100_000,
safetyMargin: isPeakHours ? 0.7 : 0.8,
};
}
const agent = LLMist.createAgent()
.withModel('sonnet')
.withRateLimits(getRateLimits())
.ask('...');

Pattern 6: Disable for Local/Mock Providers

Section titled “Pattern 6: Disable for Local/Mock Providers”

Skip rate limiting for local models or mocks:

const isLocal = process.env.LLM_PROVIDER === 'local';
const agent = LLMist.createAgent()
.withModel(isLocal ? 'local:llama3' : 'sonnet')
.withRateLimits(isLocal ? { enabled: false } : {
requestsPerMinute: 50,
tokensPerMinute: 40_000,
});

When using subagents (like BrowseWeb), rate limits are shared across the entire agent tree:

// Parent agent: 20 RPM limit
const parent = LLMist.createAgent()
.withModel('sonnet')
.withRateLimits({
requestsPerMinute: 20,
tokensPerMinute: 50_000,
})
.withGadgets(BrowseWeb); // BrowseWeb is a subagent gadget
// BrowseWeb subagent shares the 20 RPM limit with parent
// Total system throughput: max 20 RPM (parent + subagent combined)

This ensures your configured limits actually protect against provider quotas, since API limits apply to your API key—not individual agent instances. The RateLimitTracker is shared via ExecutionContext when subagents are created with withParentContext(ctx).

Pattern 8: Understanding Which Limit Triggered

Section titled “Pattern 8: Understanding Which Limit Triggered”

The triggeredBy field in RateLimitStats tells you exactly which limit(s) caused throttling:

.withHooks({
observers: {
onRateLimitThrottle: (ctx) => {
const { triggeredBy } = ctx.stats;
if (triggeredBy?.daily) {
// Daily token limit reached - wait until midnight UTC
console.log(`Daily limit hit: ${triggeredBy.daily.current}/${triggeredBy.daily.limit} tokens`);
} else if (triggeredBy?.rpm) {
// Request rate limit
console.log(`RPM limit: ${triggeredBy.rpm.current}/${triggeredBy.rpm.limit} requests`);
} else if (triggeredBy?.tpm) {
// Token rate limit
console.log(`TPM limit: ${triggeredBy.tpm.current}/${triggeredBy.tpm.limit} tokens`);
}
},
},
})

Problem: Still hitting rate limits despite configuration

Solution: Check safety margin and actual token usage:

.withHooks({
observers: {
onRateLimitThrottle: (ctx) => {
const { rpm, tpm, triggeredBy } = ctx.stats;
console.log(`Current: RPM=${rpm}, TPM=${tpm}`);
if (triggeredBy) {
console.log(`Triggered by: ${Object.keys(triggeredBy).join(', ')}`);
}
// If these are below your configured limits, increase safetyMargin
},
},
})

Problem: Too much throttling, performance is slow

Solution: Lower safety margin or increase limits:

.withRateLimits({
requestsPerMinute: 50,
tokensPerMinute: 100_000,
safetyMargin: 0.95, // Only throttle at 95% usage
})