Skip to content

Retry Strategies

LLM APIs can fail transiently due to rate limits, server overload, or network issues. llmist provides a two-layer protection system to handle these failures reliably.

llmist uses two complementary strategies for rate limit handling:

LayerPurposeHow it Works
ProactivePrevent errorsTrack usage and delay requests before hitting limits
ReactiveHandle errorsRetry with backoff when errors occur, respecting Retry-After headers
const agent = LLMist.createAgent()
.withModel('sonnet')
// Layer 1: Proactive throttling
.withRateLimits({
requestsPerMinute: 50,
tokensPerMinute: 100000,
})
// Layer 2: Reactive retry (enabled by default)
.withRetry({
retries: 3,
respectRetryAfter: true,
})
.ask('...');

Retry is enabled by default with sensible settings:

// Default behavior - automatic retry on transient errors
const agent = await LLMist.createAgent()
.withModel('sonnet')
.ask('...');

To customize:

// Custom retry configuration
.withRetry({
retries: 5, // Max 5 attempts
minTimeout: 2000, // Start with 2s delay
maxTimeout: 60000, // Cap at 60s
onRetry: (error, attempt) => {
console.log(`Retry ${attempt}: ${error.message}`);
},
})
// Disable retry
.withRetry({ enabled: false })

The RetryConfig interface supports these options:

OptionTypeDefaultDescription
enabledbooleantrueEnable/disable retry
retriesnumber3Maximum retry attempts
minTimeoutnumber1000Initial delay in ms
maxTimeoutnumber30000Maximum delay in ms
factornumber2Exponential backoff factor
randomizebooleantrueAdd jitter to prevent thundering herd
respectRetryAfterbooleantrueHonor Retry-After headers from providers
maxRetryAfterMsnumber120000Cap on server-requested wait time
.withRetry({
// Called before each retry
onRetry: (error: Error, attempt: number) => {
metrics.increment('llm.retry', { attempt });
console.warn(`Retry ${attempt}: ${error.message}`);
},
// Called when all retries exhausted
onRetriesExhausted: (error: Error, attempts: number) => {
alerting.notify(`LLM failed after ${attempts} attempts`);
},
})

Override the default error classification:

.withRetry({
shouldRetry: (error: Error) => {
// Only retry rate limits, not server errors
return error.message.includes('429');
},
})

By default, llmist retries these errors automatically:

Error TypeExamples
Rate Limits429, “rate limit exceeded”, “rate_limit”
Server Errors500, 502, 503, 504, “internal server error”
Timeouts”timeout”, “etimedout”, “timed out”
Connection Issues”econnreset”, “econnrefused”, “enotfound”
Provider Overload”overloaded”, “capacity”
Error TypeExamples
Authentication401, 403, “unauthorized”, “forbidden”
Bad Request400, “invalid” (except HuggingFace, see below)
Not Found404
Content Policy”content policy”, “safety”

llmist handles provider-specific error patterns:

ProviderRetryable Errors
GeminiRESOURCE_EXHAUSTED, UNAVAILABLE, DEADLINE_EXCEEDED, “quota exceeded”
Anthropicoverloaded_error, api_error
OpenAIRateLimitError, ServiceUnavailableError, APITimeoutError
HuggingFace400 errors (often transient on serverless inference due to model loading/capacity)

Prevent rate limit errors before they occur by configuring usage limits:

const agent = LLMist.createAgent()
.withRateLimits({
requestsPerMinute: 60, // Your API tier's RPM limit
tokensPerMinute: 100000, // Your API tier's TPM limit
tokensPerDay: 1000000, // Optional daily limit (Gemini free tier)
safetyMargin: 0.9, // Start throttling at 90% of limit
})
.ask('...');
OptionTypeDefaultDescription
requestsPerMinutenumber-Max RPM for your tier
tokensPerMinutenumber-Max TPM for your tier
tokensPerDaynumber-Daily token limit
safetyMarginnumber0.9Threshold to start throttling
enabledbooleantrueEnable/disable proactive limiting

Example configurations for common tiers:

// Gemini free tier
.withRateLimits({
requestsPerMinute: 15,
tokensPerMinute: 1_000_000,
tokensPerDay: 1_500_000,
})
// OpenAI Tier 1
.withRateLimits({
requestsPerMinute: 500,
tokensPerMinute: 200_000,
})
// Anthropic Tier 1
.withRateLimits({
requestsPerMinute: 50,
tokensPerMinute: 40_000,
})

llmist automatically parses and respects Retry-After headers from providers:

.withRetry({
respectRetryAfter: true, // Default: true
maxRetryAfterMs: 60000, // Cap wait at 1 minute
})
ProviderRetry-After Format
AnthropicHTTP header (seconds)
OpenAIHTTP header (seconds)
GeminiParsed from error message (e.g., “retry in 45.2s”)

When a provider sends Retry-After: 30, llmist will wait 30 seconds before retrying instead of using exponential backoff.

The delay between retries follows exponential backoff:

delay = min(minTimeout * (factor ^ attempt), maxTimeout)

With randomize: true (default), jitter is added:

delay = delay * random(0.5, 1.5)

Example with defaults:

  • Attempt 1: ~1-1.5s
  • Attempt 2: ~2-3s
  • Attempt 3: ~4-6s
const agent = LLMist.createAgent()
.withRetry({
retries: 5,
minTimeout: 2000,
maxTimeout: 120000, // Up to 2 minutes between retries
onRetry: (error, attempt) => {
logger.warn(`LLM retry ${attempt}/5`, { error: error.message });
},
onRetriesExhausted: (error, attempts) => {
logger.error(`LLM failed permanently after ${attempts} attempts`);
alerting.trigger('llm_failure');
},
})
.ask('...');
const agent = LLMist.createAgent()
.withRetry({
retries: 1, // Only one retry
minTimeout: 500, // Short delay
maxTimeout: 2000, // Cap quickly
})
.ask('...');
// For testing or when you handle retries externally
const agent = LLMist.createAgent()
.withRetry({ enabled: false })
.ask('...');
const agent = LLMist.createAgent()
.withRetry({
shouldRetry: (error) => {
// Only retry rate limits and overload
const msg = error.message.toLowerCase();
return msg.includes('429') ||
msg.includes('rate limit') ||
msg.includes('overloaded');
},
})
.ask('...');
const agent = LLMist.createAgent()
.withRetry({
onRetry: (error, attempt) => {
statsd.increment('llm.retries', {
attempt: String(attempt),
error_type: classifyError(error),
});
},
onRetriesExhausted: (error, attempts) => {
statsd.increment('llm.failures', {
total_attempts: String(attempts),
});
},
})
.ask('...');

llmist also provides formatLLMError() to clean up verbose API error messages:

import { formatLLMError } from 'llmist';
try {
await agent.askAndCollect('...');
} catch (error) {
// Instead of: "{\"error\":{\"message\":\"Rate limit exceeded...\"}}"
// You get: "Rate limit exceeded (429) - retry after a few seconds"
console.error(formatLLMError(error));
}