I Read 1 Million Lines of Code and Found the Layer Most LLM Apps Are Missing
Distilling production-grade resilience patterns from OpenClaw open-source Agent engine into a reusable library for LLM applications
Your AI app will hit these 5 failure modes on day one. Most developers slap on a
try/catchand show an error toast. Here’s a better way.
The Problem
I recently dug into an open-source project called OpenClaw. It’s an “AI assistant gateway” — it plugs AI into your existing chat tools like WhatsApp, Slack, and Telegram, so the AI shows up right where your conversations already happen.
1.05 million lines of TypeScript. 3,000+ files. 713 contributors. Nearly 12,000 commits in three months.
While reading the core file of its Agent execution engine (run.ts, 997 lines), I found a design pattern that stopped me in my tracks.
Most LLM applications are architected like this:
User Input → Call LLM API → Return Result
OpenClaw adds one more layer:
User Input → Resilience Layer (fault tolerance + recovery) → Call LLM API → Return Result
It looks trivial. But this layer determines the ceiling of your user experience.
5 Failures That Will Definitely Happen
If your AI app calls an LLM API, these 5 errors will occur:
| # | Error | Frequency | What the user sees |
|---|---|---|---|
| 1 | Rate limit (429) | Daily during peak hours | ”Too many requests, try again later” |
| 2 | Auth failure (401/403) | When a key expires | ”Service temporarily unavailable” |
| 3 | Context overflow | During long conversations | ”Conversation too long, please start a new one” |
| 4 | Thinking mode unsupported | When switching models | ”This model doesn’t support this feature” |
| 5 | Billing error (402) | When quota runs out | ”Service temporarily unavailable” |
How most developers handle this:
try {
const response = await anthropic.messages.create({ ... });
return response;
} catch (err) {
return "Something went wrong, please try again"; // 😅
}
The user sees “it’s broken.” Then they close your app and go back to ChatGPT.
OpenClaw’s Approach: The User Never Notices
The core of OpenClaw’s Agent engine is a while(true) loop. Every error type has a corresponding automatic recovery strategy:
while (true) {
result = await callLLM(...)
if (context overflow) → auto-compress conversation, retry
if (rate limit) → cool down current key, silently switch to backup
if (auth failure) → mark as failed, switch to next auth profile
if (thinking unsupported) → auto-downgrade extended → deep → off
if (billing error) → long cooldown, switch to next provider
if (success) → break
}
What the user perceives is “it just works.” Behind the scenes, it may have rotated through 2 keys, downgraded thinking once, and compressed the conversation — all invisibly.
The 5 Recovery Strategies in Detail
Strategy 1: Key Rotation + Exponential Backoff Cooldown
Instead of storing a single API key, maintain an ordered candidate list:
const keys = ["sk-key1", "sk-key2", "sk-key3"];
let keyIndex = 0;
// When a key gets rate-limited:
// → Mark it for cooldown (1min → 5min → 25min → 1hr, exponential backoff)
// → Switch to the next key
// → User notices nothing
OpenClaw’s cooldown schedule:
| Consecutive failures | Cooldown duration |
|---|---|
| 1 | 1 minute |
| 2 | 5 minutes |
| 3 | 25 minutes |
| 4+ | 1 hour (capped) |
For billing errors, cooldowns are much longer: starting at 5 hours, up to 24 hours max. Because it takes time for users to top up their accounts.
Strategy 2: Multi-Provider Fallback
When all Anthropic keys are in cooldown:
Anthropic (all keys cooling down)
↓ auto-fallback
OpenAI (try GPT-4o)
↓ also unavailable
Google (try Gemini)
The user might notice a shift in response style, but at least there’s no interruption.
Strategy 3: Three Lines of Defense Against Context Overflow
This is the most elegant design. When the LLM reports “context too long,” instead of failing outright, it applies a tiered recovery:
Level 1: SDK already auto-compressed? → Retry directly (zero cost)
Level 2: Actively call compression function → Summarize old messages with a cheap model → Retry
Level 3: Truncate oversized tool results → Retry
Level 4: All failed → Return a friendly message
The thresholds for tool result truncation are carefully calibrated:
// A single tool result can occupy at most 30% of the context
const MAX_TOOL_RESULT_CONTEXT_SHARE = 0.3;
// Hard cap at 400K characters (≈100K tokens)
const HARD_MAX_TOOL_RESULT_CHARS = 400_000;
// After truncation, keep at least 2000 characters (so the LLM understands what the content is)
const MIN_KEEP_CHARS = 2_000;
There’s a nice detail in the truncation logic: it tries to cut at a newline boundary rather than in the middle of a line.
let cutPoint = keepChars;
const lastNewline = text.lastIndexOf("\n", keepChars);
if (lastNewline > keepChars * 0.8) { // newline is past the 80% mark
cutPoint = lastNewline; // cut there instead
}
Strategy 4: Thinking Mode Auto-Downgrade
Different models support different thinking levels. OpenClaw uses a Set to track attempted levels and avoid infinite loops:
const attempted = new Set<ThinkingLevel>();
// extended → deep → off
// Already tried? Skip it. No infinite loops.
Strategy 5: Honest Token Accounting
This one detail is worth its weight in gold.
In a tool-calling loop, each API call reports token usage for the full context. If there are 5 tool calls and you naively sum them up, you get 5x over-reporting.
// ❌ Wrong approach
totalInputTokens += response.usage.input_tokens; // 5 calls = 5x inflation
// ✅ Correct approach (OpenClaw's method)
// Output tokens: accumulate (total generated)
// Prompt tokens: take only the last call's value (reflects actual context size)
Add This Layer to Your LLM App Today
You might be thinking: I could write this while(true) loop myself. And you’re right — but the devil is in the details. The 40+ regex patterns for error classification, the exponential backoff math, the last-call token accounting, the infinite-loop guard on thinking downgrades… each of these edge cases was born from a production incident.
So I distilled the essence of OpenClaw’s 997-line run.ts into a drop-in library:
npm install @yuyuqueen/resilient-llm
GitHub: github.com/yuyuqueen/llm-toolkit — Stars welcome
5-Minute Integration
All 5 recovery strategies from above, in one call:
import { createResilientLLM } from '@yuyuqueen/resilient-llm'
import Anthropic from '@anthropic-ai/sdk'
const resilient = createResilientLLM({
providers: [
{
name: 'anthropic',
model: 'claude-sonnet-4-20250514',
keys: [
{ id: 'key-1', value: process.env.ANTHROPIC_KEY_1! },
{ id: 'key-2', value: process.env.ANTHROPIC_KEY_2! },
],
},
{
name: 'openai',
model: 'gpt-4o',
keys: [{ id: 'openai-1', value: process.env.OPENAI_KEY! }],
},
],
})
const result = await resilient.call(async (ctx) => {
const client = new Anthropic({ apiKey: ctx.apiKey.value })
return {
response: await client.messages.create({
model: ctx.model,
max_tokens: 1024,
messages: [{ role: 'user', content: 'Hello!' }],
}),
}
})
// The user doesn't need to know how many key swaps or downgrades happened
console.log(result.response.content[0].text)
Before and after comparison:
| Before | After | |
|---|---|---|
| Rate limit | Crash, show error | Silently rotate keys, user unaware |
| Key expired | Service outage | Auto-switch to backup key |
| Context overflow | ”Please start a new chat” | Auto-compress, conversation continues |
| Provider down | Entire app unavailable | Fallback to backup provider |
| Token billing | 5x over-reporting | Accurate accounting |
Core Design Principles
- Provider-agnostic: Not tied to any LLM SDK — you provide the callback, the library handles orchestration
- Zero dependencies: Pure TypeScript, no runtime dependencies
- 5 auto-recovery strategies: Key rotation, provider fallback, context compression, thinking downgrade, exponential backoff
Context Compression
const result = await resilient.call(
callFn,
{
thinkingLevel: 'high',
contextCompressor: async () => {
const removed = trimOldMessages(messages)
return removed > 0
? { compressed: true, description: `Removed ${removed} messages` }
: { compressed: false }
},
},
)
Key Health Monitoring
const health = resilient.getKeyHealth()
// → { keys: [{ id: 'key-1', status: 'cooldown', errorCount: 2 }, ...] }
Know the status of every key at a glance. No more getting paged at 3 AM to figure out which key went down.
Conclusion
The UX ceiling of your LLM application isn’t determined by the AI model’s capability — it’s determined by your resilience layer.
Everyone’s calling the same APIs with the same models. The real differentiator is what the user sees when things go wrong: “Something broke” or “nothing happened.”
That’s the missing layer. And now you can add it to your project with a single npm install.
→ @yuyuqueen/resilient-llm on npm → GitHub source
This is the first post in the “Distilling Libraries from Open Source” series. Coming next:
- Three Lines of Defense for Context Window Management (
@yuyuqueen/llm-context-kit) - Stop Hardcoding Your System Prompts (
@yuyuqueen/prompt-assembler)
Follow for updates → Twitter @YuYuQueen_ · GitHub