I Read 1 Million Lines of Code and Found the Layer Most LLM Apps Are Missing

Your AI app will hit these 5 failure modes on day one. Most developers slap on a try/catch and show an error toast. Here’s a better way.

The Problem

I recently dug into an open-source project called OpenClaw. It’s an “AI assistant gateway” — it plugs AI into your existing chat tools like WhatsApp, Slack, and Telegram, so the AI shows up right where your conversations already happen.

1.05 million lines of TypeScript. 3,000+ files. 713 contributors. Nearly 12,000 commits in three months.

While reading the core file of its Agent execution engine (run.ts, 997 lines), I found a design pattern that stopped me in my tracks.

Most LLM applications are architected like this:

User Input → Call LLM API → Return Result

OpenClaw adds one more layer:

User Input → Resilience Layer (fault tolerance + recovery) → Call LLM API → Return Result

It looks trivial. But this layer determines the ceiling of your user experience.

5 Failures That Will Definitely Happen

If your AI app calls an LLM API, these 5 errors will occur:

#	Error	Frequency	What the user sees
1	Rate limit (429)	Daily during peak hours	”Too many requests, try again later”
2	Auth failure (401/403)	When a key expires	”Service temporarily unavailable”
3	Context overflow	During long conversations	”Conversation too long, please start a new one”
4	Thinking mode unsupported	When switching models	”This model doesn’t support this feature”
5	Billing error (402)	When quota runs out	”Service temporarily unavailable”

How most developers handle this:

try {
  const response = await anthropic.messages.create({ ... });
  return response;
} catch (err) {
  return "Something went wrong, please try again";  // 😅
}

The user sees “it’s broken.” Then they close your app and go back to ChatGPT.

OpenClaw’s Approach: The User Never Notices

The core of OpenClaw’s Agent engine is a while(true) loop. Every error type has a corresponding automatic recovery strategy:

while (true) {
    result = await callLLM(...)

    if (context overflow)      → auto-compress conversation, retry
    if (rate limit)            → cool down current key, silently switch to backup
    if (auth failure)          → mark as failed, switch to next auth profile
    if (thinking unsupported)  → auto-downgrade extended → deep → off
    if (billing error)         → long cooldown, switch to next provider

    if (success) → break
}

What the user perceives is “it just works.” Behind the scenes, it may have rotated through 2 keys, downgraded thinking once, and compressed the conversation — all invisibly.

The 5 Recovery Strategies in Detail

Strategy 1: Key Rotation + Exponential Backoff Cooldown

Instead of storing a single API key, maintain an ordered candidate list:

const keys = ["sk-key1", "sk-key2", "sk-key3"];
let keyIndex = 0;

// When a key gets rate-limited:
// → Mark it for cooldown (1min → 5min → 25min → 1hr, exponential backoff)
// → Switch to the next key
// → User notices nothing

OpenClaw’s cooldown schedule:

Consecutive failures	Cooldown duration
1	1 minute
2	5 minutes
3	25 minutes
4+	1 hour (capped)

For billing errors, cooldowns are much longer: starting at 5 hours, up to 24 hours max. Because it takes time for users to top up their accounts.

Strategy 2: Multi-Provider Fallback

When all Anthropic keys are in cooldown:

Anthropic (all keys cooling down)
    ↓ auto-fallback
OpenAI (try GPT-4o)
    ↓ also unavailable
Google (try Gemini)

The user might notice a shift in response style, but at least there’s no interruption.

Strategy 3: Three Lines of Defense Against Context Overflow

This is the most elegant design. When the LLM reports “context too long,” instead of failing outright, it applies a tiered recovery:

Level 1: SDK already auto-compressed? → Retry directly (zero cost)
Level 2: Actively call compression function → Summarize old messages with a cheap model → Retry
Level 3: Truncate oversized tool results → Retry
Level 4: All failed → Return a friendly message

The thresholds for tool result truncation are carefully calibrated:

// A single tool result can occupy at most 30% of the context
const MAX_TOOL_RESULT_CONTEXT_SHARE = 0.3;
// Hard cap at 400K characters (≈100K tokens)
const HARD_MAX_TOOL_RESULT_CHARS = 400_000;
// After truncation, keep at least 2000 characters (so the LLM understands what the content is)
const MIN_KEEP_CHARS = 2_000;

There’s a nice detail in the truncation logic: it tries to cut at a newline boundary rather than in the middle of a line.

let cutPoint = keepChars;
const lastNewline = text.lastIndexOf("\n", keepChars);
if (lastNewline > keepChars * 0.8) {  // newline is past the 80% mark
  cutPoint = lastNewline;              // cut there instead
}

Strategy 4: Thinking Mode Auto-Downgrade

Different models support different thinking levels. OpenClaw uses a Set to track attempted levels and avoid infinite loops:

const attempted = new Set<ThinkingLevel>();
// extended → deep → off
// Already tried? Skip it. No infinite loops.

Strategy 5: Honest Token Accounting

This one detail is worth its weight in gold.

In a tool-calling loop, each API call reports token usage for the full context. If there are 5 tool calls and you naively sum them up, you get 5x over-reporting.

// ❌ Wrong approach
totalInputTokens += response.usage.input_tokens;  // 5 calls = 5x inflation

// ✅ Correct approach (OpenClaw's method)
// Output tokens: accumulate (total generated)
// Prompt tokens: take only the last call's value (reflects actual context size)

Add This Layer to Your LLM App Today

You might be thinking: I could write this while(true) loop myself. And you’re right — but the devil is in the details. The 40+ regex patterns for error classification, the exponential backoff math, the last-call token accounting, the infinite-loop guard on thinking downgrades… each of these edge cases was born from a production incident.

So I distilled the essence of OpenClaw’s 997-line run.ts into a drop-in library:

npm install @yuyuqueen/resilient-llm

GitHub: github.com/yuyuqueen/llm-toolkit — Stars welcome

5-Minute Integration

All 5 recovery strategies from above, in one call:

import { createResilientLLM } from '@yuyuqueen/resilient-llm'
import Anthropic from '@anthropic-ai/sdk'

const resilient = createResilientLLM({
  providers: [
    {
      name: 'anthropic',
      model: 'claude-sonnet-4-20250514',
      keys: [
        { id: 'key-1', value: process.env.ANTHROPIC_KEY_1! },
        { id: 'key-2', value: process.env.ANTHROPIC_KEY_2! },
      ],
    },
    {
      name: 'openai',
      model: 'gpt-4o',
      keys: [{ id: 'openai-1', value: process.env.OPENAI_KEY! }],
    },
  ],
})

const result = await resilient.call(async (ctx) => {
  const client = new Anthropic({ apiKey: ctx.apiKey.value })
  return {
    response: await client.messages.create({
      model: ctx.model,
      max_tokens: 1024,
      messages: [{ role: 'user', content: 'Hello!' }],
    }),
  }
})

// The user doesn't need to know how many key swaps or downgrades happened
console.log(result.response.content[0].text)

Before and after comparison:

	Before	After
Rate limit	Crash, show error	Silently rotate keys, user unaware
Key expired	Service outage	Auto-switch to backup key
Context overflow	”Please start a new chat”	Auto-compress, conversation continues
Provider down	Entire app unavailable	Fallback to backup provider
Token billing	5x over-reporting	Accurate accounting

Core Design Principles

Provider-agnostic: Not tied to any LLM SDK — you provide the callback, the library handles orchestration
Zero dependencies: Pure TypeScript, no runtime dependencies
5 auto-recovery strategies: Key rotation, provider fallback, context compression, thinking downgrade, exponential backoff

Context Compression

const result = await resilient.call(
  callFn,
  {
    thinkingLevel: 'high',
    contextCompressor: async () => {
      const removed = trimOldMessages(messages)
      return removed > 0
        ? { compressed: true, description: `Removed ${removed} messages` }
        : { compressed: false }
    },
  },
)

Key Health Monitoring

const health = resilient.getKeyHealth()
// → { keys: [{ id: 'key-1', status: 'cooldown', errorCount: 2 }, ...] }

Know the status of every key at a glance. No more getting paged at 3 AM to figure out which key went down.

Conclusion

The UX ceiling of your LLM application isn’t determined by the AI model’s capability — it’s determined by your resilience layer.

Everyone’s calling the same APIs with the same models. The real differentiator is what the user sees when things go wrong: “Something broke” or “nothing happened.”

That’s the missing layer. And now you can add it to your project with a single npm install.

→ @yuyuqueen/resilient-llm on npm → GitHub source

This is the first post in the “Distilling Libraries from Open Source” series. Coming next:

Three Lines of Defense for Context Window Management (@yuyuqueen/llm-context-kit)
Stop Hardcoding Your System Prompts (@yuyuqueen/prompt-assembler)

Follow for updates → Twitter @YuYuQueen_ · GitHub