Your AI Conversations Break After 20 Turns? You Need These Three Lines of Defense
Distilled from OpenClaw's context management system: a three-tier defense strategy against context window overflow in LLM applications
Your user has been chatting with your AI assistant for 20 turns. They’ve built up rich context — debugging history, file references, decisions made. Then they hit: “Conversation too long, please start a new chat.” All that context? Gone. They uninstall your app.
The Problem
In the previous post, I distilled the resilience layer from OpenClaw’s Agent engine. I briefly mentioned one recovery strategy — context overflow recovery — in a single paragraph.
But when I actually read OpenClaw’s context handling code, I found it far more sophisticated than “compress and retry.”
It’s a three-tier defense system. Each tier has its own trigger conditions, execution strategy, and fallback path.
How most LLM apps handle context overflow:
try {
await llm.chat(messages)
} catch (err) {
if (err.message.includes('context')) {
return "Conversation too long, please start a new chat"
}
}
OpenClaw’s approach — three lines of defense, from light to heavy:
Before every API call
|
v
Defense 1: Context Budget Check (proactive)
-> Estimate current token count, predict overflow
-> Over threshold? Trigger Defense 2 or 3
|
v
Defense 2: Tool Result Truncation (lightweight)
-> Find the largest tool results, truncate by threshold
-> Pure string operations, completes in milliseconds
|
v
Defense 3: Conversation Compression (heavyweight)
-> Summarize old messages with a cheap model
-> Preserve recent turns + system message
|
v
All fail -> Graceful error (not a crash)
The user notices nothing. The conversation continues.
The key difference: most developers only handle overflow after the API throws an error. OpenClaw knows before sending whether overflow will happen.
Defense 1: Context Budget — Knowing You’ll Overflow Before You Do
The other two defenses are about “how to fix it when things break.” This one is about “knowing things will break before they do.”
// Reserve 4096 tokens for model output
const RESERVE_OUTPUT_TOKENS = 4_096
// Heuristic: 4 characters ~ 1 token
const CHARS_PER_TOKEN = 4
Before every request, calculate the total token count of current messages, subtract the output reserve, and you know how much room is left. When utilization exceeds 70%, proactively trigger truncation or compression — instead of waiting for a 100% crash.
This is the “radar” of the entire defense system. Without it, you only learn about overflow after the API fails. With it, you can start handling things with 30% headroom remaining.
Defense 2: Tool Result Truncation
The lightest defense — no LLM calls needed, pure string operations, millisecond completion.
Why Tool Results Are the Biggest Overflow Source
In Agent scenarios, a single read_file can return hundreds of thousands of characters. A database query can return hundreds of records. These tool results get stuffed directly into the context window.
OpenClaw’s solution: set thresholds and truncate.
// A single tool result gets at most 30% of the context
const MAX_TOOL_RESULT_CONTEXT_SHARE = 0.3
// Hard cap at 400K characters (~100K tokens)
const HARD_MAX_TOOL_RESULT_CHARS = 400_000
// After truncation, keep at least 2000 chars (so the LLM understands what the content is)
const MIN_KEEP_CHARS = 2_000
Why 30%? Because the context also needs to hold the system prompt, conversation history, and other tool results. A single tool result taking 30% is already a large share.
Truncation Details
Truncation isn’t a naive .slice(0, n). One detail in OpenClaw impressed me:
// Try to cut at a newline, not in the middle of a line
let cutPoint = keepChars
const lastNewline = text.lastIndexOf("\n", keepChars)
if (lastNewline > keepChars * 0.8) { // Newline is past the 80% mark
cutPoint = lastNewline // Cut at the newline
}
Why? Because when an LLM reads a half-cut line of JSON or code, it can hallucinate. Cutting at newlines ensures every line is complete.
What About Multiple Tool Results?
If a message contains multiple text blocks (e.g., Anthropic-format tool_results), the truncation budget is allocated proportionally:
Text block A: 100K chars (50%) -> Gets 50% of the budget
Text block B: 100K chars (50%) -> Gets 50% of the budget
Fair distribution — no single block gets completely sacrificed.
Defense 3: Conversation Compression
When tool truncation isn’t enough, it’s time for the heavy artillery — summarizing old conversations with a cheap LLM.
The Algorithm
- Keep the system message (never compressed)
- Keep the most recent N turns (default: 4)
- Old messages in between -> sent to a cheap model (Haiku-tier) for summarization
- The summary replaces the original messages, marked with
[Previous conversation compressed]
Before compression:
[system] You are an assistant
[user] Help me look at this bug <- old message
[assistant] Sure, let me check <- old message
[user] What about this file? <- old message
[assistant] There's an issue here... <- old message
[user] What do the logs say? <- kept (recent 4 turns)
[assistant] The logs show... <- kept
[user] How do I fix it? <- kept
[assistant] I'd suggest this fix... <- kept
After compression:
[system] You are an assistant
[user] [Previous conversation compressed]
User was debugging a bug, found an issue in a file...
[user] What do the logs say?
[assistant] The logs show...
[user] How do I fix it?
[assistant] I'd suggest this fix...
Three Critical Details
1. Tool Use / Tool Result Must Be Paired
An LLM’s tool_use message and its corresponding tool_result are a pair. If you compress the tool_result but leave the tool_use, the LLM gets confused: “I called the tool — where’s the result?”
So the compression boundary must never split a tool_use/tool_result pair.
2. Images Are Discarded
Image content in old messages is dropped during compression — you can’t “summarize” an image into text. The summary preserves only textual context.
3. Timeout Protection
Compression itself requires an LLM call, which can also fail. OpenClaw sets a 5-minute safety timeout to prevent the compression process from becoming the problem.
Add These Three Lines of Defense to Your LLM App Today
At this point you might be thinking: the principles make sense, but the edge cases are endless. Tool result truncation needs to handle both Anthropic and OpenAI message formats, conversation compression needs to handle tool_use/tool_result pairing and image stripping, the heuristic parameters for budget estimation need tuning…
I distilled OpenClaw’s 1000+ lines of code into a drop-in library — zero dependencies, not tied to any LLM provider:
npm install @yuyuqueen/llm-context-kit
GitHub: github.com/yuyuqueen/llm-toolkit — Stars welcome
Tool Result Truncation
import { createToolResultTruncator } from '@yuyuqueen/llm-context-kit'
const truncator = createToolResultTruncator()
// Automatically finds oversized tool results and truncates them
const { messages: safeMessages, truncatedCount } =
truncator.truncate(messages, 200_000) // context window tokens
console.log(`Truncated ${truncatedCount} tool results`)
Conversation Compression
import { createContextCompressor } from '@yuyuqueen/llm-context-kit'
const compressor = createContextCompressor({
summarize: async ({ messages, systemPrompt }) => {
// Use a cheap model for summarization
const response = await anthropic.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 4096,
system: systemPrompt,
messages: messages.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
})
return response.content[0].text
},
preserveRecentTurns: 4, // Keep the last 4 turns
})
const result = await compressor.compress(messages)
if (result.compressed) {
messages = result.messages
console.log(result.description)
// -> "Compressed 12 messages into summary"
}
Context Budget Check
import { createContextBudget } from '@yuyuqueen/llm-context-kit'
const budget = createContextBudget({
contextWindowTokens: 200_000,
reserveOutputTokens: 4_096,
})
const status = budget.check(messages)
console.log(status)
// -> {
// withinBudget: true,
// estimatedTokens: 45000,
// availableTokens: 150904,
// utilizationPercent: 23
// }
Wiring the Three Defenses Together
async function chat(messages) {
// Defense 1: Check the budget
const status = budget.check(messages)
// Defense 2: Budget tight? Truncate tool results first
if (status.utilizationPercent > 70) {
messages = truncator.truncate(messages, 200_000).messages
}
// Defense 3: Still not enough? Compress old conversations
if (status.utilizationPercent > 85) {
const compressed = await compressor.compress(messages)
if (compressed.compressed) messages = compressed.messages
}
return callLLM({ messages })
}
Before and After
| Scenario | Before | After |
|---|---|---|
| Tool returns 50KB JSON | Context overflows immediately | Auto-truncated to safe range, cut at newlines |
| 30-turn conversation | Model loses early instructions, degrades | Old messages compressed to summary, last 4 turns fully preserved |
| Context nearly full | No warning, sudden crash | Detected early, proactive truncation/compression |
| Tool call pairing | Compression splits tool_use/result | Always paired, never split |
| Processing time | No handling (just crashes) | Truncation in milliseconds, compression with 5-min timeout |
Using with resilient-llm
This library pairs seamlessly with @yuyuqueen/resilient-llm from the previous post:
import { createResilientLLM } from '@yuyuqueen/resilient-llm'
import { createContextCompressor } from '@yuyuqueen/llm-context-kit'
const compressor = createContextCompressor({
summarize: async ({ messages, systemPrompt }) => { /* ... */ },
})
let messages = [/* ... */]
const resilient = createResilientLLM({
providers: [/* ... */],
contextCompressor: async () => {
const result = await compressor.compress(messages)
if (result.compressed) {
messages = result.messages // Update the external messages array
}
return {
compressed: result.compressed,
description: result.description,
}
},
})
// Context overflow automatically triggers compression, invisible to the user
const result = await resilient.call(async (ctx) => {
return {
response: await anthropic.messages.create({
model: ctx.model,
max_tokens: 1024,
messages,
}),
}
})
The two libraries together form a complete production-grade LLM defense:
API call fails
|
+-- Rate limit -> resilient-llm auto-rotates keys
+-- Auth error -> resilient-llm switches provider
+-- Context overflow -> llm-context-kit truncates/compresses
+-- Other errors -> resilient-llm exponential backoff retry
Design Principles
This library follows the same design philosophy as resilient-llm:
- Provider-agnostic — Not tied to any LLM SDK. Compression needs an LLM call? You provide the callback, the library handles orchestration
- Zero dependencies — Pure TypeScript, no runtime dependencies
- Immutable — All operations return new arrays, never mutating your original data
- Dual-format compatible — Supports both Anthropic and OpenAI message formats
Conclusion
Context windows aren’t infinite, but your users’ conversations can be.
200K tokens sounds like a lot, but in Agent scenarios, a few read_file calls plus a few rounds of tool use can eat through most of it. An app without defenses crashes at the moment the user is most engaged. With three lines of defense, the conversation just keeps going.
-> @yuyuqueen/llm-context-kit on npm -> @yuyuqueen/resilient-llm on npm -> GitHub source
This is the second post in the “Distilling Libraries from Open Source Projects” series.
- Part 1: I Read 1 Million Lines of Code and Found the Layer Most LLM Apps Are Missing
- Part 3: Stop Hardcoding Your System Prompts (
@yuyuqueen/prompt-assembler)
Follow for updates -> Twitter @YuYuQueen_ · GitHub