Your AI Conversations Break After 20 Turns? You Need These Three Lines of Defense

Your user has been chatting with your AI assistant for 20 turns. They’ve built up rich context — debugging history, file references, decisions made. Then they hit: “Conversation too long, please start a new chat.” All that context? Gone. They uninstall your app.

The Problem

In the previous post, I distilled the resilience layer from OpenClaw’s Agent engine. I briefly mentioned one recovery strategy — context overflow recovery — in a single paragraph.

But when I actually read OpenClaw’s context handling code, I found it far more sophisticated than “compress and retry.”

It’s a three-tier defense system. Each tier has its own trigger conditions, execution strategy, and fallback path.

How most LLM apps handle context overflow:

try {
  await llm.chat(messages)
} catch (err) {
  if (err.message.includes('context')) {
    return "Conversation too long, please start a new chat"
  }
}

OpenClaw’s approach — three lines of defense, from light to heavy:

Before every API call
    |
    v
Defense 1: Context Budget Check (proactive)
    -> Estimate current token count, predict overflow
    -> Over threshold? Trigger Defense 2 or 3
    |
    v
Defense 2: Tool Result Truncation (lightweight)
    -> Find the largest tool results, truncate by threshold
    -> Pure string operations, completes in milliseconds
    |
    v
Defense 3: Conversation Compression (heavyweight)
    -> Summarize old messages with a cheap model
    -> Preserve recent turns + system message
    |
    v
All fail -> Graceful error (not a crash)

The user notices nothing. The conversation continues.

The key difference: most developers only handle overflow after the API throws an error. OpenClaw knows before sending whether overflow will happen.

Defense 1: Context Budget — Knowing You’ll Overflow Before You Do

The other two defenses are about “how to fix it when things break.” This one is about “knowing things will break before they do.”

// Reserve 4096 tokens for model output
const RESERVE_OUTPUT_TOKENS = 4_096

// Heuristic: 4 characters ~ 1 token
const CHARS_PER_TOKEN = 4

Before every request, calculate the total token count of current messages, subtract the output reserve, and you know how much room is left. When utilization exceeds 70%, proactively trigger truncation or compression — instead of waiting for a 100% crash.

This is the “radar” of the entire defense system. Without it, you only learn about overflow after the API fails. With it, you can start handling things with 30% headroom remaining.

Defense 2: Tool Result Truncation

The lightest defense — no LLM calls needed, pure string operations, millisecond completion.

Why Tool Results Are the Biggest Overflow Source

In Agent scenarios, a single read_file can return hundreds of thousands of characters. A database query can return hundreds of records. These tool results get stuffed directly into the context window.

OpenClaw’s solution: set thresholds and truncate.

// A single tool result gets at most 30% of the context
const MAX_TOOL_RESULT_CONTEXT_SHARE = 0.3

// Hard cap at 400K characters (~100K tokens)
const HARD_MAX_TOOL_RESULT_CHARS = 400_000

// After truncation, keep at least 2000 chars (so the LLM understands what the content is)
const MIN_KEEP_CHARS = 2_000

Why 30%? Because the context also needs to hold the system prompt, conversation history, and other tool results. A single tool result taking 30% is already a large share.

Truncation Details

Truncation isn’t a naive .slice(0, n). One detail in OpenClaw impressed me:

// Try to cut at a newline, not in the middle of a line
let cutPoint = keepChars
const lastNewline = text.lastIndexOf("\n", keepChars)
if (lastNewline > keepChars * 0.8) {  // Newline is past the 80% mark
  cutPoint = lastNewline              // Cut at the newline
}

Why? Because when an LLM reads a half-cut line of JSON or code, it can hallucinate. Cutting at newlines ensures every line is complete.

What About Multiple Tool Results?

If a message contains multiple text blocks (e.g., Anthropic-format tool_results), the truncation budget is allocated proportionally:

Text block A: 100K chars (50%)  -> Gets 50% of the budget
Text block B: 100K chars (50%)  -> Gets 50% of the budget

Fair distribution — no single block gets completely sacrificed.

Defense 3: Conversation Compression

When tool truncation isn’t enough, it’s time for the heavy artillery — summarizing old conversations with a cheap LLM.

The Algorithm

Keep the system message (never compressed)
Keep the most recent N turns (default: 4)
Old messages in between -> sent to a cheap model (Haiku-tier) for summarization
The summary replaces the original messages, marked with [Previous conversation compressed]

Before compression:
  [system] You are an assistant
  [user] Help me look at this bug        <- old message
  [assistant] Sure, let me check         <- old message
  [user] What about this file?           <- old message
  [assistant] There's an issue here...   <- old message
  [user] What do the logs say?           <- kept (recent 4 turns)
  [assistant] The logs show...           <- kept
  [user] How do I fix it?               <- kept
  [assistant] I'd suggest this fix...    <- kept

After compression:
  [system] You are an assistant
  [user] [Previous conversation compressed]
         User was debugging a bug, found an issue in a file...
  [user] What do the logs say?
  [assistant] The logs show...
  [user] How do I fix it?
  [assistant] I'd suggest this fix...

Three Critical Details

1. Tool Use / Tool Result Must Be Paired

An LLM’s tool_use message and its corresponding tool_result are a pair. If you compress the tool_result but leave the tool_use, the LLM gets confused: “I called the tool — where’s the result?”

So the compression boundary must never split a tool_use/tool_result pair.

2. Images Are Discarded

Image content in old messages is dropped during compression — you can’t “summarize” an image into text. The summary preserves only textual context.

3. Timeout Protection

Compression itself requires an LLM call, which can also fail. OpenClaw sets a 5-minute safety timeout to prevent the compression process from becoming the problem.

Add These Three Lines of Defense to Your LLM App Today

At this point you might be thinking: the principles make sense, but the edge cases are endless. Tool result truncation needs to handle both Anthropic and OpenAI message formats, conversation compression needs to handle tool_use/tool_result pairing and image stripping, the heuristic parameters for budget estimation need tuning…

I distilled OpenClaw’s 1000+ lines of code into a drop-in library — zero dependencies, not tied to any LLM provider:

npm install @yuyuqueen/llm-context-kit

GitHub: github.com/yuyuqueen/llm-toolkit — Stars welcome

Tool Result Truncation

import { createToolResultTruncator } from '@yuyuqueen/llm-context-kit'

const truncator = createToolResultTruncator()

// Automatically finds oversized tool results and truncates them
const { messages: safeMessages, truncatedCount } =
  truncator.truncate(messages, 200_000) // context window tokens

console.log(`Truncated ${truncatedCount} tool results`)

Conversation Compression

import { createContextCompressor } from '@yuyuqueen/llm-context-kit'

const compressor = createContextCompressor({
  summarize: async ({ messages, systemPrompt }) => {
    // Use a cheap model for summarization
    const response = await anthropic.messages.create({
      model: 'claude-haiku-4-5-20251001',
      max_tokens: 4096,
      system: systemPrompt,
      messages: messages.map(m => ({
        role: m.role as 'user' | 'assistant',
        content: m.content,
      })),
    })
    return response.content[0].text
  },
  preserveRecentTurns: 4,  // Keep the last 4 turns
})

const result = await compressor.compress(messages)
if (result.compressed) {
  messages = result.messages
  console.log(result.description)
  // -> "Compressed 12 messages into summary"
}

Context Budget Check

import { createContextBudget } from '@yuyuqueen/llm-context-kit'

const budget = createContextBudget({
  contextWindowTokens: 200_000,
  reserveOutputTokens: 4_096,
})

const status = budget.check(messages)
console.log(status)
// -> {
//     withinBudget: true,
//     estimatedTokens: 45000,
//     availableTokens: 150904,
//     utilizationPercent: 23
//   }

Wiring the Three Defenses Together

async function chat(messages) {
  // Defense 1: Check the budget
  const status = budget.check(messages)

  // Defense 2: Budget tight? Truncate tool results first
  if (status.utilizationPercent > 70) {
    messages = truncator.truncate(messages, 200_000).messages
  }

  // Defense 3: Still not enough? Compress old conversations
  if (status.utilizationPercent > 85) {
    const compressed = await compressor.compress(messages)
    if (compressed.compressed) messages = compressed.messages
  }

  return callLLM({ messages })
}

Before and After

Scenario	Before	After
Tool returns 50KB JSON	Context overflows immediately	Auto-truncated to safe range, cut at newlines
30-turn conversation	Model loses early instructions, degrades	Old messages compressed to summary, last 4 turns fully preserved
Context nearly full	No warning, sudden crash	Detected early, proactive truncation/compression
Tool call pairing	Compression splits tool_use/result	Always paired, never split
Processing time	No handling (just crashes)	Truncation in milliseconds, compression with 5-min timeout

Using with resilient-llm

This library pairs seamlessly with @yuyuqueen/resilient-llm from the previous post:

import { createResilientLLM } from '@yuyuqueen/resilient-llm'
import { createContextCompressor } from '@yuyuqueen/llm-context-kit'

const compressor = createContextCompressor({
  summarize: async ({ messages, systemPrompt }) => { /* ... */ },
})

let messages = [/* ... */]

const resilient = createResilientLLM({
  providers: [/* ... */],
  contextCompressor: async () => {
    const result = await compressor.compress(messages)
    if (result.compressed) {
      messages = result.messages  // Update the external messages array
    }
    return {
      compressed: result.compressed,
      description: result.description,
    }
  },
})

// Context overflow automatically triggers compression, invisible to the user
const result = await resilient.call(async (ctx) => {
  return {
    response: await anthropic.messages.create({
      model: ctx.model,
      max_tokens: 1024,
      messages,
    }),
  }
})

The two libraries together form a complete production-grade LLM defense:

API call fails
    |
    +-- Rate limit -> resilient-llm auto-rotates keys
    +-- Auth error -> resilient-llm switches provider
    +-- Context overflow -> llm-context-kit truncates/compresses
    +-- Other errors -> resilient-llm exponential backoff retry

Design Principles

This library follows the same design philosophy as resilient-llm:

Provider-agnostic — Not tied to any LLM SDK. Compression needs an LLM call? You provide the callback, the library handles orchestration
Zero dependencies — Pure TypeScript, no runtime dependencies
Immutable — All operations return new arrays, never mutating your original data
Dual-format compatible — Supports both Anthropic and OpenAI message formats

Conclusion

Context windows aren’t infinite, but your users’ conversations can be.

200K tokens sounds like a lot, but in Agent scenarios, a few read_file calls plus a few rounds of tool use can eat through most of it. An app without defenses crashes at the moment the user is most engaged. With three lines of defense, the conversation just keeps going.

-> @yuyuqueen/llm-context-kit on npm -> @yuyuqueen/resilient-llm on npm -> GitHub source

This is the second post in the “Distilling Libraries from Open Source Projects” series.

Part 1: I Read 1 Million Lines of Code and Found the Layer Most LLM Apps Are Missing
Part 3: Stop Hardcoding Your System Prompts (@yuyuqueen/prompt-assembler)

Follow for updates -> Twitter @YuYuQueen_ · GitHub