LLM Context Window Engineering: Fit More Into 200K Tokens and Get Better Answers

LLM Context Window Engineering: Fit More Into 200K Tokens and Get Better Answers

A 200K token context window is not a license to dump everything you have into the prompt. LLMs have positional bias — they pay more attention to content at the beginning and end of their context. They degrade on long-context tasks even when the answer is technically present. And the cost of a 200K context call is 100x a 2K call. Context window engineering — how you structure, compress, and position information — is the difference between good and great LLM outputs.

TL;DR: Put the most important information first and last — LLMs exhibit “lost in the middle” degradation. Compress context aggressively before sending (50% reduction typical). Use hierarchical summarization for long documents. Chunk code by semantic unit not character count. Always measure whether more context actually improves your specific task.

Lost in the middle — the positional bias problem

// Research finding: LLMs perform worse when the relevant information
// is in the middle of a long context vs beginning or end
// Effect is measurable and consistent across models

// Experiment:
// - 50 documents, 1 contains the answer
// - Measure accuracy by answer document position
// Position 1 (beginning): 95% accuracy
// Position 25 (middle):   62% accuracy
// Position 50 (end):      91% accuracy

// Mitigation strategies:

// 1. Put instructions and key context FIRST
const prompt = [
  SYSTEM_INSTRUCTIONS,     // Always first
  CRITICAL_CONTEXT,        // High-relevance material next
  BACKGROUND_DOCUMENTS,    // Supporting material in middle
  THE_ACTUAL_QUESTION,     // Repeat key points and question at END
].join('\n');

// 2. Repeat key facts near the question
// "Given that [key constraint from earlier], answer the following..."
// Repetition is cheap at 1-5 tokens, impact is significant

// 3. Rank retrieved chunks by relevance, not source order
const sortedChunks = chunks.sort((a, b) => b.relevanceScore - a.relevanceScore);
// Put highest-relevance chunks first

Context compression — reduce tokens without losing information

// Compression techniques ordered by cost/effectiveness:

// 1. Strip whitespace and comments from code (free, lossless)
function compressCode(code) {
  return code
    .replace(/\/\/.*$/gm, '')         // Remove line comments
    .replace(/\/\*[\s\S]*?\*\//g, '') // Remove block comments
    .replace(/^\s*$/gm, '')            // Remove blank lines
    .replace(/\s+/g, ' ').trim();      // Collapse whitespace
  // Typical reduction: 25-40% fewer tokens
}

// 2. Extractive summarization — keep key sentences, remove the rest
async function extractiveSummarize(text, targetTokens) {
  const response = await llm.complete(
    `Extract the ${targetTokens} most important tokens from this text.
     Keep exact wording for key facts and numbers.
     Remove: examples, repetition, transition phrases.
     Return ONLY the extracted content, no commentary.\n\n${text}`
  );
  return response; // Typical reduction: 40-60%
}

// 3. Structured extraction — convert prose to structured data
// "The server takes approximately 200ms to respond to requests
//  under normal load conditions"
// → { metric: "response_time", value: 200, unit: "ms", condition: "normal_load" }
// Structured data uses 70% fewer tokens than prose for the same information

// 4. Hierarchical summarization for long documents
async function hierarchicalSummarize(document, targetTokens) {
  const CHUNK_SIZE = 4000;
  const chunks = splitIntoChunks(document, CHUNK_SIZE);

  // First pass: summarize each chunk
  const summaries = await Promise.all(
    chunks.map(c => llm.complete(`Summarize in 200 tokens:\n${c}`))
  );

  // Second pass: if still too long, summarize summaries
  const combined = summaries.join('\n');
  if (estimateTokens(combined) > targetTokens) {
    return llm.complete(`Summarize in ${targetTokens} tokens:\n${combined}`);
  }
  return combined;
}

Chunking strategy — semantic units beat character counts

// Bad chunking: fixed character count (common but wrong)
function naiveChunk(text, size = 1000) {
  const chunks = [];
  for (let i = 0; i < text.length; i += size) {
    chunks.push(text.slice(i, i + size)); // Cuts sentences, breaks context
  }
  return chunks;
}

// Good chunking: semantic boundaries
function semanticChunk(code) {
  // For source code: chunk by function/class definition
  const functionPattern = /^(async function|function|class|const \w+ = \()/gm;
  const boundaries = [];
  let match;
  while ((match = functionPattern.exec(code)) !== null) {
    boundaries.push(match.index);
  }

  return boundaries.map((start, i) => ({
    content: code.slice(start, boundaries[i+1]),
    type: 'function',
    startLine: code.slice(0, start).split('\n').length
  }));
}

// For prose: chunk at paragraph boundaries with overlap
function proseSemantic(text, maxTokens = 512, overlap = 50) {
  const paragraphs = text.split(/\n\n+/);
  const chunks = [];
  let current = [];
  let currentTokens = 0;

  for (const para of paragraphs) {
    const paraTokens = estimateTokens(para);
    if (currentTokens + paraTokens > maxTokens && current.length > 0) {
      chunks.push(current.join('\n\n'));
      // Keep last paragraph for overlap
      current = current.slice(-1);
      currentTokens = estimateTokens(current[0] || '');
    }
    current.push(para);
    currentTokens += paraTokens;
  }
  if (current.length) chunks.push(current.join('\n\n'));
  return chunks;
}

Context engineering checklist

  • ✅ Put instructions and key context first, repeat the question at the end
  • ✅ Sort retrieved chunks by relevance score, not source document order
  • ✅ Compress code before including: strip comments and whitespace (25-40% reduction)
  • ✅ Use hierarchical summarization for documents over 20K tokens
  • ✅ Chunk at semantic boundaries (functions, paragraphs) not character counts
  • ✅ Always measure: does more context actually improve your task accuracy?
  • ❌ Never dump an entire codebase in context hoping the model finds what it needs
  • ❌ Never trust that long-context models have solved the “lost in the middle” problem

Context window engineering is the foundation of production AI agent memory systems — the compression and chunking techniques here power the episodic memory retrieval tier. For implementing RAG with these chunking strategies, the RAG vs fine-tuning guide covers when to use each approach. External reference: Lost in the Middle: How Language Models Use Long Contexts.

Level Up: LLM Engineering and Context Management

Python Bootcamp on Udemy — Build real AI agents and automation tools with Python from scratch.

Designing Data-Intensive Applications — The infrastructure foundation every AI engineer needs.

Sponsored links. We may earn a commission at no extra cost to you.


Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply