LLM Context Window Engineering: Fit More Into 200K Tokens and Get Better Answers

A 200K token context window is not a license to dump everything you have into the prompt. LLMs have positional bias — they pay more attention to content at the beginning and end of their context. They degrade on long-context tasks even when the answer is technically present. And the cost of a 200K context call is 100x a 2K call. Context window engineering — how you structure, compress, and position information — is the difference between good and great LLM outputs.

⚡ TL;DR: Put the most important information first and last — LLMs exhibit “lost in the middle” degradation. Compress context aggressively before sending (50% reduction typical). Use hierarchical summarization for long documents. Chunk code by semantic unit not character count. Always measure whether more context actually improves your specific task.

Lost in the middle — the positional bias problem

// Research finding: LLMs perform worse when the relevant information
// is in the middle of a long context vs beginning or end
// Effect is measurable and consistent across models

// Experiment:
// - 50 documents, 1 contains the answer
// - Measure accuracy by answer document position
// Position 1 (beginning): 95% accuracy
// Position 25 (middle):   62% accuracy
// Position 50 (end):      91% accuracy

// Mitigation strategies:

// 1. Put instructions and key context FIRST
const prompt = [
  SYSTEM_INSTRUCTIONS,     // Always first
  CRITICAL_CONTEXT,        // High-relevance material next
  BACKGROUND_DOCUMENTS,    // Supporting material in middle
  THE_ACTUAL_QUESTION,     // Repeat key points and question at END
].join('\n');

// 2. Repeat key facts near the question
// "Given that [key constraint from earlier], answer the following..."
// Repetition is cheap at 1-5 tokens, impact is significant

// 3. Rank retrieved chunks by relevance, not source order
const sortedChunks = chunks.sort((a, b) => b.relevanceScore - a.relevanceScore);
// Put highest-relevance chunks first

Context compression — reduce tokens without losing information

// Compression techniques ordered by cost/effectiveness:

// 1. Strip whitespace and comments from code (free, lossless)
function compressCode(code) {
  return code
    .replace(/\/\/.*$/gm, '')         // Remove line comments
    .replace(/\/\*[\s\S]*?\*\//g, '') // Remove block comments
    .replace(/^\s*$/gm, '')            // Remove blank lines
    .replace(/\s+/g, ' ').trim();      // Collapse whitespace
  // Typical reduction: 25-40% fewer tokens
}

// 2. Extractive summarization — keep key sentences, remove the rest
async function extractiveSummarize(text, targetTokens) {
  const response = await llm.complete(
    `Extract the ${targetTokens} most important tokens from this text.
     Keep exact wording for key facts and numbers.
     Remove: examples, repetition, transition phrases.
     Return ONLY the extracted content, no commentary.\n\n${text}`
  );
  return response; // Typical reduction: 40-60%
}

// 3. Structured extraction — convert prose to structured data
// "The server takes approximately 200ms to respond to requests
//  under normal load conditions"
// → { metric: "response_time", value: 200, unit: "ms", condition: "normal_load" }
// Structured data uses 70% fewer tokens than prose for the same information

// 4. Hierarchical summarization for long documents
async function hierarchicalSummarize(document, targetTokens) {
  const CHUNK_SIZE = 4000;
  const chunks = splitIntoChunks(document, CHUNK_SIZE);

  // First pass: summarize each chunk
  const summaries = await Promise.all(
    chunks.map(c => llm.complete(`Summarize in 200 tokens:\n${c}`))
  );

  // Second pass: if still too long, summarize summaries
  const combined = summaries.join('\n');
  if (estimateTokens(combined) > targetTokens) {
    return llm.complete(`Summarize in ${targetTokens} tokens:\n${combined}`);
  }
  return combined;
}

Chunking strategy — semantic units beat character counts

// Bad chunking: fixed character count (common but wrong)
function naiveChunk(text, size = 1000) {
  const chunks = [];
  for (let i = 0; i < text.length; i += size) {
    chunks.push(text.slice(i, i + size)); // Cuts sentences, breaks context
  }
  return chunks;
}

// Good chunking: semantic boundaries
function semanticChunk(code) {
  // For source code: chunk by function/class definition
  const functionPattern = /^(async function|function|class|const \w+ = \()/gm;
  const boundaries = [];
  let match;
  while ((match = functionPattern.exec(code)) !== null) {
    boundaries.push(match.index);
  }

  return boundaries.map((start, i) => ({
    content: code.slice(start, boundaries[i+1]),
    type: 'function',
    startLine: code.slice(0, start).split('\n').length
  }));
}

// For prose: chunk at paragraph boundaries with overlap
function proseSemantic(text, maxTokens = 512, overlap = 50) {
  const paragraphs = text.split(/\n\n+/);
  const chunks = [];
  let current = [];
  let currentTokens = 0;

  for (const para of paragraphs) {
    const paraTokens = estimateTokens(para);
    if (currentTokens + paraTokens > maxTokens && current.length > 0) {
      chunks.push(current.join('\n\n'));
      // Keep last paragraph for overlap
      current = current.slice(-1);
      currentTokens = estimateTokens(current[0] || '');
    }
    current.push(para);
    currentTokens += paraTokens;
  }
  if (current.length) chunks.push(current.join('\n\n'));
  return chunks;
}

Context engineering checklist

✅ Put instructions and key context first, repeat the question at the end
✅ Sort retrieved chunks by relevance score, not source document order
✅ Compress code before including: strip comments and whitespace (25-40% reduction)
✅ Use hierarchical summarization for documents over 20K tokens
✅ Chunk at semantic boundaries (functions, paragraphs) not character counts
✅ Always measure: does more context actually improve your task accuracy?
❌ Never dump an entire codebase in context hoping the model finds what it needs
❌ Never trust that long-context models have solved the “lost in the middle” problem

Context window engineering is the foundation of production AI agent memory systems — the compression and chunking techniques here power the episodic memory retrieval tier. For implementing RAG with these chunking strategies, the RAG vs fine-tuning guide covers when to use each approach. External reference: Lost in the Middle: How Language Models Use Long Contexts.

Level Up: LLM Engineering and Context Management

→ Python Bootcamp on Udemy — Build real AI agents and automation tools with Python from scratch.

→ Designing Data-Intensive Applications — The infrastructure foundation every AI engineer needs.

Sponsored links. We may earn a commission at no extra cost to you.

Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.

LLM Context Window Engineering: Fit More Into 200K Tokens and Get Better Answers

Lost in the middle — the positional bias problem

Context compression — reduce tokens without losing information

Chunking strategy — semantic units beat character counts

Context engineering checklist

Like this:

Related

Discover more from CheatCoders

Lost in the middle — the positional bias problem

Context compression — reduce tokens without losing information

Chunking strategy — semantic units beat character counts

Context engineering checklist

🚀 Don’t Miss the Next Cheat Code

Share this:

Like this:

Related

Discover more from CheatCoders