A 200K token context window is not a license to dump everything you have into the prompt. LLMs have positional bias — they pay more attention to content at the beginning and end of their context. They degrade on long-context tasks even when the answer is technically present. And the cost of a 200K context call is 100x a 2K call. Context window engineering — how you structure, compress, and position information — is the difference between good and great LLM outputs.
⚡ TL;DR: Put the most important information first and last — LLMs exhibit “lost in the middle” degradation. Compress context aggressively before sending (50% reduction typical). Use hierarchical summarization for long documents. Chunk code by semantic unit not character count. Always measure whether more context actually improves your specific task.
Lost in the middle — the positional bias problem
// Research finding: LLMs perform worse when the relevant information
// is in the middle of a long context vs beginning or end
// Effect is measurable and consistent across models
// Experiment:
// - 50 documents, 1 contains the answer
// - Measure accuracy by answer document position
// Position 1 (beginning): 95% accuracy
// Position 25 (middle): 62% accuracy
// Position 50 (end): 91% accuracy
// Mitigation strategies:
// 1. Put instructions and key context FIRST
const prompt = [
SYSTEM_INSTRUCTIONS, // Always first
CRITICAL_CONTEXT, // High-relevance material next
BACKGROUND_DOCUMENTS, // Supporting material in middle
THE_ACTUAL_QUESTION, // Repeat key points and question at END
].join('\n');
// 2. Repeat key facts near the question
// "Given that [key constraint from earlier], answer the following..."
// Repetition is cheap at 1-5 tokens, impact is significant
// 3. Rank retrieved chunks by relevance, not source order
const sortedChunks = chunks.sort((a, b) => b.relevanceScore - a.relevanceScore);
// Put highest-relevance chunks first
Context compression — reduce tokens without losing information
// Compression techniques ordered by cost/effectiveness:
// 1. Strip whitespace and comments from code (free, lossless)
function compressCode(code) {
return code
.replace(/\/\/.*$/gm, '') // Remove line comments
.replace(/\/\*[\s\S]*?\*\//g, '') // Remove block comments
.replace(/^\s*$/gm, '') // Remove blank lines
.replace(/\s+/g, ' ').trim(); // Collapse whitespace
// Typical reduction: 25-40% fewer tokens
}
// 2. Extractive summarization — keep key sentences, remove the rest
async function extractiveSummarize(text, targetTokens) {
const response = await llm.complete(
`Extract the ${targetTokens} most important tokens from this text.
Keep exact wording for key facts and numbers.
Remove: examples, repetition, transition phrases.
Return ONLY the extracted content, no commentary.\n\n${text}`
);
return response; // Typical reduction: 40-60%
}
// 3. Structured extraction — convert prose to structured data
// "The server takes approximately 200ms to respond to requests
// under normal load conditions"
// → { metric: "response_time", value: 200, unit: "ms", condition: "normal_load" }
// Structured data uses 70% fewer tokens than prose for the same information
// 4. Hierarchical summarization for long documents
async function hierarchicalSummarize(document, targetTokens) {
const CHUNK_SIZE = 4000;
const chunks = splitIntoChunks(document, CHUNK_SIZE);
// First pass: summarize each chunk
const summaries = await Promise.all(
chunks.map(c => llm.complete(`Summarize in 200 tokens:\n${c}`))
);
// Second pass: if still too long, summarize summaries
const combined = summaries.join('\n');
if (estimateTokens(combined) > targetTokens) {
return llm.complete(`Summarize in ${targetTokens} tokens:\n${combined}`);
}
return combined;
}
Chunking strategy — semantic units beat character counts
// Bad chunking: fixed character count (common but wrong)
function naiveChunk(text, size = 1000) {
const chunks = [];
for (let i = 0; i < text.length; i += size) {
chunks.push(text.slice(i, i + size)); // Cuts sentences, breaks context
}
return chunks;
}
// Good chunking: semantic boundaries
function semanticChunk(code) {
// For source code: chunk by function/class definition
const functionPattern = /^(async function|function|class|const \w+ = \()/gm;
const boundaries = [];
let match;
while ((match = functionPattern.exec(code)) !== null) {
boundaries.push(match.index);
}
return boundaries.map((start, i) => ({
content: code.slice(start, boundaries[i+1]),
type: 'function',
startLine: code.slice(0, start).split('\n').length
}));
}
// For prose: chunk at paragraph boundaries with overlap
function proseSemantic(text, maxTokens = 512, overlap = 50) {
const paragraphs = text.split(/\n\n+/);
const chunks = [];
let current = [];
let currentTokens = 0;
for (const para of paragraphs) {
const paraTokens = estimateTokens(para);
if (currentTokens + paraTokens > maxTokens && current.length > 0) {
chunks.push(current.join('\n\n'));
// Keep last paragraph for overlap
current = current.slice(-1);
currentTokens = estimateTokens(current[0] || '');
}
current.push(para);
currentTokens += paraTokens;
}
if (current.length) chunks.push(current.join('\n\n'));
return chunks;
}
Context engineering checklist
- ✅ Put instructions and key context first, repeat the question at the end
- ✅ Sort retrieved chunks by relevance score, not source document order
- ✅ Compress code before including: strip comments and whitespace (25-40% reduction)
- ✅ Use hierarchical summarization for documents over 20K tokens
- ✅ Chunk at semantic boundaries (functions, paragraphs) not character counts
- ✅ Always measure: does more context actually improve your task accuracy?
- ❌ Never dump an entire codebase in context hoping the model finds what it needs
- ❌ Never trust that long-context models have solved the “lost in the middle” problem
Context window engineering is the foundation of production AI agent memory systems — the compression and chunking techniques here power the episodic memory retrieval tier. For implementing RAG with these chunking strategies, the RAG vs fine-tuning guide covers when to use each approach. External reference: Lost in the Middle: How Language Models Use Long Contexts.
Level Up: LLM Engineering and Context Management
→ Python Bootcamp on Udemy — Build real AI agents and automation tools with Python from scratch.
→ Designing Data-Intensive Applications — The infrastructure foundation every AI engineer needs.
Sponsored links. We may earn a commission at no extra cost to you.
Discover more from CheatCoders
Subscribe to get the latest posts sent to your email.
