Building Production AI Agents: Tool Calling, Memory, and Orchestration That Actually Work

Building an AI agent that works in a demo is easy. Building one that works reliably in production is a completely different engineering problem. Tool calling breaks silently. Memory grows unbounded. Reasoning loops get stuck. The patterns here separate hobbyist agents from systems running at scale.

⚡ TL;DR: Production agents need structured tool schemas, bounded memory with summarization, deterministic orchestration graphs (not recursive self-calls), and cost+latency budgets per task. Use LangGraph or raw API tool_use for control.

Tool calling done right — schema design matters

// Raw Anthropic API tool calling
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();

const tools = [{
  name: 'search_codebase',
  description: 'Search codebase for files matching a pattern. Returns file paths and snippets.',
  input_schema: {
    type: 'object',
    properties: {
      query: { type: 'string', description: 'Search query — filename, function name, or code snippet' },
      file_type: { type: 'string', enum: ['ts', 'js', 'py', 'go', 'all'] },
      max_results: { type: 'number', description: 'Max results. Default: 10. Max: 50.' }
    },
    required: ['query', 'file_type']
  }
},
{
  name: 'run_tests',
  description: 'Run test suite for a file and return results. Only call after code changes.',
  input_schema: {
    type: 'object',
    properties: {
      test_path: { type: 'string', description: 'Path to test file or directory' },
      timeout_seconds: { type: 'number', description: 'Max wait time. Default: 30.' }
    },
    required: ['test_path']
  }
}];

// Agent loop with proper tool execution
async function runAgent(task, maxSteps = 10) {
  const messages = [{ role: 'user', content: task }];
  let steps = 0;

  while (steps < maxSteps) {
    const response = await client.messages.create({
      model: 'claude-opus-4-5',
      max_tokens: 4096,
      tools,
      messages,
    });

    messages.push({ role: 'assistant', content: response.content });
    if (response.stop_reason === 'end_turn') break;

    if (response.stop_reason === 'tool_use') {
      const results = [];
      for (const block of response.content) {
        if (block.type === 'tool_use') {
          const result = await executeTool(block.name, block.input);
          results.push({ type: 'tool_result', tool_use_id: block.id, content: JSON.stringify(result) });
        }
      }
      messages.push({ role: 'user', content: results });
    }
    steps++;
  }

  if (steps >= maxSteps) console.warn('Agent reached max steps — possible loop');
  return messages;
}

Memory architecture — four types every production agent needs

// 1. Working memory — current conversation (messages array)
// 2. Episodic memory — summaries of past interactions
// 3. Semantic memory — facts in vector store (pgvector, Pinecone)
// 4. Procedural memory — tool usage examples in system prompt

class EpisodicMemory {
  summaries = [];

  async store(messages, outcome) {
    // Use cheap model for summarization
    const summary = await client.messages.create({
      model: 'claude-haiku-4-5-20251001',
      max_tokens: 200,
      messages: [{ role: 'user', content:
        'Summarize in 2 sentences: what task, what worked, what failed. Outcome: ' + outcome
      }]
    });
    this.summaries.push({ ts: Date.now(), text: summary.content[0].text });
  }

  getRecent(n = 3) {
    // Production: use vector similarity (pgvector, Pinecone)
    return this.summaries.slice(-n).map(s => s.text);
  }
}

// Inject into system prompt:
const systemPrompt = 'You are a code assistant agent.'
  + (memory.length ? '\n\nRELEVANT PAST EXPERIENCE:\n' + memory.join('\n') : '');

LangGraph orchestration — deterministic state machines

// LangGraph: agent as state machine, not recursive loop
// Gives deterministic flow, breakpoints, and resumability
import { StateGraph, END } from '@langchain/langgraph';

const workflow = new StateGraph({
  channels: {
    messages: { reducer: (a, b) => [...a, ...b] },
    plan: { reducer: (_, b) => b },
    step: { reducer: (_, b) => b },
    results: { reducer: (a, b) => ({...a, ...b}) },
  }
});

// Each node: state -> state (pure functions)
workflow.addNode('plan', async (state) => {
  const plan = await generatePlan(state.messages[0]);
  return { plan, step: 0 };
});

workflow.addNode('execute', async (state) => {
  const result = await executeStep(state.plan[state.step]);
  return { results: { [state.plan[state.step]]: result }, step: state.step + 1 };
});

// Conditional routing — no recursive self-calls
workflow.addConditionalEdges('execute', (state) => {
  if (state.step >= state.plan.length) return END;
  if (state.error) return 'handle_error';
  return 'execute';
});

workflow.setEntryPoint('plan');
const app = workflow.compile();

Cost and latency budgets — mandatory for production

class AgentBudget {
  tokensUsed = 0;
  startTime = Date.now();

  constructor(
    maxTokens = 100000,   // ~$0.30 with Claude claude-sonnet-4-6
    maxLatencyMs = 30000  // 30 second hard timeout
  ) {
    this.maxTokens = maxTokens;
    this.maxLatencyMs = maxLatencyMs;
  }

  check(newTokens) {
    this.tokensUsed += newTokens;
    if (this.tokensUsed > this.maxTokens)
      throw new Error('Token budget exceeded: ' + this.tokensUsed);
    if (Date.now() - this.startTime > this.maxLatencyMs)
      throw new Error('Timeout exceeded');
  }

  get estimatedCost() {
    return (this.tokensUsed / 1000000) * 9; // ~$9/M average
  }
}

Production agent checklist

✅ Structured JSON tool schemas with explicit types and required fields
✅ Bounded context window — summarize old messages, never grow unboundedly
✅ Explicit max_steps and token budget on every agent run
✅ LangGraph state machine for complex multi-step agents
✅ Structured logging of every tool call and result
✅ Graceful degradation — return partial results when budget exceeded
❌ Never use recursive self-calls — loops are impossible to detect
❌ Never run agents without explicit timeouts and step limits

Production agents pair naturally with Lambda Function URL streaming to stream tool call results and reasoning to users in real time. For deploying agent backends, the Lambda cold start guide ensures first-token latency stays fast. External reference: Anthropic tool use documentation.

Level up your AI development skills

→ View Course on Udemy — The most comprehensive hands-on course covering every concept in this post with real projects.

→ Building LLM Powered Applications (Amazon) — The definitive book on building production AI systems and agents.

Sponsored links. We may earn a commission at no extra cost to you.

Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.

Building Production AI Agents: Tool Calling, Memory, and Orchestration That Actually Work

Tool calling done right — schema design matters

Memory architecture — four types every production agent needs

LangGraph orchestration — deterministic state machines

Cost and latency budgets — mandatory for production

Production agent checklist

Like this:

Related

Discover more from CheatCoders

Tool calling done right — schema design matters

Memory architecture — four types every production agent needs

LangGraph orchestration — deterministic state machines

Cost and latency budgets — mandatory for production

Production agent checklist

🚀 Don’t Miss the Next Cheat Code

Share this:

Like this:

Related

Discover more from CheatCoders