AI-Powered Backend Systems Guide (2026)

Every backend eventually faces the same question: how do we integrate AI without turning our clean architecture into a tangled mess of API calls, retry logic, and ballooning costs? The answer is not “just call OpenAI” — it is a set of proven architectural patterns that treat AI as a first-class service layer.

This guide covers the eight patterns that production AI backends rely on in 2026. Whether you are building intelligent document APIs or adding AI-powered features to an existing microservices architecture, these patterns will help you ship AI features that are reliable, observable, and cost-effective.

Pattern	Category	When To Use	Complexity
ML Model Serving	Inference	Real-time predictions via REST/gRPC endpoints	High
RAG Pipelines	Knowledge	Context-aware responses from domain-specific data	High
Vector Databases	Storage	Semantic search and similarity matching	Medium
LLM Integration	Generation	Text generation, summarization, classification	Medium
AI Middleware	Orchestration	Request routing, prompt templating, fallbacks	Medium
Feature Stores	Data	Consistent feature serving for training and inference	High
A/B Testing ML	Experimentation	Comparing model versions in production	Medium
AI Observability	Monitoring	Tracking latency, cost, drift, and accuracy	Low

Part 1: AI Integration Architecture

The most common mistake teams make when integrating AI is scattering LLM calls throughout their business logic. A better approach is to treat AI as a dedicated service layer — a boundary that sits between your application logic and the AI providers, handling prompt construction, response parsing, caching, and error recovery in one place.

This pattern mirrors how you would integrate any external service: behind an interface, with retries, circuit breakers, and fallback strategies. The difference is that AI services have unique characteristics — they are non-deterministic, latency-heavy, and priced per token — which require specialized handling.

Synchronous vs Asynchronous AI Pipelines

Synchronous pipelines work for low-latency tasks like classification or extraction where the model response is under 500ms. For generative tasks like document drafting or multi-step reasoning, asynchronous pipelines with streaming or job queues are essential. At TurboDocx Writer, we use a hybrid approach: classification runs synchronously in the request path, while content generation is streamed via Server-Sent Events.

* Code examples throughout this guide are simplified for illustrative purposes. Refer to the linked official documentation for complete API references and production-ready configurations.

Model Serving Patterns: REST, gRPC, and Streaming

REST is the default for most AI endpoints because it is simple and compatible with every client. gRPC shines when you need binary-efficient communication between internal services — particularly for feature vector transfer and batch inference. Streaming (SSE or WebSockets) is non-negotiable for any endpoint that returns LLM-generated text; users will not wait 5 seconds staring at a spinner.

// AI Service Layer — single boundary for all AI operations
import { OpenAI } from 'openai';
import { Redis } from 'ioredis';

interface AIServiceConfig {
  model: string;
  maxTokens: number;
  temperature: number;
  cacheTtlSeconds: number;
}

interface AIResponse<T> {
  data: T;
  cached: boolean;
  latencyMs: number;
  tokensUsed: { prompt: number; completion: number };
}

class AIService {
  private client: OpenAI;
  private cache: Redis;
  private config: AIServiceConfig;

  constructor(config: AIServiceConfig) {
    this.client = new OpenAI();
    this.cache = new Redis(process.env.REDIS_URL!);
    this.config = config;
  }

  // Prompt cache key based on exact content hash
  private getCacheKey(prompt: string): string {
    const hash = crypto
      .createHash('sha256')
      .update(prompt)
      .digest('hex')
      .slice(0, 16);
    return `ai:completion:${this.config.model}:${hash}`;
  }

  async complete<T>(
    systemPrompt: string,
    userPrompt: string,
    parser: (raw: string) => T
  ): Promise<AIResponse<T>> {
    const cacheKey = this.getCacheKey(systemPrompt + userPrompt);
    const start = performance.now();

    // Check cache first
    const cached = await this.cache.get(cacheKey);
    if (cached) {
      return {
        data: parser(cached),
        cached: true,
        latencyMs: performance.now() - start,
        tokensUsed: { prompt: 0, completion: 0 },
      };
    }

    // Call LLM with retry logic
    const response = await this.client.chat.completions.create({
      model: this.config.model,
      max_tokens: this.config.maxTokens,
      temperature: this.config.temperature,
      messages: [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: userPrompt },
      ],
    });

    const raw = response.choices[0]?.message?.content ?? '';
    const usage = response.usage;

    // Cache the raw response
    await this.cache.setex(cacheKey, this.config.cacheTtlSeconds, raw);

    return {
      data: parser(raw),
      cached: false,
      latencyMs: performance.now() - start,
      tokensUsed: {
        prompt: usage?.prompt_tokens ?? 0,
        completion: usage?.completion_tokens ?? 0,
      },
    };
  }
}

// Usage in an Express route
const aiService = new AIService({
  model: 'gpt-4o',
  maxTokens: 2048,
  temperature: 0.3,
  cacheTtlSeconds: 3600,
});

app.post('/api/documents/classify', async (req, res) => {
  const { content } = req.body;
  const result = await aiService.complete(
    'Classify the document into one of: invoice, contract, proposal, sow.',
    content,
    (raw) => JSON.parse(raw) as { category: string; confidence: number }
  );

  res.json({
    classification: result.data,
    cached: result.cached,
    latencyMs: Math.round(result.latencyMs),
  });
});

Docs: OpenAI API Reference | ioredis

Architecture principle

Never call an LLM directly from a route handler. Always go through a service layer that owns caching, retries, token tracking, and response parsing. This single boundary makes it trivial to swap providers, add logging, or implement fallbacks later.

Part 2: RAG Pipelines

Retrieval-Augmented Generation is the single most impactful pattern for building AI-powered backends. Instead of relying solely on an LLM's training data, RAG retrieves relevant documents from your own knowledge base and injects them into the prompt context. The result is responses that are grounded in your data, dramatically reducing hallucinations.

A production RAG pipeline has four stages: ingest (chunk and embed documents), index (store embeddings in a vector database), retrieve (find the most relevant chunks for a query), and generate (pass retrieved context to the LLM). Each stage has its own optimization levers.

Vector Databases: Choosing the Right Store

The vector database landscape in 2026 has consolidated around three tiers. Pinecone and Weaviate dominate managed solutions with sub-10ms query latency at billion-vector scale. pgvector is the pragmatic choice when you want to keep embeddings alongside your relational data without introducing a new service. Qdrant and Milvus offer the best self-hosted performance for teams that need data-sovereignty guarantees.

For most teams building developer-facing applications, starting with pgvector inside your existing Postgres instance is the right call. You avoid operational overhead and can always migrate to a dedicated vector database when you cross the 10-million-vector threshold.

Embedding Generation and Chunking Strategy

Chunking strategy matters more than embedding model choice. Overlapping chunks of 512 tokens with a 64-token overlap consistently outperform both smaller and larger chunks for document Q&A tasks. For structured documents like statements of work or contracts, section-aware chunking that respects heading boundaries delivers significantly better retrieval accuracy.

// Production RAG Pipeline with vector search and LLM generation
import { OpenAIEmbeddings } from '@langchain/openai';
import { PineconeStore } from '@langchain/pinecone';
import { Pinecone } from '@pinecone-database/pinecone';
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';
import { OpenAI } from 'openai';

interface RAGConfig {
  indexName: string;
  namespace: string;
  topK: number;
  scoreThreshold: number;
}

interface RetrievedChunk {
  content: string;
  metadata: { source: string; page: number; score: number };
}

class RAGPipeline {
  private embeddings: OpenAIEmbeddings;
  private vectorStore: PineconeStore;
  private splitter: RecursiveCharacterTextSplitter;
  private config: RAGConfig;

  private constructor(config: RAGConfig, vectorStore: PineconeStore) {
    this.config = config;
    this.vectorStore = vectorStore;
    this.embeddings = new OpenAIEmbeddings({
      model: 'text-embedding-3-large',
      dimensions: 3072,
    });
    this.splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 512,
      chunkOverlap: 64,
      separators: ['\n## ', '\n### ', '\n\n', '\n', ' '],
    });
  }

  static async create(config: RAGConfig): Promise<RAGPipeline> {
    const embeddings = new OpenAIEmbeddings({
      model: 'text-embedding-3-large',
      dimensions: 3072,
    });
    const pinecone = new Pinecone();
    const index = pinecone.index(config.indexName);
    const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {
      pineconeIndex: index,
      namespace: config.namespace,
    });
    return new RAGPipeline(config, vectorStore);
  }

  // Stage 1 & 2: Ingest and Index
  async ingestDocument(
    content: string,
    metadata: { source: string; docId: string }
  ): Promise<{ chunksIndexed: number }> {
    const chunks = await this.splitter.createDocuments(
      [content],
      [metadata],
      { chunkHeader: `SOURCE: ${metadata.source}\n\n` }
    );

    await this.vectorStore.addDocuments(chunks);
    return { chunksIndexed: chunks.length };
  }

  // Stage 3: Retrieve relevant chunks
  async retrieve(query: string): Promise<RetrievedChunk[]> {
    const results = await this.vectorStore.similaritySearchWithScore(
      query,
      this.config.topK
    );

    return results
      .filter(([, score]) => score >= this.config.scoreThreshold)
      .map(([doc, score]) => ({
        content: doc.pageContent,
        metadata: {
          source: doc.metadata.source,
          page: doc.metadata.page ?? 0,
          score: Math.round(score * 1000) / 1000,
        },
      }));
  }

  // Stage 4: Generate with retrieved context
  async queryWithContext(
    query: string,
    systemPrompt: string
  ): Promise<{ answer: string; sources: RetrievedChunk[] }> {
    const chunks = await this.retrieve(query);

    const contextBlock = chunks
      .map((c, i) => `[Source ${i + 1}: ${c.metadata.source}]\n${c.content}`)
      .join('\n\n---\n\n');

    const augmentedPrompt = `${systemPrompt}

Use the following context to answer the user's question.
If the context does not contain enough information, say so explicitly.

CONTEXT:
${contextBlock}

USER QUESTION:
${query}`;

    const client = new OpenAI();
    const response = await client.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: augmentedPrompt }],
      temperature: 0.2,
      max_tokens: 1024,
    });

    return {
      answer: response.choices[0].message.content ?? '',
      sources: chunks,
    };
  }
}

// Express route using the RAG pipeline
const rag = await RAGPipeline.create({
  indexName: 'documents',
  namespace: 'production',
  topK: 5,
  scoreThreshold: 0.75,
});

app.post('/api/knowledge/query', async (req, res) => {
  const { question } = req.body;
  const result = await rag.queryWithContext(
    question,
    'You are a helpful assistant for a document automation platform.'
  );

  res.json({
    answer: result.answer,
    sources: result.sources.map((s) => ({
      source: s.metadata.source,
      relevance: s.metadata.score,
    })),
  });
});

Docs: LangChain JS | Pinecone | OpenAI Embeddings

Production tip

Always include a scoreThreshold filter on your vector search results. Without it, the LLM will receive low-relevance chunks that confuse the generation. A threshold of 0.75 on cosine similarity is a good starting point for most document Q&A use cases.

Part 3: LLM Integration Patterns

Integrating an LLM into a production backend is not simply wrapping an API call. You need prompt versioning, token optimization, streaming, and graceful degradation. These patterns turn a fragile prototype into a resilient production system.

Prompt Management and Versioning

Treat prompts like code. Store them in version-controlled template files, not inline strings. Use a prompt registry that maps prompt IDs to versioned templates with variable interpolation. This lets you A/B test prompt variants, roll back regressions, and audit changes — critical when your prompts drive business logic in document automation workflows.

Token Optimization and Semantic Caching

Token cost is the largest line item for AI-heavy backends. Three techniques consistently reduce costs by 40-60%: prompt compression (stripping redundant whitespace and instructions), semantic caching (caching responses for semantically similar queries, not just exact matches), and tiered model routing (using a smaller model for simple tasks and a larger model only when complexity warrants it).

Streaming Responses with Server-Sent Events

For any endpoint that generates text longer than a sentence, streaming is not optional. Users perceive a streaming response as significantly faster than a batch response with the same total latency. The pattern below shows how to pipe OpenAI's streaming API directly through your Express backend to the client.

// LLM Streaming with prompt versioning and fallback
import { OpenAI, AzureOpenAI } from 'openai';
import type { Request, Response } from 'express';

// Prompt registry — version-controlled templates
const PROMPT_REGISTRY = {
  'document-draft': {
    v1: {
      system: `You are a professional document writer.
Generate content based on the template variables provided.
Output clean, well-structured prose.`,
      maxTokens: 4096,
      temperature: 0.7,
    },
    v2: {
      system: `You are a professional document writer for a B2B SaaS platform.
Generate content using the template variables below.
Follow the document type conventions. Be concise and professional.
Output well-structured prose with clear section headings.`,
      maxTokens: 4096,
      temperature: 0.5,
    },
  },
} as const;

type PromptId = keyof typeof PROMPT_REGISTRY;

interface StreamOptions {
  promptId: PromptId;
  promptVersion: string;
  userMessage: string;
  onToken?: (token: string) => void;
}

class LLMService {
  private primary: OpenAI;
  private fallback: AzureOpenAI;

  constructor() {
    this.primary = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
    this.fallback = new AzureOpenAI({
      apiKey: process.env.AZURE_OPENAI_API_KEY,
      endpoint: process.env.AZURE_OPENAI_ENDPOINT,
      apiVersion: '2024-08-01-preview',
    });
  }

  async streamCompletion(
    options: StreamOptions,
    res: Response
  ): Promise<void> {
    const promptConfig =
      PROMPT_REGISTRY[options.promptId]?.[
        options.promptVersion as keyof (typeof PROMPT_REGISTRY)[PromptId]
      ];

    if (!promptConfig) {
      throw new Error(`Unknown prompt: ${options.promptId}@${options.promptVersion}`);
    }

    // Set SSE headers
    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache');
    res.setHeader('Connection', 'keep-alive');

    try {
      const stream = await this.primary.chat.completions.create({
        model: 'gpt-4o',
        stream: true,
        max_tokens: promptConfig.maxTokens,
        temperature: promptConfig.temperature,
        messages: [
          { role: 'system', content: promptConfig.system },
          { role: 'user', content: options.userMessage },
        ],
      });

      let totalTokens = 0;
      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content;
        if (content) {
          totalTokens++; // Note: This is approximate - each chunk may contain multiple tokens
          res.write(`data: ${JSON.stringify({ token: content })}\n\n`);
        }
      }

      // Send completion event with metadata
      res.write(
        `data: ${JSON.stringify({ done: true, totalTokens })}\n\n`
      );
      res.end();
    } catch (error) {
      // Fallback to Azure OpenAI on primary failure
      console.error('Primary LLM failed, falling back:', error);
      const fallbackResponse = await this.fallback.chat.completions.create({
        model: 'gpt-4o',
        max_tokens: promptConfig.maxTokens,
        temperature: promptConfig.temperature,
        messages: [
          { role: 'system', content: promptConfig.system },
          { role: 'user', content: options.userMessage },
        ],
      });

      const content = fallbackResponse.choices[0]?.message?.content ?? '';
      res.write(`data: ${JSON.stringify({ token: content })}\n\n`);
      res.write(`data: ${JSON.stringify({ done: true, fallback: true })}\n\n`);
      res.end();
    }
  }
}

// Express route with streaming
const llm = new LLMService();

app.post('/api/ai/generate', async (req, res) => {
  const { promptId, version, input } = req.body;
  await llm.streamCompletion(
    { promptId, promptVersion: version, userMessage: input },
    res
  );
});

Docs: OpenAI Streaming | Azure OpenAI

Fallback strategy

Always configure a fallback LLM provider. If your primary is OpenAI, use Azure OpenAI or Anthropic as your secondary. The fallback should be a non-streaming batch call — simpler to implement and sufficient for error scenarios. Monitor your fallback rate; if it exceeds 2%, investigate the primary provider's reliability.

Part 4: ML Model Serving

Not every AI task needs an LLM. Classification, anomaly detection, recommendation, and scoring tasks are better served by purpose-built ML models that are faster, cheaper, and more predictable than generative models. The challenge is serving these models reliably at scale.

Serving Frameworks: TensorFlow Serving vs FastAPI

TensorFlow Serving is the production standard for TensorFlow and SavedModel formats. It supports model versioning, canary deployments, and batching out of the box. FastAPI + ONNX Runtime is the flexible alternative when your models come from PyTorch, scikit-learn, or XGBoost. FastAPI gives you full control over preprocessing, validation, and response formatting while ONNX Runtime handles cross-framework inference.

Model Versioning and A/B Testing

Model versioning follows the same principles as service versioning in microservices. Every model artifact gets a semantic version. A/B testing routes a percentage of traffic to the new model version while monitoring accuracy, latency, and business metrics. Only promote when the challenger outperforms the champion on your primary metric.

Feature Stores for Consistent Inference

A feature store guarantees that the features used during training are identical to those used during inference. Without one, training-serving skew silently degrades model accuracy. Tools like Feast, Tecton, and even a simple Redis-backed feature cache solve this problem at different scales.

// ML Model Serving with versioning and A/B routing
import express from 'express';
import { InferenceSession, Tensor } from 'onnxruntime-node';

interface ModelVersion {
  version: string;
  session: InferenceSession;
  trafficWeight: number; // 0-100
}

interface PredictionResult {
  prediction: number[];
  confidence: number;
  modelVersion: string;
  latencyMs: number;
}

class ModelServingService {
  private models: Map<string, ModelVersion[]> = new Map();

  async loadModel(
    modelName: string,
    version: string,
    modelPath: string,
    trafficWeight: number
  ): Promise<void> {
    const session = await InferenceSession.create(modelPath);
    const versions = this.models.get(modelName) ?? [];
    versions.push({ version, session, trafficWeight });
    this.models.set(modelName, versions);
  }

  // Weighted random routing for A/B testing
  private selectVersion(modelName: string): ModelVersion {
    const versions = this.models.get(modelName);
    if (!versions?.length) throw new Error(`No model: ${modelName}`);

    const totalWeight = versions.reduce((sum, v) => sum + v.trafficWeight, 0);
    let random = Math.random() * totalWeight;

    for (const version of versions) {
      random -= version.trafficWeight;
      if (random <= 0) return version;
    }
    return versions[versions.length - 1];
  }

  async predict(
    modelName: string,
    features: number[]
  ): Promise<PredictionResult> {
    const model = this.selectVersion(modelName);
    const start = performance.now();

    const inputTensor = new Tensor('float32', features, [1, features.length]);
    const results = await model.session.run({ [model.session.inputNames[0]]: inputTensor });
    const output = results[model.session.outputNames[0]].data as Float32Array;

    return {
      prediction: Array.from(output),
      confidence: Math.max(...Array.from(output)),
      modelVersion: model.version,
      latencyMs: performance.now() - start,
    };
  }
}

// Feature Store — Redis-backed for real-time serving
class FeatureStore {
  private redis: Redis;
  private ttl: number;

  constructor(redis: Redis, ttlSeconds = 300) {
    this.redis = redis;
    this.ttl = ttlSeconds;
  }

  async getFeatures(
    entityId: string,
    featureNames: string[]
  ): Promise<Record<string, number>> {
    const pipeline = this.redis.pipeline();
    for (const name of featureNames) {
      pipeline.hget(`features:${entityId}`, name);
    }

    const results = await pipeline.exec();
    const features: Record<string, number> = {};

    featureNames.forEach((name, i) => {
      features[name] = parseFloat((results?.[i]?.[1] as string) ?? '0');
    });

    return features;
  }

  async setFeatures(
    entityId: string,
    features: Record<string, number>
  ): Promise<void> {
    const pipeline = this.redis.pipeline();
    for (const [name, value] of Object.entries(features)) {
      pipeline.hset(`features:${entityId}`, name, value.toString());
    }
    pipeline.expire(`features:${entityId}`, this.ttl);
    await pipeline.exec();
  }
}

// Express route: classify a document using ML model
const modelService = new ModelServingService();

// Load two model versions for A/B testing
await modelService.loadModel('doc-classifier', 'v2.1', './models/v2.1.onnx', 80);
await modelService.loadModel('doc-classifier', 'v2.2', './models/v2.2.onnx', 20);

app.post('/api/ml/classify', async (req, res) => {
  const { features } = req.body;
  const result = await modelService.predict('doc-classifier', features);

  res.json({
    prediction: result.prediction,
    confidence: result.confidence,
    model: result.modelVersion,
    latencyMs: Math.round(result.latencyMs),
  });
});

Docs: ONNX Runtime Node.js

Real-time vs batch inference

Use real-time inference for user-facing requests where latency matters (classification, scoring, recommendations). Use batch inference for offline tasks like retraining data preparation, bulk scoring, and periodic report generation. The cost difference is significant — batch inference on GPU instances is 3-5x cheaper per prediction than real-time serving.

Part 5: AI Observability

Traditional APM tools measure request latency and error rates. AI backends need a fundamentally different observability stack: one that tracks prompt versions, token costs, model accuracy, and output quality drift. Without it, your AI features will silently degrade as data distributions shift and prompt effectiveness decays.

Monitoring Model Performance

Track three categories of metrics: operational (latency, throughput, error rates), quality (accuracy, precision, recall, user feedback scores), and business (conversion rate, task completion rate, cost per prediction). Most teams only monitor operational metrics and are blindsided when quality degrades.

Prompt Tracking and Cost Management

Every LLM call should log the prompt version, input tokens, output tokens, model used, latency, and cache hit status. Aggregate these into a cost dashboard that shows spend by prompt, by model, and by endpoint. At TurboDocx's API layer, this dashboard helped us identify a single prompt that was consuming 40% of our monthly token budget — we optimized it and cut costs by 35%.

Drift Detection

Model drift happens when the distribution of input data changes over time, causing model predictions to degrade. For ML models, track feature distribution statistics (mean, variance, quantiles) and alert when they deviate beyond a threshold. For LLMs, track output length distribution, refusal rates, and structured output parsing failure rates as proxies for quality drift.

// AI Observability Layer — track every AI operation
import { EventEmitter } from 'events';

interface AIMetric {
  timestamp: Date;
  endpoint: string;
  promptId: string;
  promptVersion: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  cached: boolean;
  success: boolean;
  costUsd: number;
  userFeedback?: 'positive' | 'negative';
}

// Token pricing per model (per 1M tokens, as of 2026)
// Note: Verify current pricing against official API docs — rates change frequently
const MODEL_PRICING: Record<string, { input: number; output: number }> = {
  'gpt-4o': { input: 2.50, output: 10.00 },
  'gpt-4o-mini': { input: 0.15, output: 0.60 },
  'claude-sonnet-4-5': { input: 3.00, output: 15.00 },
  'text-embedding-3-large': { input: 0.13, output: 0 },
};

class AIObserver extends EventEmitter {
  private metrics: AIMetric[] = [];
  private flushInterval: NodeJS.Timeout;

  constructor(private flushIntervalMs = 30_000) {
    super();
    this.flushInterval = setInterval(() => this.flush(), flushIntervalMs);
  }

  // Calculate cost based on model pricing
  private calculateCost(
    model: string,
    inputTokens: number,
    outputTokens: number
  ): number {
    const pricing = MODEL_PRICING[model];
    if (!pricing) return 0;
    return (
      (inputTokens / 1_000_000) * pricing.input +
      (outputTokens / 1_000_000) * pricing.output
    );
  }

  record(metric: Omit<AIMetric, 'timestamp' | 'costUsd'>): void {
    const fullMetric: AIMetric = {
      ...metric,
      timestamp: new Date(),
      costUsd: metric.cached
        ? 0
        : this.calculateCost(metric.model, metric.inputTokens, metric.outputTokens),
    };

    this.metrics.push(fullMetric);
    this.emit('metric', fullMetric);

    // Alert on high-latency operations
    if (fullMetric.latencyMs > 5000) {
      this.emit('alert', {
        type: 'high_latency',
        metric: fullMetric,
        message: `AI call to ${metric.endpoint} took ${metric.latencyMs}ms`,
      });
    }
  }

  // Aggregate metrics for dashboard
  getSummary(windowMinutes = 60): {
    totalCost: number;
    totalRequests: number;
    cacheHitRate: number;
    avgLatencyMs: number;
    errorRate: number;
    costByModel: Record<string, number>;
    costByPrompt: Record<string, number>;
  } {
    const cutoff = new Date(Date.now() - windowMinutes * 60 * 1000);
    const window = this.metrics.filter((m) => m.timestamp >= cutoff);

    const totalCost = window.reduce((sum, m) => sum + m.costUsd, 0);
    const cached = window.filter((m) => m.cached).length;
    const errors = window.filter((m) => !m.success).length;
    const avgLatency =
      window.reduce((sum, m) => sum + m.latencyMs, 0) / (window.length || 1);

    const costByModel: Record<string, number> = {};
    const costByPrompt: Record<string, number> = {};

    for (const m of window) {
      costByModel[m.model] = (costByModel[m.model] ?? 0) + m.costUsd;
      costByPrompt[m.promptId] = (costByPrompt[m.promptId] ?? 0) + m.costUsd;
    }

    return {
      totalCost: Math.round(totalCost * 100) / 100,
      totalRequests: window.length,
      cacheHitRate: window.length ? cached / window.length : 0,
      avgLatencyMs: Math.round(avgLatency),
      errorRate: window.length ? errors / window.length : 0,
      costByModel,
      costByPrompt,
    };
  }

  // Drift detection: compare current distribution to baseline
  detectDrift(
    baselineAvgTokens: number,
    baselineRefusalRate: number,
    windowMinutes = 60
  ): { tokenDrift: boolean; refusalDrift: boolean } {
    const cutoff = new Date(Date.now() - windowMinutes * 60 * 1000);
    const window = this.metrics.filter(
      (m) => m.timestamp >= cutoff && m.success
    );

    const avgOutputTokens =
      window.reduce((sum, m) => sum + m.outputTokens, 0) / (window.length || 1);

    const refusals = window.filter((m) => m.outputTokens < 10).length;
    const currentRefusalRate = window.length ? refusals / window.length : 0;

    return {
      tokenDrift: Math.abs(avgOutputTokens - baselineAvgTokens) / baselineAvgTokens > 0.3,
      refusalDrift: Math.abs(currentRefusalRate - baselineRefusalRate) > 0.05,
    };
  }

  private async flush(): Promise<void> {
    if (this.metrics.length === 0) return;
    // Flush to your analytics pipeline (BigQuery, ClickHouse, etc.)
    const batch = this.metrics.splice(0, this.metrics.length);
    console.log(`Flushing ${batch.length} AI metrics`);
    // await analyticsClient.insert('ai_metrics', batch);
  }

  destroy(): void {
    clearInterval(this.flushInterval);
    this.flush();
  }
}

// Usage: wrap every AI call with the observer
const observer = new AIObserver();

observer.on('alert', (alert) => {
  // Send to PagerDuty, Slack, etc.
  console.warn('AI Alert:', alert.message);
});

// Express middleware to track AI metrics
function trackAICall(endpoint: string, promptId: string, promptVersion: string) {
  return (req: Request, res: Response, next: NextFunction) => {
    const start = performance.now();
    const originalJson = res.json.bind(res);

    res.json = ((body: Record<string, unknown>) => {
      observer.record({
        endpoint,
        promptId,
        promptVersion,
        model: (body.model as string) ?? 'unknown',
        inputTokens: (body.inputTokens as number) ?? 0,
        outputTokens: (body.outputTokens as number) ?? 0,
        latencyMs: performance.now() - start,
        cached: (body.cached as boolean) ?? false,
        success: res.statusCode < 400,
      });
      return originalJson(body);
    }) as typeof res.json;

    next();
  };
}

Cost management rule of thumb

Set per-endpoint daily cost budgets and circuit-break when they are exceeded. A single prompt regression can 10x your daily spend if left unchecked. Tools like LangSmith, Helicone, and custom dashboards make this straightforward. Pair cost alerts with your existing workflow automation to auto-disable runaway endpoints.

Putting It All Together: Architecture Reference

A complete AI-powered backend combines these patterns into a layered architecture. Here is how the layers interact in a production system — using TurboDocx Writer's AI document generation as a concrete example.

Request flow: AI document generation

1. Client POST /api/documents/generate

→ AI Middleware (auth, rate limit, validate)

→ Feature Store (retrieve user preferences, org settings)

→ RAG Pipeline (retrieve relevant template examples)

→ LLM Service (stream generation with prompt v2.1)

→ AI Observer (log tokens, latency, cost)

→ SSE stream to client

Total latency: 200ms to first token, 3-8s to completion

Cost: ~$0.003 per document (with semantic caching)

Each layer is independently testable, swappable, and observable. The RAG pipeline can be replaced with a fine-tuned model. The LLM provider can be swapped from OpenAI to Anthropic. The feature store can scale from Redis to Feast. None of these changes require modifying the request flow.

This modular approach is the same philosophy behind building scalable IT platforms — isolate concerns, define clear interfaces, and make every component replaceable without a rewrite.

Key Takeaways

Start with RAG Before Fine-Tuning

RAG pipelines give you 80% of the quality of a fine-tuned model at 10% of the cost. Always validate that retrieval-augmented generation cannot solve your problem before investing in model training.

Stream Everything

LLM responses take seconds. Streaming tokens to the client as they are generated turns a 5-second wait into an instant-feeling interaction. Use Server-Sent Events or WebSockets for all generative endpoints.

Cache Aggressively at Every Layer

Embedding generation, LLM completions, and vector search results are all cacheable. A prompt cache alone can reduce your LLM API costs by 40-60% in production workloads.

Observe Prompts Like You Observe Code

Every prompt is a function call with variable inputs. Log prompt versions, track token usage, measure latency percentiles, and set up drift alerts. AI systems degrade silently without proper observability.

AI Backend Production Checklist

Before shipping an AI-powered feature to production, run through this checklist. Each item has caused production incidents at real companies.

Every LLM call goes through a service layer — never called directly from route handlers

Semantic caching is enabled for all deterministic prompts (temperature < 0.3)

Streaming is implemented for all generative endpoints

A fallback LLM provider is configured and tested

Per-endpoint daily cost budgets are set with circuit breakers

Prompt versions are logged with every request

Token usage, latency, and cache hit rates are tracked in a dashboard

Vector search results are filtered by a minimum score threshold

Input validation rejects prompts that exceed context window limits

Rate limiting is applied per user and per organization

Model A/B testing infrastructure is in place before deploying new model versions

Drift detection alerts are configured for both ML models and LLM outputs

Build AI-Powered Document Workflows

TurboDocx uses these exact patterns to power intelligent document generation, template automation, and AI-driven content creation — so you can focus on building great products.

Schedule a Demo Explore the API

Amit Sharma•Software Engineer, Backend Architecture at TurboDocx

AI-Powered Backend Systems: The Complete Guide