Every backend eventually faces the same question: how do we integrate AI without turning our clean architecture into a tangled mess of API calls, retry logic, and ballooning costs? The answer is not “just call OpenAI” — it is a set of proven architectural patterns that treat AI as a first-class service layer.
This guide covers the eight patterns that production AI backends rely on in 2026. Whether you are building intelligent document APIs or adding AI-powered features to an existing microservices architecture, these patterns will help you ship AI features that are reliable, observable, and cost-effective.
| Pattern | Category | When To Use | Complexity |
|---|---|---|---|
| ML Model Serving | Inference | Real-time predictions via REST/gRPC endpoints | High |
| RAG Pipelines | Knowledge | Context-aware responses from domain-specific data | High |
| Vector Databases | Storage | Semantic search and similarity matching | Medium |
| LLM Integration | Generation | Text generation, summarization, classification | Medium |
| AI Middleware | Orchestration | Request routing, prompt templating, fallbacks | Medium |
| Feature Stores | Data | Consistent feature serving for training and inference | High |
| A/B Testing ML | Experimentation | Comparing model versions in production | Medium |
| AI Observability | Monitoring | Tracking latency, cost, drift, and accuracy | Low |
Part 1: AI Integration Architecture
The most common mistake teams make when integrating AI is scattering LLM calls throughout their business logic. A better approach is to treat AI as a dedicated service layer — a boundary that sits between your application logic and the AI providers, handling prompt construction, response parsing, caching, and error recovery in one place.
This pattern mirrors how you would integrate any external service: behind an interface, with retries, circuit breakers, and fallback strategies. The difference is that AI services have unique characteristics — they are non-deterministic, latency-heavy, and priced per token — which require specialized handling.
Synchronous vs Asynchronous AI Pipelines
Synchronous pipelines work for low-latency tasks like classification or extraction where the model response is under 500ms. For generative tasks like document drafting or multi-step reasoning, asynchronous pipelines with streaming or job queues are essential. At TurboDocx Writer, we use a hybrid approach: classification runs synchronously in the request path, while content generation is streamed via Server-Sent Events.
* Code examples throughout this guide are simplified for illustrative purposes. Refer to the linked official documentation for complete API references and production-ready configurations.
Model Serving Patterns: REST, gRPC, and Streaming
REST is the default for most AI endpoints because it is simple and compatible with every client. gRPC shines when you need binary-efficient communication between internal services — particularly for feature vector transfer and batch inference. Streaming (SSE or WebSockets) is non-negotiable for any endpoint that returns LLM-generated text; users will not wait 5 seconds staring at a spinner.
// AI Service Layer — single boundary for all AI operationsimport { OpenAI } from 'openai';import { Redis } from 'ioredis';interface AIServiceConfig {model: string;maxTokens: number;temperature: number;cacheTtlSeconds: number;}interface AIResponse<T> {data: T;cached: boolean;latencyMs: number;tokensUsed: { prompt: number; completion: number };}class AIService {private client: OpenAI;private cache: Redis;private config: AIServiceConfig;constructor(config: AIServiceConfig) {this.client = new OpenAI();this.cache = new Redis(process.env.REDIS_URL!);this.config = config;}// Prompt cache key based on exact content hashprivate getCacheKey(prompt: string): string {const hash = crypto.createHash('sha256').update(prompt).digest('hex').slice(0, 16);return `ai:completion:${this.config.model}:${hash}`;}async complete<T>(systemPrompt: string,userPrompt: string,parser: (raw: string) => T): Promise<AIResponse<T>> {const cacheKey = this.getCacheKey(systemPrompt + userPrompt);const start = performance.now();// Check cache firstconst cached = await this.cache.get(cacheKey);if (cached) {return {data: parser(cached),cached: true,latencyMs: performance.now() - start,tokensUsed: { prompt: 0, completion: 0 },};}// Call LLM with retry logicconst response = await this.client.chat.completions.create({model: this.config.model,max_tokens: this.config.maxTokens,temperature: this.config.temperature,messages: [{ role: 'system', content: systemPrompt },{ role: 'user', content: userPrompt },],});const raw = response.choices[0]?.message?.content ?? '';const usage = response.usage;// Cache the raw responseawait this.cache.setex(cacheKey, this.config.cacheTtlSeconds, raw);return {data: parser(raw),cached: false,latencyMs: performance.now() - start,tokensUsed: {prompt: usage?.prompt_tokens ?? 0,completion: usage?.completion_tokens ?? 0,},};}}// Usage in an Express routeconst aiService = new AIService({model: 'gpt-4o',maxTokens: 2048,temperature: 0.3,cacheTtlSeconds: 3600,});app.post('/api/documents/classify', async (req, res) => {const { content } = req.body;const result = await aiService.complete('Classify the document into one of: invoice, contract, proposal, sow.',content,(raw) => JSON.parse(raw) as { category: string; confidence: number });res.json({classification: result.data,cached: result.cached,latencyMs: Math.round(result.latencyMs),});});
Docs: OpenAI API Reference | ioredis
Architecture principle
Never call an LLM directly from a route handler. Always go through a service layer that owns caching, retries, token tracking, and response parsing. This single boundary makes it trivial to swap providers, add logging, or implement fallbacks later.
Part 2: RAG Pipelines
Retrieval-Augmented Generation is the single most impactful pattern for building AI-powered backends. Instead of relying solely on an LLM's training data, RAG retrieves relevant documents from your own knowledge base and injects them into the prompt context. The result is responses that are grounded in your data, dramatically reducing hallucinations.
A production RAG pipeline has four stages: ingest (chunk and embed documents), index (store embeddings in a vector database), retrieve (find the most relevant chunks for a query), and generate (pass retrieved context to the LLM). Each stage has its own optimization levers.
Vector Databases: Choosing the Right Store
The vector database landscape in 2026 has consolidated around three tiers. Pinecone and Weaviate dominate managed solutions with sub-10ms query latency at billion-vector scale. pgvector is the pragmatic choice when you want to keep embeddings alongside your relational data without introducing a new service. Qdrant and Milvus offer the best self-hosted performance for teams that need data-sovereignty guarantees.
For most teams building developer-facing applications, starting with pgvector inside your existing Postgres instance is the right call. You avoid operational overhead and can always migrate to a dedicated vector database when you cross the 10-million-vector threshold.
Embedding Generation and Chunking Strategy
Chunking strategy matters more than embedding model choice. Overlapping chunks of 512 tokens with a 64-token overlap consistently outperform both smaller and larger chunks for document Q&A tasks. For structured documents like statements of work or contracts, section-aware chunking that respects heading boundaries delivers significantly better retrieval accuracy.
// Production RAG Pipeline with vector search and LLM generationimport { OpenAIEmbeddings } from '@langchain/openai';import { PineconeStore } from '@langchain/pinecone';import { Pinecone } from '@pinecone-database/pinecone';import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';import { OpenAI } from 'openai';interface RAGConfig {indexName: string;namespace: string;topK: number;scoreThreshold: number;}interface RetrievedChunk {content: string;metadata: { source: string; page: number; score: number };}class RAGPipeline {private embeddings: OpenAIEmbeddings;private vectorStore: PineconeStore;private splitter: RecursiveCharacterTextSplitter;private config: RAGConfig;private constructor(config: RAGConfig, vectorStore: PineconeStore) {this.config = config;this.vectorStore = vectorStore;this.embeddings = new OpenAIEmbeddings({model: 'text-embedding-3-large',dimensions: 3072,});this.splitter = new RecursiveCharacterTextSplitter({chunkSize: 512,chunkOverlap: 64,separators: ['\n## ', '\n### ', '\n\n', '\n', ' '],});}static async create(config: RAGConfig): Promise<RAGPipeline> {const embeddings = new OpenAIEmbeddings({model: 'text-embedding-3-large',dimensions: 3072,});const pinecone = new Pinecone();const index = pinecone.index(config.indexName);const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {pineconeIndex: index,namespace: config.namespace,});return new RAGPipeline(config, vectorStore);}// Stage 1 & 2: Ingest and Indexasync ingestDocument(content: string,metadata: { source: string; docId: string }): Promise<{ chunksIndexed: number }> {const chunks = await this.splitter.createDocuments([content],[metadata],{ chunkHeader: `SOURCE: ${metadata.source}\n\n` });await this.vectorStore.addDocuments(chunks);return { chunksIndexed: chunks.length };}// Stage 3: Retrieve relevant chunksasync retrieve(query: string): Promise<RetrievedChunk[]> {const results = await this.vectorStore.similaritySearchWithScore(query,this.config.topK);return results.filter(([, score]) => score >= this.config.scoreThreshold).map(([doc, score]) => ({content: doc.pageContent,metadata: {source: doc.metadata.source,page: doc.metadata.page ?? 0,score: Math.round(score * 1000) / 1000,},}));}// Stage 4: Generate with retrieved contextasync queryWithContext(query: string,systemPrompt: string): Promise<{ answer: string; sources: RetrievedChunk[] }> {const chunks = await this.retrieve(query);const contextBlock = chunks.map((c, i) => `[Source ${i + 1}: ${c.metadata.source}]\n${c.content}`).join('\n\n---\n\n');const augmentedPrompt = `${systemPrompt}Use the following context to answer the user's question.If the context does not contain enough information, say so explicitly.CONTEXT:${contextBlock}USER QUESTION:${query}`;const client = new OpenAI();const response = await client.chat.completions.create({model: 'gpt-4o',messages: [{ role: 'user', content: augmentedPrompt }],temperature: 0.2,max_tokens: 1024,});return {answer: response.choices[0].message.content ?? '',sources: chunks,};}}// Express route using the RAG pipelineconst rag = await RAGPipeline.create({indexName: 'documents',namespace: 'production',topK: 5,scoreThreshold: 0.75,});app.post('/api/knowledge/query', async (req, res) => {const { question } = req.body;const result = await rag.queryWithContext(question,'You are a helpful assistant for a document automation platform.');res.json({answer: result.answer,sources: result.sources.map((s) => ({source: s.metadata.source,relevance: s.metadata.score,})),});});
Docs: LangChain JS | Pinecone | OpenAI Embeddings
Production tip
Always include a scoreThreshold filter on your vector search results. Without it, the LLM will receive low-relevance chunks that confuse the generation. A threshold of 0.75 on cosine similarity is a good starting point for most document Q&A use cases.
Part 3: LLM Integration Patterns
Integrating an LLM into a production backend is not simply wrapping an API call. You need prompt versioning, token optimization, streaming, and graceful degradation. These patterns turn a fragile prototype into a resilient production system.
Prompt Management and Versioning
Treat prompts like code. Store them in version-controlled template files, not inline strings. Use a prompt registry that maps prompt IDs to versioned templates with variable interpolation. This lets you A/B test prompt variants, roll back regressions, and audit changes — critical when your prompts drive business logic in document automation workflows.
Token Optimization and Semantic Caching
Token cost is the largest line item for AI-heavy backends. Three techniques consistently reduce costs by 40-60%: prompt compression (stripping redundant whitespace and instructions), semantic caching (caching responses for semantically similar queries, not just exact matches), and tiered model routing (using a smaller model for simple tasks and a larger model only when complexity warrants it).
Streaming Responses with Server-Sent Events
For any endpoint that generates text longer than a sentence, streaming is not optional. Users perceive a streaming response as significantly faster than a batch response with the same total latency. The pattern below shows how to pipe OpenAI's streaming API directly through your Express backend to the client.
// LLM Streaming with prompt versioning and fallbackimport { OpenAI, AzureOpenAI } from 'openai';import type { Request, Response } from 'express';// Prompt registry — version-controlled templatesconst PROMPT_REGISTRY = {'document-draft': {v1: {system: `You are a professional document writer.Generate content based on the template variables provided.Output clean, well-structured prose.`,maxTokens: 4096,temperature: 0.7,},v2: {system: `You are a professional document writer for a B2B SaaS platform.Generate content using the template variables below.Follow the document type conventions. Be concise and professional.Output well-structured prose with clear section headings.`,maxTokens: 4096,temperature: 0.5,},},} as const;type PromptId = keyof typeof PROMPT_REGISTRY;interface StreamOptions {promptId: PromptId;promptVersion: string;userMessage: string;onToken?: (token: string) => void;}class LLMService {private primary: OpenAI;private fallback: AzureOpenAI;constructor() {this.primary = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });this.fallback = new AzureOpenAI({apiKey: process.env.AZURE_OPENAI_API_KEY,endpoint: process.env.AZURE_OPENAI_ENDPOINT,apiVersion: '2024-08-01-preview',});}async streamCompletion(options: StreamOptions,res: Response): Promise<void> {const promptConfig =PROMPT_REGISTRY[options.promptId]?.[options.promptVersion as keyof (typeof PROMPT_REGISTRY)[PromptId]];if (!promptConfig) {throw new Error(`Unknown prompt: ${options.promptId}@${options.promptVersion}`);}// Set SSE headersres.setHeader('Content-Type', 'text/event-stream');res.setHeader('Cache-Control', 'no-cache');res.setHeader('Connection', 'keep-alive');try {const stream = await this.primary.chat.completions.create({model: 'gpt-4o',stream: true,max_tokens: promptConfig.maxTokens,temperature: promptConfig.temperature,messages: [{ role: 'system', content: promptConfig.system },{ role: 'user', content: options.userMessage },],});let totalTokens = 0;for await (const chunk of stream) {const content = chunk.choices[0]?.delta?.content;if (content) {totalTokens++; // Note: This is approximate - each chunk may contain multiple tokensres.write(`data: ${JSON.stringify({ token: content })}\n\n`);}}// Send completion event with metadatares.write(`data: ${JSON.stringify({ done: true, totalTokens })}\n\n`);res.end();} catch (error) {// Fallback to Azure OpenAI on primary failureconsole.error('Primary LLM failed, falling back:', error);const fallbackResponse = await this.fallback.chat.completions.create({model: 'gpt-4o',max_tokens: promptConfig.maxTokens,temperature: promptConfig.temperature,messages: [{ role: 'system', content: promptConfig.system },{ role: 'user', content: options.userMessage },],});const content = fallbackResponse.choices[0]?.message?.content ?? '';res.write(`data: ${JSON.stringify({ token: content })}\n\n`);res.write(`data: ${JSON.stringify({ done: true, fallback: true })}\n\n`);res.end();}}}// Express route with streamingconst llm = new LLMService();app.post('/api/ai/generate', async (req, res) => {const { promptId, version, input } = req.body;await llm.streamCompletion({ promptId, promptVersion: version, userMessage: input },res);});
Docs: OpenAI Streaming | Azure OpenAI
Fallback strategy
Always configure a fallback LLM provider. If your primary is OpenAI, use Azure OpenAI or Anthropic as your secondary. The fallback should be a non-streaming batch call — simpler to implement and sufficient for error scenarios. Monitor your fallback rate; if it exceeds 2%, investigate the primary provider's reliability.
Part 4: ML Model Serving
Not every AI task needs an LLM. Classification, anomaly detection, recommendation, and scoring tasks are better served by purpose-built ML models that are faster, cheaper, and more predictable than generative models. The challenge is serving these models reliably at scale.
Serving Frameworks: TensorFlow Serving vs FastAPI
TensorFlow Serving is the production standard for TensorFlow and SavedModel formats. It supports model versioning, canary deployments, and batching out of the box. FastAPI + ONNX Runtime is the flexible alternative when your models come from PyTorch, scikit-learn, or XGBoost. FastAPI gives you full control over preprocessing, validation, and response formatting while ONNX Runtime handles cross-framework inference.
Model Versioning and A/B Testing
Model versioning follows the same principles as service versioning in microservices. Every model artifact gets a semantic version. A/B testing routes a percentage of traffic to the new model version while monitoring accuracy, latency, and business metrics. Only promote when the challenger outperforms the champion on your primary metric.
Feature Stores for Consistent Inference
A feature store guarantees that the features used during training are identical to those used during inference. Without one, training-serving skew silently degrades model accuracy. Tools like Feast, Tecton, and even a simple Redis-backed feature cache solve this problem at different scales.
// ML Model Serving with versioning and A/B routingimport express from 'express';import { InferenceSession, Tensor } from 'onnxruntime-node';interface ModelVersion {version: string;session: InferenceSession;trafficWeight: number; // 0-100}interface PredictionResult {prediction: number[];confidence: number;modelVersion: string;latencyMs: number;}class ModelServingService {private models: Map<string, ModelVersion[]> = new Map();async loadModel(modelName: string,version: string,modelPath: string,trafficWeight: number): Promise<void> {const session = await InferenceSession.create(modelPath);const versions = this.models.get(modelName) ?? [];versions.push({ version, session, trafficWeight });this.models.set(modelName, versions);}// Weighted random routing for A/B testingprivate selectVersion(modelName: string): ModelVersion {const versions = this.models.get(modelName);if (!versions?.length) throw new Error(`No model: ${modelName}`);const totalWeight = versions.reduce((sum, v) => sum + v.trafficWeight, 0);let random = Math.random() * totalWeight;for (const version of versions) {random -= version.trafficWeight;if (random <= 0) return version;}return versions[versions.length - 1];}async predict(modelName: string,features: number[]): Promise<PredictionResult> {const model = this.selectVersion(modelName);const start = performance.now();const inputTensor = new Tensor('float32', features, [1, features.length]);const results = await model.session.run({ [model.session.inputNames[0]]: inputTensor });const output = results[model.session.outputNames[0]].data as Float32Array;return {prediction: Array.from(output),confidence: Math.max(...Array.from(output)),modelVersion: model.version,latencyMs: performance.now() - start,};}}// Feature Store — Redis-backed for real-time servingclass FeatureStore {private redis: Redis;private ttl: number;constructor(redis: Redis, ttlSeconds = 300) {this.redis = redis;this.ttl = ttlSeconds;}async getFeatures(entityId: string,featureNames: string[]): Promise<Record<string, number>> {const pipeline = this.redis.pipeline();for (const name of featureNames) {pipeline.hget(`features:${entityId}`, name);}const results = await pipeline.exec();const features: Record<string, number> = {};featureNames.forEach((name, i) => {features[name] = parseFloat((results?.[i]?.[1] as string) ?? '0');});return features;}async setFeatures(entityId: string,features: Record<string, number>): Promise<void> {const pipeline = this.redis.pipeline();for (const [name, value] of Object.entries(features)) {pipeline.hset(`features:${entityId}`, name, value.toString());}pipeline.expire(`features:${entityId}`, this.ttl);await pipeline.exec();}}// Express route: classify a document using ML modelconst modelService = new ModelServingService();// Load two model versions for A/B testingawait modelService.loadModel('doc-classifier', 'v2.1', './models/v2.1.onnx', 80);await modelService.loadModel('doc-classifier', 'v2.2', './models/v2.2.onnx', 20);app.post('/api/ml/classify', async (req, res) => {const { features } = req.body;const result = await modelService.predict('doc-classifier', features);res.json({prediction: result.prediction,confidence: result.confidence,model: result.modelVersion,latencyMs: Math.round(result.latencyMs),});});
Docs: ONNX Runtime Node.js
Real-time vs batch inference
Use real-time inference for user-facing requests where latency matters (classification, scoring, recommendations). Use batch inference for offline tasks like retraining data preparation, bulk scoring, and periodic report generation. The cost difference is significant — batch inference on GPU instances is 3-5x cheaper per prediction than real-time serving.
Part 5: AI Observability
Traditional APM tools measure request latency and error rates. AI backends need a fundamentally different observability stack: one that tracks prompt versions, token costs, model accuracy, and output quality drift. Without it, your AI features will silently degrade as data distributions shift and prompt effectiveness decays.
Monitoring Model Performance
Track three categories of metrics: operational (latency, throughput, error rates), quality (accuracy, precision, recall, user feedback scores), and business (conversion rate, task completion rate, cost per prediction). Most teams only monitor operational metrics and are blindsided when quality degrades.
Prompt Tracking and Cost Management
Every LLM call should log the prompt version, input tokens, output tokens, model used, latency, and cache hit status. Aggregate these into a cost dashboard that shows spend by prompt, by model, and by endpoint. At TurboDocx's API layer, this dashboard helped us identify a single prompt that was consuming 40% of our monthly token budget — we optimized it and cut costs by 35%.
Drift Detection
Model drift happens when the distribution of input data changes over time, causing model predictions to degrade. For ML models, track feature distribution statistics (mean, variance, quantiles) and alert when they deviate beyond a threshold. For LLMs, track output length distribution, refusal rates, and structured output parsing failure rates as proxies for quality drift.
// AI Observability Layer — track every AI operationimport { EventEmitter } from 'events';interface AIMetric {timestamp: Date;endpoint: string;promptId: string;promptVersion: string;model: string;inputTokens: number;outputTokens: number;latencyMs: number;cached: boolean;success: boolean;costUsd: number;userFeedback?: 'positive' | 'negative';}// Token pricing per model (per 1M tokens, as of 2026)// Note: Verify current pricing against official API docs — rates change frequentlyconst MODEL_PRICING: Record<string, { input: number; output: number }> = {'gpt-4o': { input: 2.50, output: 10.00 },'gpt-4o-mini': { input: 0.15, output: 0.60 },'claude-sonnet-4-5': { input: 3.00, output: 15.00 },'text-embedding-3-large': { input: 0.13, output: 0 },};class AIObserver extends EventEmitter {private metrics: AIMetric[] = [];private flushInterval: NodeJS.Timeout;constructor(private flushIntervalMs = 30_000) {super();this.flushInterval = setInterval(() => this.flush(), flushIntervalMs);}// Calculate cost based on model pricingprivate calculateCost(model: string,inputTokens: number,outputTokens: number): number {const pricing = MODEL_PRICING[model];if (!pricing) return 0;return ((inputTokens / 1_000_000) * pricing.input +(outputTokens / 1_000_000) * pricing.output);}record(metric: Omit<AIMetric, 'timestamp' | 'costUsd'>): void {const fullMetric: AIMetric = {...metric,timestamp: new Date(),costUsd: metric.cached? 0: this.calculateCost(metric.model, metric.inputTokens, metric.outputTokens),};this.metrics.push(fullMetric);this.emit('metric', fullMetric);// Alert on high-latency operationsif (fullMetric.latencyMs > 5000) {this.emit('alert', {type: 'high_latency',metric: fullMetric,message: `AI call to ${metric.endpoint} took ${metric.latencyMs}ms`,});}}// Aggregate metrics for dashboardgetSummary(windowMinutes = 60): {totalCost: number;totalRequests: number;cacheHitRate: number;avgLatencyMs: number;errorRate: number;costByModel: Record<string, number>;costByPrompt: Record<string, number>;} {const cutoff = new Date(Date.now() - windowMinutes * 60 * 1000);const window = this.metrics.filter((m) => m.timestamp >= cutoff);const totalCost = window.reduce((sum, m) => sum + m.costUsd, 0);const cached = window.filter((m) => m.cached).length;const errors = window.filter((m) => !m.success).length;const avgLatency =window.reduce((sum, m) => sum + m.latencyMs, 0) / (window.length || 1);const costByModel: Record<string, number> = {};const costByPrompt: Record<string, number> = {};for (const m of window) {costByModel[m.model] = (costByModel[m.model] ?? 0) + m.costUsd;costByPrompt[m.promptId] = (costByPrompt[m.promptId] ?? 0) + m.costUsd;}return {totalCost: Math.round(totalCost * 100) / 100,totalRequests: window.length,cacheHitRate: window.length ? cached / window.length : 0,avgLatencyMs: Math.round(avgLatency),errorRate: window.length ? errors / window.length : 0,costByModel,costByPrompt,};}// Drift detection: compare current distribution to baselinedetectDrift(baselineAvgTokens: number,baselineRefusalRate: number,windowMinutes = 60): { tokenDrift: boolean; refusalDrift: boolean } {const cutoff = new Date(Date.now() - windowMinutes * 60 * 1000);const window = this.metrics.filter((m) => m.timestamp >= cutoff && m.success);const avgOutputTokens =window.reduce((sum, m) => sum + m.outputTokens, 0) / (window.length || 1);const refusals = window.filter((m) => m.outputTokens < 10).length;const currentRefusalRate = window.length ? refusals / window.length : 0;return {tokenDrift: Math.abs(avgOutputTokens - baselineAvgTokens) / baselineAvgTokens > 0.3,refusalDrift: Math.abs(currentRefusalRate - baselineRefusalRate) > 0.05,};}private async flush(): Promise<void> {if (this.metrics.length === 0) return;// Flush to your analytics pipeline (BigQuery, ClickHouse, etc.)const batch = this.metrics.splice(0, this.metrics.length);console.log(`Flushing ${batch.length} AI metrics`);// await analyticsClient.insert('ai_metrics', batch);}destroy(): void {clearInterval(this.flushInterval);this.flush();}}// Usage: wrap every AI call with the observerconst observer = new AIObserver();observer.on('alert', (alert) => {// Send to PagerDuty, Slack, etc.console.warn('AI Alert:', alert.message);});// Express middleware to track AI metricsfunction trackAICall(endpoint: string, promptId: string, promptVersion: string) {return (req: Request, res: Response, next: NextFunction) => {const start = performance.now();const originalJson = res.json.bind(res);res.json = ((body: Record<string, unknown>) => {observer.record({endpoint,promptId,promptVersion,model: (body.model as string) ?? 'unknown',inputTokens: (body.inputTokens as number) ?? 0,outputTokens: (body.outputTokens as number) ?? 0,latencyMs: performance.now() - start,cached: (body.cached as boolean) ?? false,success: res.statusCode < 400,});return originalJson(body);}) as typeof res.json;next();};}
Cost management rule of thumb
Set per-endpoint daily cost budgets and circuit-break when they are exceeded. A single prompt regression can 10x your daily spend if left unchecked. Tools like LangSmith, Helicone, and custom dashboards make this straightforward. Pair cost alerts with your existing workflow automation to auto-disable runaway endpoints.
Putting It All Together: Architecture Reference
A complete AI-powered backend combines these patterns into a layered architecture. Here is how the layers interact in a production system — using TurboDocx Writer's AI document generation as a concrete example.
Request flow: AI document generation
1. Client POST /api/documents/generate
→ AI Middleware (auth, rate limit, validate)
→ Feature Store (retrieve user preferences, org settings)
→ RAG Pipeline (retrieve relevant template examples)
→ LLM Service (stream generation with prompt v2.1)
→ AI Observer (log tokens, latency, cost)
→ SSE stream to client
Total latency: 200ms to first token, 3-8s to completion
Cost: ~$0.003 per document (with semantic caching)
Each layer is independently testable, swappable, and observable. The RAG pipeline can be replaced with a fine-tuned model. The LLM provider can be swapped from OpenAI to Anthropic. The feature store can scale from Redis to Feast. None of these changes require modifying the request flow.
This modular approach is the same philosophy behind building scalable IT platforms — isolate concerns, define clear interfaces, and make every component replaceable without a rewrite.
Key Takeaways
Start with RAG Before Fine-Tuning
RAG pipelines give you 80% of the quality of a fine-tuned model at 10% of the cost. Always validate that retrieval-augmented generation cannot solve your problem before investing in model training.
Stream Everything
LLM responses take seconds. Streaming tokens to the client as they are generated turns a 5-second wait into an instant-feeling interaction. Use Server-Sent Events or WebSockets for all generative endpoints.
Cache Aggressively at Every Layer
Embedding generation, LLM completions, and vector search results are all cacheable. A prompt cache alone can reduce your LLM API costs by 40-60% in production workloads.
Observe Prompts Like You Observe Code
Every prompt is a function call with variable inputs. Log prompt versions, track token usage, measure latency percentiles, and set up drift alerts. AI systems degrade silently without proper observability.
AI Backend Production Checklist
Before shipping an AI-powered feature to production, run through this checklist. Each item has caused production incidents at real companies.
Every LLM call goes through a service layer — never called directly from route handlers
Semantic caching is enabled for all deterministic prompts (temperature < 0.3)
Streaming is implemented for all generative endpoints
A fallback LLM provider is configured and tested
Per-endpoint daily cost budgets are set with circuit breakers
Prompt versions are logged with every request
Token usage, latency, and cache hit rates are tracked in a dashboard
Vector search results are filtered by a minimum score threshold
Input validation rejects prompts that exceed context window limits
Rate limiting is applied per user and per organization
Model A/B testing infrastructure is in place before deploying new model versions
Drift detection alerts are configured for both ML models and LLM outputs
Related Resources
Microservices & Event-Driven Architecture
Patterns for building distributed backend systems that AI services plug into — event sourcing, CQRS, and saga orchestration.
React Performance Optimization
Optimize the frontend that consumes your AI-powered API endpoints — memoization, code splitting, and virtualization.
TurboDocx API & SDK
See how TurboDocx exposes AI-powered document generation through its API and developer toolkit.
TurboDocx for Developers
How developers integrate AI-powered document automation into their applications and workflows.
Build AI-Powered Document Workflows
TurboDocx uses these exact patterns to power intelligent document generation, template automation, and AI-driven content creation — so you can focus on building great products.
