The Challenge
Business Context
We built a document intelligence platform for mid-market financial services firms — think regional banks and insurance companies processing contracts, compliance documents, and customer correspondence through LLM-based extraction and summarization pipelines. At peak, 40,000 documents per day, each triggering between 3 and 12 LLM calls depending on document type and complexity.
The business constraint was simple: our customers were regulated entities. If our system exposed one customer's document content to another customer's LLM context, or generated a summary that contradicted the actual document in a legally significant way, we would lose the customer and potentially face regulatory action alongside them. Guardrails weren't a feature — they were the product.
We had 6 months from initial deployment to demonstrate enough control over AI output quality and safety to satisfy two enterprise customer security audits. This is the honest account of what we built, what broke, and what we'd change.
Technical Constraints
The stack we inherited: Python FastAPI services on AWS ECS, PostgreSQL (RDS) for job state, S3 for document storage, and an early integration with OpenAI's GPT-4 (before gpt-4o existed). The team was 4 engineers total — no dedicated ML or AI safety specialist.
Non-negotiable constraints:
- Latency SLA: Document processing must complete within 90 seconds end-to-end for 95th percentile
- Multi-tenant isolation: Zero tolerance for cross-tenant data leakage. Each financial firm's documents are logically isolated.
- Audit trail: Every LLM call must be logged in immutable, queryable storage for compliance audits
- Availability: 99.9% uptime. Guardrail failures cannot take down the processing pipeline.
What we didn't have: a dedicated vector database (we used PostgreSQL pgvector), a dedicated safety classifier service, or any prior ML infrastructure.
Scale Requirements
Month 1: 500 documents/day. Month 3: 8,000 documents/day. Month 6: 40,000 documents/day. We had to build a system that worked at Month 1 scale without requiring a rewrite at Month 6 scale.
The growth curve was the main architectural challenge. A guardrail that adds 500ms synchronously is fine at 500 requests/day. At 40,000 requests/day distributed across business hours, that same synchronous check becomes a bottleneck that requires either optimization or a fundamental redesign.
Architecture Decision
Options Evaluated
We evaluated three guardrail architectures before settling on our final approach:
Option A: Fully synchronous, in-process guardrails. Every check runs in the FastAPI request handler before LLM calls. Simple, easy to reason about, no additional infrastructure. Problem: adding multiple checks at 200-500ms each would blow our 90-second SLA at scale.
Option B: Dedicated guardrail microservice. All checks go through a separate service with its own scaling. Clean separation of concerns, independently scalable. Problem: at Month 1 scale, this is two engineers maintaining infrastructure instead of building features. And we'd be adding an additional network hop to every document processing call.
Option C: Layered async pipeline with synchronous blocking layer. Synchronous checks only for the highest-risk categories (PII leakage, cross-tenant data isolation). All other checks run async post-processing with a separate human review queue for flagged items.
We chose Option C.
Decision Criteria
The key insight was that not all guardrail failures have the same consequence:
| Failure type | Consequence | Required: block or flag? |
|---|---|---|
| Cross-tenant data leak | Immediate regulatory/legal incident | Block synchronously |
| PII in output | Compliance violation, customer complaint | Block synchronously |
| Hallucination in summary | Poor product quality, customer complaint | Flag async, human review |
| Inconsistency across documents | Product quality issue | Flag async, weekly review |
| Prompt injection attempt | Security incident | Block synchronously |
Only three categories required synchronous blocking. Everything else could be async with a review queue. This let us keep the synchronous blocking layer thin and fast (<100ms total) while still catching quality issues post-hoc.
Final Architecture
Immutable audit log: every LLM call logged to CloudWatch Logs with a 7-year retention policy (regulatory requirement), indexed in OpenSearch for queries.
Implementation
Phase 1: Foundation
The synchronous guardrail layer. This had to be fast and reliable because it ran on every document, in the critical path.
Phase 2: Core Features
The async quality pipeline ran as a Lambda function triggered after the synchronous processing completed. Its job was hallucination detection and consistency checking:
Phase 3: Optimization
By month 4, with 20,000 documents/day, we had three concrete performance problems:
Problem 1: AWS Comprehend PII detection was adding 150-300ms per call. At scale, this was significant.
Fix: Cache Comprehend results by a hash of the input text segment. Documents in financial services are often templated — the same boilerplate appears across thousands of documents. Cache hit rate reached 34% within a week.
Problem 2: Hallucination check via LLM was expensive — $0.04 per document when checking every output. At 40,000 documents/day, that's $1,600/day on quality checks alone.
Fix: Risk-tiered checking. High-risk document types (contracts, compliance filings) checked every output. Low-risk types (routine correspondence) sampled at 10%. Reduced hallucination check costs by 78% with no detectable change in quality incident rate.
Problem 3: The human review queue was backlogging. Reviewers couldn't keep up.
Fix: Added a confidence-based routing layer. Items with hallucination confidence > 0.95 were auto-rejected (re-processed). Items with confidence 0.7-0.95 went to human review. Items below 0.7 were auto-approved (classifier uncertainty, not actual hallucination).
Need a second opinion on your AI systems architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallResults & Metrics
Performance Gains
After 6 months of iteration:
- Synchronous guardrail latency: 82ms median (down from 340ms at launch, before caching)
- End-to-end document processing: 38 seconds median, 71 seconds p95 (well within the 90s SLA)
- Cross-tenant data leak incidents: 0 in production
- Prompt injection attempts blocked: 847 over 6 months (0.003% of total requests — lower than expected)
The hallucination detection system flagged 2.3% of all document outputs for human review. Of those flagged, 61% were confirmed as having at least one unsupported claim. That's a meaningful catch rate for a financial services context.
Cost Impact
| Component | Month 1 cost | Month 6 cost | Notes |
|---|---|---|---|
| AWS Comprehend PII | $12 | $1,840 | Volume-driven, caching helped |
| Hallucination checks (LLM) | $31 | $1,620 | Risk-tiered reduced from $6,200 est. |
| Redis cache | $0 | $180 | 34% cache hit rate offset Comprehend costs |
| CloudWatch Logs (audit) | $8 | $890 | Regulatory requirement, non-negotiable |
Total guardrail cost at 40,000 docs/day: ~$4,530/month, approximately 18% of total infrastructure costs. We considered this acceptable for the risk management function it provided.
Developer Productivity
Unexpected finding: the guardrail system accelerated development of new AI features, not slowed it. Because we had a standardized interface for adding guardrails, engineers could ship new LLM features with confidence that the safety layer handled the common risks. The pre-launch checklist went from "fill this out before the security team will approve" to "oh right, I should check these 6 things" — a sign that it had become part of the workflow rather than overhead.
Lessons Learned
What Worked
Risk tiering was the right call. Treating hallucination as "flag and review" rather than "block synchronously" was the decision that made the system viable. 100% synchronous blocking of uncertain outputs would have required the review queue to handle thousands of items per day — economically and operationally impossible for a 4-person team.
AWS Comprehend for PII was the right trade-off. We evaluated building our own PII classifier. At startup scale, Comprehend's accuracy (92% precision, 88% recall on our financial document corpus) was sufficient and saved 6 weeks of ML work. We would revisit this at 3x scale when the cost becomes a real line item.
Immutable audit logs paid dividends. One enterprise customer requested a compliance audit 3 months after contract signing. We pulled a complete, tamper-evident log of every LLM call for their document portfolio in 4 hours. That capability closed a deal with a second customer who was watching the audit closely.
What Surprised Us
Prompt injection attempts were rarer than expected. We anticipated this as the primary attack vector. In practice, the injection patterns we blocked came mostly from poorly formatted customer documents that happened to contain phrases like "ignore previous" in legal disclaimers — false positives, not attacks. We tuned the patterns down significantly after the first month.
Hallucination was correlated with document quality, not model behavior. Our hallucination detector flagged outputs on documents that were scanned PDFs with poor OCR quality. The model was doing its best with ambiguous source text. We added an input quality check (OCR confidence score) that caught these before LLM processing, reducing flagged hallucinations by 40%.
Cache invalidation for PII patterns was an edge case we missed. When we updated our PII detection patterns, stale cache entries using old patterns persisted for an hour. For a one-hour window, documents processed from cache used the outdated detection logic. We added cache versioning and a pattern change procedure.
Key Takeaways
-
Ship the synchronous blocking layer first. It's the only one that prevents incidents. Everything else is quality assurance.
-
Risk-tier your checks. Not all safety failures have the same consequence. Reserve synchronous blocking for the ones that would be immediately harmful.
-
Build the audit log before you need it. It will be asked for at the worst possible time (a customer audit, an incident investigation). The cost of building it retroactively under time pressure is high.
-
Measure false positive rates weekly. Guardrails that over-block are a product quality problem, not just a safety nuance. Every blocked legitimate document erodes customer trust.
What We'd Do Differently
Architecture Changes
Introduce a dedicated guardrail service earlier. We kept guardrail logic in the main processing service for the first 4 months. By the time we extracted it, we had enough interdependencies that the extraction took 2 weeks instead of 2 days. A shared Python package with a stable interface, deployed to a separate service, would have given us independent scaling from month 2.
Use a purpose-built LLM observation platform. We logged everything to CloudWatch, which worked fine for compliance but made debugging difficult. Platforms like Langfuse or Helicone would have given us cost breakdowns by document type, latency percentiles by guardrail, and per-tenant usage — all of which we ended up building manually in Grafana.
Build the confidence-based auto-routing from day 1. We built the human review queue and manually watched it grow for two months before we added automated routing. We should have designed the routing logic first and added human review as the fallback tier, not the primary tier.
Process Improvements
Establish a "guardrail regression" test suite before launch. When we changed PII patterns or injection detection heuristics, we had no automated way to check that existing safe inputs were still passing. We caught two regressions via customer complaints before we built regression tests. This should have been standard from the first deploy.
Create a shared "known-bad" document corpus for testing. We accumulated adversarial examples over 6 months organically (from real incidents and injection attempts). Having this corpus from month 1 — even with synthetic data — would have given us a more rigorous baseline for guardrail quality.
Conclusion
Building production-grade AI guardrails for regulated industries is fundamentally an exercise in risk stratification. The layered architecture — synchronous blocking for high-severity failures like cross-tenant leakage and PII exposure, async flagging for quality issues like hallucination and inconsistency — let us keep the critical path fast while still catching the long tail of quality problems. The 80-100ms synchronous layer was the key design constraint that made the system viable at 40,000 documents per day without requiring a dedicated guardrail microservice.
The most valuable takeaway is that guardrail systems are living infrastructure, not set-and-forget rules. Our false positive rates shifted significantly as document types changed and LLM model versions updated. Invest early in observability and feedback loops — immutable audit logs, dashboards tracking block rates by category, and a human review queue that feeds corrections back into your classifiers. Start with the simplest check that addresses your highest-risk failure mode, measure its real-world performance, and add complexity only when the data justifies it.