The Challenge
Business Context
This case study documents a real-world migration of a document processing platform from a rule-based extraction pipeline to an agentic AI workflow running on AWS. The platform handled contract review for a mid-sized legal services company — approximately 2,400 contracts per month, each requiring extraction of 15–20 structured fields (parties, dates, governing law, termination clauses, payment terms) and a risk assessment summary.
The previous system: a combination of regex patterns, Apache Tika for text extraction, and a lookup-based classification model trained on 3,000 labeled examples. It handled 70% of contracts accurately with no human review. The remaining 30% required a paralegal to correct the output — a bottleneck that was growing linearly with business volume and represented roughly $18,000/month in labor cost.
The business case for agentic AI: eliminate the majority of the manual review queue, reduce per-contract processing cost, and enable the company to handle 3x contract volume without proportional headcount growth. The target was 92% accuracy (up from 70%) with a processing time under 90 seconds per contract.
Technical Constraints
Several constraints shaped the architecture:
Data residency. Contracts contained privileged legal information. Cloud provider AI services were acceptable; third-party SaaS AI was not. This ruled out direct integrations with OpenAI's API for production — all LLM inference had to run through AWS Bedrock, keeping data within the AWS account boundary and VPC.
Latency SLA. The existing system processed contracts in under 20 seconds. A move to LLM-based processing would inherently increase latency. The business accepted up to 90 seconds for the initial implementation, with a target of under 60 seconds after optimization.
Auditability. Every field extracted and every risk assessment generated needed a provenance record: which document sections informed the extraction, what the LLM was asked, and what it returned. This was a compliance requirement, not a nice-to-have.
Existing infrastructure. The team was AWS-native — already using SQS for job queuing, S3 for document storage, RDS PostgreSQL for structured data, and Lambda for lightweight processing steps. The new system needed to integrate with these rather than replace them.
Scale Requirements
- Peak load: 120 contracts per hour during business hours (Monday–Friday, 8am–6pm)
- Monthly volume: ~2,400 contracts with expected 40% YoY growth
- Concurrency requirement: support 20 simultaneous contract processing jobs without queue buildup
- Availability: 99.5% uptime during business hours
- Storage: retain full audit trail (inputs, LLM prompts, raw outputs) for 7 years per legal requirement
Architecture Decision
Options Evaluated
Three approaches were evaluated before selecting the final architecture:
Option A: Direct LLM integration per field. One LLM call per field to extract. 15–20 calls per contract. Pros: simple to implement, easy to test each extraction independently. Cons: 15–20x API cost per contract, very high latency (sequential calls at 2–5 seconds each = 30–100 seconds just for extraction, before risk assessment).
Option B: Single-prompt extraction. One LLM call with all 15–20 fields requested in a single structured output schema. Pros: lowest cost and latency. Cons: performance degraded significantly on contracts over 15 pages (context window constraints) and accuracy dropped for complex nested fields like termination clauses that required reasoning, not just extraction.
Option C: Agentic pipeline with specialized steps. A multi-step workflow where: (1) a document segmentation agent identifies and labels contract sections, (2) a field extraction agent pulls structured data from relevant sections, (3) a risk assessment agent reasons over the full contract. Pros: better accuracy on complex contracts, smaller context per step. Cons: higher implementation complexity, orchestration overhead.
Option D: Hybrid agentic with tool-augmented extraction. A single extraction agent with access to tools for semantic search over the contract sections. The agent calls the search tool to retrieve relevant sections before answering each extraction question. Pros: handles long contracts without full-document context, maintains accuracy on complex fields. Cons: more LLM calls than Option B (but fewer than Option A).
Decision Criteria
| Criterion | Weight | Option A | Option B | Option C | Option D |
|---|---|---|---|---|---|
| Accuracy on test set | 40% | 88% | 79% | 91% | 90% |
| Avg. processing time | 25% | 85s | 22s | 68s | 41s |
| Cost per contract | 20% | High | Low | Medium | Medium |
| Implementation complexity | 15% | Low | Low | High | Medium |
Option D scored highest overall. The key tiebreaker was Option C's complexity — the team was three engineers and needed to ship in six weeks.
Final Architecture
Key AWS services: S3, SQS, Lambda, ECS Fargate (for LangGraph workflow runner), Amazon Bedrock, RDS PostgreSQL with pgvector extension, CloudWatch for metrics.
Implementation
Phase 1: Foundation
The first two weeks focused on the document preparation pipeline and getting LangGraph running reliably on ECS Fargate. The most important decision here: run the LangGraph orchestration on ECS (not Lambda) because Lambda's 15-minute timeout and 10GB memory limit were borderline for complex contracts.
Phase 2: Core Features
The extraction agent used LangGraph's StateGraph with a simple two-node graph: an extract node that called the LLM with tool access, and a validate node that ran Pydantic validation on the output.
Phase 3: Optimization
After two weeks in production, three optimization opportunities emerged from the metrics:
1. Parallel field extraction reduced latency by 38%. The original extraction called the LLM once for all fields. Restructuring to call it three times in parallel — financial terms, parties/dates, and risk assessment as separate concurrent tasks — reduced p95 latency from 67s to 41s.
2. Bedrock on-demand vs. provisioned throughput. At peak load (120 contracts/hour), on-demand Bedrock throttled 8% of requests. Switching to provisioned throughput for 20 model units eliminated throttling and reduced the 429 retry overhead.
3. Chunk cache in ElastiCache. Contract embeddings were re-computed on retry. Adding a Redis cache (ElastiCache) for the embedding lookup reduced the re-embedding cost on retries by 100% and shaved 3–4 seconds off retried jobs.
Need a second opinion on your AI systems architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallResults & Metrics
Performance Gains
After three months in production with 7,200 contracts processed:
| Metric | Before | After | Change |
|---|---|---|---|
| Field extraction accuracy | 70% | 91.4% | +21.4pp |
| Manual review queue rate | 30% | 7.2% | -22.8pp |
| p50 processing time | 18s | 38s | +20s |
| p95 processing time | 24s | 58s | +34s |
| Contracts requiring escalation | 30% | 7.2% | -22.8pp |
Latency increased, as expected — the 90-second SLA was met with margin. The accuracy improvement was the primary success metric and exceeded the 92% target in months two and three (month one was 88.2% during the warm-up period).
Cost Impact
| Cost Category | Before | After | Change |
|---|---|---|---|
| Paralegal review labor | $18,000/mo | $4,320/mo | -76% |
| Infrastructure (AWS) | $2,100/mo | $5,800/mo | +176% |
| Net monthly cost | $20,100/mo | $10,120/mo | -50% |
LLM inference costs (Bedrock) ran approximately $0.85 per contract at the initial model size. After switching intermediate steps to Claude Haiku for lower-complexity extractions, this dropped to $0.52 per contract.
Developer Productivity
The engineering team spent approximately 280 hours on implementation and three months of iteration. The payback period on engineering time, at the $9,980/month net savings, was approximately 4 weeks after the system reached production accuracy targets.
The unexpected productivity gain: the extraction system's structured output became the foundation for two additional downstream features (contract comparison and renewal reminders) that would have required separate engineering work without the structured data store.
Lessons Learned
What Worked
Bedrock for data residency. Using Amazon Bedrock exclusively meant zero legal review friction. Privacy and compliance teams approved the architecture in two days — a process that had previously taken six weeks for third-party AI vendors.
pgvector in RDS. The decision to use pgvector in the existing RDS instance rather than a standalone vector database (Pinecone, Weaviate) was correct at this scale. One less service to operate, and the SQL join between contract metadata and vector search results was trivial.
LangSmith for debugging. Even though LangSmith required data to leave the VPC (for the traces), the team used it in staging only for debugging. The step-by-step trace visibility cut debugging time by roughly 60% during development.
Pydantic validation as the quality gate. Every extraction went through strict Pydantic validation before writing to the database. Fields that failed validation were flagged for human review rather than silently written as incorrect data. This preserved data integrity and gave the team a clear quality signal for prompt iteration.
What Surprised Us
Context window mattered less than expected. The initial assumption was that long contracts would degrade accuracy due to context window constraints. In practice, the semantic search tool allowed the agent to retrieve relevant sections without full-document context, and accuracy on 40+ page contracts was comparable to 5-page contracts (within 2 percentage points).
Retry rates were higher than expected. Bedrock throttled significantly more at peak hours than the published limits suggested. Real-world throughput limits were approximately 70% of the documented values during business hours. This required building more aggressive backoff logic than initially planned.
Prompt iteration was continuous. The team expected to write the extraction prompts once and move on. In practice, new contract types surfaced edge cases every two weeks. Building a prompt versioning system in week one turned out to be critical — without it, rolling back a bad prompt update would have required a code deploy.
Key Takeaways
-
Semantic search over full-document context is the right tradeoff for long documents. Accuracy comparable to full context at a fraction of the token cost.
-
Plan for 60–70% of documented provider throughput limits in production. Build your concurrency model around this, not the published ceiling.
-
The audit trail requirement, while initially seen as overhead, became a product feature. Customers valued being able to see exactly what the AI extracted and from which clause. It reduced objections to AI-based processing.
-
Structured output validation is the highest-leverage reliability improvement. It costs almost nothing to add and eliminates an entire class of silent data quality failures.
What We'd Do Differently
Architecture Changes
Start with provisioned throughput on day one. The team spent three weeks debugging intermittent throttling before purchasing provisioned Bedrock throughput. At the expected volume, provisioned throughput would have been cost-justified from launch.
Add a human-review feedback loop earlier. The cases that the agent escalated to human review were never fed back into prompt improvement. After month three, the team realized that the manual review queue was a goldmine of training signal — every case where the human corrected the agent's output was a prompt improvement opportunity. Building the feedback mechanism took one additional sprint and immediately started improving accuracy on the most problematic contract types.
Use ECS task definition versioning from the start. Mid-production updates to the extraction workflow required carefully coordinated deploys. A blue/green deployment pattern with ECS task definition versioning would have made workflow updates safer.
Process Improvements
Define accuracy benchmarks before implementation, not after. The team measured accuracy against a held-out test set of 200 contracts — but this test set was assembled after development began. Defining it before would have caught a data distribution gap: the test set underrepresented multi-party agreements, which were 15% of production volume and had 83% accuracy vs. 93% for standard bilateral contracts.
Write the runbook in parallel with the code. The first three production incidents each took 2–3 hours to resolve because the on-call engineer was not the engineer who built the relevant component. A runbook written during development would have cut resolution time by at least half.
Conclusion
This migration validated a specific architectural bet: a tool-augmented single agent with semantic search outperforms both the brute-force approach of one LLM call per field and the monolithic single-prompt approach for document extraction at production scale. The hybrid architecture (Option D) delivered 91.4%% accuracy — exceeding the 92%% target in months two and three — while keeping processing time well within the 90-second SLA and cutting net operational cost by 50%%.
Three takeaways transfer directly to other production agentic deployments. First, plan for 60-70%% of your LLM provider's documented throughput limits — real-world capacity under sustained load is consistently lower than published figures. Second, structured output validation with Pydantic is the single highest-leverage reliability improvement you can make; it costs almost nothing and eliminates silent data quality failures entirely. Third, build the human-review feedback loop from day one. Every case the agent escalates is training signal for prompt improvement, and the teams that capture this signal systematically are the ones whose accuracy curves keep climbing after launch.