Back to Journal
AI Architecture

Agentic AI Workflows at Scale: Lessons from Production

Real-world lessons from implementing Agentic AI Workflows in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 8 min read

The Challenge

Business Context

This case study documents a real-world migration of a document processing platform from a rule-based extraction pipeline to an agentic AI workflow running on AWS. The platform handled contract review for a mid-sized legal services company — approximately 2,400 contracts per month, each requiring extraction of 15–20 structured fields (parties, dates, governing law, termination clauses, payment terms) and a risk assessment summary.

The previous system: a combination of regex patterns, Apache Tika for text extraction, and a lookup-based classification model trained on 3,000 labeled examples. It handled 70% of contracts accurately with no human review. The remaining 30% required a paralegal to correct the output — a bottleneck that was growing linearly with business volume and represented roughly $18,000/month in labor cost.

The business case for agentic AI: eliminate the majority of the manual review queue, reduce per-contract processing cost, and enable the company to handle 3x contract volume without proportional headcount growth. The target was 92% accuracy (up from 70%) with a processing time under 90 seconds per contract.

Technical Constraints

Several constraints shaped the architecture:

Data residency. Contracts contained privileged legal information. Cloud provider AI services were acceptable; third-party SaaS AI was not. This ruled out direct integrations with OpenAI's API for production — all LLM inference had to run through AWS Bedrock, keeping data within the AWS account boundary and VPC.

Latency SLA. The existing system processed contracts in under 20 seconds. A move to LLM-based processing would inherently increase latency. The business accepted up to 90 seconds for the initial implementation, with a target of under 60 seconds after optimization.

Auditability. Every field extracted and every risk assessment generated needed a provenance record: which document sections informed the extraction, what the LLM was asked, and what it returned. This was a compliance requirement, not a nice-to-have.

Existing infrastructure. The team was AWS-native — already using SQS for job queuing, S3 for document storage, RDS PostgreSQL for structured data, and Lambda for lightweight processing steps. The new system needed to integrate with these rather than replace them.

Scale Requirements

  • Peak load: 120 contracts per hour during business hours (Monday–Friday, 8am–6pm)
  • Monthly volume: ~2,400 contracts with expected 40% YoY growth
  • Concurrency requirement: support 20 simultaneous contract processing jobs without queue buildup
  • Availability: 99.5% uptime during business hours
  • Storage: retain full audit trail (inputs, LLM prompts, raw outputs) for 7 years per legal requirement

Architecture Decision

Options Evaluated

Three approaches were evaluated before selecting the final architecture:

Option A: Direct LLM integration per field. One LLM call per field to extract. 15–20 calls per contract. Pros: simple to implement, easy to test each extraction independently. Cons: 15–20x API cost per contract, very high latency (sequential calls at 2–5 seconds each = 30–100 seconds just for extraction, before risk assessment).

Option B: Single-prompt extraction. One LLM call with all 15–20 fields requested in a single structured output schema. Pros: lowest cost and latency. Cons: performance degraded significantly on contracts over 15 pages (context window constraints) and accuracy dropped for complex nested fields like termination clauses that required reasoning, not just extraction.

Option C: Agentic pipeline with specialized steps. A multi-step workflow where: (1) a document segmentation agent identifies and labels contract sections, (2) a field extraction agent pulls structured data from relevant sections, (3) a risk assessment agent reasons over the full contract. Pros: better accuracy on complex contracts, smaller context per step. Cons: higher implementation complexity, orchestration overhead.

Option D: Hybrid agentic with tool-augmented extraction. A single extraction agent with access to tools for semantic search over the contract sections. The agent calls the search tool to retrieve relevant sections before answering each extraction question. Pros: handles long contracts without full-document context, maintains accuracy on complex fields. Cons: more LLM calls than Option B (but fewer than Option A).

Decision Criteria

CriterionWeightOption AOption BOption COption D
Accuracy on test set40%88%79%91%90%
Avg. processing time25%85s22s68s41s
Cost per contract20%HighLowMediumMedium
Implementation complexity15%LowLowHighMedium

Option D scored highest overall. The key tiebreaker was Option C's complexity — the team was three engineers and needed to ship in six weeks.

Final Architecture

1┌─────────────────────────────────────────────────────────┐
2 Contract Ingestion
3 S3 Upload SQS Message Lambda Trigger
4└─────────────────────────┬───────────────────────────────┘
5
6┌─────────────────────────▼───────────────────────────────┐
7 Document Preparation
8 Lambda: PDF→Text (Textract), chunking, embedding,
9 store chunks in pgvector (RDS)
10└─────────────────────────┬───────────────────────────────┘
11
12┌─────────────────────────▼───────────────────────────────┐
13 Agentic Extraction (ECS Task)
14 LangGraph workflow on Bedrock (Claude Sonnet 3.5)
15 Tool: semantic_search(query) relevant contract chunks│
16 Tool: get_section(section_name) verbatim text
17 Output: structured JSON (15-20 fields + risk score)
18└─────────────────────────┬───────────────────────────────┘
19
20┌─────────────────────────▼───────────────────────────────┐
21 Validation & Storage
22 Lambda: Pydantic validation, confidence scoring
23 Write to RDS: structured fields + audit trail
24 Flag low-confidence fields for human review
25└─────────────────────────────────────────────────────────┘
26 

Key AWS services: S3, SQS, Lambda, ECS Fargate (for LangGraph workflow runner), Amazon Bedrock, RDS PostgreSQL with pgvector extension, CloudWatch for metrics.


Implementation

Phase 1: Foundation

The first two weeks focused on the document preparation pipeline and getting LangGraph running reliably on ECS Fargate. The most important decision here: run the LangGraph orchestration on ECS (not Lambda) because Lambda's 15-minute timeout and 10GB memory limit were borderline for complex contracts.

python
1# ECS task: document preparation
2import boto3
3import asyncio
4from langchain_text_splitters import RecursiveCharacterTextSplitter
5from langchain_aws import BedrockEmbeddings
6import psycopg2
7 
8async def prepare_document(s3_key: str, contract_id: str):
9 # Extract text from PDF via Textract
10 textract = boto3.client('textract')
11 response = textract.start_document_text_detection(
12 DocumentLocation={'S3Object': {'Bucket': BUCKET, 'Name': s3_key}}
13 )
14 job_id = response['JobId']
15 text = await wait_for_textract(job_id)
16
17 # Chunk into 512-token segments with 50-token overlap
18 splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
19 chunks = splitter.split_text(text)
20
21 # Embed and store in pgvector
22 embedder = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
23 embeddings = await embedder.aembed_documents(chunks)
24
25 conn = psycopg2.connect(DATABASE_URL)
26 cur = conn.cursor()
27 for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
28 cur.execute(
29 """INSERT INTO contract_chunks
30 (contract_id, chunk_index, text, embedding)
31 VALUES (%s, %s, %s, %s)""",
32 (contract_id, i, chunk, embedding)
33 )
34 conn.commit()
35 

Phase 2: Core Features

The extraction agent used LangGraph's StateGraph with a simple two-node graph: an extract node that called the LLM with tool access, and a validate node that ran Pydantic validation on the output.

python
1from langgraph.graph import StateGraph, END
2from langchain_aws import ChatBedrock
3from langchain_core.tools import tool
4from pydantic import BaseModel
5from typing import Optional
6import json
7 
8class ContractFields(BaseModel):
9 effective_date: Optional[str]
10 parties: list[str]
11 governing_law: Optional[str]
12 payment_terms: Optional[str]
13 termination_notice_days: Optional[int]
14 auto_renewal: Optional[bool]
15 liability_cap: Optional[str]
16 # ... 10 more fields
17 risk_score: int # 1-10
18 risk_factors: list[str]
19 
20class ExtractionState(BaseModel):
21 contract_id: str
22 extracted: Optional[dict] = None
23 validation_errors: list[str] = []
24 retry_count: int = 0
25 
26@tool
27def semantic_search(query: str, contract_id: str) -> str:
28 """Search contract chunks by semantic similarity."""
29 # pgvector cosine similarity search
30 results = db.execute(
31 """SELECT text FROM contract_chunks
32 WHERE contract_id = %s
33 ORDER BY embedding <=> %s::vector
34 LIMIT 3""",
35 (contract_id, embed(query))
36 )
37 return "\n\n---\n\n".join(r[0] for r in results)
38 
39llm = ChatBedrock(
40 model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
41 model_kwargs={"max_tokens": 4096, "temperature": 0},
42)
43llm_with_tools = llm.bind_tools([semantic_search])
44 
45def extract_node(state: ExtractionState) -> ExtractionState:
46 response = llm_with_tools.invoke([
47 ("system", EXTRACTION_SYSTEM_PROMPT),
48 ("human", f"Extract all fields from contract {state.contract_id}. "
49 f"Use the semantic_search tool to find relevant sections.")
50 ])
51 # Handle tool calls in a loop
52 messages = [response]
53 while response.tool_calls:
54 for tool_call in response.tool_calls:
55 result = semantic_search.invoke({**tool_call['args'], 'contract_id': state.contract_id})
56 messages.append({"role": "tool", "content": result, "tool_call_id": tool_call['id']})
57 response = llm_with_tools.invoke(messages)
58 messages.append(response)
59
60 state.extracted = json.loads(response.content)
61 return state
62 
63graph = StateGraph(ExtractionState)
64graph.add_node("extract", extract_node)
65graph.add_node("validate", validate_node)
66graph.set_entry_point("extract")
67graph.add_edge("extract", "validate")
68graph.add_edge("validate", END)
69workflow = graph.compile()
70 

Phase 3: Optimization

After two weeks in production, three optimization opportunities emerged from the metrics:

1. Parallel field extraction reduced latency by 38%. The original extraction called the LLM once for all fields. Restructuring to call it three times in parallel — financial terms, parties/dates, and risk assessment as separate concurrent tasks — reduced p95 latency from 67s to 41s.

python
1import asyncio
2 
3async def extract_parallel(contract_id: str) -> ContractFields:
4 financial_task = asyncio.create_task(
5 extract_section("financial", contract_id)
6 )
7 parties_task = asyncio.create_task(
8 extract_section("parties_and_dates", contract_id)
9 )
10 risk_task = asyncio.create_task(
11 assess_risk(contract_id)
12 )
13
14 financial, parties, risk = await asyncio.gather(
15 financial_task, parties_task, risk_task
16 )
17 return merge_results(financial, parties, risk)
18 

2. Bedrock on-demand vs. provisioned throughput. At peak load (120 contracts/hour), on-demand Bedrock throttled 8% of requests. Switching to provisioned throughput for 20 model units eliminated throttling and reduced the 429 retry overhead.

3. Chunk cache in ElastiCache. Contract embeddings were re-computed on retry. Adding a Redis cache (ElastiCache) for the embedding lookup reduced the re-embedding cost on retries by 100% and shaved 3–4 seconds off retried jobs.


Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Results & Metrics

Performance Gains

After three months in production with 7,200 contracts processed:

MetricBeforeAfterChange
Field extraction accuracy70%91.4%+21.4pp
Manual review queue rate30%7.2%-22.8pp
p50 processing time18s38s+20s
p95 processing time24s58s+34s
Contracts requiring escalation30%7.2%-22.8pp

Latency increased, as expected — the 90-second SLA was met with margin. The accuracy improvement was the primary success metric and exceeded the 92% target in months two and three (month one was 88.2% during the warm-up period).

Cost Impact

Cost CategoryBeforeAfterChange
Paralegal review labor$18,000/mo$4,320/mo-76%
Infrastructure (AWS)$2,100/mo$5,800/mo+176%
Net monthly cost$20,100/mo$10,120/mo-50%

LLM inference costs (Bedrock) ran approximately $0.85 per contract at the initial model size. After switching intermediate steps to Claude Haiku for lower-complexity extractions, this dropped to $0.52 per contract.

Developer Productivity

The engineering team spent approximately 280 hours on implementation and three months of iteration. The payback period on engineering time, at the $9,980/month net savings, was approximately 4 weeks after the system reached production accuracy targets.

The unexpected productivity gain: the extraction system's structured output became the foundation for two additional downstream features (contract comparison and renewal reminders) that would have required separate engineering work without the structured data store.


Lessons Learned

What Worked

Bedrock for data residency. Using Amazon Bedrock exclusively meant zero legal review friction. Privacy and compliance teams approved the architecture in two days — a process that had previously taken six weeks for third-party AI vendors.

pgvector in RDS. The decision to use pgvector in the existing RDS instance rather than a standalone vector database (Pinecone, Weaviate) was correct at this scale. One less service to operate, and the SQL join between contract metadata and vector search results was trivial.

LangSmith for debugging. Even though LangSmith required data to leave the VPC (for the traces), the team used it in staging only for debugging. The step-by-step trace visibility cut debugging time by roughly 60% during development.

Pydantic validation as the quality gate. Every extraction went through strict Pydantic validation before writing to the database. Fields that failed validation were flagged for human review rather than silently written as incorrect data. This preserved data integrity and gave the team a clear quality signal for prompt iteration.

What Surprised Us

Context window mattered less than expected. The initial assumption was that long contracts would degrade accuracy due to context window constraints. In practice, the semantic search tool allowed the agent to retrieve relevant sections without full-document context, and accuracy on 40+ page contracts was comparable to 5-page contracts (within 2 percentage points).

Retry rates were higher than expected. Bedrock throttled significantly more at peak hours than the published limits suggested. Real-world throughput limits were approximately 70% of the documented values during business hours. This required building more aggressive backoff logic than initially planned.

Prompt iteration was continuous. The team expected to write the extraction prompts once and move on. In practice, new contract types surfaced edge cases every two weeks. Building a prompt versioning system in week one turned out to be critical — without it, rolling back a bad prompt update would have required a code deploy.

Key Takeaways

  1. Semantic search over full-document context is the right tradeoff for long documents. Accuracy comparable to full context at a fraction of the token cost.

  2. Plan for 60–70% of documented provider throughput limits in production. Build your concurrency model around this, not the published ceiling.

  3. The audit trail requirement, while initially seen as overhead, became a product feature. Customers valued being able to see exactly what the AI extracted and from which clause. It reduced objections to AI-based processing.

  4. Structured output validation is the highest-leverage reliability improvement. It costs almost nothing to add and eliminates an entire class of silent data quality failures.


What We'd Do Differently

Architecture Changes

Start with provisioned throughput on day one. The team spent three weeks debugging intermittent throttling before purchasing provisioned Bedrock throughput. At the expected volume, provisioned throughput would have been cost-justified from launch.

Add a human-review feedback loop earlier. The cases that the agent escalated to human review were never fed back into prompt improvement. After month three, the team realized that the manual review queue was a goldmine of training signal — every case where the human corrected the agent's output was a prompt improvement opportunity. Building the feedback mechanism took one additional sprint and immediately started improving accuracy on the most problematic contract types.

Use ECS task definition versioning from the start. Mid-production updates to the extraction workflow required carefully coordinated deploys. A blue/green deployment pattern with ECS task definition versioning would have made workflow updates safer.

Process Improvements

Define accuracy benchmarks before implementation, not after. The team measured accuracy against a held-out test set of 200 contracts — but this test set was assembled after development began. Defining it before would have caught a data distribution gap: the test set underrepresented multi-party agreements, which were 15% of production volume and had 83% accuracy vs. 93% for standard bilateral contracts.

Write the runbook in parallel with the code. The first three production incidents each took 2–3 hours to resolve because the on-call engineer was not the engineer who built the relevant component. A runbook written during development would have cut resolution time by at least half.


Conclusion

This migration validated a specific architectural bet: a tool-augmented single agent with semantic search outperforms both the brute-force approach of one LLM call per field and the monolithic single-prompt approach for document extraction at production scale. The hybrid architecture (Option D) delivered 91.4%% accuracy — exceeding the 92%% target in months two and three — while keeping processing time well within the 90-second SLA and cutting net operational cost by 50%%.

Three takeaways transfer directly to other production agentic deployments. First, plan for 60-70%% of your LLM provider's documented throughput limits — real-world capacity under sustained load is consistently lower than published figures. Second, structured output validation with Pydantic is the single highest-leverage reliability improvement you can make; it costs almost nothing and eliminates silent data quality failures entirely. Third, build the human-review feedback loop from day one. Every case the agent escalates is training signal for prompt improvement, and the teams that capture this signal systematically are the ones whose accuracy curves keep climbing after launch.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026