What is Agentic AI Workflows and why does it matter?

Agentic AI workflows use LLMs to autonomously decide which information to retrieve and what actions to take to complete a structured task. For document-intensive industries, they matter because they replace brittle rule-based extraction with systems that handle the natural variation in real-world documents — different clause orderings, varied language, multi-page tables — with accuracy that approaches human performance.

How does AWS compare for Agentic AI Workflows?

AWS's key advantage for production agentic systems is the combination of Bedrock (managed LLM inference within your AWS account boundary), pgvector on RDS (vector search without a separate vector DB), and the existing compute and storage primitives (ECS, SQS, S3) that most teams already operate. The tradeoff is that Bedrock has fewer cutting-edge models than direct OpenAI or Anthropic API access, and documented throughput limits are optimistic.

What are common mistakes with Agentic AI Workflows?

The three highest-impact mistakes from this implementation: (1) not accounting for real-world throughput limits being 60–70% of published values, leading to unexpected throttling; (2) missing the feedback loop from human review back to prompt improvement; (3) not defining accuracy benchmarks against a representative test set before development began.

Agentic AI Workflows at Scale: Lessons from Production

The Challenge

Business Context

This case study documents a real-world migration of a document processing platform from a rule-based extraction pipeline to an agentic AI workflow running on AWS. The platform handled contract review for a mid-sized legal services company — approximately 2,400 contracts per month, each requiring extraction of 15–20 structured fields (parties, dates, governing law, termination clauses, payment terms) and a risk assessment summary.

The previous system: a combination of regex patterns, Apache Tika for text extraction, and a lookup-based classification model trained on 3,000 labeled examples. It handled 70% of contracts accurately with no human review. The remaining 30% required a paralegal to correct the output — a bottleneck that was growing linearly with business volume and represented roughly $18,000/month in labor cost.

The business case for agentic AI: eliminate the majority of the manual review queue, reduce per-contract processing cost, and enable the company to handle 3x contract volume without proportional headcount growth. The target was 92% accuracy (up from 70%) with a processing time under 90 seconds per contract.

Technical Constraints

Several constraints shaped the architecture:

Data residency. Contracts contained privileged legal information. Cloud provider AI services were acceptable; third-party SaaS AI was not. This ruled out direct integrations with OpenAI's API for production — all LLM inference had to run through AWS Bedrock, keeping data within the AWS account boundary and VPC.

Latency SLA. The existing system processed contracts in under 20 seconds. A move to LLM-based processing would inherently increase latency. The business accepted up to 90 seconds for the initial implementation, with a target of under 60 seconds after optimization.

Auditability. Every field extracted and every risk assessment generated needed a provenance record: which document sections informed the extraction, what the LLM was asked, and what it returned. This was a compliance requirement, not a nice-to-have.

Existing infrastructure. The team was AWS-native — already using SQS for job queuing, S3 for document storage, RDS PostgreSQL for structured data, and Lambda for lightweight processing steps. The new system needed to integrate with these rather than replace them.

Scale Requirements

Peak load: 120 contracts per hour during business hours (Monday–Friday, 8am–6pm)
Monthly volume: ~2,400 contracts with expected 40% YoY growth
Concurrency requirement: support 20 simultaneous contract processing jobs without queue buildup
Availability: 99.5% uptime during business hours
Storage: retain full audit trail (inputs, LLM prompts, raw outputs) for 7 years per legal requirement

Architecture Decision

Options Evaluated

Three approaches were evaluated before selecting the final architecture:

Option A: Direct LLM integration per field. One LLM call per field to extract. 15–20 calls per contract. Pros: simple to implement, easy to test each extraction independently. Cons: 15–20x API cost per contract, very high latency (sequential calls at 2–5 seconds each = 30–100 seconds just for extraction, before risk assessment).

Option B: Single-prompt extraction. One LLM call with all 15–20 fields requested in a single structured output schema. Pros: lowest cost and latency. Cons: performance degraded significantly on contracts over 15 pages (context window constraints) and accuracy dropped for complex nested fields like termination clauses that required reasoning, not just extraction.

Option C: Agentic pipeline with specialized steps. A multi-step workflow where: (1) a document segmentation agent identifies and labels contract sections, (2) a field extraction agent pulls structured data from relevant sections, (3) a risk assessment agent reasons over the full contract. Pros: better accuracy on complex contracts, smaller context per step. Cons: higher implementation complexity, orchestration overhead.

Option D: Hybrid agentic with tool-augmented extraction. A single extraction agent with access to tools for semantic search over the contract sections. The agent calls the search tool to retrieve relevant sections before answering each extraction question. Pros: handles long contracts without full-document context, maintains accuracy on complex fields. Cons: more LLM calls than Option B (but fewer than Option A).

Decision Criteria

Criterion	Weight	Option A	Option B	Option C	Option D
Accuracy on test set	40%	88%	79%	91%	90%
Avg. processing time	25%	85s	22s	68s	41s
Cost per contract	20%	High	Low	Medium	Medium
Implementation complexity	15%	Low	Low	High	Medium

Option D scored highest overall. The key tiebreaker was Option C's complexity — the team was three engineers and needed to ship in six weeks.

Final Architecture

1┌─────────────────────────────────────────────────────────┐

2│ Contract Ingestion │

3│ S3 Upload → SQS Message → Lambda Trigger │

4└─────────────────────────┬───────────────────────────────┘

5 │

6┌─────────────────────────▼───────────────────────────────┐

7│ Document Preparation │

8│ Lambda: PDF→Text (Textract), chunking, embedding, │

9│ store chunks in pgvector (RDS) │

10└─────────────────────────┬───────────────────────────────┘

11 │

12┌─────────────────────────▼───────────────────────────────┐

13│ Agentic Extraction (ECS Task) │

14│ LangGraph workflow on Bedrock (Claude Sonnet 3.5) │

15│ Tool: semantic_search(query) → relevant contract chunks│

16│ Tool: get_section(section_name) → verbatim text │

17│ Output: structured JSON (15-20 fields + risk score) │

18└─────────────────────────┬───────────────────────────────┘

19 │

20┌─────────────────────────▼───────────────────────────────┐

21│ Validation & Storage │

22│ Lambda: Pydantic validation, confidence scoring │

23│ Write to RDS: structured fields + audit trail │

24│ Flag low-confidence fields for human review │

25└─────────────────────────────────────────────────────────┘

Key AWS services: S3, SQS, Lambda, ECS Fargate (for LangGraph workflow runner), Amazon Bedrock, RDS PostgreSQL with pgvector extension, CloudWatch for metrics.

Implementation

Phase 1: Foundation

The first two weeks focused on the document preparation pipeline and getting LangGraph running reliably on ECS Fargate. The most important decision here: run the LangGraph orchestration on ECS (not Lambda) because Lambda's 15-minute timeout and 10GB memory limit were borderline for complex contracts.

python

1# ECS task: document preparation

2import boto3

3import asyncio

4from langchain_text_splitters import RecursiveCharacterTextSplitter

5from langchain_aws import BedrockEmbeddings

6import psycopg2

8async def prepare_document(s3_key: str, contract_id: str):

9 # Extract text from PDF via Textract

10 textract = boto3.client('textract')

11 response = textract.start_document_text_detection(

12 DocumentLocation={'S3Object': {'Bucket': BUCKET, 'Name': s3_key}}

13 )

14 job_id = response['JobId']

15 text = await wait_for_textract(job_id)

17 # Chunk into 512-token segments with 50-token overlap

18 splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)

19 chunks = splitter.split_text(text)

21 # Embed and store in pgvector

22 embedder = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")

23 embeddings = await embedder.aembed_documents(chunks)

25 conn = psycopg2.connect(DATABASE_URL)

26 cur = conn.cursor()

27 for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):

28 cur.execute(

29 """INSERT INTO contract_chunks

30 (contract_id, chunk_index, text, embedding)

31 VALUES (%s, %s, %s, %s)""",

32 (contract_id, i, chunk, embedding)

33 )

34 conn.commit()

Phase 2: Core Features

The extraction agent used LangGraph's StateGraph with a simple two-node graph: an extract node that called the LLM with tool access, and a validate node that ran Pydantic validation on the output.

python

1from langgraph.graph import StateGraph, END

2from langchain_aws import ChatBedrock

3from langchain_core.tools import tool

4from pydantic import BaseModel

5from typing import Optional

6import json

8class ContractFields(BaseModel):

9 effective_date: Optional[str]

10 parties: list[str]

11 governing_law: Optional[str]

12 payment_terms: Optional[str]

13 termination_notice_days: Optional[int]

14 auto_renewal: Optional[bool]

15 liability_cap: Optional[str]

16 # ... 10 more fields

17 risk_score: int # 1-10

18 risk_factors: list[str]

20class ExtractionState(BaseModel):

21 contract_id: str

22 extracted: Optional[dict] = None

23 validation_errors: list[str] = []

24 retry_count: int = 0

26@tool

27def semantic_search(query: str, contract_id: str) -> str:

28 """Search contract chunks by semantic similarity."""

29 # pgvector cosine similarity search

30 results = db.execute(

31 """SELECT text FROM contract_chunks

32 WHERE contract_id = %s

33 ORDER BY embedding <=> %s::vector

34 LIMIT 3""",

35 (contract_id, embed(query))

36 )

37 return "\n\n---\n\n".join(r[0] for r in results)

39llm = ChatBedrock(

40 model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",

41 model_kwargs={"max_tokens": 4096, "temperature": 0},

42)

43llm_with_tools = llm.bind_tools([semantic_search])

45def extract_node(state: ExtractionState) -> ExtractionState:

46 response = llm_with_tools.invoke([

47 ("system", EXTRACTION_SYSTEM_PROMPT),

48 ("human", f"Extract all fields from contract {state.contract_id}. "

49 f"Use the semantic_search tool to find relevant sections.")

50 ])

51 # Handle tool calls in a loop

52 messages = [response]

53 while response.tool_calls:

54 for tool_call in response.tool_calls:

55 result = semantic_search.invoke({**tool_call['args'], 'contract_id': state.contract_id})

56 messages.append({"role": "tool", "content": result, "tool_call_id": tool_call['id']})

57 response = llm_with_tools.invoke(messages)

58 messages.append(response)

60 state.extracted = json.loads(response.content)

61 return state

63graph = StateGraph(ExtractionState)

64graph.add_node("extract", extract_node)

65graph.add_node("validate", validate_node)

66graph.set_entry_point("extract")

67graph.add_edge("extract", "validate")

68graph.add_edge("validate", END)

69workflow = graph.compile()

Phase 3: Optimization

After two weeks in production, three optimization opportunities emerged from the metrics:

1. Parallel field extraction reduced latency by 38%. The original extraction called the LLM once for all fields. Restructuring to call it three times in parallel — financial terms, parties/dates, and risk assessment as separate concurrent tasks — reduced p95 latency from 67s to 41s.

python

1import asyncio

3async def extract_parallel(contract_id: str) -> ContractFields:

4 financial_task = asyncio.create_task(

5 extract_section("financial", contract_id)

6 )

7 parties_task = asyncio.create_task(

8 extract_section("parties_and_dates", contract_id)

9 )

10 risk_task = asyncio.create_task(

11 assess_risk(contract_id)

12 )

14 financial, parties, risk = await asyncio.gather(

15 financial_task, parties_task, risk_task

16 )

17 return merge_results(financial, parties, risk)

2. Bedrock on-demand vs. provisioned throughput. At peak load (120 contracts/hour), on-demand Bedrock throttled 8% of requests. Switching to provisioned throughput for 20 model units eliminated throttling and reduced the 429 retry overhead.

3. Chunk cache in ElastiCache. Contract embeddings were re-computed on retry. Adding a Redis cache (ElastiCache) for the embedding lookup reduced the re-embedding cost on retries by 100% and shaved 3–4 seconds off retried jobs.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Results & Metrics

Performance Gains

After three months in production with 7,200 contracts processed:

Metric	Before	After	Change
Field extraction accuracy	70%	91.4%	+21.4pp
Manual review queue rate	30%	7.2%	-22.8pp
p50 processing time	18s	38s	+20s
p95 processing time	24s	58s	+34s
Contracts requiring escalation	30%	7.2%	-22.8pp

Latency increased, as expected — the 90-second SLA was met with margin. The accuracy improvement was the primary success metric and exceeded the 92% target in months two and three (month one was 88.2% during the warm-up period).

Cost Impact

Cost Category	Before	After	Change
Paralegal review labor	$18,000/mo	$4,320/mo	-76%
Infrastructure (AWS)	$2,100/mo	$5,800/mo	+176%
Net monthly cost	$20,100/mo	$10,120/mo	-50%

LLM inference costs (Bedrock) ran approximately $0.85 per contract at the initial model size. After switching intermediate steps to Claude Haiku for lower-complexity extractions, this dropped to $0.52 per contract.

Developer Productivity

The engineering team spent approximately 280 hours on implementation and three months of iteration. The payback period on engineering time, at the $9,980/month net savings, was approximately 4 weeks after the system reached production accuracy targets.

The unexpected productivity gain: the extraction system's structured output became the foundation for two additional downstream features (contract comparison and renewal reminders) that would have required separate engineering work without the structured data store.

Lessons Learned

What Worked

Bedrock for data residency. Using Amazon Bedrock exclusively meant zero legal review friction. Privacy and compliance teams approved the architecture in two days — a process that had previously taken six weeks for third-party AI vendors.

pgvector in RDS. The decision to use pgvector in the existing RDS instance rather than a standalone vector database (Pinecone, Weaviate) was correct at this scale. One less service to operate, and the SQL join between contract metadata and vector search results was trivial.

LangSmith for debugging. Even though LangSmith required data to leave the VPC (for the traces), the team used it in staging only for debugging. The step-by-step trace visibility cut debugging time by roughly 60% during development.

Pydantic validation as the quality gate. Every extraction went through strict Pydantic validation before writing to the database. Fields that failed validation were flagged for human review rather than silently written as incorrect data. This preserved data integrity and gave the team a clear quality signal for prompt iteration.

What Surprised Us

Context window mattered less than expected. The initial assumption was that long contracts would degrade accuracy due to context window constraints. In practice, the semantic search tool allowed the agent to retrieve relevant sections without full-document context, and accuracy on 40+ page contracts was comparable to 5-page contracts (within 2 percentage points).

Retry rates were higher than expected. Bedrock throttled significantly more at peak hours than the published limits suggested. Real-world throughput limits were approximately 70% of the documented values during business hours. This required building more aggressive backoff logic than initially planned.

Prompt iteration was continuous. The team expected to write the extraction prompts once and move on. In practice, new contract types surfaced edge cases every two weeks. Building a prompt versioning system in week one turned out to be critical — without it, rolling back a bad prompt update would have required a code deploy.

Key Takeaways

Semantic search over full-document context is the right tradeoff for long documents. Accuracy comparable to full context at a fraction of the token cost.
Plan for 60–70% of documented provider throughput limits in production. Build your concurrency model around this, not the published ceiling.
The audit trail requirement, while initially seen as overhead, became a product feature. Customers valued being able to see exactly what the AI extracted and from which clause. It reduced objections to AI-based processing.
Structured output validation is the highest-leverage reliability improvement. It costs almost nothing to add and eliminates an entire class of silent data quality failures.

What We'd Do Differently

Architecture Changes

Start with provisioned throughput on day one. The team spent three weeks debugging intermittent throttling before purchasing provisioned Bedrock throughput. At the expected volume, provisioned throughput would have been cost-justified from launch.

Add a human-review feedback loop earlier. The cases that the agent escalated to human review were never fed back into prompt improvement. After month three, the team realized that the manual review queue was a goldmine of training signal — every case where the human corrected the agent's output was a prompt improvement opportunity. Building the feedback mechanism took one additional sprint and immediately started improving accuracy on the most problematic contract types.

Use ECS task definition versioning from the start. Mid-production updates to the extraction workflow required carefully coordinated deploys. A blue/green deployment pattern with ECS task definition versioning would have made workflow updates safer.

Process Improvements

Define accuracy benchmarks before implementation, not after. The team measured accuracy against a held-out test set of 200 contracts — but this test set was assembled after development began. Defining it before would have caught a data distribution gap: the test set underrepresented multi-party agreements, which were 15% of production volume and had 83% accuracy vs. 93% for standard bilateral contracts.

Write the runbook in parallel with the code. The first three production incidents each took 2–3 hours to resolve because the on-call engineer was not the engineer who built the relevant component. A runbook written during development would have cut resolution time by at least half.

Conclusion

This migration validated a specific architectural bet: a tool-augmented single agent with semantic search outperforms both the brute-force approach of one LLM call per field and the monolithic single-prompt approach for document extraction at production scale. The hybrid architecture (Option D) delivered 91.4%% accuracy — exceeding the 92%% target in months two and three — while keeping processing time well within the 90-second SLA and cutting net operational cost by 50%%.

Three takeaways transfer directly to other production agentic deployments. First, plan for 60-70%% of your LLM provider's documented throughput limits — real-world capacity under sustained load is consistently lower than published figures. Second, structured output validation with Pydantic is the single highest-leverage reliability improvement you can make; it costs almost nothing and eliminates silent data quality failures entirely. Third, build the human-review feedback loop from day one. Every case the agent escalates is training signal for prompt improvement, and the teams that capture this signal systematically are the ones whose accuracy curves keep climbing after launch.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

agentic-ai llm workflows orchestration aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Agentic AI Workflows at Scale: Lessons from Production

The Challenge

Business Context

Technical Constraints

Scale Requirements

Architecture Decision

Options Evaluated

Decision Criteria

Final Architecture

Implementation

Phase 1: Foundation

Phase 2: Core Features

Phase 3: Optimization

Results & Metrics

Performance Gains

Cost Impact

Developer Productivity

Lessons Learned

What Worked

What Surprised Us

Key Takeaways

What We'd Do Differently

Architecture Changes

Process Improvements

Conclusion

FAQ

Building with agentic AI?

Agentic AI Workflows Best Practices for High Scale Teams

Agentic AI Workflows Best Practices for Enterprise Teams

Agentic AI Workflows Best Practices for Startup Teams

Complete Guide to RAG Pipeline Design with Typescript

Agentic AI Workflows Best Practices for High Scale Teams

Start a
Conversation.

The Challenge

Business Context

Technical Constraints

Scale Requirements

Architecture Decision

Options Evaluated

Decision Criteria

Final Architecture

Implementation

Phase 1: Foundation

Phase 2: Core Features

Phase 3: Optimization

Results & Metrics

Performance Gains

Cost Impact

Developer Productivity

Lessons Learned

What Worked

What Surprised Us

Key Takeaways

What We'd Do Differently

Architecture Changes

Process Improvements

Conclusion

FAQ

Building with agentic AI?

Agentic AI Workflows Best Practices for High Scale Teams

Agentic AI Workflows Best Practices for Enterprise Teams

Agentic AI Workflows Best Practices for Startup Teams

Complete Guide to RAG Pipeline Design with Typescript

Agentic AI Workflows Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.