How do you handle model updates when the claim format changes?

We maintain a "drift detection" pipeline that compares model outputs against a sample of newly reviewed claims weekly. When field-level accuracy drops below 92%, it triggers an alert. Adding 50-100 examples of the new format and retraining (2-hour process) typically restores accuracy. We've retrained twice in three months — once for a new claim type and once for a form revision.

What happens when the model is uncertain about an extraction?

We implemented a simple confidence heuristic based on token probabilities. When the model generates a field value with low average log probability (<-2.0), that field is flagged for human review. This routes approximately 8% of fields to reviewers while keeping 92% fully automated. Flagged fields have a 40% error rate, confirming the confidence scoring is effective.

Could you have used GPT-4 fine-tuning instead of self-hosted?

OpenAI's fine-tuning API would have been simpler operationally but raised two concerns: (1) insurance claim data contains PHI that cannot be sent to external APIs under the client's compliance requirements, and (2) per-token pricing on fine-tuned GPT-4 would have cost $0.03 per claim vs $0.001 self-hosted. For startups without compliance constraints, OpenAI fine-tuning is a valid shortcut.

How did you validate the model wasn't hallucinating field values?

We implemented a two-layer validation: (1) structural validation — all 23 fields present, correct data types, valid date formats, policy numbers matching known patterns, and (2) cross-field consistency — incident dates within policy coverage periods, damage amounts within policy limits, claimant names matching policy records. Structural validation catches 95%+ of hallucinations. The remaining edge cases are caught by the human review of low-confidence outputs.

LLM Fine-Tuning Production at Scale: Lessons from Production

Q: Could you have used GPT-4 fine-tuning instead of self-hosted?

OpenAI's fine-tuning API would have been simpler operationally but raised two concerns: (1) insurance claim data contains PHI that cannot be sent to external APIs under the client's compliance requirements, and (2) per-token pricing on fine-tuned GPT-4 would have cost $0.03 per claim vs $0.001 self-hosted. For startups without compliance constraints, OpenAI fine-tuning is a valid shortcut.

Q: How did you validate the model wasn't hallucinating field values?

We implemented a two-layer validation: (1) structural validation — all 23 fields present, correct data types, valid date formats, policy numbers matching known patterns, and (2) cross-field consistency — incident dates within policy coverage periods, damage amounts within policy limits, claimant names matching policy records. Structural validation catches 95%+ of hallucinations. The remaining edge cases are caught by the human review of low-confidence outputs.

In mid-2024, we fine-tuned a Llama 3.1 8B model for automated insurance claims processing. The project took 6 weeks from concept to production, reduced manual review time by 62%, and achieved 94.3% accuracy on structured extraction tasks — up from 78% with prompt engineering alone. Here is what actually happened.

The Problem

Our insurance client processed 15,000 claims per month. Each claim document required extracting 23 structured fields: claimant information, incident details, damage descriptions, policy numbers, and coverage determinations. Human reviewers spent an average of 12 minutes per claim. At $35/hour fully loaded, that was $105,000/month in review labor.

The initial approach — GPT-4 with few-shot prompting — achieved 78% field-level accuracy. The remaining 22% required human correction, which took nearly as long as reviewing from scratch because reviewers had to verify every field to find the errors.

Architecture Decisions

Why Fine-Tuning Over RAG

RAG was evaluated first. We built a system that retrieved similar historical claims and used them as context for GPT-4. Accuracy improved to 83%, but at $0.08 per claim in API costs ($1,200/month) and 8-12 second latency per document. The latency was the bigger problem — reviewers waiting 10 seconds per claim broke their workflow.

Fine-tuning a smaller model offered: (1) sub-second inference, (2) $0.001 per claim on self-hosted infrastructure, and (3) data never leaving our infrastructure — critical for PHI/PII compliance.

Model Selection

We chose Llama 3.1 8B over larger alternatives. The reasoning:

8B parameters fit on a single A10G GPU for both training and inference
The claims extraction task was narrow enough that a smaller model could specialize effectively
Iteration speed mattered: 2-hour training runs vs 12+ hours for 70B models

Training Process

Data Preparation

We started with 8,000 historical claims that had been manually reviewed and corrected. The data pipeline:

python

1import json

2from pathlib import Path

4def process_claim_to_training(

5 claim_document: str,

6 verified_extraction: dict,

7) -> dict:

8 instruction = (

9 "Extract all structured fields from the following insurance claim document. "

10 "Return a JSON object with the following fields: claimant_name, claimant_dob, "

11 "policy_number, incident_date, incident_type, damage_description, "

12 "estimated_amount, coverage_determination, and all other standard fields."

13 )

15 return {

16 "instruction": instruction,

17 "input": claim_document,

18 "output": json.dumps(verified_extraction, indent=2),

19 }

21def split_dataset(examples: list[dict], train_ratio: float = 0.9):

22 split_idx = int(len(examples) * train_ratio)

23 return examples[:split_idx], examples[split_idx:]

After deduplication and quality filtering (removing claims with incomplete extractions), we had 6,847 training examples and 761 evaluation examples.

Training Iterations

Iteration 1: Baseline LoRA fine-tuning. 6,847 examples, default hyperparameters (lr=2e-4, r=16, 3 epochs). Field-level accuracy: 87.2%. The model learned the output format perfectly but struggled with ambiguous damage descriptions and multi-vehicle incidents.

Iteration 2: Targeted data augmentation. We added 400 examples specifically for the failure modes: complex damage descriptions (127 examples), multi-party incidents (89 examples), and edge cases in coverage determination (184 examples). Accuracy: 91.8%.

Iteration 3: Hyperparameter tuning. We reduced the learning rate to 1e-4, increased LoRA rank to 32, and trained for 5 epochs. Accuracy: 93.1%.

Iteration 4: Chain-of-thought formatting. We reformatted outputs to include reasoning before the JSON extraction. Accuracy: 94.3%. The reasoning chain helped the model handle ambiguous cases by forcing explicit intermediate steps.

Final training configuration:

python

1training_args = TrainingArguments(

2 output_dir="outputs/claims-v4",

3 num_train_epochs=5,

4 per_device_train_batch_size=4,

5 gradient_accumulation_steps=4,

6 learning_rate=1e-4,

7 warmup_ratio=0.1,

8 lr_scheduler_type="cosine",

9 bf16=True,

10 gradient_checkpointing=True,

11 save_strategy="epoch",

12 evaluation_strategy="epoch",

13 load_best_model_at_end=True,

14 metric_for_best_model="eval_loss",

15)

17lora_config = LoraConfig(

18 r=32,

19 lora_alpha=64,

20 lora_dropout=0.05,

21 target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],

22 task_type="CAUSAL_LM",

23)

Total training cost across all iterations: $47 in GPU compute (A10G on AWS).

Evaluation

python

1def evaluate_claims_extraction(

2 predictions: list[dict],

3 ground_truth: list[dict],

4) -> dict:

5 field_accuracies = {}

6 total_fields = 0

7 correct_fields = 0

9 for pred, truth in zip(predictions, ground_truth):

10 for field in truth:

11 total_fields += 1

12 if field in pred and normalize(pred[field]) == normalize(truth[field]):

13 correct_fields += 1

14 field_accuracies.setdefault(field, []).append(1)

15 else:

16 field_accuracies.setdefault(field, []).append(0)

18 per_field = {

19 field: sum(scores) / len(scores)

20 for field, scores in field_accuracies.items()

21 }

23 return {

24 "overall_accuracy": correct_fields / total_fields,

25 "per_field_accuracy": per_field,

26 "worst_fields": sorted(per_field.items(), key=lambda x: x[1])[:5],

27 }

Results on the held-out evaluation set:

Metric	Prompt Engineering	RAG + GPT-4	Fine-tuned 8B
Field-level accuracy	78.0%	83.2%	94.3%
Latency per claim	8.2s	11.4s	0.6s
Cost per claim	$0.06	$0.08	$0.001
Monthly inference cost	$900	$1,200	$15
Format compliance	92%	95%	99.7%

The accuracy improvement from 78% to 94.3% crossed a critical threshold: at 94%+ accuracy, reviewers could trust the extraction and only verify flagged low-confidence fields, reducing review time from 12 minutes to 4.5 minutes per claim.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Production Deployment

We deployed with vLLM behind a FastAPI service:

python

1from vllm import LLM, SamplingParams

3llm = LLM(

4 model="outputs/claims-v4-merged",

5 max_model_len=4096,

6 gpu_memory_utilization=0.9,

7 tensor_parallel_size=1,

10sampling_params = SamplingParams(

11 temperature=0.1,

12 max_tokens=1024,

13 top_p=0.95,

14)

Running on a single g5.xlarge instance ($1.01/hour, $730/month). The model handles 15,000 claims per month in batch processing with room to spare — peak throughput is approximately 40 claims per minute.

Measurable Results

After three months in production:

Review time reduced from 12 min to 4.5 min per claim (62% reduction)
Monthly labor cost reduced from $105,000 to $39,375 ($65,625 savings)
Infrastructure cost: $730/month (vLLM serving instance)
Net monthly savings: $64,895
Project ROI: 112x (annual savings of $778,740 on a $47 training + $6,000 engineering investment)

What Went Wrong

Initial data quality issues. The first 8,000 historical claims had inconsistent formatting — some reviewers used abbreviations, others wrote full descriptions. We spent two weeks standardizing output formats before training produced usable results.

Overfitting on iteration 3. When we increased to 5 epochs without reducing the learning rate, the model memorized training examples. We discovered this when accuracy on the eval set was 96% but production accuracy dropped to 89%. Adding more diverse examples and reducing the learning rate fixed this.

Latency variance. While average latency was 0.6 seconds, claims with very long documents (10+ pages) took 3-4 seconds due to the 4096 token context limit. We had to implement document chunking for long claims, which added engineering complexity.

Honest Retrospective

Would we fine-tune again? Absolutely. The cost savings are unambiguous, and the accuracy improvement over prompting was significant for this specific task.

What we'd do differently:

Standardize training data formatting before the first training run, not after wasting two iterations.
Start with 500 hand-curated examples instead of 8,000 historical ones. Quality trumps quantity.
Build the evaluation pipeline first, before the first training run.
Implement confidence scoring from the start to route low-confidence extractions to human review.

Conclusion

Fine-tuning a relatively small model (8B parameters) on a focused task with quality training data delivered production results that exceeded both prompt engineering and RAG approaches. The total project cost was negligible compared to the monthly savings, and the iteration cycle (4 training runs over 6 weeks) was manageable for a two-person team.

The key insight is that fine-tuning works best for narrow, well-defined tasks where you have access to verified ground truth data. Claims extraction is an ideal use case: clear input format, defined output schema, and thousands of historically verified examples. For broader, open-ended tasks, prompt engineering or RAG may still be the pragmatic choice.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

llm fine-tuning mlops training aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

LLM Fine-Tuning Production at Scale: Lessons from Production

The Problem

Architecture Decisions

Why Fine-Tuning Over RAG

Model Selection

Training Process

Data Preparation

Training Iterations

Evaluation

Production Deployment

Measurable Results

What Went Wrong

Honest Retrospective

Conclusion

FAQ

Building with agentic AI?

LLM Fine-Tuning Production Best Practices for High Scale Teams

LLM Fine-Tuning Production Best Practices for Enterprise Teams

LLM Fine-Tuning Production Best Practices for Startup Teams

Complete Guide to Vector Database Architecture with Typescript

LLM Fine-Tuning Production Best Practices for High Scale Teams

Start a
Conversation.

The Problem

Architecture Decisions

Why Fine-Tuning Over RAG

Model Selection

Training Process

Data Preparation

Training Iterations

Evaluation

Production Deployment

Measurable Results

What Went Wrong

Honest Retrospective

Conclusion

FAQ

Building with agentic AI?

LLM Fine-Tuning Production Best Practices for High Scale Teams

LLM Fine-Tuning Production Best Practices for Enterprise Teams

LLM Fine-Tuning Production Best Practices for Startup Teams

Complete Guide to Vector Database Architecture with Typescript

LLM Fine-Tuning Production Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.