Back to Journal
AI Architecture

LLM Fine-Tuning Production at Scale: Lessons from Production

Real-world lessons from implementing LLM Fine-Tuning Production in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 9 min read

In mid-2024, we fine-tuned a Llama 3.1 8B model for automated insurance claims processing. The project took 6 weeks from concept to production, reduced manual review time by 62%, and achieved 94.3% accuracy on structured extraction tasks — up from 78% with prompt engineering alone. Here is what actually happened.

The Problem

Our insurance client processed 15,000 claims per month. Each claim document required extracting 23 structured fields: claimant information, incident details, damage descriptions, policy numbers, and coverage determinations. Human reviewers spent an average of 12 minutes per claim. At $35/hour fully loaded, that was $105,000/month in review labor.

The initial approach — GPT-4 with few-shot prompting — achieved 78% field-level accuracy. The remaining 22% required human correction, which took nearly as long as reviewing from scratch because reviewers had to verify every field to find the errors.

Architecture Decisions

Why Fine-Tuning Over RAG

RAG was evaluated first. We built a system that retrieved similar historical claims and used them as context for GPT-4. Accuracy improved to 83%, but at $0.08 per claim in API costs ($1,200/month) and 8-12 second latency per document. The latency was the bigger problem — reviewers waiting 10 seconds per claim broke their workflow.

Fine-tuning a smaller model offered: (1) sub-second inference, (2) $0.001 per claim on self-hosted infrastructure, and (3) data never leaving our infrastructure — critical for PHI/PII compliance.

Model Selection

We chose Llama 3.1 8B over larger alternatives. The reasoning:

  • 8B parameters fit on a single A10G GPU for both training and inference
  • The claims extraction task was narrow enough that a smaller model could specialize effectively
  • Iteration speed mattered: 2-hour training runs vs 12+ hours for 70B models

Training Process

Data Preparation

We started with 8,000 historical claims that had been manually reviewed and corrected. The data pipeline:

python
1import json
2from pathlib import Path
3 
4def process_claim_to_training(
5 claim_document: str,
6 verified_extraction: dict,
7) -> dict:
8 instruction = (
9 "Extract all structured fields from the following insurance claim document. "
10 "Return a JSON object with the following fields: claimant_name, claimant_dob, "
11 "policy_number, incident_date, incident_type, damage_description, "
12 "estimated_amount, coverage_determination, and all other standard fields."
13 )
14 
15 return {
16 "instruction": instruction,
17 "input": claim_document,
18 "output": json.dumps(verified_extraction, indent=2),
19 }
20 
21def split_dataset(examples: list[dict], train_ratio: float = 0.9):
22 split_idx = int(len(examples) * train_ratio)
23 return examples[:split_idx], examples[split_idx:]
24 

After deduplication and quality filtering (removing claims with incomplete extractions), we had 6,847 training examples and 761 evaluation examples.

Training Iterations

Iteration 1: Baseline LoRA fine-tuning. 6,847 examples, default hyperparameters (lr=2e-4, r=16, 3 epochs). Field-level accuracy: 87.2%. The model learned the output format perfectly but struggled with ambiguous damage descriptions and multi-vehicle incidents.

Iteration 2: Targeted data augmentation. We added 400 examples specifically for the failure modes: complex damage descriptions (127 examples), multi-party incidents (89 examples), and edge cases in coverage determination (184 examples). Accuracy: 91.8%.

Iteration 3: Hyperparameter tuning. We reduced the learning rate to 1e-4, increased LoRA rank to 32, and trained for 5 epochs. Accuracy: 93.1%.

Iteration 4: Chain-of-thought formatting. We reformatted outputs to include reasoning before the JSON extraction. Accuracy: 94.3%. The reasoning chain helped the model handle ambiguous cases by forcing explicit intermediate steps.

Final training configuration:

python
1training_args = TrainingArguments(
2 output_dir="outputs/claims-v4",
3 num_train_epochs=5,
4 per_device_train_batch_size=4,
5 gradient_accumulation_steps=4,
6 learning_rate=1e-4,
7 warmup_ratio=0.1,
8 lr_scheduler_type="cosine",
9 bf16=True,
10 gradient_checkpointing=True,
11 save_strategy="epoch",
12 evaluation_strategy="epoch",
13 load_best_model_at_end=True,
14 metric_for_best_model="eval_loss",
15)
16 
17lora_config = LoraConfig(
18 r=32,
19 lora_alpha=64,
20 lora_dropout=0.05,
21 target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
22 task_type="CAUSAL_LM",
23)
24 

Total training cost across all iterations: $47 in GPU compute (A10G on AWS).

Evaluation

python
1def evaluate_claims_extraction(
2 predictions: list[dict],
3 ground_truth: list[dict],
4) -> dict:
5 field_accuracies = {}
6 total_fields = 0
7 correct_fields = 0
8 
9 for pred, truth in zip(predictions, ground_truth):
10 for field in truth:
11 total_fields += 1
12 if field in pred and normalize(pred[field]) == normalize(truth[field]):
13 correct_fields += 1
14 field_accuracies.setdefault(field, []).append(1)
15 else:
16 field_accuracies.setdefault(field, []).append(0)
17 
18 per_field = {
19 field: sum(scores) / len(scores)
20 for field, scores in field_accuracies.items()
21 }
22 
23 return {
24 "overall_accuracy": correct_fields / total_fields,
25 "per_field_accuracy": per_field,
26 "worst_fields": sorted(per_field.items(), key=lambda x: x[1])[:5],
27 }
28 

Results on the held-out evaluation set:

MetricPrompt EngineeringRAG + GPT-4Fine-tuned 8B
Field-level accuracy78.0%83.2%94.3%
Latency per claim8.2s11.4s0.6s
Cost per claim$0.06$0.08$0.001
Monthly inference cost$900$1,200$15
Format compliance92%95%99.7%

The accuracy improvement from 78% to 94.3% crossed a critical threshold: at 94%+ accuracy, reviewers could trust the extraction and only verify flagged low-confidence fields, reducing review time from 12 minutes to 4.5 minutes per claim.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Production Deployment

We deployed with vLLM behind a FastAPI service:

python
1from vllm import LLM, SamplingParams
2 
3llm = LLM(
4 model="outputs/claims-v4-merged",
5 max_model_len=4096,
6 gpu_memory_utilization=0.9,
7 tensor_parallel_size=1,
8)
9 
10sampling_params = SamplingParams(
11 temperature=0.1,
12 max_tokens=1024,
13 top_p=0.95,
14)
15 

Running on a single g5.xlarge instance ($1.01/hour, $730/month). The model handles 15,000 claims per month in batch processing with room to spare — peak throughput is approximately 40 claims per minute.

Measurable Results

After three months in production:

  • Review time reduced from 12 min to 4.5 min per claim (62% reduction)
  • Monthly labor cost reduced from $105,000 to $39,375 ($65,625 savings)
  • Infrastructure cost: $730/month (vLLM serving instance)
  • Net monthly savings: $64,895
  • Project ROI: 112x (annual savings of $778,740 on a $47 training + $6,000 engineering investment)

What Went Wrong

Initial data quality issues. The first 8,000 historical claims had inconsistent formatting — some reviewers used abbreviations, others wrote full descriptions. We spent two weeks standardizing output formats before training produced usable results.

Overfitting on iteration 3. When we increased to 5 epochs without reducing the learning rate, the model memorized training examples. We discovered this when accuracy on the eval set was 96% but production accuracy dropped to 89%. Adding more diverse examples and reducing the learning rate fixed this.

Latency variance. While average latency was 0.6 seconds, claims with very long documents (10+ pages) took 3-4 seconds due to the 4096 token context limit. We had to implement document chunking for long claims, which added engineering complexity.

Honest Retrospective

Would we fine-tune again? Absolutely. The cost savings are unambiguous, and the accuracy improvement over prompting was significant for this specific task.

What we'd do differently:

  1. Standardize training data formatting before the first training run, not after wasting two iterations.
  2. Start with 500 hand-curated examples instead of 8,000 historical ones. Quality trumps quantity.
  3. Build the evaluation pipeline first, before the first training run.
  4. Implement confidence scoring from the start to route low-confidence extractions to human review.

Conclusion

Fine-tuning a relatively small model (8B parameters) on a focused task with quality training data delivered production results that exceeded both prompt engineering and RAG approaches. The total project cost was negligible compared to the monthly savings, and the iteration cycle (4 training runs over 6 weeks) was manageable for a two-person team.

The key insight is that fine-tuning works best for narrow, well-defined tasks where you have access to verified ground truth data. Claims extraction is an ideal use case: clear input format, defined output schema, and thousands of historically verified examples. For broader, open-ended tasks, prompt engineering or RAG may still be the pragmatic choice.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026