In mid-2024, we fine-tuned a Llama 3.1 8B model for automated insurance claims processing. The project took 6 weeks from concept to production, reduced manual review time by 62%, and achieved 94.3% accuracy on structured extraction tasks — up from 78% with prompt engineering alone. Here is what actually happened.
The Problem
Our insurance client processed 15,000 claims per month. Each claim document required extracting 23 structured fields: claimant information, incident details, damage descriptions, policy numbers, and coverage determinations. Human reviewers spent an average of 12 minutes per claim. At $35/hour fully loaded, that was $105,000/month in review labor.
The initial approach — GPT-4 with few-shot prompting — achieved 78% field-level accuracy. The remaining 22% required human correction, which took nearly as long as reviewing from scratch because reviewers had to verify every field to find the errors.
Architecture Decisions
Why Fine-Tuning Over RAG
RAG was evaluated first. We built a system that retrieved similar historical claims and used them as context for GPT-4. Accuracy improved to 83%, but at $0.08 per claim in API costs ($1,200/month) and 8-12 second latency per document. The latency was the bigger problem — reviewers waiting 10 seconds per claim broke their workflow.
Fine-tuning a smaller model offered: (1) sub-second inference, (2) $0.001 per claim on self-hosted infrastructure, and (3) data never leaving our infrastructure — critical for PHI/PII compliance.
Model Selection
We chose Llama 3.1 8B over larger alternatives. The reasoning:
- 8B parameters fit on a single A10G GPU for both training and inference
- The claims extraction task was narrow enough that a smaller model could specialize effectively
- Iteration speed mattered: 2-hour training runs vs 12+ hours for 70B models
Training Process
Data Preparation
We started with 8,000 historical claims that had been manually reviewed and corrected. The data pipeline:
After deduplication and quality filtering (removing claims with incomplete extractions), we had 6,847 training examples and 761 evaluation examples.
Training Iterations
Iteration 1: Baseline LoRA fine-tuning. 6,847 examples, default hyperparameters (lr=2e-4, r=16, 3 epochs). Field-level accuracy: 87.2%. The model learned the output format perfectly but struggled with ambiguous damage descriptions and multi-vehicle incidents.
Iteration 2: Targeted data augmentation. We added 400 examples specifically for the failure modes: complex damage descriptions (127 examples), multi-party incidents (89 examples), and edge cases in coverage determination (184 examples). Accuracy: 91.8%.
Iteration 3: Hyperparameter tuning. We reduced the learning rate to 1e-4, increased LoRA rank to 32, and trained for 5 epochs. Accuracy: 93.1%.
Iteration 4: Chain-of-thought formatting. We reformatted outputs to include reasoning before the JSON extraction. Accuracy: 94.3%. The reasoning chain helped the model handle ambiguous cases by forcing explicit intermediate steps.
Final training configuration:
Total training cost across all iterations: $47 in GPU compute (A10G on AWS).
Evaluation
Results on the held-out evaluation set:
| Metric | Prompt Engineering | RAG + GPT-4 | Fine-tuned 8B |
|---|---|---|---|
| Field-level accuracy | 78.0% | 83.2% | 94.3% |
| Latency per claim | 8.2s | 11.4s | 0.6s |
| Cost per claim | $0.06 | $0.08 | $0.001 |
| Monthly inference cost | $900 | $1,200 | $15 |
| Format compliance | 92% | 95% | 99.7% |
The accuracy improvement from 78% to 94.3% crossed a critical threshold: at 94%+ accuracy, reviewers could trust the extraction and only verify flagged low-confidence fields, reducing review time from 12 minutes to 4.5 minutes per claim.
Need a second opinion on your AI systems architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallProduction Deployment
We deployed with vLLM behind a FastAPI service:
Running on a single g5.xlarge instance ($1.01/hour, $730/month). The model handles 15,000 claims per month in batch processing with room to spare — peak throughput is approximately 40 claims per minute.
Measurable Results
After three months in production:
- Review time reduced from 12 min to 4.5 min per claim (62% reduction)
- Monthly labor cost reduced from $105,000 to $39,375 ($65,625 savings)
- Infrastructure cost: $730/month (vLLM serving instance)
- Net monthly savings: $64,895
- Project ROI: 112x (annual savings of $778,740 on a $47 training + $6,000 engineering investment)
What Went Wrong
Initial data quality issues. The first 8,000 historical claims had inconsistent formatting — some reviewers used abbreviations, others wrote full descriptions. We spent two weeks standardizing output formats before training produced usable results.
Overfitting on iteration 3. When we increased to 5 epochs without reducing the learning rate, the model memorized training examples. We discovered this when accuracy on the eval set was 96% but production accuracy dropped to 89%. Adding more diverse examples and reducing the learning rate fixed this.
Latency variance. While average latency was 0.6 seconds, claims with very long documents (10+ pages) took 3-4 seconds due to the 4096 token context limit. We had to implement document chunking for long claims, which added engineering complexity.
Honest Retrospective
Would we fine-tune again? Absolutely. The cost savings are unambiguous, and the accuracy improvement over prompting was significant for this specific task.
What we'd do differently:
- Standardize training data formatting before the first training run, not after wasting two iterations.
- Start with 500 hand-curated examples instead of 8,000 historical ones. Quality trumps quantity.
- Build the evaluation pipeline first, before the first training run.
- Implement confidence scoring from the start to route low-confidence extractions to human review.
Conclusion
Fine-tuning a relatively small model (8B parameters) on a focused task with quality training data delivered production results that exceeded both prompt engineering and RAG approaches. The total project cost was negligible compared to the monthly savings, and the iteration cycle (4 training runs over 6 weeks) was manageable for a two-person team.
The key insight is that fine-tuning works best for narrow, well-defined tasks where you have access to verified ground truth data. Claims extraction is an ideal use case: clear input format, defined output schema, and thousands of historically verified examples. For broader, open-ended tasks, prompt engineering or RAG may still be the pragmatic choice.