Back to Journal
AI Architecture

Complete Guide to LLM Fine-Tuning Production with Python

A comprehensive guide to implementing LLM Fine-Tuning Production using Python, covering architecture, code examples, and production-ready patterns.

Muneer Puthiya Purayil 19 min read

Python is the default language for LLM fine-tuning — the entire ecosystem (Hugging Face Transformers, PyTorch, DeepSpeed) is Python-first. This guide covers the complete pipeline: data preparation, training configuration, evaluation, and deployment, with production-ready code for each stage.

Environment Setup

bash
1pip install torch transformers datasets peft trl accelerate bitsandbytes
2pip install vllm # for inference
3pip install mlflow wandb # for experiment tracking
4pip install presidio-analyzer presidio-anonymizer # for PII detection
5 

Pin versions in production:

txt
1torch==2.3.0
2transformers==4.44.0
3datasets==2.20.0
4peft==0.12.0
5trl==0.9.0
6accelerate==0.33.0
7bitsandbytes==0.43.0
8 

Data Pipeline

Dataset Format

The standard format for instruction fine-tuning is a JSONL file with instruction-input-output triples:

json
{"instruction": "Summarize this customer support ticket", "input": "Customer reports that...", "output": "The customer is experiencing..."}

For chat-style fine-tuning, use the conversation format:

json
{"messages": [{"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Data Processing Pipeline

python
1import json
2import hashlib
3from pathlib import Path
4from datasets import Dataset, DatasetDict
5from transformers import AutoTokenizer
6 
7class DataPipeline:
8 def __init__(self, model_name: str, max_length: int = 2048):
9 self.tokenizer = AutoTokenizer.from_pretrained(model_name)
10 if self.tokenizer.pad_token is None:
11 self.tokenizer.pad_token = self.tokenizer.eos_token
12 self.max_length = max_length
13 
14 def load_and_validate(self, data_path: str) -> list[dict]:
15 examples = []
16 issues = []
17 seen_hashes = set()
18 
19 with open(data_path) as f:
20 for line_num, line in enumerate(f, 1):
21 try:
22 item = json.loads(line.strip())
23 except json.JSONDecodeError:
24 issues.append(f"Line {line_num}: invalid JSON")
25 continue
26 
27 required = {"instruction", "input", "output"}
28 if not required.issubset(item.keys()):
29 missing = required - set(item.keys())
30 issues.append(f"Line {line_num}: missing fields {missing}")
31 continue
32 
33 content_hash = hashlib.md5(
34 f"{item['instruction']}|{item['output']}".encode()
35 ).hexdigest()
36 if content_hash in seen_hashes:
37 issues.append(f"Line {line_num}: duplicate content")
38 continue
39 seen_hashes.add(content_hash)
40 
41 examples.append(item)
42 
43 if issues:
44 print(f"Data validation: {len(issues)} issues found")
45 for issue in issues[:10]:
46 print(f" - {issue}")
47 
48 print(f"Loaded {len(examples)} valid examples from {data_path}")
49 return examples
50 
51 def format_for_training(self, examples: list[dict]) -> Dataset:
52 formatted = []
53 for ex in examples:
54 if ex["input"]:
55 text = (
56 f"### Instruction:\n{ex['instruction']}\n\n"
57 f"### Input:\n{ex['input']}\n\n"
58 f"### Response:\n{ex['output']}"
59 )
60 else:
61 text = (
62 f"### Instruction:\n{ex['instruction']}\n\n"
63 f"### Response:\n{ex['output']}"
64 )
65 formatted.append({"text": text})
66 return Dataset.from_list(formatted)
67 
68 def format_chat_for_training(self, examples: list[dict]) -> Dataset:
69 formatted = []
70 for ex in examples:
71 messages = ex.get("messages", [
72 {"role": "system", "content": "You are a helpful assistant."},
73 {"role": "user", "content": ex.get("instruction", "") + "\n" + ex.get("input", "")},
74 {"role": "assistant", "content": ex["output"]},
75 ])
76 text = self.tokenizer.apply_chat_template(
77 messages, tokenize=False, add_generation_prompt=False
78 )
79 formatted.append({"text": text})
80 return Dataset.from_list(formatted)
81 
82 def create_splits(
83 self, dataset: Dataset, eval_ratio: float = 0.05
84 ) -> DatasetDict:
85 split = dataset.train_test_split(test_size=eval_ratio, seed=42)
86 return DatasetDict({
87 "train": split["train"],
88 "eval": split["test"],
89 })
90 

Training Configuration

python
1import torch
2from transformers import (
3 AutoModelForCausalLM,
4 AutoTokenizer,
5 TrainingArguments,
6)
7from peft import LoraConfig, get_peft_model, TaskType
8from trl import SFTTrainer
9 
10def train_lora(
11 base_model: str,
12 train_dataset: Dataset,
13 eval_dataset: Dataset,
14 output_dir: str,
15 lora_r: int = 16,
16 lora_alpha: int = 32,
17 learning_rate: float = 2e-4,
18 num_epochs: int = 3,
19 batch_size: int = 4,
20 gradient_accumulation: int = 4,
21 max_seq_length: int = 2048,
22):
23 tokenizer = AutoTokenizer.from_pretrained(base_model)
24 if tokenizer.pad_token is None:
25 tokenizer.pad_token = tokenizer.eos_token
26 
27 model = AutoModelForCausalLM.from_pretrained(
28 base_model,
29 torch_dtype=torch.bfloat16,
30 device_map="auto",
31 attn_implementation="flash_attention_2",
32 )
33 
34 lora_config = LoraConfig(
35 task_type=TaskType.CAUSAL_LM,
36 r=lora_r,
37 lora_alpha=lora_alpha,
38 lora_dropout=0.05,
39 target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
40 "gate_proj", "up_proj", "down_proj"],
41 bias="none",
42 )
43 
44 model = get_peft_model(model, lora_config)
45 trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
46 total = sum(p.numel() for p in model.parameters())
47 print(f"Trainable parameters: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
48 
49 training_args = TrainingArguments(
50 output_dir=output_dir,
51 num_train_epochs=num_epochs,
52 per_device_train_batch_size=batch_size,
53 gradient_accumulation_steps=gradient_accumulation,
54 learning_rate=learning_rate,
55 warmup_ratio=0.1,
56 lr_scheduler_type="cosine",
57 logging_steps=10,
58 save_strategy="epoch",
59 evaluation_strategy="epoch",
60 load_best_model_at_end=True,
61 metric_for_best_model="eval_loss",
62 bf16=True,
63 gradient_checkpointing=True,
64 max_grad_norm=1.0,
65 report_to="wandb",
66 )
67 
68 trainer = SFTTrainer(
69 model=model,
70 train_dataset=train_dataset,
71 eval_dataset=eval_dataset,
72 args=training_args,
73 tokenizer=tokenizer,
74 max_seq_length=max_seq_length,
75 )
76 
77 trainer.train()
78 trainer.save_model(output_dir)
79 
80 return trainer
81 

QLoRA for Memory-Constrained Training

python
1from transformers import BitsAndBytesConfig
2from peft import prepare_model_for_kbit_training
3 
4def train_qlora(
5 base_model: str,
6 train_dataset: Dataset,
7 eval_dataset: Dataset,
8 output_dir: str,
9):
10 bnb_config = BitsAndBytesConfig(
11 load_in_4bit=True,
12 bnb_4bit_quant_type="nf4",
13 bnb_4bit_compute_dtype=torch.bfloat16,
14 bnb_4bit_use_double_quant=True,
15 )
16 
17 model = AutoModelForCausalLM.from_pretrained(
18 base_model,
19 quantization_config=bnb_config,
20 device_map="auto",
21 )
22 model = prepare_model_for_kbit_training(model)
23 
24 # Rest of training is identical to LoRA
25 lora_config = LoraConfig(
26 task_type=TaskType.CAUSAL_LM,
27 r=16,
28 lora_alpha=32,
29 lora_dropout=0.05,
30 target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
31 )
32 
33 model = get_peft_model(model, lora_config)
34 # ... same TrainingArguments and SFTTrainer setup
35 

QLoRA vs LoRA: QLoRA quantizes the base model to 4-bit, reducing GPU memory from ~14GB (16-bit 8B model) to ~4GB. The trade-off is a 5-10% quality reduction and 20% slower training due to quantization/dequantization overhead. Use QLoRA when your GPU has less than 24GB VRAM; use LoRA when you have 40GB+ VRAM.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Model Merging and Export

python
1from peft import PeftModel
2 
3def merge_adapter(
4 base_model: str,
5 adapter_path: str,
6 output_path: str,
7 push_to_hub: bool = False,
8 hub_repo: str = None,
9):
10 tokenizer = AutoTokenizer.from_pretrained(base_model)
11 model = AutoModelForCausalLM.from_pretrained(
12 base_model,
13 torch_dtype=torch.bfloat16,
14 device_map="cpu",
15 )
16 
17 model = PeftModel.from_pretrained(model, adapter_path)
18 merged = model.merge_and_unload()
19 merged.save_pretrained(output_path)
20 tokenizer.save_pretrained(output_path)
21 
22 if push_to_hub and hub_repo:
23 merged.push_to_hub(hub_repo)
24 tokenizer.push_to_hub(hub_repo)
25 
26 print(f"Merged model saved to {output_path}")
27 return output_path
28 

Evaluation Framework

python
1import json
2import re
3from dataclasses import dataclass
4 
5@dataclass
6class EvalMetrics:
7 accuracy: float
8 format_compliance: float
9 avg_similarity: float
10 total_examples: int
11 per_category: dict
12 
13def evaluate_model(
14 model,
15 tokenizer,
16 eval_data: list[dict],
17 max_new_tokens: int = 512,
18) -> EvalMetrics:
19 correct = 0
20 format_ok = 0
21 similarities = []
22 per_category = {}
23 
24 for item in eval_data:
25 prompt = item["prompt"]
26 expected = item["expected"]
27 category = item.get("category", "general")
28 
29 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
30 with torch.no_grad():
31 outputs = model.generate(
32 **inputs,
33 max_new_tokens=max_new_tokens,
34 temperature=0.1,
35 do_sample=True,
36 pad_token_id=tokenizer.pad_token_id,
37 )
38 
39 generated = tokenizer.decode(
40 outputs[0][inputs["input_ids"].shape[1]:],
41 skip_special_tokens=True,
42 ).strip()
43 
44 is_correct = normalize_text(generated) == normalize_text(expected)
45 is_format_ok = check_output_format(generated, item.get("format_spec"))
46 
47 if is_correct:
48 correct += 1
49 if is_format_ok:
50 format_ok += 1
51 
52 sim = compute_similarity(generated, expected)
53 similarities.append(sim)
54 
55 if category not in per_category:
56 per_category[category] = {"correct": 0, "total": 0}
57 per_category[category]["total"] += 1
58 if is_correct:
59 per_category[category]["correct"] += 1
60 
61 total = len(eval_data)
62 return EvalMetrics(
63 accuracy=correct / total,
64 format_compliance=format_ok / total,
65 avg_similarity=sum(similarities) / len(similarities),
66 total_examples=total,
67 per_category={
68 k: v["correct"] / v["total"]
69 for k, v in per_category.items()
70 },
71 )
72 
73def normalize_text(text: str) -> str:
74 return re.sub(r"\s+", " ", text.strip().lower())
75 
76def compute_similarity(a: str, b: str) -> float:
77 from difflib import SequenceMatcher
78 return SequenceMatcher(None, normalize_text(a), normalize_text(b)).ratio()
79 
80def check_output_format(output: str, format_spec: dict | None) -> bool:
81 if not format_spec:
82 return True
83 if format_spec.get("type") == "json":
84 try:
85 json.loads(output)
86 return True
87 except json.JSONDecodeError:
88 return False
89 if format_spec.get("type") == "markdown":
90 return output.startswith("#") or "##" in output
91 return True
92 

Inference and Serving

vLLM for Production

python
1from vllm import LLM, SamplingParams
2 
3class InferenceService:
4 def __init__(self, model_path: str, max_model_len: int = 4096):
5 self.llm = LLM(
6 model=model_path,
7 max_model_len=max_model_len,
8 gpu_memory_utilization=0.9,
9 dtype="bfloat16",
10 )
11 self.sampling_params = SamplingParams(
12 temperature=0.1,
13 max_tokens=512,
14 top_p=0.95,
15 stop=["### Instruction:", "\n\n\n"],
16 )
17 
18 def generate(self, prompts: list[str]) -> list[str]:
19 outputs = self.llm.generate(prompts, self.sampling_params)
20 return [output.outputs[0].text.strip() for output in outputs]
21 
22 def generate_single(self, prompt: str) -> str:
23 return self.generate([prompt])[0]
24 

OpenAI-Compatible API

bash
1python -m vllm.entrypoints.openai.api_server \
2 --model outputs/merged-model \
3 --max-model-len 4096 \
4 --dtype bfloat16 \
5 --port 8000
6 

This provides /v1/chat/completions and /v1/completions endpoints compatible with the OpenAI Python SDK:

python
1from openai import OpenAI
2 
3client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
4 
5response = client.chat.completions.create(
6 model="outputs/merged-model",
7 messages=[
8 {"role": "system", "content": "You are a helpful assistant."},
9 {"role": "user", "content": "Extract the key entities from this text..."},
10 ],
11 temperature=0.1,
12 max_tokens=512,
13)
14print(response.choices[0].message.content)
15 

Conclusion

The Python LLM fine-tuning stack — Transformers, PEFT, TRL, and vLLM — provides a complete pipeline from raw data to production inference. The key to success is not in sophisticated training configurations but in data quality, iterative evaluation, and proper serving infrastructure. Start with the simplest configuration (LoRA, default hyperparameters, 500 examples), evaluate rigorously, and add complexity only where evaluation shows room for improvement.

The most common failure mode is over-investing in training infrastructure before the data pipeline and evaluation framework are solid. Get the data right first — the training will follow.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026