What Python version should you use for LLM fine-tuning?

Python 3.10 or 3.11. Python 3.12 has compatibility issues with some versions of bitsandbytes and flash-attention. Pin your Python version in a Dockerfile or conda environment to avoid version drift between development and training environments.

How do you debug training that produces garbage outputs?

Check three things in order: (1) data formatting — ensure the tokenizer's chat template matches your training data format, (2) learning rate — if outputs are random, the LR is too high; if outputs match the base model, the LR is too low, (3) training data quality — inspect 20 random examples manually for formatting consistency and output quality. Most "the model produces garbage" issues are data formatting problems.

Should you use SFTTrainer or the base Trainer class?

Use SFTTrainer from the TRL library for instruction fine-tuning. It handles dataset formatting, tokenization, and packing (combining short examples into single sequences to maximize GPU utilization). The base Trainer requires manual dataset preprocessing that SFTTrainer automates. The only reason to use the base Trainer is for custom training loops that SFTTrainer doesn't support.

How do you serve multiple fine-tuned models efficiently?

vLLM supports LoRA adapter hot-swapping: load the base model once and switch between multiple LoRA adapters per request. This is more memory-efficient than loading separate merged models. Configure with `--enable-lora --max-loras 4` to keep up to 4 adapters in GPU memory simultaneously. For serving 10+ models, use a routing layer that directs requests to the appropriate vLLM instance.

Complete Guide to LLM Fine-Tuning Production with Python

Python is the default language for LLM fine-tuning — the entire ecosystem (Hugging Face Transformers, PyTorch, DeepSpeed) is Python-first. This guide covers the complete pipeline: data preparation, training configuration, evaluation, and deployment, with production-ready code for each stage.

Environment Setup

bash

1pip install torch transformers datasets peft trl accelerate bitsandbytes

2pip install vllm # for inference

3pip install mlflow wandb # for experiment tracking

4pip install presidio-analyzer presidio-anonymizer # for PII detection

Pin versions in production:

txt

1torch==2.3.0

2transformers==4.44.0

3datasets==2.20.0

4peft==0.12.0

5trl==0.9.0

6accelerate==0.33.0

7bitsandbytes==0.43.0

Data Pipeline

Dataset Format

The standard format for instruction fine-tuning is a JSONL file with instruction-input-output triples:

json

{"instruction": "Summarize this customer support ticket", "input": "Customer reports that...", "output": "The customer is experiencing..."}

For chat-style fine-tuning, use the conversation format:

json

{"messages": [{"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Data Processing Pipeline

python

1import json

2import hashlib

3from pathlib import Path

4from datasets import Dataset, DatasetDict

5from transformers import AutoTokenizer

7class DataPipeline:

8 def __init__(self, model_name: str, max_length: int = 2048):

9 self.tokenizer = AutoTokenizer.from_pretrained(model_name)

10 if self.tokenizer.pad_token is None:

11 self.tokenizer.pad_token = self.tokenizer.eos_token

12 self.max_length = max_length

14 def load_and_validate(self, data_path: str) -> list[dict]:

15 examples = []

16 issues = []

17 seen_hashes = set()

19 with open(data_path) as f:

20 for line_num, line in enumerate(f, 1):

21 try:

22 item = json.loads(line.strip())

23 except json.JSONDecodeError:

24 issues.append(f"Line {line_num}: invalid JSON")

25 continue

27 required = {"instruction", "input", "output"}

28 if not required.issubset(item.keys()):

29 missing = required - set(item.keys())

30 issues.append(f"Line {line_num}: missing fields {missing}")

31 continue

33 content_hash = hashlib.md5(

34 f"{item['instruction']}|{item['output']}".encode()

35 ).hexdigest()

36 if content_hash in seen_hashes:

37 issues.append(f"Line {line_num}: duplicate content")

38 continue

39 seen_hashes.add(content_hash)

41 examples.append(item)

43 if issues:

44 print(f"Data validation: {len(issues)} issues found")

45 for issue in issues[:10]:

46 print(f" - {issue}")

48 print(f"Loaded {len(examples)} valid examples from {data_path}")

49 return examples

51 def format_for_training(self, examples: list[dict]) -> Dataset:

52 formatted = []

53 for ex in examples:

54 if ex["input"]:

55 text = (

56 f"### Instruction:\n{ex['instruction']}\n\n"

57 f"### Input:\n{ex['input']}\n\n"

58 f"### Response:\n{ex['output']}"

59 )

60 else:

61 text = (

62 f"### Instruction:\n{ex['instruction']}\n\n"

63 f"### Response:\n{ex['output']}"

64 )

65 formatted.append({"text": text})

66 return Dataset.from_list(formatted)

68 def format_chat_for_training(self, examples: list[dict]) -> Dataset:

69 formatted = []

70 for ex in examples:

71 messages = ex.get("messages", [

72 {"role": "system", "content": "You are a helpful assistant."},

73 {"role": "user", "content": ex.get("instruction", "") + "\n" + ex.get("input", "")},

74 {"role": "assistant", "content": ex["output"]},

75 ])

76 text = self.tokenizer.apply_chat_template(

77 messages, tokenize=False, add_generation_prompt=False

78 )

79 formatted.append({"text": text})

80 return Dataset.from_list(formatted)

82 def create_splits(

83 self, dataset: Dataset, eval_ratio: float = 0.05

84 ) -> DatasetDict:

85 split = dataset.train_test_split(test_size=eval_ratio, seed=42)

86 return DatasetDict({

87 "train": split["train"],

88 "eval": split["test"],

89 })

Training Configuration

LoRA Fine-Tuning (Recommended Default)

python

1import torch

2from transformers import (

3 AutoModelForCausalLM,

4 AutoTokenizer,

5 TrainingArguments,

7from peft import LoraConfig, get_peft_model, TaskType

8from trl import SFTTrainer

10def train_lora(

11 base_model: str,

12 train_dataset: Dataset,

13 eval_dataset: Dataset,

14 output_dir: str,

15 lora_r: int = 16,

16 lora_alpha: int = 32,

17 learning_rate: float = 2e-4,

18 num_epochs: int = 3,

19 batch_size: int = 4,

20 gradient_accumulation: int = 4,

21 max_seq_length: int = 2048,

22):

23 tokenizer = AutoTokenizer.from_pretrained(base_model)

24 if tokenizer.pad_token is None:

25 tokenizer.pad_token = tokenizer.eos_token

27 model = AutoModelForCausalLM.from_pretrained(

28 base_model,

29 torch_dtype=torch.bfloat16,

30 device_map="auto",

31 attn_implementation="flash_attention_2",

32 )

34 lora_config = LoraConfig(

35 task_type=TaskType.CAUSAL_LM,

36 r=lora_r,

37 lora_alpha=lora_alpha,

38 lora_dropout=0.05,

39 target_modules=["q_proj", "k_proj", "v_proj", "o_proj",

40 "gate_proj", "up_proj", "down_proj"],

41 bias="none",

42 )

44 model = get_peft_model(model, lora_config)

45 trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

46 total = sum(p.numel() for p in model.parameters())

47 print(f"Trainable parameters: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

49 training_args = TrainingArguments(

50 output_dir=output_dir,

51 num_train_epochs=num_epochs,

52 per_device_train_batch_size=batch_size,

53 gradient_accumulation_steps=gradient_accumulation,

54 learning_rate=learning_rate,

55 warmup_ratio=0.1,

56 lr_scheduler_type="cosine",

57 logging_steps=10,

58 save_strategy="epoch",

59 evaluation_strategy="epoch",

60 load_best_model_at_end=True,

61 metric_for_best_model="eval_loss",

62 bf16=True,

63 gradient_checkpointing=True,

64 max_grad_norm=1.0,

65 report_to="wandb",

66 )

68 trainer = SFTTrainer(

69 model=model,

70 train_dataset=train_dataset,

71 eval_dataset=eval_dataset,

72 args=training_args,

73 tokenizer=tokenizer,

74 max_seq_length=max_seq_length,

75 )

77 trainer.train()

78 trainer.save_model(output_dir)

80 return trainer

QLoRA for Memory-Constrained Training

python

1from transformers import BitsAndBytesConfig

2from peft import prepare_model_for_kbit_training

4def train_qlora(

5 base_model: str,

6 train_dataset: Dataset,

7 eval_dataset: Dataset,

8 output_dir: str,

9):

10 bnb_config = BitsAndBytesConfig(

11 load_in_4bit=True,

12 bnb_4bit_quant_type="nf4",

13 bnb_4bit_compute_dtype=torch.bfloat16,

14 bnb_4bit_use_double_quant=True,

15 )

17 model = AutoModelForCausalLM.from_pretrained(

18 base_model,

19 quantization_config=bnb_config,

20 device_map="auto",

21 )

22 model = prepare_model_for_kbit_training(model)

24 # Rest of training is identical to LoRA

25 lora_config = LoraConfig(

26 task_type=TaskType.CAUSAL_LM,

27 r=16,

28 lora_alpha=32,

29 lora_dropout=0.05,

30 target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],

31 )

33 model = get_peft_model(model, lora_config)

34 # ... same TrainingArguments and SFTTrainer setup

QLoRA vs LoRA: QLoRA quantizes the base model to 4-bit, reducing GPU memory from ~14GB (16-bit 8B model) to ~4GB. The trade-off is a 5-10% quality reduction and 20% slower training due to quantization/dequantization overhead. Use QLoRA when your GPU has less than 24GB VRAM; use LoRA when you have 40GB+ VRAM.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Model Merging and Export

python

1from peft import PeftModel

3def merge_adapter(

4 base_model: str,

5 adapter_path: str,

6 output_path: str,

7 push_to_hub: bool = False,

8 hub_repo: str = None,

9):

10 tokenizer = AutoTokenizer.from_pretrained(base_model)

11 model = AutoModelForCausalLM.from_pretrained(

12 base_model,

13 torch_dtype=torch.bfloat16,

14 device_map="cpu",

15 )

17 model = PeftModel.from_pretrained(model, adapter_path)

18 merged = model.merge_and_unload()

19 merged.save_pretrained(output_path)

20 tokenizer.save_pretrained(output_path)

22 if push_to_hub and hub_repo:

23 merged.push_to_hub(hub_repo)

24 tokenizer.push_to_hub(hub_repo)

26 print(f"Merged model saved to {output_path}")

27 return output_path

Evaluation Framework

python

1import json

2import re

3from dataclasses import dataclass

5@dataclass

6class EvalMetrics:

7 accuracy: float

8 format_compliance: float

9 avg_similarity: float

10 total_examples: int

11 per_category: dict

13def evaluate_model(

14 model,

15 tokenizer,

16 eval_data: list[dict],

17 max_new_tokens: int = 512,

18) -> EvalMetrics:

19 correct = 0

20 format_ok = 0

21 similarities = []

22 per_category = {}

24 for item in eval_data:

25 prompt = item["prompt"]

26 expected = item["expected"]

27 category = item.get("category", "general")

29 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

30 with torch.no_grad():

31 outputs = model.generate(

32 **inputs,

33 max_new_tokens=max_new_tokens,

34 temperature=0.1,

35 do_sample=True,

36 pad_token_id=tokenizer.pad_token_id,

37 )

39 generated = tokenizer.decode(

40 outputs[0][inputs["input_ids"].shape[1]:],

41 skip_special_tokens=True,

42 ).strip()

44 is_correct = normalize_text(generated) == normalize_text(expected)

45 is_format_ok = check_output_format(generated, item.get("format_spec"))

47 if is_correct:

48 correct += 1

49 if is_format_ok:

50 format_ok += 1

52 sim = compute_similarity(generated, expected)

53 similarities.append(sim)

55 if category not in per_category:

56 per_category[category] = {"correct": 0, "total": 0}

57 per_category[category]["total"] += 1

58 if is_correct:

59 per_category[category]["correct"] += 1

61 total = len(eval_data)

62 return EvalMetrics(

63 accuracy=correct / total,

64 format_compliance=format_ok / total,

65 avg_similarity=sum(similarities) / len(similarities),

66 total_examples=total,

67 per_category={

68 k: v["correct"] / v["total"]

69 for k, v in per_category.items()

70 },

71 )

73def normalize_text(text: str) -> str:

74 return re.sub(r"\s+", " ", text.strip().lower())

76def compute_similarity(a: str, b: str) -> float:

77 from difflib import SequenceMatcher

78 return SequenceMatcher(None, normalize_text(a), normalize_text(b)).ratio()

80def check_output_format(output: str, format_spec: dict | None) -> bool:

81 if not format_spec:

82 return True

83 if format_spec.get("type") == "json":

84 try:

85 json.loads(output)

86 return True

87 except json.JSONDecodeError:

88 return False

89 if format_spec.get("type") == "markdown":

90 return output.startswith("#") or "##" in output

91 return True

Inference and Serving

vLLM for Production

python

1from vllm import LLM, SamplingParams

3class InferenceService:

4 def __init__(self, model_path: str, max_model_len: int = 4096):

5 self.llm = LLM(

6 model=model_path,

7 max_model_len=max_model_len,

8 gpu_memory_utilization=0.9,

9 dtype="bfloat16",

10 )

11 self.sampling_params = SamplingParams(

12 temperature=0.1,

13 max_tokens=512,

14 top_p=0.95,

15 stop=["### Instruction:", "\n\n\n"],

16 )

18 def generate(self, prompts: list[str]) -> list[str]:

19 outputs = self.llm.generate(prompts, self.sampling_params)

20 return [output.outputs[0].text.strip() for output in outputs]

22 def generate_single(self, prompt: str) -> str:

23 return self.generate([prompt])[0]

OpenAI-Compatible API

bash

1python -m vllm.entrypoints.openai.api_server \

2 --model outputs/merged-model \

3 --max-model-len 4096 \

4 --dtype bfloat16 \

5 --port 8000

This provides /v1/chat/completions and /v1/completions endpoints compatible with the OpenAI Python SDK:

python

1from openai import OpenAI

3client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

5response = client.chat.completions.create(

6 model="outputs/merged-model",

7 messages=[

8 {"role": "system", "content": "You are a helpful assistant."},

9 {"role": "user", "content": "Extract the key entities from this text..."},

10 ],

11 temperature=0.1,

12 max_tokens=512,

13)

14print(response.choices[0].message.content)

Conclusion

The Python LLM fine-tuning stack — Transformers, PEFT, TRL, and vLLM — provides a complete pipeline from raw data to production inference. The key to success is not in sophisticated training configurations but in data quality, iterative evaluation, and proper serving infrastructure. Start with the simplest configuration (LoRA, default hyperparameters, 500 examples), evaluate rigorously, and add complexity only where evaluation shows room for improvement.

The most common failure mode is over-investing in training infrastructure before the data pipeline and evaluation framework are solid. Get the data right first — the training will follow.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

llm fine-tuning mlops training python guide

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Complete Guide to LLM Fine-Tuning Production with Python

Environment Setup

Data Pipeline

Dataset Format

Data Processing Pipeline

Training Configuration

LoRA Fine-Tuning (Recommended Default)

QLoRA for Memory-Constrained Training

Model Merging and Export

Evaluation Framework

Inference and Serving

vLLM for Production

OpenAI-Compatible API

Conclusion

FAQ

Building with agentic AI?

Complete Guide to LLM Fine-Tuning Production with Typescript

LLM Fine-Tuning Production at Scale: Lessons from Production

LLM Fine-Tuning Production Best Practices for High Scale Teams

Complete Guide to LLM Fine-Tuning Production with Typescript

RAG Pipeline Design Best Practices for High Scale Teams

Start a
Conversation.

Environment Setup

Data Pipeline

Dataset Format

Data Processing Pipeline

Training Configuration

LoRA Fine-Tuning (Recommended Default)

QLoRA for Memory-Constrained Training

Model Merging and Export

Evaluation Framework

Inference and Serving

vLLM for Production

OpenAI-Compatible API

Conclusion

FAQ

Building with agentic AI?

Complete Guide to LLM Fine-Tuning Production with Typescript

LLM Fine-Tuning Production at Scale: Lessons from Production

LLM Fine-Tuning Production Best Practices for High Scale Teams

Complete Guide to LLM Fine-Tuning Production with Typescript

RAG Pipeline Design Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.