Back to Journal
AI Architecture

LLM Fine-Tuning Production Best Practices for High Scale Teams

Battle-tested best practices for LLM Fine-Tuning Production tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 13 min read

High-scale LLM fine-tuning operates under constraints that most teams never encounter: datasets exceeding GPU memory, multi-node distributed training, and the need to run hundreds of experiments per week. These practices address the operational challenges of fine-tuning at scale — where a single training run consumes $500+ in compute and a misconfigured hyperparameter wastes a full day.

Distributed Training Architecture

Multi-GPU Training with DeepSpeed

python
1import torch
2from transformers import (
3 AutoModelForCausalLM,
4 AutoTokenizer,
5 TrainingArguments,
6 Trainer,
7)
8from peft import LoraConfig, get_peft_model, TaskType
9from datasets import load_dataset
10 
11def setup_distributed_training(
12 base_model: str,
13 dataset_path: str,
14 output_dir: str,
15 num_gpus: int = 8,
16):
17 tokenizer = AutoTokenizer.from_pretrained(base_model)
18 tokenizer.pad_token = tokenizer.eos_token
19 
20 model = AutoModelForCausalLM.from_pretrained(
21 base_model,
22 torch_dtype=torch.bfloat16,
23 attn_implementation="flash_attention_2",
24 )
25 
26 lora_config = LoraConfig(
27 task_type=TaskType.CAUSAL_LM,
28 r=32,
29 lora_alpha=64,
30 lora_dropout=0.05,
31 target_modules=[
32 "q_proj", "k_proj", "v_proj", "o_proj",
33 "gate_proj", "up_proj", "down_proj",
34 ],
35 )
36 
37 model = get_peft_model(model, lora_config)
38 
39 training_args = TrainingArguments(
40 output_dir=output_dir,
41 num_train_epochs=3,
42 per_device_train_batch_size=2,
43 gradient_accumulation_steps=8,
44 learning_rate=1e-4,
45 warmup_ratio=0.1,
46 lr_scheduler_type="cosine",
47 logging_steps=5,
48 save_strategy="steps",
49 save_steps=200,
50 evaluation_strategy="steps",
51 eval_steps=200,
52 bf16=True,
53 deepspeed="configs/ds_config_zero3.json",
54 gradient_checkpointing=True,
55 dataloader_num_workers=4,
56 dataloader_pin_memory=True,
57 )
58 
59 return model, tokenizer, training_args
60 

DeepSpeed ZeRO-3 Configuration

json
1{
2 "bf16": {
3 "enabled": true
4 },
5 "zero_optimization": {
6 "stage": 3,
7 "offload_optimizer": {
8 "device": "none"
9 },
10 "offload_param": {
11 "device": "none"
12 },
13 "overlap_comm": true,
14 "contiguous_gradients": true,
15 "sub_group_size": 1e9,
16 "reduce_scatter": true,
17 "stage3_prefetch_bucket_size": 5e8,
18 "stage3_param_persistence_threshold": 1e6,
19 "stage3_max_live_parameters": 1e9,
20 "stage3_max_reuse_distance": 1e9,
21 "stage3_gather_16bit_weights_on_model_save": true
22 },
23 "gradient_accumulation_steps": "auto",
24 "gradient_clipping": 1.0,
25 "train_batch_size": "auto",
26 "train_micro_batch_size_per_gpu": "auto",
27 "wall_clock_breakdown": true
28}
29 

ZeRO-3 partitions model parameters, gradients, and optimizer states across GPUs. For a 70B model, ZeRO-3 reduces per-GPU memory from ~280GB (full model + optimizer) to ~35GB per GPU across 8 GPUs. This enables fine-tuning models that exceed the memory of any single GPU.

Data Pipeline for Large Datasets

Streaming Data Processing

python
1from datasets import load_dataset, DatasetDict
2from transformers import AutoTokenizer
3from functools import partial
4 
5def create_training_dataset(
6 data_path: str,
7 tokenizer: AutoTokenizer,
8 max_length: int = 2048,
9 num_proc: int = 16,
10) -> DatasetDict:
11 dataset = load_dataset("json", data_files=data_path, split="train")
12 
13 def format_and_tokenize(examples, tokenizer, max_length):
14 texts = []
15 for instruction, input_text, output in zip(
16 examples["instruction"],
17 examples["input"],
18 examples["output"],
19 ):
20 if input_text:
21 prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
22 else:
23 prompt = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
24 texts.append(prompt)
25 
26 tokenized = tokenizer(
27 texts,
28 truncation=True,
29 max_length=max_length,
30 padding=False,
31 )
32 
33 tokenized["labels"] = [
34 [-100 if token == tokenizer.pad_token_id else token for token in ids]
35 for ids in tokenized["input_ids"]
36 ]
37 return tokenized
38 
39 process_fn = partial(
40 format_and_tokenize,
41 tokenizer=tokenizer,
42 max_length=max_length,
43 )
44 
45 tokenized = dataset.map(
46 process_fn,
47 batched=True,
48 num_proc=num_proc,
49 remove_columns=dataset.column_names,
50 desc="Tokenizing",
51 )
52 
53 split = tokenized.train_test_split(test_size=0.05, seed=42)
54 return DatasetDict({
55 "train": split["train"],
56 "eval": split["test"],
57 })
58 

At scale, tokenization becomes a bottleneck. Using num_proc=16 parallelizes tokenization across CPU cores. For datasets exceeding 100GB, use the streaming=True parameter and process data in chunks to avoid loading the entire dataset into memory.

Hyperparameter Optimization at Scale

python
1import optuna
2from optuna.integration import MLflowCallback
3 
4def objective(trial: optuna.Trial) -> float:
5 learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-4, log=True)
6 lora_r = trial.suggest_categorical("lora_r", [8, 16, 32, 64])
7 lora_alpha = trial.suggest_categorical("lora_alpha", [16, 32, 64, 128])
8 warmup_ratio = trial.suggest_float("warmup_ratio", 0.03, 0.15)
9 gradient_accumulation = trial.suggest_categorical(
10 "gradient_accumulation_steps", [4, 8, 16]
11 )
12 
13 effective_batch_size = 2 * gradient_accumulation * 8 # per_device * accum * gpus
14 
15 model, tokenizer, training_args = setup_distributed_training(
16 base_model="meta-llama/Llama-3.1-8B",
17 dataset_path="data/training.jsonl",
18 output_dir=f"outputs/trial_{trial.number}",
19 )
20 
21 training_args.learning_rate = learning_rate
22 training_args.warmup_ratio = warmup_ratio
23 training_args.gradient_accumulation_steps = gradient_accumulation
24 
25 trainer = Trainer(
26 model=model,
27 args=training_args,
28 train_dataset=train_dataset,
29 eval_dataset=eval_dataset,
30 )
31 
32 trainer.train()
33 eval_results = trainer.evaluate()
34 
35 return eval_results["eval_loss"]
36 
37study = optuna.create_study(
38 direction="minimize",
39 study_name="llm-finetune-hpo",
40 storage="postgresql://user:pass@localhost/optuna",
41 load_if_exists=True,
42 pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
43)
44 
45mlflow_callback = MLflowCallback(
46 tracking_uri="http://mlflow.internal:5000",
47 metric_name="eval_loss",
48)
49 
50study.optimize(
51 objective,
52 n_trials=50,
53 callbacks=[mlflow_callback],
54)
55 
56print(f"Best trial: {study.best_trial.params}")
57print(f"Best eval loss: {study.best_trial.value}")
58 

Running 50 HPO trials with 8 GPUs each is $2,500-5,000 in compute. Optuna's median pruner terminates underperforming trials early, typically saving 40-60% of compute. For high-scale teams, this is the difference between daily and weekly iteration cycles.

Training Monitoring and Alerting

python
1import wandb
2from transformers import TrainerCallback
3 
4class ScaleTrainingCallback(TrainerCallback):
5 def __init__(self, alert_threshold_loss: float = 5.0):
6 self.alert_threshold = alert_threshold_loss
7 self.loss_history = []
8 
9 def on_log(self, args, state, control, logs=None, **kwargs):
10 if logs and "loss" in logs:
11 current_loss = logs["loss"]
12 self.loss_history.append(current_loss)
13 
14 if current_loss > self.alert_threshold:
15 wandb.alert(
16 title="Training Loss Spike",
17 text=f"Loss {current_loss:.4f} exceeds threshold {self.alert_threshold}",
18 level=wandb.AlertLevel.WARN,
19 )
20 
21 if len(self.loss_history) >= 50:
22 recent_avg = sum(self.loss_history[-10:]) / 10
23 earlier_avg = sum(self.loss_history[-50:-40]) / 10
24 if recent_avg > earlier_avg * 1.1:
25 wandb.alert(
26 title="Training Divergence Detected",
27 text=f"Recent loss {recent_avg:.4f} is 10%+ higher than earlier {earlier_avg:.4f}",
28 level=wandb.AlertLevel.ERROR,
29 )
30 control.should_training_stop = True
31 
32 def on_evaluate(self, args, state, control, metrics=None, **kwargs):
33 if metrics:
34 wandb.log({
35 "eval/loss": metrics.get("eval_loss"),
36 "eval/perplexity": 2 ** metrics.get("eval_loss", 0),
37 "training/gpu_memory_allocated": torch.cuda.memory_allocated() / 1e9,
38 "training/gpu_memory_reserved": torch.cuda.memory_reserved() / 1e9,
39 })
40 

At high scale, unattended training runs must self-diagnose problems. Loss spikes, divergence, and GPU memory leaks should trigger alerts within minutes — not be discovered 8 hours later when the training budget is spent.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Efficient Model Merging and Serving

python
1from peft import PeftModel, PeftConfig
2from transformers import AutoModelForCausalLM, AutoTokenizer
3import torch
4 
5def merge_and_export(
6 base_model: str,
7 adapter_path: str,
8 output_path: str,
9):
10 """Merge LoRA adapter into base model for efficient serving."""
11 config = PeftConfig.from_pretrained(adapter_path)
12 
13 model = AutoModelForCausalLM.from_pretrained(
14 base_model,
15 torch_dtype=torch.bfloat16,
16 device_map="cpu",
17 )
18 model = PeftModel.from_pretrained(model, adapter_path)
19 merged = model.merge_and_unload()
20 
21 merged.save_pretrained(output_path)
22 tokenizer = AutoTokenizer.from_pretrained(base_model)
23 tokenizer.save_pretrained(output_path)
24 
25 return output_path
26 
27def convert_to_gguf(model_path: str, output_path: str, quantization: str = "q4_k_m"):
28 """Convert to GGUF format for efficient CPU/edge inference."""
29 import subprocess
30 subprocess.run([
31 "python", "llama.cpp/convert_hf_to_gguf.py",
32 model_path,
33 "--outtype", quantization,
34 "--outfile", output_path,
35 ], check=True)
36 

Anti-Patterns to Avoid

Running single-GPU training when multi-GPU is available. At high scale, a training run on 1 GPU that takes 24 hours can run on 8 GPUs in 3-4 hours. The linear scaling isn't perfect due to communication overhead, but the time savings justify the parallel GPU cost.

Not using gradient checkpointing. For models above 7B parameters, gradient checkpointing trades 30% slower training for 60% less GPU memory. Without it, you need twice as many GPUs to train the same model.

Ignoring data quality for dataset size. At high scale, teams are tempted to autogenerate millions of training examples. A dataset of 50,000 expert-curated examples consistently outperforms 500,000 synthetic examples. Invest in data quality before data volume.

Fixed learning rate schedules. Cosine schedules with warmup adapt better than fixed or step-decay schedules across different dataset sizes and model architectures. The warmup phase prevents early training instability; the cosine decay prevents overfitting in later epochs.

Not checkpointing frequently enough. A GPU failure at hour 7 of an 8-hour training run without checkpoints means restarting from scratch. Save checkpoints every 200-500 steps and configure auto-resume from the latest checkpoint.

Production Checklist

  • DeepSpeed ZeRO-3 configured for multi-GPU training
  • Flash Attention 2 enabled for memory-efficient attention
  • Gradient checkpointing enabled for large models
  • Data pipeline parallelized with num_proc matching CPU cores
  • HPO framework (Optuna) with pruning for efficient search
  • Training callbacks for loss monitoring and divergence detection
  • Checkpoint saves every 200-500 steps with auto-resume
  • W&B or MLflow tracking for all experiments
  • Model merging pipeline for LoRA adapter → full model
  • Quantization pipeline (GGUF, GPTQ, AWQ) for deployment variants
  • Cost tracking per experiment and per GPU-hour
  • Automated data quality validation before training start
  • GPU utilization monitoring (target >85% MFU)
  • Dead GPU detection and automatic job restart

Conclusion

High-scale LLM fine-tuning is as much about infrastructure efficiency as it is about model quality. The teams that iterate fastest — running 10+ experiments per day instead of 1 per week — consistently produce better models. This speed comes from distributed training infrastructure (DeepSpeed ZeRO-3, Flash Attention), automated hyperparameter search (Optuna with pruning), and robust monitoring that catches problems in minutes rather than hours.

The most expensive mistake at high scale is not a single failed training run — it's running the wrong experiment for too long. Invest in experiment tracking, early stopping, and HPO infrastructure to ensure every GPU-hour produces actionable signal about what makes your model better.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026