Training and Inference Optimization

This page maps the core engineering stack behind modern model training and inference systems. The goal is not buzzword familiarity. The goal is operational judgment.

PyTorch

PyTorch is still the default execution layer for modern model training. You need to understand tensors, autograd, nn.Module composition, dataloaders, checkpointing, and distributed wrappers. Even when higher-level frameworks appear, serious debugging still drops you back into raw PyTorch.

Distributed training

Once models or datasets grow, single-device training stops being enough. You need to understand data parallelism, tensor parallelism, pipeline parallelism, gradient synchronization cost, checkpoint sharding, and fault recovery. Distributed training is not just “more GPUs”, it is systems engineering under bandwidth constraints.

Mixed precision

Mixed precision cuts memory usage and increases throughput, but only if you understand numerical stability. You should know fp16, bf16, loss scaling, gradient overflow, accumulation precision, and when fp32 fallback is necessary.

LoRA / PEFT / fine-tuning

Most real teams do not pretrain from scratch, they adapt existing models. You need to understand full fine-tuning versus PEFT, rank selection, target modules, optimizer state cost, and why a base model that is mismatched to the task can make LoRA look worse than it really is.

Inference optimization

Inference is where product reality hits model ambition. Throughput, latency, prompt length, batching strategy, quantization, memory reuse, and cache policy all matter. Training gets headlines, but inference cost determines whether a product survives.

vLLM / tensor parallel / KV cache

Modern serving stacks like vLLM exist because naive autoregressive inference wastes memory and leaves GPUs underutilized. You need to understand paged attention, tensor-parallel layout, KV cache growth, and why long-context serving can become a memory bottleneck before it becomes a compute bottleneck.

Batching and serving

Serving is not just “host the model”. You need dynamic batching, queue control, timeout strategy, concurrency limits, admission control, and fallback behavior. A good serving system protects latency while preserving GPU utilization.

Profiling

If you do not profile, you are guessing. You should be able to inspect kernel time, dataloader stalls, host-device transfer overhead, memory peaks, layer hotspots, and token-per-second throughput. Profiling converts hand-wavy intuition into actual optimization work.

GPU memory management

Modern AI engineering is often memory engineering. You need to reason about activations, optimizer states, gradients, checkpoints, KV cache, fragmentation, sequence length, micro-batches, and memory offloading. A lot of “model scaling” is just learning how not to run out of VRAM.