Hardware Guide
Machine-specific setup, tuning, and gotchas for every GPU in our curriculum. Stop wasting hours on environment errors — start from a known-working baseline.
Jump to: Comparison · M4 Pro · RTX 4090 · A100 · 8× L40S · Mixed Precision · Memory Opt · Profiling
Machine Comparison
| Machine | VRAM | Bandwidth | FP16 TFLOPS | BF16 | Best Use Case |
|---|---|---|---|---|---|
| Mac M4 Pro 128GB | 128 GB (unified) | 273 GB/s | ~14.7 TFLOPS | ✅ (via MPS) | Prototyping, fine-tuning ≤13B models |
| RTX 4090 24GB | 24 GB GDDR6X | 1,008 GB/s | 82.6 TFLOPS | ✅ | Local training, RL, fine-tuning ≤7B |
| A100 80GB | 80 GB HBM2e | 2,000 GB/s | 312 TFLOPS | ✅ (native, fast) | LLM training/fine-tuning, long-context |
| 8× L40S (cluster) | 48 GB × 8 = 384 GB | 864 GB/s × 8 | 91.6 × 8 TFLOPS | ✅ | Large-scale multi-GPU training, FSDP |
* M4 Pro TFLOPS are for the Neural Engine; MPS compute is lower but memory bandwidth advantage is significant. L40S uses PCIe interconnect (not NVLink) — see multi-GPU section for implications.
Mac M4 Pro 128GB
Apple Silicon · MPS backend · Unified memory architecture
Why unified memory matters
Unlike discrete GPUs where VRAM is separate from system RAM, the M4 Pro shares 128 GB between CPU and GPU with 273 GB/s bandwidth. A 70B model in 4-bit (≈35 GB) fits in one machine and the GPU never stalls waiting for data transfers. No PCIe bottleneck.
Environment Setup
# 1. Install Homebrew (if not present) /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" # 2. Install miniforge (Apple-native conda — do NOT use Anaconda on M-series) brew install miniforge conda init zsh # or bash # 3. Create environment conda create -n ml python=3.11 -y conda activate ml # 4. Install PyTorch nightly (stable 2.x also supports MPS) pip install torch torchvision torchaudio # 5. Verify MPS is available python -c "import torch; print(torch.backends.mps.is_available())" # → True
PyTorch Configuration
import torch
# Detect and select device
if torch.backends.mps.is_available():
device = torch.device("mps")
elif torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
print(f"Using device: {device}")
# Move model to MPS
model = MyModel().to(device)
# Move tensors to MPS
x = torch.randn(32, 3, 224, 224).to(device)
y = model(x) # runs on Metal Performance ShadersWorkloads That Fit
Performance Tips
# Use float32 or float16 — float64 not supported on MPS
model = model.to(torch.float16) # halves memory, usually fine
# Set env var to enable MPS memory growth (avoid OOM on large batches)
import os
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0" # 0.0 = unlimited
# For inference, use torch.no_grad() to skip graph construction
with torch.no_grad():
output = model(input_tensor)
# MLX is often faster than PyTorch MPS for inference-only workloads
# pip install mlx
# Use mlx.core instead of torch for pure inference pipelines⚠️ Known Gotchas
- ▸ No float64: Operations requiring float64 silently fall back to CPU. Use
model.double()only when CPU fallback is acceptable. - ▸ Partial op support: Some complex ops (certain scatter operations, some sparse tensor ops) fall back to CPU mid-forward-pass. Profile to detect.
- ▸ No torch.compile (MPS): As of PyTorch 2.3,
torch.compiledoes not support the MPS backend. Skip it. - ▸ AMP with MPS: Use
torch.autocast("mps", dtype=torch.float16)— not the CUDA variant. - ▸ Memory pressure: MPS shares memory with the OS. Close Chrome/other apps when training large models — you're competing for the same pool.
RTX 4090 24GB
Ada Lovelace · CUDA 12.x · Best local GPU for the money
Environment Setup
# Ubuntu 22.04 recommended for RTX 4090 # 1. Install CUDA Toolkit 12.1+ (match your driver version) wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update && sudo apt-get -y install cuda-12-1 # Verify nvcc --version # should show 12.1+ nvidia-smi # should show RTX 4090 with driver 530+ # 2. Create environment conda create -n ml python=3.11 -y && conda activate ml # 3. Install PyTorch with CUDA 12.1 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # 4. Install Flash Attention 2 (significant speedup for transformers) pip install flash-attn --no-build-isolation # Takes ~10 minutes to compile — this is normal # Verify CUDA python -c "import torch; print(torch.cuda.get_device_name(0))" # → NVIDIA GeForce RTX 4090
PyTorch Configuration
import torch
from torch.cuda.amp import autocast, GradScaler
device = torch.device("cuda")
model = MyModel().to(device)
# torch.compile with max-autotune (20-40% speedup after warmup)
# First 2-3 batches are slow (compilation) — this is normal
model = torch.compile(model, mode="max-autotune")
# AMP training loop (fp16 on RTX 4090 — NOT bf16, use float16 for Ada)
scaler = GradScaler()
for batch_idx, (x, y) in enumerate(loader):
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
with autocast(dtype=torch.float16): # fp16 forward
logits = model(x)
loss = criterion(logits, y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()VRAM Budget Planning
# Rule of thumb for transformer fine-tuning on RTX 4090 (24 GB):
# Model params × 4 bytes (fp32 weights)
# + Model params × 4 bytes (fp32 gradients)
# + Model params × 8 bytes (Adam states: m + v)
# + Activations (batch-size-dependent)
#
# Example: 7B model in fp16 full fine-tune
# Weights: 7e9 × 2 = 14 GB (fp16)
# Gradients: 7e9 × 2 = 14 GB → doesn't fit!
#
# Solution 1: LoRA / QLoRA (only tune ~1% of params)
pip install bitsandbytes peft
# 7B in 4-bit + LoRA ≈ 6 GB — fits easily
# Solution 2: Gradient checkpointing (trades compute for memory)
model.gradient_checkpointing_enable()
# Recomputes activations during backward; ~30% slower but ~60% less VRAM
# Solution 3: Gradient accumulation (simulate larger batches)
accum_steps = 8 # effective batch = batch_size × 8
for i, (x, y) in enumerate(loader):
with autocast(dtype=torch.float16):
loss = model(x, labels=y).loss / accum_steps
scaler.scale(loss).backward()
if (i + 1) % accum_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()Performance Tips
# Enable TF32 (default on Ampere+, verify it's on)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# Use Flash Attention 2 in your transformer
from flash_attn import flash_attn_qkvpacked_func
# Or if using HuggingFace:
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
attn_implementation="flash_attention_2",
torch_dtype=torch.float16,
device_map="auto"
)
# Increase DataLoader workers (saturate GPU)
loader = DataLoader(dataset, num_workers=8, pin_memory=True, prefetch_factor=2)
# Monitor VRAM in real-time
watch -n 0.5 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv⚠️ Known Gotchas
- ▸ Use float16, NOT bfloat16: Ada Lovelace supports both, but fp16 tensor cores are faster. bfloat16 is preferred on A100/H100.
- ▸ torch.compile warmup: First few forward passes are slow (compilation). Don't benchmark before step ~5.
- ▸ Flash Attention requires contiguous memory: Ensure Q/K/V tensors are contiguous before passing to flash_attn.
- ▸ CUDA OOM is not always VRAM: Sometimes it's fragmentation. Try
torch.cuda.empty_cache()between runs in a notebook. - ▸ nvcc mismatch: PyTorch CUDA version must match the installed toolkit for custom CUDA extensions (like flash-attn).
A100 80GB (Cloud)
Ampere · HBM2e · Native BF16 · Lambda Labs / RunPod
A100 vs H100 — when does it matter?
H100 SXM is 3× faster for FP8 training and has NVLink 4.0 (900 GB/s vs 600 GB/s). But at ~2× the price, A100 80G is the workhorse for most LLM fine-tuning jobs. Choose H100 only when you're training 70B+ from scratch or need maximum throughput. For fine-tuning <70B, A100 80G is usually the right call economically.
Cloud Setup (Lambda Labs)
# Lambda Labs: https://lambdalabs.com/service/gpu-cloud # Select: A100 80G SXM — ~$1.10/hr on-demand as of 2025 # SSH into instance ssh ubuntu@<your-instance-ip> # Lambda images come with CUDA 12.x and PyTorch pre-installed # Verify: python -c "import torch; print(torch.__version__, torch.cuda.get_device_name(0))" # → 2.x.x NVIDIA A100-SXM4-80GB # If starting fresh: conda create -n ml python=3.11 -y && conda activate ml pip install torch --index-url https://download.pytorch.org/whl/cu121 pip install flash-attn --no-build-isolation # 5-10 min compile # RunPod alternative: use the "RunPod PyTorch" template # It ships with torch pre-compiled, saving 10-15 minutes on launch
PyTorch Configuration — Use BF16, Not FP16
import torch
device = torch.device("cuda")
model = MyModel().to(device)
# A100 has native bfloat16 hardware support
# bfloat16 = same dynamic range as float32, lower precision (8-bit mantissa)
# This means: no GradScaler needed! bfloat16 doesn't overflow like fp16.
# BF16 training loop — simpler than FP16 AMP
model = model.to(torch.bfloat16)
for x, y in loader:
x = x.to(device, dtype=torch.bfloat16)
y = y.to(device)
optimizer.zero_grad()
with torch.autocast("cuda", dtype=torch.bfloat16):
loss = model(x, labels=y).loss
loss.backward() # No GradScaler needed with bfloat16
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
# TF32 is enabled by default on A100 — verify
print(torch.backends.cuda.matmul.allow_tf32) # → True
# TF32 uses 10-bit mantissa for matmul, full 8-bit exponent — ~10% speedupWorkload Sizing (80GB VRAM)
# What fits in 80GB for fine-tuning:
#
# Full fine-tune (bf16):
# 13B model → ~104 GB (needs ZeRO-3 or FSDP — marginally too large solo)
# 7B model → ~56 GB ✅ fits with 24GB headroom for activations
#
# QLoRA / 4-bit quantized:
# 70B model → ~35 GB ✅ fits!
# Llama 3 70B + bitsandbytes 4-bit + LoRA → typical: 42-48 GB total
#
# Batch size rule of thumb for 7B fine-tune:
# Activations per token ≈ 2 × n_layers × hidden_dim × 2 bytes (bf16)
# For Llama-7B: 2 × 32 × 4096 × 2 ≈ 0.5 MB/token
# Sequence length 2048, batch 8 → 8 GB activations
# → Comfortable at batch=8, seq=2048 with gradient checkpointing OFF
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)⚠️ Known Gotchas
- ▸ Don't use FP16 GradScaler on A100: BF16 doesn't overflow so the scaler is unnecessary and adds complexity.
- ▸ Lambda Labs billing: You're billed from instance start, not when you SSH in. Terminate (don't just stop) instances you're not using.
- ▸ Persistent storage is separate: Instance disk is ephemeral on Lambda. Use Persistent Storage ($0.20/GB/mo) for checkpoints.
- ▸ A100 PCIe vs SXM: A100-PCIe has 400 GB/s bandwidth; A100-SXM has 2 TB/s. For multi-GPU, always prefer SXM variants — significant impact on gradient sync.
8× L40S Multi-GPU Cluster
Ada Lovelace · 48GB × 8 · PCIe interconnect · FSDP/DDP
⚠️ Critical: L40S Uses PCIe, Not NVLink
Unlike A100/H100 SXM which use NVLink (600–900 GB/s GPU-to-GPU), the L40S uses PCIe 4.0 for inter-GPU communication (~64 GB/s bidirectional). This is ~10× less bandwidth. For large gradient tensors, this means all-reduce can become the bottleneck. Design your training to minimise communication (FSDP, gradient accumulation, larger micro-batches).
DDP vs FSDP Decision Guide
| DDP | FSDP | |
|---|---|---|
| Model fits on 1 GPU? | ✅ Use DDP | ✅ Still works, but adds overhead |
| Model > 1 GPU VRAM? | ❌ OOM | ✅ Required (shards params across GPUs) |
| Communication cost | Lower (only gradients) | Higher (params + grads + optim states) |
| Code complexity | Simple — 5 lines | More config, but HuggingFace Trainer handles it |
| Best for L40S PCIe? | Yes, when model fits | Yes, with reduce-scatter to minimize comms |
DDP Setup
# Launch with torchrun (replaces torch.distributed.launch)
torchrun --nproc_per_node=8 --nnodes=1 train.py
# train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
dist.init_process_group(backend="nccl") # NCCL is fastest for GPU-to-GPU
rank = dist.get_rank() # 0–7
local_rank = int(os.environ["LOCAL_RANK"])
world_size = dist.get_world_size() # 8
torch.cuda.set_device(local_rank)
device = torch.device(f"cuda:{local_rank}")
model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank])
# DistributedSampler ensures each GPU sees different data
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, sampler=sampler, batch_size=32, num_workers=4)
# Don't forget to set sampler epoch each epoch
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # ensures shuffling differs per epoch
for x, y in loader:
...
dist.destroy_process_group()FSDP Setup (for 70B+ models)
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
CPUOffload, BackwardPrefetch, MixedPrecision
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
# FSDP shards parameters, gradients, and optimizer states across GPUs
# With 8× L40S (48GB each = 384GB total), you can fit a 70B model:
# 70B × 2 bytes (bf16) = 140 GB
# + optimizer states (AdamW): 70B × 8 bytes = 560 GB
# → Too large for one GPU, FSDP distributes it
fsdp_config = dict(
auto_wrap_policy=transformer_auto_wrap_policy(
transformer_layer_cls={LlamaDecoderLayer}
),
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
),
backward_prefetch=BackwardPrefetch.BACKWARD_PRE, # overlap comms
cpu_offload=CPUOffload(offload_params=False), # True if OOM persists
)
model = FSDP(model, **fsdp_config)NCCL Tuning for PCIe Topology
# L40S on PCIe: tune NCCL to avoid redundant copies export NCCL_P2P_DISABLE=0 # keep P2P enabled if NVLink exists export NCCL_IB_DISABLE=1 # no InfiniBand on typical cloud export NCCL_SOCKET_IFNAME=eth0 # use primary network interface export NCCL_DEBUG=WARN # INFO is verbose; use only for debugging export NCCL_ALGO=Ring # Ring all-reduce suits PCIe topology export NCCL_BUFFSIZE=16777216 # 16MB NCCL buffer (default is 4MB) # Per-GPU memory budget math for 8× L40S: # Total pool: 8 × 48 GB = 384 GB # For 70B model in bf16: # Parameters: 70B × 2 = 140 GB → 140/8 = 17.5 GB/GPU # Gradients: 70B × 2 = 140 GB → 17.5 GB/GPU # Adam m,v: 70B × 8 = 560 GB → 70 GB/GPU ← bottleneck # Total: 105 GB/GPU → DOES NOT FIT with standard Adam # → Use 8-bit Adam (bitsandbytes) or Adafactor (no v state = halved mem) pip install bitsandbytes import bitsandbytes as bnb optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=2e-5)
⚠️ Known Gotchas
- ▸ L40S ≠ NVLink: Never assume NVLink on L40S. PCIe means all-reduce is ~10× slower per-byte. Gradient accumulation (16+ steps) is your friend.
- ▸ Gradient sync overhead: With large models, gradient all-reduce can take 30-50% of step time. Use
model.no_sync()context during accumulation steps. - ▸ FSDP checkpoint saving: Must use
FULL_STATE_DICTpolicy to save a consolidated checkpoint — otherwise you get 8 shards. - ▸ torchrun vs old launch: Always use
torchrun. The old-m torch.distributed.launchis deprecated since PyTorch 1.9.
Mixed Precision Guide
Decision Tree: Which Precision?
1. On A100/H100? → Use bfloat16. No GradScaler, better dynamic range, same speed.
2. On RTX 4090 / consumer GPU? → Use float16 with GradScaler. Better tensor core utilisation than bf16 on Ada.
3. On M4 Pro MPS? → Use float16 with torch.autocast("mps"). No GradScaler needed for most workloads.
4. Training instability / NaN loss? → First try GradScaler if using fp16. If still unstable, switch to bf16 — its wider exponent range is more numerically stable.
5. Inference only? → Use torch.no_grad() + fp16 or bf16. No scaler needed ever for inference.
# FP16 training (RTX 4090, consumer GPUs)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for x, y in loader:
optimizer.zero_grad()
with autocast(dtype=torch.float16):
loss = model(x, labels=y).loss
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
# ---
# BF16 training (A100, H100 — simpler, no scaler)
for x, y in loader:
optimizer.zero_grad()
with torch.autocast("cuda", dtype=torch.bfloat16):
loss = model(x, labels=y).loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()Memory Optimization Techniques
Gradient Checkpointing
Instead of storing all activations during the forward pass (needed for backward), recompute them on-the-fly during backward. Reduces activation memory by ~60% at the cost of ~33% longer backward pass (one extra forward per layer).
# PyTorch native
model.gradient_checkpointing_enable()
# HuggingFace — same thing
model = AutoModelForCausalLM.from_pretrained(
"...", use_cache=False # must disable KV-cache when checkpointing
)
model.gradient_checkpointing_enable()
# Verify memory savings:
# Without: 7B model training ≈ 80 GB (activations dominate at long seqlens)
# With: 7B model training ≈ 32 GB (activations recomputed per layer)Gradient Accumulation Math
Simulate a larger effective batch without materialising all gradients at once.
# Effective batch size = micro_batch_size × accumulation_steps × num_gpus
# Example: micro_batch=4, accum=8, gpus=8 → effective batch = 256
accum_steps = 8
optimizer.zero_grad()
for step, (x, y) in enumerate(loader):
# Scale loss to average across accumulation steps
with autocast(dtype=torch.bfloat16):
loss = model(x, labels=y).loss / accum_steps
loss.backward()
if (step + 1) % accum_steps == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
# With DDP: use no_sync() to skip gradient sync on intermediate steps
for step, (x, y) in enumerate(loader):
is_sync_step = (step + 1) % accum_steps == 0
ctx = contextlib.nullcontext() if is_sync_step else model.no_sync()
with ctx:
loss = model(x, labels=y).loss / accum_steps
loss.backward()CPU Offloading
# ZeRO-3 with CPU offload via DeepSpeed
# Moves optimizer states + parameters to CPU RAM when not in use
# Enables fitting models that don't fit in GPU VRAM at all
pip install deepspeed
# ds_config.json
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"offload_param": { "device": "cpu", "pin_memory": true }
},
"bf16": { "enabled": true }
}
# 70B model on single A100 80GB with ZeRO-3 + CPU offload is feasible
# (slow, but possible for inference/small batch fine-tuning)Profiling Tools
torch.profilerBuilt-in profiler. Generates Chrome trace JSON for kernel-level analysis.
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/profiler'),
record_shapes=True, with_stack=True
) as prof:
for step, (x, y) in enumerate(loader):
with record_function("forward"):
loss = model(x, labels=y).loss
with record_function("backward"):
loss.backward()
if step >= 5: break # profile a few steps, not the whole run
# Open Chrome → chrome://tracing → load ./log/profiler/*.jsonnvidia-smiQuick VRAM and GPU utilisation monitoring.
# Continuous monitoring every 0.5s watch -n 0.5 nvidia-smi --query-gpu=name,utilization.gpu,memory.used,memory.free,temperature.gpu --format=csv,noheader # Log to file for post-training analysis nvidia-smi dmon -s u -d 1 > gpu_stats.txt
torch.cuda.memory_summary()Detailed breakdown of CUDA memory allocation.
# After an OOM or suspicious memory usage:
print(torch.cuda.memory_summary(device=0))
# Shows: allocated, reserved, peak, fragmentation info
# Track peak memory
torch.cuda.reset_peak_memory_stats()
# ... run your training step ...
peak = torch.cuda.max_memory_allocated() / 1e9
print(f"Peak VRAM: {peak:.2f} GB")Weights & BiasesLog system metrics alongside training metrics automatically.
import wandb
wandb.init(project="my-training-run")
# Log GPU stats every N steps (wandb.agent does this automatically)
wandb.log({
"loss": loss.item(),
"gpu_memory_gb": torch.cuda.memory_allocated() / 1e9,
"learning_rate": scheduler.get_last_lr()[0],
}, step=global_step)Cloud Provider Comparison
| Provider | GPUs Available | On-Demand | Spot | Notes |
|---|---|---|---|---|
| Lambda Labs | A100, H100, A10 | $1.10–$2.49/hr | N/A | Cheapest on-demand A100, simple UI, persistent storage |
| RunPod | A100, H100, L40S | $1.49–$3.49/hr | $0.79–$1.49/hr | Good UI, templates, community cloud |
| Vast.ai | A100, 4090, L40S | Varies | $0.40–$1.20/hr | Cheapest spot, P2P marketplace, verify host reliability |
| CoreWeave | A100, H100, L40S | $2.06–$4.25/hr | Kubernetes | Kubernetes-native, scales to 1000s of GPUs, enterprise SLAs |
| Google Colab | T4, A100 (Pro+) | Free / $12/mo | N/A | Great for notebooks, session limits, no persistent env |
Prices as of 2025 — verify current rates before committing to long runs. Always set a budget alert. An 8×A100 left running overnight is ~$200.