Math Foundations for Machine Learning
This page turns abstract math into practical ML intuition. Instead of formal proof-first presentation, each section connects the concept to PyTorch code, model behavior, and common engineering decisions.
Linear Algebra
A vector is not just an array. In machine learning it represents a direction and magnitude inside feature space. Word embeddings, image features, and hidden states are all vectors.
Matrix multiplication is the language of neural networks. When you compute y = W @ x, you are applying a learned linear transformation that rotates, scales, and mixes information.
SVD and low-rank approximation matter because modern techniques like LoRA exploit the fact that useful updates often live in a low-dimensional subspace.
Key Equations
y = W @ x
score = Q @ Kᵀ / √d_k
A = U Σ Vᵀ
‖x‖₂ = √(xᵀx)
PyTorch Implementation
import torch q = torch.randn(8, 64) k = torch.randn(8, 64) scores = q @ k.T / q.shape[-1] ** 0.5 A = torch.randn(32, 64) U, S, Vh = torch.linalg.svd(A, full_matrices=False) A_approx = (U[:, :8] * S[:8]) @ Vh[:8, :] print(torch.linalg.norm(A - A_approx))
Topics in this section
Without linear algebra, attention, embeddings, LoRA, PCA, and tensor shape reasoning all feel mysterious.
Differential Calculus
Derivatives measure how a small change in input affects the output. In ML that means how changing a parameter changes the loss.
The chain rule is the heart of backpropagation. Deep learning works because gradients can flow backward through a composed computation graph.
Curvature matters too. Even if you never compute a full Hessian, optimizer behavior makes more sense once you understand local geometry.
Key Equations
f'(x) = lim_{h→0} (f(x+h)-f(x))/hw ← w − α ∇L(w)
dL/dx = dL/dy · dy/dx
PyTorch Implementation
import torch x = torch.tensor([3.0], requires_grad=True) y = (2 * x + 1) ** 3 y.backward() print(x.grad)
Topics in this section
If you understand calculus, autograd stops feeling magical and training bugs become much easier to diagnose.
Probability & Statistics
Machine learning is full of uncertainty. Model outputs often represent probabilities, not certainties.
Cross-entropy is not just a library function. It is the negative log-likelihood objective behind classification.
Regularization, Bayesian intuition, sampling, and policy gradients all become easier once probability feels native.
Key Equations
L = -Σ y log p̂
θ* = argmax Σ log P(x|θ)
D_KL(P ‖ Q) = Σ P(x) log(P(x)/Q(x))
PyTorch Implementation
import torch import torch.nn.functional as F logits = torch.tensor([[2.0, 1.0, 0.1]]) labels = torch.tensor([0]) loss = F.cross_entropy(logits, labels) print(loss)
Topics in this section
Probability unifies classification, uncertainty, regularization, generation, and reinforcement learning.
Information Theory
Entropy measures uncertainty. Low entropy means confident predictions, high entropy means uncertainty.
Cross-entropy links the true distribution and model distribution. That is why it is central in classification and language modeling.
Perplexity gives a readable measure of language-model uncertainty and predictive sharpness.
Key Equations
H(X) = -Σ p(x) log p(x)
H(p, q) = -Σ p(x) log q(x)
PP = exp(H)
PyTorch Implementation
import torch
def entropy(probs):
return -(probs * torch.log(probs + 1e-8)).sum()
print(entropy(torch.tensor([0.25, 0.25, 0.25, 0.25])))Topics in this section
Information theory explains why cross-entropy works and how to interpret uncertainty in modern AI systems.
Optimization
Training a model means solving a high-dimensional optimization problem under noise, approximation, and hardware constraints.
Learning rate schedules, weight decay, momentum, and adaptive optimizers are not trivia. They shape whether training converges at all.
A lot of practical ML is optimization engineering disguised as architecture work.
Key Equations
w ← w − α g
v ← βv + g, w ← w − αv
adaptive step sizes + decoupled weight decay
PyTorch Implementation
import torch
model = torch.nn.Linear(128, 10)
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
for step in range(10):
loss = model(torch.randn(32, 128)).pow(2).mean()
opt.zero_grad()
loss.backward()
opt.step()Topics in this section
Even a great architecture fails if optimization is unstable, under-tuned, or misunderstood.
Common Pitfalls
These mistakes do not produce immediate errors. They quietly create wrong results, weak intuition, or unstable training.
Memorising formulas without operational meaning
If you cannot tie a formula to tensors, losses, or model behavior, it will not help you debug real systems.
Ignoring shapes while learning math
A lot of practical understanding comes from knowing what dimensions each object carries in code.
Treating optimizers as black boxes
Many training failures are optimization failures, not architecture failures.
Separating math from implementation
The fastest way to really learn is to pair each concept with a short PyTorch experiment.
Apply this math in the mini-GPT project
The mini-GPT project uses attention (linear algebra), backprop through transformer blocks (calculus), cross-entropy training (probability + information theory), and AdamW with cosine decay (optimization). This page is the theory map behind that implementation.
View mini-GPT project →BERT Fine-tuning
MLE, cross-entropy, AdamW with warmup, and representation learning in one workflow.
Explore project →PPO MuJoCo Agent
Policy gradients, KL divergence, entropy bonus, and GAE in a real RL setting.
Explore project →