🧮

Math Foundations for Machine Learning

This page turns abstract math into practical ML intuition. Instead of formal proof-first presentation, each section connects the concept to PyTorch code, model behavior, and common engineering decisions.

📐Foundation

Linear Algebra

A vector is not just an array. In machine learning it represents a direction and magnitude inside feature space. Word embeddings, image features, and hidden states are all vectors.

Matrix multiplication is the language of neural networks. When you compute y = W @ x, you are applying a learned linear transformation that rotates, scales, and mixes information.

SVD and low-rank approximation matter because modern techniques like LoRA exploit the fact that useful updates often live in a low-dimensional subspace.

Key Equations

Matrix-vector product
y = W @ x
Attention score
score = Q @ Kᵀ / √d_k
SVD decomposition
A = U Σ Vᵀ
L2 norm
‖x‖₂ = √(xᵀx)

PyTorch Implementation

import torch

q = torch.randn(8, 64)
k = torch.randn(8, 64)
scores = q @ k.T / q.shape[-1] ** 0.5

A = torch.randn(32, 64)
U, S, Vh = torch.linalg.svd(A, full_matrices=False)
A_approx = (U[:, :8] * S[:8]) @ Vh[:8, :]
print(torch.linalg.norm(A - A_approx))

Topics in this section

Vectors and dot products
Geometric intuition behind embeddings and similarity
Matrix multiplication
The core operation in every dense layer and attention projection
Low-rank approximation
Important for LoRA, compression, and representation structure
Why this matters for ML

Without linear algebra, attention, embeddings, LoRA, PCA, and tensor shape reasoning all feel mysterious.

Core

Differential Calculus

Derivatives measure how a small change in input affects the output. In ML that means how changing a parameter changes the loss.

The chain rule is the heart of backpropagation. Deep learning works because gradients can flow backward through a composed computation graph.

Curvature matters too. Even if you never compute a full Hessian, optimizer behavior makes more sense once you understand local geometry.

Key Equations

Derivative
f'(x) = lim_{h→0} (f(x+h)-f(x))/h
Gradient descent
w ← w − α ∇L(w)
Chain rule
dL/dx = dL/dy · dy/dx

PyTorch Implementation

import torch

x = torch.tensor([3.0], requires_grad=True)
y = (2 * x + 1) ** 3
y.backward()
print(x.grad)

Topics in this section

Derivatives
How sensitive the loss is to each parameter
Chain rule
Why backprop is possible in deep networks
Gradient flow
Explains exploding, vanishing, and clipping
Why this matters for ML

If you understand calculus, autograd stops feeling magical and training bugs become much easier to diagnose.

🎲Core

Probability & Statistics

Machine learning is full of uncertainty. Model outputs often represent probabilities, not certainties.

Cross-entropy is not just a library function. It is the negative log-likelihood objective behind classification.

Regularization, Bayesian intuition, sampling, and policy gradients all become easier once probability feels native.

Key Equations

Cross-entropy
L = -Σ y log p̂
MLE objective
θ* = argmax Σ log P(x|θ)
KL divergence
D_KL(P ‖ Q) = Σ P(x) log(P(x)/Q(x))

PyTorch Implementation

import torch
import torch.nn.functional as F

logits = torch.tensor([[2.0, 1.0, 0.1]])
labels = torch.tensor([0])
loss = F.cross_entropy(logits, labels)
print(loss)

Topics in this section

Likelihood
The statistical meaning behind common training objectives
Distributions
Normal, categorical, Bernoulli, and sampling intuition
KL divergence
Shows up in VAEs, PPO, distillation, and calibration
Why this matters for ML

Probability unifies classification, uncertainty, regularization, generation, and reinforcement learning.

📡Advanced

Information Theory

Entropy measures uncertainty. Low entropy means confident predictions, high entropy means uncertainty.

Cross-entropy links the true distribution and model distribution. That is why it is central in classification and language modeling.

Perplexity gives a readable measure of language-model uncertainty and predictive sharpness.

Key Equations

Entropy
H(X) = -Σ p(x) log p(x)
Cross-entropy
H(p, q) = -Σ p(x) log q(x)
Perplexity
PP = exp(H)

PyTorch Implementation

import torch

def entropy(probs):
    return -(probs * torch.log(probs + 1e-8)).sum()

print(entropy(torch.tensor([0.25, 0.25, 0.25, 0.25])))

Topics in this section

Entropy
A clean measure of uncertainty
Cross-entropy
The training loss behind classifiers and LMs
Perplexity
A practical metric for language models
Why this matters for ML

Information theory explains why cross-entropy works and how to interpret uncertainty in modern AI systems.

⛰️Advanced

Optimization

Training a model means solving a high-dimensional optimization problem under noise, approximation, and hardware constraints.

Learning rate schedules, weight decay, momentum, and adaptive optimizers are not trivia. They shape whether training converges at all.

A lot of practical ML is optimization engineering disguised as architecture work.

Key Equations

SGD update
w ← w − α g
Momentum
v ← βv + g,   w ← w − αv
AdamW intuition
adaptive step sizes + decoupled weight decay

PyTorch Implementation

import torch
model = torch.nn.Linear(128, 10)
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
for step in range(10):
    loss = model(torch.randn(32, 128)).pow(2).mean()
    opt.zero_grad()
    loss.backward()
    opt.step()

Topics in this section

Learning rate
The single most important hyperparameter in many runs
Weight decay
Controls capacity and affects generalization
Adaptive optimizers
Adam, AdamW, RMSProp, and when they help
Why this matters for ML

Even a great architecture fails if optimization is unstable, under-tuned, or misunderstood.

Common Pitfalls

These mistakes do not produce immediate errors. They quietly create wrong results, weak intuition, or unstable training.

⚠️

Memorising formulas without operational meaning

If you cannot tie a formula to tensors, losses, or model behavior, it will not help you debug real systems.

⚠️

Ignoring shapes while learning math

A lot of practical understanding comes from knowing what dimensions each object carries in code.

⚠️

Treating optimizers as black boxes

Many training failures are optimization failures, not architecture failures.

⚠️

Separating math from implementation

The fastest way to really learn is to pair each concept with a short PyTorch experiment.

🚀

Apply this math in the mini-GPT project

The mini-GPT project uses attention (linear algebra), backprop through transformer blocks (calculus), cross-entropy training (probability + information theory), and AdamW with cosine decay (optimization). This page is the theory map behind that implementation.

View mini-GPT project →

BERT Fine-tuning

MLE, cross-entropy, AdamW with warmup, and representation learning in one workflow.

Explore project →

PPO MuJoCo Agent

Policy gradients, KL divergence, entropy bonus, and GAE in a real RL setting.

Explore project →