🧮

Math Foundations for Machine Learning

This page turns abstract math into practical ML intuition. Instead of formal proof-first presentation, each section connects the concept to PyTorch code, model behavior, and common engineering decisions.

📐Foundation

Linear Algebra

A vector is not just an array. In machine learning it represents a direction and magnitude inside feature space. Word embeddings, image features, and hidden states are all vectors.

Matrix multiplication is the language of neural networks. When you compute y = W @ x, you are applying a learned linear transformation that rotates, scales, and mixes information.

SVD and low-rank approximation matter because modern techniques like LoRA exploit the fact that useful updates often live in a low-dimensional subspace.

Key Equations

Matrix-vector product

y = W @ x

Attention score

score = Q @ Kᵀ / √d_k

SVD decomposition

A = U Σ Vᵀ

L2 norm

‖x‖₂ = √(xᵀx)

PyTorch Implementation

import torch

q = torch.randn(8, 64)
k = torch.randn(8, 64)
scores = q @ k.T / q.shape[-1] ** 0.5

A = torch.randn(32, 64)
U, S, Vh = torch.linalg.svd(A, full_matrices=False)
A_approx = (U[:, :8] * S[:8]) @ Vh[:8, :]
print(torch.linalg.norm(A - A_approx))

Topics in this section

▸

Vectors and dot products

Geometric intuition behind embeddings and similarity

▸

Matrix multiplication

The core operation in every dense layer and attention projection

▸

Low-rank approximation

Important for LoRA, compression, and representation structure

→

Why this matters for ML

Without linear algebra, attention, embeddings, LoRA, PCA, and tensor shape reasoning all feel mysterious.

∂Core

Differential Calculus

Derivatives measure how a small change in input affects the output. In ML that means how changing a parameter changes the loss.

The chain rule is the heart of backpropagation. Deep learning works because gradients can flow backward through a composed computation graph.

Curvature matters too. Even if you never compute a full Hessian, optimizer behavior makes more sense once you understand local geometry.

Key Equations

Derivative

f'(x) = lim_{h→0} (f(x+h)-f(x))/h

Gradient descent

w ← w − α ∇L(w)

Chain rule

dL/dx = dL/dy · dy/dx

PyTorch Implementation

import torch

x = torch.tensor([3.0], requires_grad=True)
y = (2 * x + 1) ** 3
y.backward()
print(x.grad)

Topics in this section

▸

Derivatives

How sensitive the loss is to each parameter

▸

Chain rule

Why backprop is possible in deep networks

▸

Gradient flow

Explains exploding, vanishing, and clipping

→

Why this matters for ML

If you understand calculus, autograd stops feeling magical and training bugs become much easier to diagnose.

🎲Core

Probability & Statistics

Machine learning is full of uncertainty. Model outputs often represent probabilities, not certainties.

Cross-entropy is not just a library function. It is the negative log-likelihood objective behind classification.

Regularization, Bayesian intuition, sampling, and policy gradients all become easier once probability feels native.

Key Equations

Cross-entropy

L = -Σ y log p̂

MLE objective

θ* = argmax Σ log P(x|θ)

KL divergence

D_KL(P ‖ Q) = Σ P(x) log(P(x)/Q(x))

PyTorch Implementation

import torch
import torch.nn.functional as F

logits = torch.tensor([[2.0, 1.0, 0.1]])
labels = torch.tensor([0])
loss = F.cross_entropy(logits, labels)
print(loss)

Topics in this section

▸

Likelihood

The statistical meaning behind common training objectives

▸

Distributions

Normal, categorical, Bernoulli, and sampling intuition

▸

KL divergence

Shows up in VAEs, PPO, distillation, and calibration

→

Why this matters for ML

Probability unifies classification, uncertainty, regularization, generation, and reinforcement learning.

📡Advanced

Information Theory

Entropy measures uncertainty. Low entropy means confident predictions, high entropy means uncertainty.

Cross-entropy links the true distribution and model distribution. That is why it is central in classification and language modeling.

Perplexity gives a readable measure of language-model uncertainty and predictive sharpness.

Key Equations

Entropy

H(X) = -Σ p(x) log p(x)

Cross-entropy

H(p, q) = -Σ p(x) log q(x)

Perplexity

PP = exp(H)

PyTorch Implementation

import torch

def entropy(probs):
    return -(probs * torch.log(probs + 1e-8)).sum()

print(entropy(torch.tensor([0.25, 0.25, 0.25, 0.25])))

Topics in this section

▸

Entropy

A clean measure of uncertainty

▸

Cross-entropy

The training loss behind classifiers and LMs

▸

Perplexity

A practical metric for language models

→

Why this matters for ML

Information theory explains why cross-entropy works and how to interpret uncertainty in modern AI systems.

⛰️Advanced

Optimization

Training a model means solving a high-dimensional optimization problem under noise, approximation, and hardware constraints.

Learning rate schedules, weight decay, momentum, and adaptive optimizers are not trivia. They shape whether training converges at all.

A lot of practical ML is optimization engineering disguised as architecture work.

Key Equations

SGD update

w ← w − α g

Momentum

v ← βv + g,   w ← w − αv

AdamW intuition

adaptive step sizes + decoupled weight decay

PyTorch Implementation

import torch
model = torch.nn.Linear(128, 10)
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
for step in range(10):
    loss = model(torch.randn(32, 128)).pow(2).mean()
    opt.zero_grad()
    loss.backward()
    opt.step()

Topics in this section

▸

Learning rate

The single most important hyperparameter in many runs

▸

Weight decay

Controls capacity and affects generalization

▸

Adaptive optimizers

Adam, AdamW, RMSProp, and when they help

→

Why this matters for ML

Even a great architecture fails if optimization is unstable, under-tuned, or misunderstood.

Common Pitfalls

These mistakes do not produce immediate errors. They quietly create wrong results, weak intuition, or unstable training.

⚠️

Memorising formulas without operational meaning

If you cannot tie a formula to tensors, losses, or model behavior, it will not help you debug real systems.

⚠️

Ignoring shapes while learning math

A lot of practical understanding comes from knowing what dimensions each object carries in code.

⚠️

Treating optimizers as black boxes

Many training failures are optimization failures, not architecture failures.

⚠️

Separating math from implementation

The fastest way to really learn is to pair each concept with a short PyTorch experiment.

🚀

Apply this math in the mini-GPT project

The mini-GPT project uses attention (linear algebra), backprop through transformer blocks (calculus), cross-entropy training (probability + information theory), and AdamW with cosine decay (optimization). This page is the theory map behind that implementation.

View mini-GPT project →

BERT Fine-tuning

MLE, cross-entropy, AdamW with warmup, and representation learning in one workflow.

Explore project →

PPO MuJoCo Agent

Policy gradients, KL divergence, entropy bonus, and GAE in a real RL setting.

Explore project →