Dot Product and Vector Norms#
Dot Product#
For vectors \(x, y \in \mathbb{R}^n\), the dot product (also called the inner product in this context) is defined as
For vectors \(x, y \in \mathbb{C}^n\), we use the conjugate transpose:
where \(\overline{x}_i\) denotes the complex conjugate of \(x_i\).
The computational cost of computing a dot product is \(O(n)\).
Orthogonality#
Two vectors \(x\) and \(y\) are orthogonal if
(in the complex case, \(x^H y = 0\)).
This definition naturally extends to subspaces.
Orthogonal Complements#
For a subspace \(S \subset \mathbb{R}^n\), the orthogonal complement \(S^\perp\) is defined as:
Key properties:
\(S^\perp\) is itself a subspace of \(\mathbb{R}^n\).
Any vector \(v \in \mathbb{R}^n\) can be uniquely written as the sum of a vector in \(S\) and a vector in \(S^\perp\).
The dimensions satisfy:
Vector Norms#
The dot product allows us to define the Euclidean norm (or 2-norm) of a vector \(x\):
General Definition of a Norm#
A norm is a function \(\|\cdot\|\) mapping vectors to non-negative real numbers that satisfies:
Positive definiteness: \(\|x\| = 0 \iff x = 0\).
Homogeneity: \(\|\alpha x\| = |\alpha| \, \|x\|\) for all scalars \(\alpha\).
Triangle inequality: \(\|x + y\| \leq \|x\| + \|y\|\).
Common Vector Norms#
1-norm (Manhattan norm):
2-norm (Euclidean norm):
Infinity norm (max norm):
p-norm (for \(p > 1\)):
Convention: In this book, \(\|x\|\) means \(\|x\|_2\) unless otherwise stated.
Equivalence Between Norms#
In a finite-dimensional vector space like \(\mathbb{R}^n\), all norms are equivalent. This means that for any two vector norms \(\|\cdot\|_a\) and \(\|\cdot\|_b\), there exist positive constants \(c_1\) and \(c_2\) such that for every vector \(\mathbf{x}\):
This property is powerful because it guarantees that if a sequence of vectors converges in one norm, it converges in all norms. 📐
Specific Inequalities Between 1, 2, and Infinity Norms#
For any \(x \in \mathbb{R}^n\):
These follow from the Cauchy–Schwarz inequality (see below) and basic properties of maxima and sums.
To illustrate why the dimension \(n\) is critical, consider the vector \(\mathbf{x} = [1, 1, \dots, 1]^T \in \mathbb{R}^n\).
\(\|\mathbf{x}\|_\infty = 1\)
\(\|\mathbf{x}\|_2 = \sqrt{1^2 + \dots + 1^2} = \sqrt{n}\)
\(\|\mathbf{x}\|_1 = 1 + \dots + 1 = n\)
These values exactly match the scaling factors in the inequalities.
General Ordering of p-Norms#
For any vector \(\mathbf{x}\), the value of its \(p\)-norm is a non-increasing function of \(p\). This provides a simple and elegant ordering.
For any \(p > q \ge 1\):
This leads to the most frequently cited chain of inequalities:
The \(\infty\)-norm is the limit of the \(p\)-norm as \(p \to \infty\), making it the smallest of all \(p\)-norms:
Geometric Interpretation#
The unit ball of a norm \(\|\cdot\|\) is the set:
For the 2-norm, the unit ball is a sphere (circle in \(\mathbb{R}^2\)).
For the 1-norm, the unit ball in \(\mathbb{R}^2\) is a diamond shape.
For the infinity norm, the unit ball is a square in \(\mathbb{R}^2\).
As \(p \to \infty\), the \(p\)-norm unit ball transitions from “diamond-like” (near \(p=1\)) to “square-like” (as \(p \to \infty\)).
Why Different Norms Matter#
Geometric Interpretation in Optimization: The choice of norm is critical in optimization and machine learning, as it defines the geometry of the “constraint region” for your solution. In many problems, we minimize a loss function subject to a constraint that the solution vector’s norm must be small (\( \|x\|_p \le C\)). The optimal solution is often found where the level curves of the loss function first touch the boundary of this constraint region (the “norm ball”).
Inducing Sparsity (\(p \le 1\)):
The L1-norm (\(\|x\|_1\)) constraint region is shaped like a diamond (or hyper-diamond in higher dimensions), with sharp corners that lie on the axes.
Because of these corners, the expanding level curves of the loss function are most likely to make contact at a point on an axis.
A point on an axis means that the other components of the vector are zero. This is why L1 regularization (used in LASSO) produces sparse solutions, which is useful for feature selection.
Encouraging Small, Non-Zero Values (\(p \ge 2\)):
The L2-norm (\(\|x\|_2\)) constraint region is a circle (or hypersphere), which is perfectly round and has no corners. The solution can occur anywhere on its surface.
This norm penalizes large values heavily (\(x_i^2\)), so it tends to find solutions where all components are small and non-zero rather than forcing some to be exactly zero. This is the basis for Ridge Regression.
For norms with \(p>2\), the penalty on large components is even more severe, further encouraging solutions where all entries have similar, non-zero magnitudes.
Cauchy–Schwarz and Hölder Inequalities#
Hölder’s inequality:
For \(x, y \in \mathbb{R}^n\) and \(p, q > 0\) such that \(\frac{1}{p} + \frac{1}{q} = 1\),
Cauchy–Schwarz inequality:
Special case \(p = q = 2\):
Why is This Bound Important?
Proving Algorithm Convergence: Many iterative algorithms in NLA work by generating a sequence of vectors that get progressively closer to a solution. The Cauchy-Schwarz inequality is often used to prove that the error at each step is decreasing, guaranteeing that the algorithm will eventually converge.
Geometric Insight: The inequality has a clear geometric meaning. The dot product is maximized when the vectors are perfectly aligned (\(\theta = 0\) or \(\pi\)) and is zero when they are orthogonal. This intuition is invaluable when developing new algorithms.
Application: Pythagorean Theorem#
If \(x, y \in \mathbb{R}^n\) are orthogonal (\(x^T y = 0\)), then:
This follows immediately from expanding \(\|x+y\|_2^2\) using the dot product definition.
Summary Table:
Norm |
Formula |
Unit Ball Shape (\(\mathbb{R}^2\)) |
---|---|---|
1-norm |
\(\sum_{i=1}^n \lvert x_i\rvert\) |
Diamond |
2-norm |
\((\sum_{i=1}^n x_i^2)^{1/2}\) |
Circle |
Infinity-norm |
\(\max_i \lvert x_i\rvert\) |
Square |
p-norm |
\((\sum_{i=1}^n \lvert x_i\rvert^p)^{1/p}\) |
Smooth transition from diamond to square as \(p\) increases |