Method of normal equation

The method of normal equation consists in solving

(A^{T} A) x = A^{T} b

The solution is:

x = (A^{T} A)^{- 1} A^{T} b

The matrix $A^{T} A$ is SPD. So the system can be solved using Cholesky.

This method is best for very tall skinny $A$ .

One of the main drawbacks is that the condition number grows very quickly! Indeed we can prove that

κ (A^{T} A) = ∥ A^{T} A ∥_{2} ∥ (A^{T} A)^{- 1} ∥_{2} = κ (A)^{2}

So the condition number grows much faster than $κ (A)$ .

This method requires $A^{T} A$ to be non-singular. This is equivalent to saying that $A$ should be full column rank.

The computational cost is $O (m n^{2})$ .

Intuitive explanation

$A^{T} A$ : This product represents the “correlation” of A’s columns with each other. It captures how the columns of A interact and overlap.
$A^{T} b$ : This term represents the “correlation” of A’s columns with the target vector b. It tells us how much each column of A contributes to explaining b.
$(A^{T} A)^{- 1}$ : Inverting $A^{T} A$ is like “decorrelating” the columns of A. It accounts for any redundancy or overlap in A’s columns.
Final multiplication: $(A^{T} A)^{- 1} A^{T} b$ combines the decorrelated version of A with its correlation to b, giving us the optimal coefficients x.

📓 CME 302