Rajat's painless Conjugate Gradients

A geometric derivation of Conjugate Gradients

Author: Rajat Vadiraj Dwaraknath

Thanks to Anjan Dwaraknath for helpful discussions.

Problem setup

We wish to solve the linear system $A x^{*} = b$ where $A \in R^{n \times n}$ is a symmetric positive definite matrix. We use $x^{*}$ to denote the exact solution.

We are working in the computational model where we have access to the matrix $A$ only through matrix-vector products. That is, we have a method to compute $A v$ for any vector $v \in R^{n}$ . We also have access to the vector $b$ .

Since $A$ is symmetric positive definite, it induces an inner product given by $⟨ u, v ⟩_{A} = u^{T} A v$ . Although we don’t have access to the solution $x^{*}$ , we do have access to $b = A x^{*}$ . So, we can compute the expression $v^{T} b = v^{T} A x^{*} = ⟨ v, x^{*} ⟩_{A}$ for any $v$ . In other words, we can compute A-inner products $⟨ x^{*}, v ⟩_{A}$ with the solution for any $v$ .

Solution attempt

Motivated by this, we can posit a method to solve $A x^{*} = b$ by working in the $A$ -inner product as follows:

Compute an $A$ -orthogonal basis for $R^{n}$ which we denote ${p_{1}, p_{2}, \dots, p_{n}}$ via Gram-Schmidt in the $A$ -inner product. Note that for this CG derivation, we don’t need these basis vectors to be normalized.
$A$ -project the solution $x^{*}$ using this basis:

x^{*} = i = 1 \sum n \frac{⟨ x ^{*} , p _{i} ⟩ _{A}}{⟨ p _{i} , p _{i} ⟩ _{A}} p_{i}

This also naturally leads to an approximation scheme by truncating the sum in the projection:

x_{k} := i = 1 \sum k \frac{⟨ x ^{*} , p _{i} ⟩ _{A}}{⟨ p _{i} , p _{i} ⟩ _{A}} p_{i}

Therefore, the sequence of approximations $x_{k}$ can be interpreted as the $A$ -projection of the solution $x^{*}$ onto to the sequence of increasing subspaces given by $span (p_{1}, \dots, p_{k})$ . We can iteratively update the approximations by noticing that:

x_{k + 1} = i = 1 \sum k + 1 \frac{⟨ x ^{*} , p _{i} ⟩ _{A}}{⟨ p _{i} , p _{i} ⟩ _{A}} p_{i} = i = 1 \sum k (\frac{⟨ x ^{*} , p _{i} ⟩ _{A}}{⟨ p _{i} , p _{i} ⟩ _{A}} p_{i}) + \frac{⟨ x ^{*} , p _{k + 1} ⟩ _{A}}{⟨ p _{k + 1} , p _{k + 1} ⟩ _{A}} p_{k + 1} = x_{k} + \frac{⟨ x ^{*} , p _{k + 1} ⟩ _{A}}{⟨ p _{k + 1} , p _{k + 1} ⟩ _{A}} p_{k + 1}

Since the approximations are projections, we can also use the variational characterization of projection as finding the vector in the subspace that is closest to $x^{*}$ in the $A$ -norm:

x_{k} = argmin_{x \in span (p_{1}, \dots, p_{k})} ∥ x - x^{*} ∥_{A}

Notice that there is some freedom in the choice of ${p_{i}}$ in this method. We use this freedom in a specific way to arrive at the Conjugate Gradients method.

Connecting to Conjugate Gradients

The conjugate gradients method does exactly the above procedure, but for a very specific choice of the orthogonal basis ${p_{1}, p_{2}, \dots, p_{n}}$ . Specifically, it requires that these vectors span the Krylov sequence of $A$ with starting vector $b$ . More precisely, CG chooses ${p_{i}}$ such that

span (p_{1}, \dots, p_{k}) = K (A, b, k) := K_{k} for all k

With this choice, the variational characterization of the successive approximations becomes:

x_{k} = argmin_{x \in K_{k}} ∥ x - x^{*} ∥_{A}

which is exactly the starting definition of CG!

What remains now is to find an efficient way of computing the $A$ -orthogonal basis ${p_{i}}$ . It turns out that choosing the successive approximation subspaces to be the Krylov sequence allows us to compute $p_{i}$ using a short recurrence by connecting to the Lanczos iteration.

The three-term recurrence for $p_{i}$

To compute the $A$ -orthogonal basis ${p_{i}}$ , we can perform Gram-Schmidt in the $A$ -inner product on some vectors $v_{1}, \dots, v_{n}$ that span the Krylov sequence. However, Gram-Schmidt is pretty slow since to compute $p_{k}$ we need to $A$ -project out the components of $v_{k}$ along $p_{1}, \dots, p_{k - 1}$ and each projection needs $O (nnz (A))$ time since we need to compute an $A$ -inner product which requires multiplication by $A$ . So the total time to compute the basis ${p_{1}, \dots, p_{k}}$ is $O (nnz (A) k^{2})$ . It would be nice if we only needed to project out a few components instead of all $k$ at each step. We can achieve this with a smart choice of starting vectors $v_{i}$ .

A good choice of starting vectors $v_{i}$ would be one where they are already close to being $A$ -orthogonal. Putting the vectors into a matrix $V_{k} := [v_{1} \dots v_{k}]$ , we want the $A$ -inner product of $V_{k}$ with its transpose to be close to a diagonal matrix. More precisely, we want $V_{k}^{T} A V_{k}$ to be close to a diagonal matrix.

We have seen such a $V_{k}$ in the context of the Lanczos iteration. Specifically, the vectors $q_{1}, \dots, q_{k}$ generated by Lanczos span the Krylov sequence and also have the property that $Q_{k}^{T} A Q_{k}$ is a tridiagonal matrix. This means that

q_{i}^{T} A q_{j} = 0 for ∣ i - j ∣ > 1 ⟹ q_{i} ⊥_{A} q_{j} for ∣ i - j ∣ > 1

In other words, $q_{k} ⊥_{A} K_{k - 2}$ for all $i$ .

Therefore, we choose $v_{k} = q_{k}$ for all $k$ . When performing $A$ -Gram Schmidt on $q_{k}$ , we only need to project out the component of $q_{k}$ along $p_{k - 1}$ since $q_{k}$ is already $A$ -orthogonal to $p_{k - 2}, \dots, p_{1}$ . Therefore, we can write the Gram-Schmidt step as follows:

p_{k} = q_{k} - \frac{⟨ q _{k} , p _{k - 1} ⟩ _{A}}{⟨ p _{k - 1} , p _{k - 1} ⟩ _{A}} p_{k - 1}

This step only requires a $O (nnz (A))$ compute since we are only doing one $A$ -projection. Therefore, the total time to compute $p_{1}, \dots, p_{k}$ is $O (nnz (A) k)$ which is much faster than the previous $O (nnz (A) k^{2})$ time. Note that this includes the time to run the Lanczos iteration to generate $q_{k}$ since that also takes $O (nnz (A) k)$ time.

Therefore, the total time to compute the approximation $x_{k}$ is also $O (nnz (A) k)$ since we can iteratively update the approximations to obtain $x_{k}$ as described before, and each of these updates also only takes $O (nnz (A))$ time.

Bringing in the residuals

This version of CG might seem a bit different than the usual implementation since there is no mention of the residual vectors $r_{k} := b - A x_{k}$ . We can easily bring these into the picture by noticing an important fact: the residuals $r_{k}$ are scaled versions of the vectors $q_{k + 1}$ generated by Lanczos. More precisely, $r_{k}$ is parallel to $q_{k + 1}$ for all $0 \leq k \leq n - 1$ .

Note that there is an off-by-one on the indices between $r$ and $q$ simply because in Lanczos $q_{1} = b /∥ b ∥_{2}$ but the corresponding residual is $r_{0} = b - A \cdot 0 = b$ .

We prove this statement by showing that the $r_{k}$ form an orthogonal basis for the Krylov sequence and then use the fact that this orthogonal basis must be unique up to scaling to get the result.

First, observe that $r_{k} \in K_{k + 1}$ . This is because $r_{k} = b - A x_{k}$ and $b \in K_{1}$ and

x_{k} \in K_{k} ⟹ A x_{k} \in K_{k + 1} .

Now, we can rewrite the residual as

r_{k} = A (x^{*} - x_{k}) = A e_{k}

where $e_{k} = x^{*} - x_{k}$ is the error in the approximation $x_{k}$ . Now, since $x_{k}$ is the $A$ -projection of $x^{*}$ onto the subspace $K_{k}$ , we know by property of the projection that the error in this projection must be $A$ -orthogonal to $K_{k}$ . That is,

e_{k} ⊥_{A} K_{k} ⟹ A e_{k} ⊥ K_{k} ⟹ r_{k} ⊥ K_{k}

Combining this with $r_{k} \in K_{k + 1}$ for all $k$ gives us:

r_{i} ⊥ r_{j} for j < i

But the orthogonality relation is symmetric, so we can extend the result to:

r_{i} ⊥ r_{j} for j \neq = i

Therefore, $r_{i}$ is an orthogonal set of vectors and

span (r_{0}, \dots, r_{k - 1}) = K_{k} .

This means that $r_{k}$ must be parallel to $q_{k + 1}$ from Lanczos since we found an orthogonal basis for the Krylov sequence, and we can show by induction that this basis is unique up to scaling factors. We have therefore shown the required statement.

If we put the residuals into a matrix as follows: $R_{k} := [r_{0} \dots r_{k - 1}]$ , the above result says that $R_{k}^{T} A R_{k}$ is a tridiagonal matrix.

Now, we can instead use $v_{k} = r_{k - 1}$ for all $k$ as our starting vectors for Gram-Schmidt in the $A$ -inner product when finding $p_{k}$ . The short recurrence for $p_{k}$ can now be written as:

p_{k} = r_{k - 1} - \frac{⟨ r _{k - 1} , p _{k - 1} ⟩ _{A}}{⟨ p _{k - 1} , p _{k - 1} ⟩ _{A}} p_{k - 1}

Additionally, we can find $r_{k + 1}$ either as $b - A x_{k + 1}$ , or by using the update for $x_{k + 1}$ in terms of $x_{k}$ to get a recurrence as well:

r_{k + 1} = b - A x_{k + 1} = b - A x_{k} - A \frac{⟨ p _{k + 1} , x ^{*} ⟩ _{A}}{⟨ p _{k + 1} , p _{k + 1} ⟩ _{A}} p_{k + 1} = r_{k} - \frac{⟨ p _{k + 1} , x ^{*} ⟩ _{A}}{⟨ p _{k + 1} , p _{k + 1} ⟩ _{A}} A p_{k + 1}

Putting it all together

We can now combine the short recurrence for $p_{k}$ with the iterative update for $x_{k}$ and $r_{k}$ to get a more familiar version of the Conjugate Gradients method (we replace $k$ with $k + 1$ and also write the $A$ -inner products explicitly).

p_{k + 1} x_{k + 1} r_{k + 1} = r_{k} - \frac{p _{k}^{T} A r _{k}}{p _{k}^{T} A p _{k}} p_{k} = x_{k} + \frac{p _{k + 1}^{T} b}{p _{k + 1}^{T} A p _{k + 1}} p_{k + 1} = r_{k} - \frac{p _{k + 1}^{T} b}{p _{k + 1}^{T} A p _{k + 1}} A p_{k + 1}

with $x_{0} = 0$ , $r_{0} = b$ , $p_{1} = b$ .

Summary

We want to solve $A x^{*} = b$ by using only matrix-vector products. Since we can compute inner products with $b$ , we can compute $A$ -inner products with $x^{*}$ .
Suppose we have an A-orthogonal basis $p_{1}, \dots, p_{k}$ for the Krylov sequence starting with $b$ .
We can obtain approximate solution $x_{k}$ by $A$ -projecting $x^{*}$ onto $span (p_{1}, \dots, p_{k}) = K_{k}$ .
To compute the basis $p_{1}, \dots, p_{k}$ , do Gram-Schmidt in the $A$ -inner product with residuals $r_{k}$ as the starting vectors.
This choice leads to a short recurrence for $p_{k}$ resulting in an efficient algorithm for computing $x_{k}$ .

📓 CME 302

Explorer

Rajat's painless Conjugate Gradients

Problem setup

Solution attempt

Connecting to Conjugate Gradients

The three-term recurrence for $p_{i}$

Bringing in the residuals

Putting it all together

Summary

Graph View

Table of Contents

Backlinks

📓 CME 302

Explorer

Rajat's painless Conjugate Gradients

Problem setup

Solution attempt

Connecting to Conjugate Gradients

The three-term recurrence for pi​

Bringing in the residuals

Putting it all together

Summary

Graph View

Table of Contents

Backlinks

The three-term recurrence for $p_{i}$