A geometric derivation of Conjugate Gradients

Author: Rajat Vadiraj Dwaraknath

Thanks to Anjan Dwaraknath for helpful discussions.

Problem setup

We wish to solve the linear system where is a symmetric positive definite matrix. We use to denote the exact solution.

We are working in the computational model where we have access to the matrix only through matrix-vector products. That is, we have a method to compute for any vector . We also have access to the vector .

Since is symmetric positive definite, it induces an inner product given by . Although we don’t have access to the solution , we do have access to . So, we can compute the expression for any . In other words, we can compute A-inner products with the solution for any .

Solution attempt

Motivated by this, we can posit a method to solve by working in the -inner product as follows:

  • Compute an -orthogonal basis for which we denote via Gram-Schmidt in the -inner product. Note that for this CG derivation, we don’t need these basis vectors to be normalized.
  • -project the solution using this basis:

This also naturally leads to an approximation scheme by truncating the sum in the projection:

Therefore, the sequence of approximations can be interpreted as the -projection of the solution onto to the sequence of increasing subspaces given by . We can iteratively update the approximations by noticing that:

Since the approximations are projections, we can also use the variational characterization of projection as finding the vector in the subspace that is closest to in the -norm:

Notice that there is some freedom in the choice of in this method. We use this freedom in a specific way to arrive at the Conjugate Gradients method.

Connecting to Conjugate Gradients

The conjugate gradients method does exactly the above procedure, but for a very specific choice of the orthogonal basis . Specifically, it requires that these vectors span the Krylov sequence of with starting vector . More precisely, CG chooses such that

With this choice, the variational characterization of the successive approximations becomes:

which is exactly the starting definition of CG!

What remains now is to find an efficient way of computing the -orthogonal basis . It turns out that choosing the successive approximation subspaces to be the Krylov sequence allows us to compute using a short recurrence by connecting to the Lanczos iteration.

The three-term recurrence for

To compute the -orthogonal basis , we can perform Gram-Schmidt in the -inner product on some vectors that span the Krylov sequence. However, Gram-Schmidt is pretty slow since to compute we need to -project out the components of along and each projection needs time since we need to compute an -inner product which requires multiplication by . So the total time to compute the basis is . It would be nice if we only needed to project out a few components instead of all at each step. We can achieve this with a smart choice of starting vectors .

A good choice of starting vectors would be one where they are already close to being -orthogonal. Putting the vectors into a matrix , we want the -inner product of with its transpose to be close to a diagonal matrix. More precisely, we want to be close to a diagonal matrix.

We have seen such a in the context of the Lanczos iteration. Specifically, the vectors generated by Lanczos span the Krylov sequence and also have the property that is a tridiagonal matrix. This means that

In other words, for all .

Therefore, we choose for all . When performing -Gram Schmidt on , we only need to project out the component of along since is already -orthogonal to . Therefore, we can write the Gram-Schmidt step as follows:

This step only requires a compute since we are only doing one -projection. Therefore, the total time to compute is which is much faster than the previous time. Note that this includes the time to run the Lanczos iteration to generate since that also takes time.

Therefore, the total time to compute the approximation is also since we can iteratively update the approximations to obtain as described before, and each of these updates also only takes time.

Bringing in the residuals

This version of CG might seem a bit different than the usual implementation since there is no mention of the residual vectors . We can easily bring these into the picture by noticing an important fact: the residuals are scaled versions of the vectors generated by Lanczos. More precisely, is parallel to for all .

Note that there is an off-by-one on the indices between and simply because in Lanczos but the corresponding residual is .

We prove this statement by showing that the form an orthogonal basis for the Krylov sequence and then use the fact that this orthogonal basis must be unique up to scaling to get the result.

First, observe that . This is because and and

Now, we can rewrite the residual as

where is the error in the approximation . Now, since is the -projection of onto the subspace , we know by property of the projection that the error in this projection must be -orthogonal to . That is,

Combining this with for all gives us:

But the orthogonality relation is symmetric, so we can extend the result to:

Therefore, is an orthogonal set of vectors and

This means that must be parallel to from Lanczos since we found an orthogonal basis for the Krylov sequence, and we can show by induction that this basis is unique up to scaling factors. We have therefore shown the required statement.

If we put the residuals into a matrix as follows: , the above result says that is a tridiagonal matrix.

Now, we can instead use for all as our starting vectors for Gram-Schmidt in the -inner product when finding . The short recurrence for can now be written as:

Additionally, we can find either as , or by using the update for in terms of to get a recurrence as well:

Putting it all together

We can now combine the short recurrence for with the iterative update for and to get a more familiar version of the Conjugate Gradients method (we replace with and also write the -inner products explicitly).

with , , .

Summary

  • We want to solve by using only matrix-vector products. Since we can compute inner products with , we can compute -inner products with .
  • Suppose we have an A-orthogonal basis for the Krylov sequence starting with .
  • We can obtain approximate solution by -projecting onto .
  • To compute the basis , do Gram-Schmidt in the -inner product with residuals as the starting vectors.
  • This choice leads to a short recurrence for resulting in an efficient algorithm for computing .