This derivation of the Conjugate Gradient (CG) algorithm is based on Rajat’s derivation. We made it self-contained so that everything you need to know is on this page. We recommend focusing your studies on this derivation. There many ways to derive CG. Of course, in the end, all these derivations are equivalent, but they will appear superficially different.

We consider symmetric positive definite and the linear system . We assume that is sparse and search for an iterative method to solve the system.

As we saw previously, we can use a Lanczos process to define the Krylov subspace and the sequence of orthogonal vectors . For this problem, we always assume that

Let’s denote by a sequence of vectors such that

We can expand our solution in that basis:

If we have a method to calculate and , then our iterative solution update is very simple:

Attempt 1. Let’s assume that the vectors are orthogonal. In fact, in that case, we have just chosen . We then have . Since

we have

In principle this works well but we do not know ! So even if we can calculate the sequence , there is no obvious way to calculate .

Attempt 2. However there is another equation that we can use. Replace by :

This is the starting point of the entire CG algorithm! We know . So if we know , we can calculate . From there, the entire algorithm can be derived.

Recall that . If we multiply to the left by , we get

Although we can compute , we now have to deal with if we want to calculate .

In attempt 1, we had chosen . However, other choices are possible. In attempt 2, we choose such that

where is diagonal. We will denote by the diagonal entries. Note that, since is SPD, we have .

Let us denote by the first columns of and by the diagonal matrix with the first entries of . The solution is then given by:

What does diagonal mean? When looking at , we are looking at a special dot product that uses matrix . For example, the entry of is

This dot product has the following interpretation. Recall that is SPD. So

where is orthogonal and is diagonal with . So

This dot product has three steps:

  1. Multiply the vectors by . That means apply a series of reflections. This is like rotating the frame of reference.
  2. Multiply by the diagonal matrix . This is a rescaling of the axes.
  3. Apply the usual dot product.

So, essentially, the main thing we are doing is applying a rescaling using .

When we say

we simply mean that the sequence is orthogonal with respect to the dot product defined by This is a very natural choice.

Note that we could require that

But, for computational reasons, another normalization of will be used.

This new dot product allows us to define a new norm, the -norm:

Summary: the CG algorithm builds a sequence of vectors such that

The exact solution is written as:

We calculate using

or

Then we update the solution using

Least-squares problem and projection. We can further interpret the solution in a least-squares sense. From the -orthogonality of , we deduce that is -orthogonal to . This can be also verified from

We recognize that we are solving a least-squares problem using the -norm:

CG produces the approximation in the Krylov subspace that is closest to the true solution in the -norm.

Computing the sequence . In principle, using the Lanczos process, we can compute the sequence . However, there is a more efficient approach that uses the sequence of residual vectors:

Recall the definition of the subspace :

Using this, since , we have . Thus, . Since

we can derive the following important connection between the residuals and the vectors :

Below, we prove some important results involving the residuals .

The residual is orthogonal to . We now prove a key result: .

Proof. Assume that . Then:

Moreover for :

In summary, define

We can write:

where is lower triangular, and for . (Recall that column of is .)

In addition, since

we also have that there exists an upper triangular matrix such that

The matrix is very important and we will come back to it later.

Three-term recurrence. We now prove that is a linear combination of and . From this result, we derive a short and computationally efficient recurrence formula for .

From and , we have:

Since

it follows that . This can be expressed using matrix notation:

where is an upper Hessenberg matrix.

Next, consider the matrix :

where . Since is lower triangular and is lower Hessenberg (i.e., all entries below the first subdiagonal are zero for ), their product is lower Hessenberg as well.

Proof. Here is a more detailed proof. Consider:

if and if So if is lower Hessenberg.

In conclusion, is both lower Hessenberg and upper triangular. Since is diagonal, this implies that and have only two non-zero diagonals in their upper triangular part. We say that these matrices are upper bi-diagonal.

Since , we have proved that (recall that column of is ):

Moreover, since , we have:

We now derive a short recurrence relation for :

At this point, we have not yet chosen the normalization for . To simplify, we choose the following normalization:

This is the key three-term recurrence relation to update in CG.

With this normalization, the are not normalized to have unit -norm. But this normalization turns out to be computationally more efficient.

With this choice, is unit upper bi-diagonal. This means that We will show below that . All other entries in are zero.

Updating the residual vectors. We are now almost done with the complete CG algorithm. We have formulas to update and . The formula to update can be derived from :

Recall that

Multiply by and simplify to get:

This is the equation CG uses to update the residual vectors.

The residual vectors are orthogonal to each other. There are a few more simplifications needed to make the method as computationally efficient as possible. We have already seen that and . Now, we prove that the residuals are orthogonal to each other.

We have:

Since and are lower triangular matrices, is also lower triangular. However, is symmetric (and also positive definite). A matrix that is both triangular and symmetric must be diagonal.

Therefore, is diagonal, and we have proved that the residuals are orthogonal to each other.

Final simplifications. We now derive the final formulas for the CG algorithm. Recall that:

However, , since and is -orthogonal to . Thus:

where we used . Hence, we obtain:

Similarly, we simplify:

From previous relations, we know:

Since , and using the orthogonality of , we compute:

This is an amazingly simple expression! Below, we denote by :

The Conjugate Gradient Algorithm. The complete CG algorithm is as follows. Start with

Then iterate starting, from :

This recurrence is the computationally most efficient implementation of the CG algorithm. It relies on sparse matrix vector products with and just very few vector operations. The CG algorithm is one of the most efficient iterative methods for solving linear systems. But note that it only applies to SPD matrices.

Summary of key equations. We list all the key results we have used in our derivation of the CG algorithm:

We have:

and . For any vector , .

Key orthogonality relations:

  • The vectors are -orthogonal.
  • The residual is orthogonal to .
  • The residuals are orthogonal to each other.

Key optimality relation:

  • The CG algorithm produces the approximation in the Krylov subspace that is closest to the true solution in the -norm.

Orthogonality relations using matrix notations. We can formalize some of these relations using matrix notations.

First, observe the following result: the vector is orthogonal to . Similarly, is orthogonal to . This implies that and are parallel:

From the Lanczos process, we know that is symmetric tri-diagonal. This follows from the construction of the sequence . By definition, is expected to be upper Hessenberg. Since it is symmetric, it must be tri-diagonal.

Using the relationship above, we also deduce that is symmetric tri-diagonal. This further implies that and are -orthogonal if .

Additionally, we previously showed that is diagonal and that is upper bi-diagonal.

In summary:

  • Diagonal:
  • Symmetric tri-diagonal: ,
  • Upper bi-diagonal:

Why is it called the Conjugate Gradient algorithm? The name comes from the fact that the directions are -orthogonal, or conjugate, to each other.

Moreover, consider the loss function:

The gradient of with respect to is:

This shows that the residual is parallel to the gradient of the loss function at . Recall the update equation:

The new direction is the optimal direction to update . This equation shows that is a linear combination of the gradient at and the previous direction .

This is why the CG algorithm is called the Conjugate Gradient algorithm.

Connection to the Lanczos process. Recall that the Lanczos process generates the orthogonal vectors and the tridiagonal matrix . How is related to CG?

We showed above that is orthogonal to . This implies:

Since , we substitute and obtain:

Thus, we recover the Lanczos matrix !