2024 complete and painless Conjugate Gradient

This derivation of the Conjugate Gradient (CG) algorithm is based on Rajat’s derivation. We made it self-contained so that everything you need to know is on this page. We recommend focusing your studies on this derivation. There many ways to derive CG. Of course, in the end, all these derivations are equivalent, but they will appear superficially different.

We consider $A$ symmetric positive definite and the linear system $A x = b$ . We assume that $A$ is sparse and search for an iterative method to solve the system.

As we saw previously, we can use a Lanczos process to define the Krylov subspace and the sequence of orthogonal vectors $q_{1}, \dots, q_{k}$ . For this problem, we always assume that

q_{1} = \frac{b}{∥ b ∥ _{2}}, x_{0} = 0

Let’s denote by $p_{k}$ a sequence of vectors such that

span (p_{1}, \dots, p_{k}) = span (q_{1}, \dots, q_{k})

We can expand our solution $x$ in that basis:

x = i = 1 \sum n μ_{i} p_{i}, with x_{k} = i = 1 \sum k μ_{i} p_{i}

If we have a method to calculate $p_{i}$ and $μ_{i}$ , then our iterative solution update is very simple:

x_{k + 1} = x_{k} + μ_{k + 1} p_{k + 1}

Attempt 1. Let’s assume that the vectors $p_{i}$ are orthogonal. In fact, in that case, we have just chosen $p_{i} = q_{i}$ . We then have $P^{T} P = I$ . Since

x = P μ

we have

P^{T} x = P^{T} P μ = μ

In principle this works well but we do not know $x$ ! So even if we can calculate the sequence $p_{i}$ , there is no obvious way to calculate $μ_{i}$ .

Attempt 2. However there is another equation that we can use. Replace $P^{T} x$ by $P^{T} A x$ :

P^{T} A x = P^{T} b

This is the starting point of the entire CG algorithm! We know $b$ . So if we know $P$ , we can calculate $P^{T} b$ . From there, the entire algorithm can be derived.

Recall that $x = P μ$ . If we multiply to the left by $P^{T} A$ , we get

(P^{T} A) x = P^{T} (A x) = P^{T} b = (P^{T} A) P μ

Although we can compute $P^{T} b$ , we now have to deal with $P^{T} A P$ if we want to calculate $μ$ .

In attempt 1, we had chosen $p_{i} = q_{i}$ . However, other choices are possible. In attempt 2, we choose $p_{i}$ such that

P^{T} A P = D

where $D$ is diagonal. We will denote by $d_{i}$ the diagonal entries. Note that, since $A$ is SPD, we have $d_{i} > 0$ .

Let us denote by $P_{k}$ the first $k$ columns of $P$ and by $D_{k}$ the diagonal matrix with the first $k$ entries of $D$ . The solution $x_{k} = P_{k} μ^{(k)}$ is then given by:

μ^{(k)} = D_{k}^{- 1} P_{k}^{T} b, x_{k} = P_{k} D_{k}^{- 1} P_{k}^{T} b

What does $D = P^{T} A P$ diagonal mean? When looking at $P^{T} A P$ , we are looking at a special dot product that uses matrix $A$ . For example, the $(i, j)$ entry of $P^{T} A P$ is

p_{i}^{T} A p_{j} = ⟨ p_{i}, p_{j} ⟩_{A}

This dot product has the following interpretation. Recall that $A$ is SPD. So

A = Q Λ Q^{T}

where $Q$ is orthogonal and $Λ$ is diagonal with $λ_{i} > 0$ . So

⟨ y, z ⟩_{A} = y^{T} A z = y^{T} Q Λ Q^{T} z = (Q^{T} y)^{T} Λ (Q^{T} z) = (Λ^{1/2} Q^{T} y)^{T} (Λ^{1/2} Q^{T} z)

This dot product has three steps:

Multiply the vectors by $Q$ . That means apply a series of reflections. This is like rotating the frame of reference.
Multiply by the diagonal matrix $Λ^{1/2}$ . This is a rescaling of the axes.
Apply the usual dot product.

So, essentially, the main thing we are doing is applying a rescaling using $Λ^{1/2}$ .

When we say

P^{T} A P = D

we simply mean that the sequence $p_{i}$ is orthogonal with respect to the dot product defined by $A .$ This is a very natural choice.

Note that we could require that

p_{i}^{T} A p_{i} = 1

But, for computational reasons, another normalization of $p_{i}$ will be used.

This new dot product allows us to define a new norm, the $A$ -norm:

∥ z ∥_{A} = z^{T} A z = ∥ Λ^{1/2} Q^{T} z ∥_{2} = ∥ A^{1/2} z ∥_{2}

Summary: the CG algorithm builds a sequence of vectors $p_{i}$ such that

P^{T} A P = D

The exact solution $x$ is written as:

x = P μ

We calculate $μ$ using

P^{T} b = D μ, with d_{k} = p_{k}^{T} A p_{k}

μ_{k} = \frac{p _{k}^{T} b}{d _{k}}

Then we update the solution using

x_{k + 1} = x_{k} + μ_{k + 1} p_{k + 1}

Least-squares problem and projection. We can further interpret the solution in a least-squares sense. From the $A$ -orthogonality of $P$ , we deduce that $x - x_{k} = e^{(k)}$ is $A$ -orthogonal to $K_{k}$ . This can be also verified from

P_{k}^{T} A (x - x_{k}) = P_{k}^{T} A (x - P_{k} μ^{(k)}) = P_{k}^{T} b - D_{k} μ^{(k)} = 0

We recognize that we are solving a least-squares problem using the $A$ -norm:

μ^{(k)} = argmin_{y} ∥ P_{k} y - x ∥_{A}, x_{k} = P_{k} μ^{(k)} .

CG produces the approximation in the Krylov subspace $K_{k}$ that is closest to the true solution $x$ in the $A$ -norm.

Computing the sequence $p_{k}$ . In principle, using the Lanczos process, we can compute the sequence $p_{k}$ . However, there is a more efficient approach that uses the sequence of residual vectors:

r_{k} = b - A x_{k}, r_{0} = b .

Recall the definition of the subspace $K_{k}$ :

K_{k} = span (q_{1}, A q_{1}, \dots, A^{k - 1} q_{1}) .

Using this, since $x_{k - 1} \in K_{k - 1}$ , we have $A x_{k - 1} \in K_{k}$ . Thus, $r_{k - 1} = b - A x_{k - 1} \in K_{k}$ . Since

K_{k} = span (p_{1}, \dots, p_{k}),

we can derive the following important connection between the residuals and the vectors $p_{k}$ :

span (r_{0}, \dots, r_{k - 1}) = span (p_{1}, \dots, p_{k}) .

Below, we prove some important results involving the residuals $r_{k}$ .

The residual $r_{k}$ is orthogonal to $K_{k}$ . We now prove a key result: $r_{k} ⊥ K_{k}$ .

Proof. Assume that $l \leq k$ . Then:

p_{l}^{T} r_{k} = p_{l}^{T} b - p_{l}^{T} A x_{k} = d_{l} μ_{l} - p_{l}^{T} A i = 1 \sum k μ_{i} p_{i} = d_{l} μ_{l} - d_{l} μ_{l} = 0.

$□$

Moreover for $l > k$ :

p_{l}^{T} r_{k} = p_{l}^{T} b - p_{l}^{T} A x_{k} = d_{l} μ_{l}

In summary, define

R = [r_{0} r_{1} \dots r_{n - 1}]

We can write:

P^{T} R = L

where $L$ is lower triangular, and $l_{ij} = d_{i} μ_{i}$ for $i \geq j$ . (Recall that column $j$ of $R$ is $r_{j - 1}$ .)

In addition, since

span (r_{0}, \dots, r_{k - 1}) = span (p_{1}, \dots, p_{k})

we also have that there exists an upper triangular matrix $U$ such that

R = P U .

The matrix $U$ is very important and we will come back to it later.

Three-term recurrence. We now prove that $r_{k}$ is a linear combination of $p_{k}$ and $p_{k + 1}$ . From this result, we derive a short and computationally efficient recurrence formula for $p_{k + 1}$ .

From $R = P U$ and $P^{T} A P = D$ , we have:

(P^{T} A) R = (P^{T} A) P U = D U .

Since

span (p_{1}, \dots, p_{k}) = K_{k},

it follows that $A p_{k} \in K_{k + 1}$ . This can be expressed using matrix notation:

A P = P H,

where $H$ is an upper Hessenberg matrix.

Next, consider the matrix $(P^{T} A) R$ :

P^{T} A R = (A P)^{T} R = (P H)^{T} R = H^{T} P^{T} R = H^{T} L = W,

where $P^{T} R = L$ . Since $L$ is lower triangular and $H^{T}$ is lower Hessenberg (i.e., all entries below the first subdiagonal are zero for $i + 2 \leq j$ ), their product $W = H^{T} L$ is lower Hessenberg as well.

Proof. Here is a more detailed proof. Consider:

w_{ij} = k \sum h_{ki} l_{kj}

$h_{ki} = 0$ if $i + 2 \leq k$ and $l_{kj} = 0$ if $k \leq j - 1.$ So $w_{ij} = 0$ if $i + 2 \leq j .$ $W$ is lower Hessenberg.

$□$

In conclusion, $W = D U$ is both lower Hessenberg and upper triangular. Since $D$ is diagonal, this implies that $W$ and $U$ have only two non-zero diagonals in their upper triangular part. We say that these matrices are upper bi-diagonal.

Since $R = P U$ , we have proved that (recall that column $k + 1$ of $R$ is $r_{k}$ ):

r_{k} = u_{k, k + 1} p_{k} + u_{k + 1, k + 1} p_{k + 1} .

Moreover, since $P^{T} A R = D U$ , we have:

u_{k, k + 1} = \frac{p _{k}^{T} A r _{k}}{d _{k}} .

We now derive a short recurrence relation for $p_{k + 1}$ :

u_{k + 1, k + 1} p_{k + 1} = r_{k} - u_{k, k + 1} p_{k} .

At this point, we have not yet chosen the normalization for $p_{k + 1}$ . To simplify, we choose the following normalization:

u_{k + 1, k + 1} = 1, p_{k + 1} = r_{k} - u_{k, k + 1} p_{k} .

This is the key three-term recurrence relation to update $p_{k + 1}$ in CG.

With this normalization, the $p_{k}$ are not normalized to have unit $A$ -norm. But this normalization turns out to be computationally more efficient.

With this choice, $U$ is unit upper bi-diagonal. This means that $u_{kk} = 1.$ We will show below that $u_{k, k + 1} < 0$ . All other entries in $U$ are zero.

Updating the residual vectors. We are now almost done with the complete CG algorithm. We have formulas to update $x_{k + 1}$ and $p_{k + 1}$ . The formula to update $r_{k + 1}$ can be derived from $x_{k + 1}$ :

r_{k} = b - A x_{k}

Recall that

x_{k + 1} = x_{k} + μ_{k + 1} p_{k + 1}

Multiply by $- A$ and simplify to get:

r_{k + 1} = r_{k} - μ_{k + 1} A p_{k + 1}

This is the equation CG uses to update the residual vectors.

The residual $r_{k}$ vectors are orthogonal to each other. There are a few more simplifications needed to make the method as computationally efficient as possible. We have already seen that $R = P U$ and $P^{T} R = L$ . Now, we prove that the residuals $r_{k}$ are orthogonal to each other.

We have:

R^{T} R = (P U)^{T} R = U^{T} P^{T} R = U^{T} (P^{T} R) = U^{T} L .

Since $U^{T}$ and $L$ are lower triangular matrices, $R^{T} R$ is also lower triangular. However, $R^{T} R$ is symmetric (and also positive definite). A matrix that is both triangular and symmetric must be diagonal.

Therefore, $R^{T} R = U^{T} L$ is diagonal, and we have proved that the residuals $r_{k}$ are orthogonal to each other.

Final simplifications. We now derive the final formulas for the CG algorithm. Recall that:

μ_{k} = \frac{p _{k}^{T} b}{d _{k}} .

However, $p_{k}^{T} b = p_{k}^{T} (b - A x_{k - 1})$ , since $x_{k - 1} \in K_{k - 1}$ and $p_{k}$ is $A$ -orthogonal to $K_{k - 1}$ . Thus:

p_{k}^{T} b = p_{k}^{T} (b - A x_{k - 1}) = p_{k}^{T} r_{k - 1} = (r_{k - 1} - u_{k - 1, k} p_{k - 1})^{T} r_{k - 1} = r_{k - 1}^{T} r_{k - 1},

where we used $p_{k - 1}^{T} r_{k - 1} = 0$ . Hence, we obtain:

μ_{k} = \frac{∥ r _{k - 1} ∥ _{2}^{2}}{d _{k}} .

Similarly, we simplify:

u_{k, k + 1} = \frac{p _{k}^{T} A r _{k}}{d _{k}} .

From previous relations, we know:

A p_{k} = μ_{k}^{- 1} (r_{k - 1} - r_{k}) .

Since $μ_{k} d_{k} = ∥ r_{k - 1} ∥_{2}^{2}$ , and using the orthogonality of $r_{k}$ , we compute:

u_{k, k + 1} = \frac{p _{k}^{T} A r _{k}}{d _{k}} = \frac{( A p _{k} ) ^{T} r _{k}}{d _{k}} = \frac{( r _{k - 1} - r _{k} ) ^{T} r _{k}}{μ _{k} d _{k}} = - \frac{∥ r _{k} ∥ _{2}^{2}}{∥ r _{k - 1} ∥ _{2}^{2}} < 0.

This is an amazingly simple expression! Below, we denote by $τ_{k}$ :

τ_{k} = - u_{k, k + 1} .

The Conjugate Gradient Algorithm. The complete CG algorithm is as follows. Start with

x_{0} = 0, r_{0} = b, p_{1} = b .

Then iterate starting, from $k = 1$ :

μ_{k} x_{k} r_{k} τ_{k} p_{k + 1} = \frac{∥ r _{k - 1} ∥ _{2}^{2}}{p _{k}^{T} A p _{k}} = x_{k - 1} + μ_{k} p_{k} = r_{k - 1} - μ_{k} A p_{k} = \frac{∥ r _{k} ∥ _{2}^{2}}{∥ r _{k - 1} ∥ _{2}^{2}} = r_{k} + τ_{k} p_{k}

This recurrence is the computationally most efficient implementation of the CG algorithm. It relies on sparse matrix vector products with $A$ and just very few vector operations. The CG algorithm is one of the most efficient iterative methods for solving linear systems. But note that it only applies to SPD matrices.

Summary of key equations. We list all the key results we have used in our derivation of the CG algorithm:

x P^{T} A P ∥ z ∥_{A} r_{k} R P^{T} R = P μ = D, where D is diagonal = z^{T} A z = ∥ A^{1/2} z ∥_{2} = b - A x_{k} = P U, where U is unit upper bi-diagonal = L, where L is lower triangular

We have:

K_{k} = span (b, A b, \dots, A^{k - 1} b) = span (q_{1}, \dots, q_{k}) = span (p_{1}, \dots, p_{k}) = span (r_{0}, \dots, r_{k - 1})

and $x_{k} \in K_{k}$ . For any vector $y \in K_{k}$ , $A y \in K_{k + 1}$ .

Key orthogonality relations:

The vectors $p_{k}$ are $A$ -orthogonal.
The residual $r_{k}$ is orthogonal to $K_{k}$ .
The residuals $r_{k}$ are orthogonal to each other.

Key optimality relation:

The CG algorithm produces the approximation $x_{k}$ in the Krylov subspace $K_{k}$ that is closest to the true solution $x$ in the $A$ -norm.

Convergence of CG. The CG algorithm converges in at most $n$ iterations. The convergence is faster for well-conditioned matrices. The following error bound holds:

∥ x - x_{k} ∥_{A} \leq 2 (\frac{1 - κ ( A ) ^{- 1/2}}{1 + κ ( A ) ^{- 1/2}})^{k} ∥ x - x_{0} ∥_{A}

where $κ (A) = ∥ A ∥_{2} ∥ A^{- 1} ∥_{2}$ is the condition number of $A$ .

Orthogonality relations using matrix notations. We can formalize some of these relations using matrix notations.

First, observe the following result: the vector $q_{k} \in K_{k}$ is orthogonal to $K_{k - 1}$ . Similarly, $r_{k - 1} \in K_{k}$ is orthogonal to $K_{k - 1}$ . This implies that $q_{k}$ and $r_{k - 1}$ are parallel:

span (q_{k}) = span (r_{k - 1}) .

From the Lanczos process, we know that $T = Q^{T} A Q$ is symmetric tri-diagonal. This follows from the construction of the sequence $q_{k}$ . By definition, $Q^{T} A Q$ is expected to be upper Hessenberg. Since it is symmetric, it must be tri-diagonal.

Using the relationship above, we also deduce that $R^{T} A R$ is symmetric tri-diagonal. This further implies that $r_{i}$ and $r_{j}$ are $A$ -orthogonal if $∣ i - j ∣ \geq 2$ .

Additionally, we previously showed that $P^{T} A P$ is diagonal and that $P^{T} A R$ is upper bi-diagonal.

In summary:

Diagonal: $P^{T} A P$
Symmetric tri-diagonal: $Q^{T} A Q$ , $R^{T} A R$
Upper bi-diagonal: $P^{T} A R$

Why is it called the Conjugate Gradient algorithm? The name comes from the fact that the directions $p_{k}$ are $A$ -orthogonal, or conjugate, to each other.

Moreover, consider the loss function:

L (y) = ∥ x - y ∥_{A}^{2} = (x - y)^{T} A (x - y) .

The gradient of $L (y)$ with respect to $y$ is:

\nabla L (y) = 2 A (y - x) = - 2 (b - A y) = - 2 r (y) .

This shows that the residual $r_{k} = b - A x_{k}$ is parallel to the gradient of the loss function at $x_{k}$ . Recall the update equation:

p_{k + 1} = r_{k} + τ_{k} p_{k} .

The new direction $p_{k + 1}$ is the optimal direction to update $x_{k + 1}$ . This equation shows that $p_{k + 1}$ is a linear combination of the gradient at $x_{k}$ and the previous direction $p_{k}$ .

This is why the CG algorithm is called the Conjugate Gradient algorithm.

Connection to the Lanczos process. Recall that the Lanczos process generates the orthogonal vectors $q_{k}$ and the tridiagonal matrix $T_{k}$ . How is $T_{k}$ related to CG?

We showed above that $r_{k} = b - A x_{k}$ is orthogonal to $K_{k}$ . This implies:

Q_{k}^{T} (b - A x_{k}) = 0.

Since $x_{k} = Q_{k} y$ , we substitute and obtain:

Q_{k}^{T} A Q_{k} y = T_{k} y = Q_{k}^{T} b = ∥ b ∥_{2} e_{1} .

Thus, we recover the Lanczos matrix $T_{k}$ !

📓 CME 302

Explorer

2024 complete and painless Conjugate Gradient

Graph View

Backlinks