Convergence of the Arnoldi Process

Convergence of the Arnoldi Process#

We have introduced the Arnoldi iteration as a practical algorithm for building a small Hessenberg matrix \(H_k\). We now explore the deeper theoretical connections between this algorithm and the fundamental structure it explores: the Krylov subspace.

The Krylov Subspace#

For a matrix \(A \in \mathbb{R}^{n \times n}\) and a starting vector \(q_1\), the \(k\)-dimensional Krylov subspace is defined as the space spanned by the first \(k\) vectors in the “power sequence”:

\[ \mathcal{K}_k(A, q_1) = \text{span}\{q_1, Aq_1, A^2q_1, \dots, A^{k-1}q_1\} \]

This subspace is, in essence, the set of all vectors we can “reach” from \(q_1\) by applying \(A\) up to \(k-1\) times.

Arnoldi as a Krylov Basis#

The most important theoretical property of the Arnoldi iteration is that it provides a stable, orthonormal basis for this exact subspace.

Theorem 26 (Arnoldi Produces a Krylov Basis)

The basis \(Q_k = [q_1, \dots, q_k]\) generated by the \(k\)-step Arnoldi algorithm (assuming no breakdown) is an orthonormal basis for the \(k\)-dimensional Krylov subspace \(\mathcal{K}_k(A, q_1)\).

Proof. We will prove by induction that \(\text{span}\{q_1, \dots, q_j\} = \text{span}\{q_1, \dots, A^{j-1}q_1\}\) for all \(j=1, \dots, k\).

Let \(\mathcal{Q}_j = \text{span}\{q_1, \dots, q_j\}\) and \(\mathcal{K}_j = \text{span}\{q_1, \dots, A^{j-1}q_1\}\).

Base Case (j=1): \(\mathcal{Q}_1 = \text{span}\{q_1\}\) and \(\mathcal{K}_1 = \text{span}\{q_1\}\). They are equal.
Inductive Hypothesis: Assume \(\mathcal{Q}_j = \mathcal{K}_j\) for some \(j < k\).
Inductive Step: We must show \(\mathcal{Q}_{j+1} = \mathcal{K}_{j+1}\).

This requires two inclusions:

Show \(\mathcal{Q}_{j+1} \subseteq \mathcal{K}_{j+1}\):

\(\mathcal{Q}_{j+1} = \text{span}\{\mathcal{Q}_j, q_{j+1}\}\). By the inductive hypothesis, \(\mathcal{Q}_j = \mathcal{K}_j \subseteq \mathcal{K}_{j+1}\). We only need to show \(q_{j+1} \in \mathcal{K}_{j+1}\). The Arnoldi recurrence is

\[ q_{j+1} = (h_{j+1,j})^{-1} (A q_j - \sum_{i=1}^{j} h_{ij} q_i). \]

By the IH, both \(q_j\) and the \(q_i\) terms are in \(\mathcal{Q}_j = \mathcal{K}_j\).
- Since \(q_i \in \mathcal{K}_j \subseteq \mathcal{K}_{j+1}\), the sum \(\sum h_{ij} q_i\) is in \(\mathcal{K}_{j+1}\).
- Since \(q_j \in \mathcal{K}_j\),
\[ A q_j \in A(\mathcal{K}_j) = A(\text{span}\{q_1, \dots, A^{j-1}q_1\}) = \text{span}\{Aq_1, \dots, A^j q_1\} \subseteq \mathcal{K}_{j+1}. \]
- Therefore, \(q_{j+1}\) is a linear combination of vectors in \(\mathcal{K}_{j+1}\), so \(q_{j+1} \in \mathcal{K}_{j+1}\).
- This proves \(\mathcal{Q}_{j+1} \subseteq \mathcal{K}_{j+1}\).
Show \(\mathcal{K}_{j+1} \subseteq \mathcal{Q}_{j+1}\):

\(\mathcal{K}_{j+1} = \text{span}\{\mathcal{K}_j, A^j q_1\}\). By the IH, \(\mathcal{K}_j = \mathcal{Q}_j \subseteq \mathcal{Q}_{j+1}\). We only need to show \(A^j q_1 \in \mathcal{Q}_{j+1}\).

We prove by induction on \(m\) that \(A^m q_1 \in \mathcal{Q}_{m+1}\) for \(m = 0, \dots, j\):
- Base: \(A^0 q_1 = q_1 \in \mathcal{Q}_1\).
- Step: Assume \(A^m q_1 \in \mathcal{Q}_{m+1}\) for some \(m < j\). From the Arnoldi relation, \(A q_i \in \mathcal{Q}_{i+1}\) for any \(i\). Since \(A^m q_1 \in \mathcal{Q}_{m+1}\), we have \(A^{m+1} q_1 = A(A^m q_1) \in A(\mathcal{Q}_{m+1}) \subseteq \mathcal{Q}_{m+2}\).
Therefore \(A^j q_1 \in \mathcal{Q}_{j+1}\), proving \(\mathcal{K}_{j+1} \subseteq \mathcal{Q}_{j+1}\).

Since we have shown inclusion in both directions, we conclude \(\mathcal{Q}_{j+1} = \mathcal{K}_{j+1}\).

Arnoldi as QR Factorization#

This equivalence has a powerful matrix interpretation. Let’s define the Krylov matrix \(K_k\) whose columns are the generators of the Krylov subspace:

\[ K_k = [q_1, Aq_1, A^2 q_1, \dots, A^{k-1} q_1] \]

Our theorem \(\mathcal{K}_k = \text{span}(Q_k)\) means that every column of \(K_k\) can be written as a linear combination of the columns of \(Q_k\). This gives the matrix relationship:

\[ K_k = Q_k R_k \]

What is \(R_k\)? The \(j\)-th column of \(K_k\) is \(A^{j-1}q_1\). From our proof, we know \(A^{j-1}q_1 \in \mathcal{K}_j = \text{span}\{q_1, \dots, q_j\}\). This means that to build the \(j\)-th column of \(K_k\), we only need the first \(j\) columns of \(Q_k\). This is the exact definition of an upper triangular matrix \(R_k\).

Key Insight: The Arnoldi iteration is implicitly performing a QR factorization of the Krylov matrix \(K_k\).

This is of great practical importance. The columns of \(K_k\) (the power-sequence \(A^j q_1\)) tend to converge very quickly to the direction of the dominant eigenvector. This makes the Krylov matrix \(K_k\) extremely ill-conditioned, and a standard Gram-Schmidt \(QR\) factorization would fail due to catastrophic cancellation. The Arnoldi iteration, by orthogonalizing “as it goes,” is a numerically stable way to compute this same basis.

The Polynomial Connection#

The definition of \(\mathcal{K}_k\) leads to one additional insight. Any vector \(v \in \mathcal{K}_k\) is a linear combination of the basis vectors:

\[ v = c_0 q_1 + c_1 A q_1 + c_2 A^2 q_1 + \dots + c_{k-1} A^{k-1} q_1 \]

We can factor out \(q_1\) to the right:

\[ v = (c_0 I + c_1 A + c_2 A^2 + \dots + c_{k-1} A^{k-1}) q_1 \]

This can be written as \(v = p(A) q_1\), where \(p(z) = \sum_{i=0}^{k-1} c_i z^i\) is a polynomial of degree at most \(k-1\).

Thus, the Krylov subspace \(\mathcal{K}_k\) is the space of all vectors that can be formed by applying a polynomial of \(A\) of degree \(< k\) to the starting vector \(q_1\).

Important

The Arnoldi-based eigenvalue algorithm is therefore doing the following:

It searches for the “best” eigenvector approximations (the Ritz vectors) within the subspace of all vectors of the form \(p(A) q_1\), where \(p\) is a polynomial of degree at most \(k-1\).

Optimal Polynomial Approximation and Eigenvalue Convergence#

In this section, we reframe the Arnoldi process as a problem of optimal polynomial approximation.

The Cayley-Hamilton Theorem#

Polynomials of \(A\) can be used to interpret the convergence of the Arnoldi process. The fundamental connection is provided by the Cayley-Hamilton theorem.

Definition 2 (Characteristic Polynomial)

The characteristic polynomial of \(A \in \mathbb{C}^{n \times n}\) is

\[ p_A(z) = \det(zI - A) = z^n + c_{n-1}z^{n-1} + \cdots + c_1 z + c_0. \]

Theorem 27 (Cayley-Hamilton Theorem)

Every square matrix satisfies its characteristic equation:

\[ p_A(A) = 0. \]

Proof. Proof for diagonalizable matrices. Assume \(A = X \Lambda X^{-1}\) where \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)\). For any polynomial \(p(z)\), we have

\[ p(A) = X p(\Lambda) X^{-1}, \]

where \(p(\Lambda) = \text{diag}(p(\lambda_1), \ldots, p(\lambda_n))\).

The characteristic polynomial can be written as

\[ p_A(z) = \prod_{i=1}^n (z - \lambda_i). \]

Therefore,

\[ [p_A(\Lambda)]_{ii} = p_A(\lambda_i) = \prod_{j=1}^n (\lambda_i - \lambda_j) = 0. \]

Hence \(p_A(\Lambda) = 0\), which implies

\[ p_A(A) = X p_A(\Lambda) X^{-1} = 0. \]

Remark 5. The Cayley-Hamilton theorem holds for all matrices, not just diagonalizable ones. The general proof uses the adjugate matrix or Jordan canonical form.

Definition 3 (Minimal Polynomial)

The minimal polynomial \(m_A(z)\) is the monic polynomial of smallest degree such that \(m_A(A) = 0\).

Proposition 1 (Properties of the Minimal Polynomial)

If \(A\) is diagonalizable with distinct eigenvalues \(\lambda_1, \ldots, \lambda_d\), then

\[ m_A(z) = \prod_{i=1}^d (z - \lambda_i). \]

Arnoldi’s Optimality Property#

Let’s explore in what sense the Arnoldi process makes \(p(A)\) small.

Theorem 28 (Arnoldi Optimality)

Let \(H_k\) be the \(k \times k\) upper Hessenberg matrix from the Arnoldi process, and let

\[ p_k(z) = \det(zI - H_k) = z^k + c_{k-1}z^{k-1} + \cdots + c_0 \]

be its characteristic polynomial (which is monic of degree \(k\)). Then \(p_k\) minimizes

\[ \| p(A) q_1 \|_2 \]

among all monic polynomials \(p(z) = z^k + d_{k-1}z^{k-1} + \cdots + d_0\) of degree \(k\).

Proof. From the Arnoldi relation \(A Q_k = Q_k H_k + h_{k+1,k} q_{k+1} e_k^T\), we have

\[ A Q_k = Q_{k+1} \tilde{H}_k, \]

where \(\tilde{H}_k \in \mathbb{C}^{(k+1) \times k}\) is the extended Hessenberg matrix.

Key observation: If we ran the Arnoldi process to completion with an \(n \times n\) matrix \(A\), we would obtain the full decomposition

\[ A Q = Q H, \]

where \(Q\) is unitary and \(H\) is upper Hessenberg. Here \(Q_k\) consists of the first \(k\) columns of \(Q\), \(H_k\) is the leading \(k \times k\) block of \(H\), and \(\tilde{H}_k\) is the leading \((k+1) \times k\) block of \(H\).

For any polynomial \(p(z)\), we have

\[ p(A) Q = Q p(H). \]

Considering only the first column (\(q_1 = Q e_1\)):

\[ p(A) q_1 = Q p(H) e_1. \]

Since the first \(k+1\) rows of \(p(H) e_1\) depend only on the first \(k+1\) rows and \(k\) columns of \(H\) (by the Hessenberg structure), we have

\[ p(A) q_1 = Q_{k+1} [p(H) e_1]_{1:k+1}, \]

where \([p(H) e_1]_{1:k+1}\) denotes the first \(k+1\) entries of \(p(H) e_1\).

For the characteristic polynomial \(p_k(z) = \det(zI - H_k)\) of the \(k \times k\) matrix \(H_k\), by Cayley-Hamilton, \(p_k(H_k) = 0\). The bordered structure of \(H\) implies that

\[\begin{split} p_k(H) e_1 = \begin{bmatrix} 0 \\ \vdots \\ 0 \\ \beta \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \end{split}\]

where the first \(k\) entries are zero and \(\beta \neq 0\) (generically). Therefore,

\[\begin{split} p_k(A) q_1 = Q_{k+1} \begin{bmatrix} 0 \\ \vdots \\ 0 \\ \beta \end{bmatrix} = \beta q_{k+1} \perp \text{span}(Q_k). \end{split}\]

Since \(Q_{k+1}\) has orthonormal columns,

\[ \| p_k(A) q_1 \|_2 = |\beta|. \]

For any other monic polynomial \(p(z) = z^k + d_{k-1}z^{k-1} + \cdots + d_0\), we can write

\[ p(z) - p_k(z) = q_{k-1}(z) \]

for some polynomial \(q_{k-1}\) of degree at most \(k-1\). Then

\[ p(A) q_1 = p_k(A) q_1 + q_{k-1}(A) q_1. \]

Since \(q_{k-1}(A) q_1 \in \mathcal{K}_k(A, q_1) = \text{span}(Q_k)\) and \(p_k(A) q_1 \perp \text{span}(Q_k)\), by the Pythagorean theorem:

\[ \| p(A) q_1 \|_2^2 = \| p_k(A) q_1 \|_2^2 + \| q_{k-1}(A) q_1 \|_2^2 \geq \| p_k(A) q_1 \|_2^2. \]

Connection to Eigenvalue Approximation#

The optimality theorem is the key to understanding why the Ritz values (the eigenvalues of \(H_k\)) converge to the eigenvalues of \(A\). The theorem states that the Arnoldi process finds a monic polynomial \(p_k(z)\) that makes the vector \(p_k(A)q_1\) as small as possible in the 2-norm.

Let’s analyze what this means for the roots of \(p_k(z)\). The roots of the characteristic polynomial \(p_k(z) = \det(zI - H_k)\) are precisely the Ritz values, which we denote by \(\mu_1, \dots, \mu_k\). Thus, we can write:

\[ p_k(z) = \prod_{j=1}^k (z - \mu_j) \]

To see how this relates to the eigenvalues of \(A\), let’s assume \(A\) is diagonalizable, so \(A = X \Lambda X^{-1}\) where \(\Lambda = \text{diag}(\lambda_1, \dots, \lambda_n)\). We can express the starting vector \(q_1\) in the basis of eigenvectors: \(q_1 = Xw\) for some vector of coefficients \(w\).

Now, we can evaluate the vector \(p_k(A)q_1\):

\[\begin{split} \begin{align} p_k(A) q_1 &= p_k(X \Lambda X^{-1}) (Xw) \\ &= X p_k(\Lambda) X^{-1} X w \\ &= X p_k(\Lambda) w \end{align} \end{split}\]

The matrix \(p_k(\Lambda)\) is a diagonal matrix whose entries are the polynomial \(p_k\) evaluated at each eigenvalue of \(A\):

\[ p_k(\Lambda) = \text{diag}\left( p_k(\lambda_1), \dots, p_k(\lambda_n) \right) = \text{diag}\left( \prod_{j=1}^k (\lambda_i - \mu_j) \right)_{i=1}^n \]

This gives us an expression for the norm that the Arnoldi process minimizes:

\[ \| p_k(A) q_1 \|_2 = \| X p_k(\Lambda) w \|_2 \]

The vector \(p_k(\Lambda)w\) has components \([p_k(\lambda_1)w_1, \dots, p_k(\lambda_n)w_n]^T\). If the norm \(\|p_k(A)q_1\|\) is small, it implies that the vector \(p_k(\Lambda)w\) must also be small. This means that for each \(i\), the product \(|p_k(\lambda_i) w_i| = |w_i \prod_{j=1}^k (\lambda_i - \mu_j)|\) must be small.

If a particular coefficient \(w_i\) is not small (i.e., the starting vector \(q_1\) has a significant component in the direction of the eigenvector \(x_i\)), then the product \(\prod_{j=1}^k (\lambda_i - \mu_j)\) must be small. This can only happen if at least one Ritz value \(\mu_j\) is close to the eigenvalue \(\lambda_i\).

Rigorous Statement for Non-Hermitian Matrices#

We will need the left eigenvector.

The right eigenvector \(x_i\) satisfies \(A x_i = \lambda_i x_i\).
The left eigenvector \(y_i\) satisfies \(y_i^H A = \lambda_i y_i^H\).

The condition number of the eigenvalue \(\lambda_i\) is \(1/\delta_i\), where \(\delta_i = |y_i^H x_i|\) (assuming \(\|x_i\|_2 = \|y_i\|_2 = 1\)). If the left and right eigenvectors are nearly orthogonal, \(\delta_i\) is close to zero, and the eigenvalue is called “ill-conditioned.”

Theorem 29 (Non-Hermitian Convergence)

Let \((\mu, \tilde{x})\) be an approximate eigenpair (e.g., a Ritz pair) with \(\|\tilde{x}\|_2 = 1\), and let \(r = A\tilde{x} - \mu\tilde{x}\) be the residual.

Then, there exists an eigenvalue \(\lambda_i\) of \(A\) such that:

\[ |\lambda_i - \mu| \le \frac{\|r\|_2}{\delta_i} + O(\|r\|_2^2) \]

where \(\delta_i = |y_i^H x_i|\) is the eigenvector-pair cosine for the eigenvalue \(\lambda_i\).

Ignoring the higher-order terms, this gives the approximation:

\[ \underbrace{|\lambda_i - \mu|}_{\text{Eigenvalue Error}} \lessapprox \frac{\overbrace{\|r\|_2}^{\text{Residual Norm}}}{\underbrace{\delta_i}_{\text{Eigenvector Cosine}}} \]

⚠️ Challenges with This Bound#

This “eigenvalue-based” bound, while correct, is often insufficient to explain the complex convergence of the Arnoldi method. The primary challenge is that the bound \(|\lambda_i - \mu| \lessapprox \frac{\|r\|_2}{\delta_i}\) can be very large (i.e., pessimistic) even when the residual norm \(\|r\|_2\) is small. If an eigenvalue \(\lambda_i\) is ill-conditioned, its eigenvector-pair cosine \(\delta_i\) will be very small. This means that even a tiny residual \(\|r\|_2\) (which is guaranteed by a small \(h_{k+1,k}\)) does not guarantee a small eigenvalue error.

To gain a more precise understanding, we need a framework that moves beyond the eigenvalues themselves and considers the behavior of the resolvent \((zI - A)^{-1}\). This leads to the theory of pseudospectra, which provides a much more accurate and descriptive tool for analyzing Arnoldi convergence.

Pseudo-Spectral Interpretation#

The theory of pseudospectra, developed Trefethen–Embree, provides the modern framework for understanding the behavior of the Arnoldi method for non-Hermitian matrices. It successfully explains the complex, and often seemingly erratic, convergence patterns that eigenvalue-based theories cannot.

Here is an explanation of the concept and its direct relationship to the Arnoldi method and its residual.

What is the Pseudospectrum?#

The spectrum (the set of eigenvalues) of a non-normal matrix can be a “misleading” indicator of the matrix’s behavior. The pseudospectrum provides a more descriptive picture.

The pseudospectrum has two common, equivalent definitions:

Definition 4 (Pseudospectrum (Almost Eigenvalues))

The pseudospectrum is a set of complex numbers that are “almost” eigenvalues. A number \(z\) is an \(\epsilon\)-almost eigenvalue if there exists a unit vector \(v\) such that \(\|(A - zI)v\|\) is small (i.e., less than \(\epsilon\)). This is equivalent to stating that the smallest singular value of \((A-zI)\) is \(\sigma_{\min}(A-zI) \le \epsilon\), or equivalently, \(\|(zI-A)^{-1}\|_2 \ge 1/\epsilon\).

Definition 5 (Pseudospectrum (Nearby Matrix Eigenvalues))

The \(\epsilon\)-pseudospectrum of a matrix \(A\), denoted \(\Lambda_\epsilon(A)\), is the set of all eigenvalues of matrices that are \(\epsilon\)-close to \(A\).

\[ \Lambda_\epsilon(A) = \{ z \in \mathbb{C} \mid z \in \Lambda(A+E) \text{ for some } E \text{ with } \|E\| \le \epsilon \} \]

For a normal (e.g., Hermitian) matrix, the \(\epsilon\)-pseudospectrum is simply the union of \(\epsilon\)-sized disks around the true eigenvalues. For a non-normal matrix, \(\Lambda_\epsilon(A)\) can be dramatically larger, forming complex regions in the plane that reveal the matrix’s sensitivity to perturbation.

Trefethen-Embree Theory: Arnoldi Computes the Pseudospectrum#

The core idea of the Trefethen and Embree theory is that the Arnoldi method’s Ritz values are not “trying” to find the eigenvalues of \(A\). Instead, the Arnoldi method computes approximations to the pseudospectra of \(A\).

This perspective explains the “challenges” of non-Hermitian matrices:

Erratic Convergence: Ritz values that appear, seem to converge, and then vanish (“phantom” eigenvalues) are not failures. They are simply the algorithm successfully resolving the complex shapes of the pseudospectrum.
Ill-Conditioning: The \(\kappa(X)\) factor in the previous theorem is a global measure of non-normality. Pseudospectra provide a more detailed, local picture of this non-normality across the complex plane.

The Relationship Between Ritz Values and the Residual Norm#

This is the most crucial part of the theory. There is an exact, “backward error” relationship that proves that every Ritz value is an exact eigenvalue of a slightly perturbed matrix.

This relationship directly connects the Ritz values (\(\mu_j^{(k)}\)) to the residual norm (\(h_{k+1,k}\)).

The Arnoldi Relation:

At step \(k\), the Arnoldi iteration produces the decomposition: \(AQ_k = Q_k H_k + h_{k+1,k} q_{k+1} e_k^T\)

\(Q_k\) is the \(n \times k\) matrix of orthonormal basis vectors.
\(H_k\) is the \(k \times k\) upper Hessenberg matrix.
The Ritz values (\(\mu_j^{(k)}\)) are the eigenvalues of \(H_k\).
The term \(h_{k+1,k} q_{k+1} e_k^T\) is the residual matrix. The norm \(\|AQ_k - Q_k H_k\|\) is precisely this residual and is equal to \(h_{k+1,k}\).

The Backward Error Proof:

Let \(\mu\) be a Ritz value. By definition, it is an eigenvalue of \(H_k\), so there is a corresponding eigenvector \(y\) such that \(H_k y = \mu y\).

Let \(x = Q_k y\). This \(x\) is the “Ritz vector” corresponding to \(\mu\).
Let’s look at the residual for this specific Ritz pair \((A - \mu I)x\):
- \(Ax - \mu x = A(Q_k y) - \mu(Q_k y)\)
- Substitute the Arnoldi relation: \(Ax = (Q_k H_k + h_{k+1,k} q_{k+1} e_k^T) y\)
- Substitute \(H_k y = \mu y\): \(Ax = Q_k(\mu y) + h_{k+1,k} q_{k+1} (e_k^T y)\)
- This simplifies to: \(Ax - \mu x = (h_{k+1,k} \cdot (e_k^T y)) q_{k+1}\)
This equation shows that the residual for the Ritz pair \((\mu, x)\) is confined to the direction of the next basis vector \(q_{k+1}\).
We can now define a perturbation matrix \(E\) such that \((A+E)x = \mu x\).
- \(E = - (h_{k+1,k} \cdot (e_k^T y)) q_{k+1} x^H / \|x\|^2\)
- The norm of this perturbation \(E\) can be shown to be exactly: \(\|E\| = h_{k+1,k} |(e_k^T y)| / \|y\|\)
- This is often simplified to the bound \(\|E\| \le h_{k+1,k}\).

Conclusion

This proves that any Ritz value \(\mu\) is the exact eigenvalue of a perturbed matrix \(A+E\), where the norm of the perturbation \(\|E\|\) is directly proportional to the Arnoldi residual norm \(h_{k+1,k}\).

By the second definition of the pseudospectrum, this means:

\(\mu \in \Lambda_\epsilon(A)\) where \(\epsilon = h_{k+1,k} |(e_k^T y)| / \|y\| \le h_{k+1,k}\)

In short: The Arnoldi method’s Ritz values are not “approximations” to the pseudospectrum. They are, by construction, guaranteed members of the \(\epsilon\)-pseudospectrum, where the perturbation norm \(\epsilon\) is bounded by the Arnoldi residual norm \(h_{k+1,k}\).

Summary#

The Arnoldi process constructs a sequence of monic polynomials whose application to \(q_1\) has minimal norm. When this norm is small, the roots of these polynomials (the Ritz values) approximate eigenvalues of \(A\), with the quality of approximation depending on the spectral properties of \(A\) and the initial vector \(q_1\).