Quasi-Newton Methods

See Also: Unconstrained Optimization

Quasi-Newton methods, or variable metric methods, can be used when the Hessian matrix is difficult or time-consuming to evaluate. Instead of obtaining an estimate of the Hessian matrix at a single point, these methods gradually build up an approximate Hessian matrix by using gradient information from some or all of the previous iterates \(x_k\) visited by the algorithm. Given the current iterate \(x_k\), and the approximate Hessian matrix \(B_k\) at \(x_k\), the linear system
\[B_kd_k = -\nabla f(x_k)\] is solved to generate a direction \(d_k\). The next iterate is then found by performing a line search along \(d_k\) and setting \(x_{k+1} = x_k + \alpha_k d_k\). The question is then: How can we use the function and gradient information from points \(x_k\) and \(x_{k+1}\) to improve the quality of the approximate Hessian matrix \(B_k\)? In other words, how do we obtain the new approximate Hessian matrix \(B_{k+1}\) from the previous approximation \(B_k\)?

The key to this question depends on what is sometimes called the fundamental theorem of integral calculus. If we define
\[s_k = x_{k+1} – x_k, \quad\quad y_k = \nabla f(x_{k+1}) – \nabla f(x_k),\] then this theorem implies that
\[ \left\{ \int_0^1 \nabla^2 f(x_k + ts_k) dt \right\} s_k = y_k.\]

The matrix in braces can be interpreted as the average of the Hessian matrix on the line segment \([x_k, x_k + s_k]\). This result states that when this matrix is multiplied by the vector \(s_k\), the resulting vector is \(y_k\). In view of these observations, we can make \(B_{k+1}\) mimic the behavior of \(\nabla^2f\) by enforcing the quasi-Newton condition
\[B_{k+1}s_k = y_k.\]

This condition can be satisfied by making a simple low-rank update to \(B_k\). The most commonly used family of updates is the Broyden class of rank-two updates, which have the form
\[B_{k+1} = B_k – \frac{B_k s_k(B_k s_k)^T}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k} + \phi_k [s_k^T B_k s_k]v_k v_k^T,\] where \(\phi \in [0,1]\) and
\[v_k = \left[ \frac{y_k}{y_k^T s_k} – \frac{B_k s_k}{s_k^T B_k s_k} \right].\]

The choice \(\phi_k = 0\) gives the Broyden-Fletcher-Goldfarb-Shanno update, which practical experience, and some theoretical analysis, has shown to be the method of choice in most circumstances. The Davidon-Fletcher-Powell update, which was proposed earlier, is obtained by setting \(\phi_k = 1\). These two update formulae are known universally by their initials BFGS and DFP, respectively.

Updates in the Broyden class remain positive definite as long as \(y_k^T s_k > 0\). Although the previous condition holds automatically if \(f\) is strictly convex, it can be enforced for all functions by requiring that \(\alpha_k\) satisfy the curvature condition. Some codes avoid enforcing the curvature condition by skipping the update if \(y_k^T s_k \leq 0\).

GAUSS , IMSL , MATLAB , NAG(FORTRAN) , NAG(C) , OPTIMA , and PROC NLP implement quasi-Newton methods. These codes differ in the choice of update (usually BFGS), line-search procedure, and the way in which \(B_k\) is stored and updated. We can update \(B_k\) by either updating the Cholesky decomposition of \(B_k\) or by updating the inverse of \(B_k\). In either case, the cost of updating the search direction by solving the system
\[B_k d_k = -\nabla f(x_k)\] is on the order of \(n^2\) operations. Updating the Cholesky factorization is widely regarded as more reliable, while updating the inverse of \(B_k\) is less complicated. Indeed, if we define
\[H_k = B_k^{-1},\] then a BFGS update of \(B_k\) is equivalent to the following update of \(H_k\):
\[H_{k+1} = \left( I – \frac{s_k y_k^T}{y_k^T s_k} \right) H_k \left( I – \frac{y_k s_k^T}{y_k^T s_k} \right) + \frac{s_k s_k^T}{y_k^T s_k}.\]

When we store \(H_k\) explicitly, the direction \(d_k\) is obtained from the matrix-vector product
\[d_k = -H_k \nabla f(x_k).\]

The availability of quasi-Newton methods renders steepest-descent methods obsolete. Both types of algorithms require only first derivatives, and both require a line search. The quasi-Newton algorithms require slightly more operations to calculate an iterate and somewhat more storage, but in almost all cases, these additional costs are outweighed by the advantage of superior convergence.

At first glance, quasi-Newton methods may seem unsuitable for large problems because the approximate Hessian matrices and inverse Hessian matrices are generally dense. This is not the case as the explicit storage of \(B_k\) or \(H_k\) as \(n x n\) matrices is not necessary. For example, the above expression for the BFGS update of \(H_k\) makes it clear that we can compute
\[H_k \nabla f(x_k)\] if we know the initial matrix \(H_0\); the subsequent update vectors \(s_i, y_i\); and their inner products \(y_i^T s_i\) for \(0 \leq i < k\). If \(H_0\) is chosen to be a diagonal matrix, the necessary information can be stored in about \(2nk\) words of memory. Limited-memory quasi-Newton methods make use of these ideas to cut down on storage for large problems. They store only the \(s_i\) and \(y_i\) vectors from the previous few iterates (typically, five) and compute the vector \(d_k\) by a recursion that requires roughly 16 \(nm\) operations. The L-BFGS code is an implementation of the limited-memory BFGS algorithms. The codes M1QN2 and M1QN3 are the same as L-BFGS, except that they allow the user to specify a preconditioning technique.