Difference Approximations

See Also: Unconstrained Optimization

For unconstrained optimization problems in which the exact Hessian matrix is not available, the same algorithms can be used with the Hessian matrix replaced by an approximation. One method for approximating the Hessian matrix is to use difference approximations. Difference approximation methods exploit the fact that each column of the Hessian can be approximated by taking the difference between two instances of the gradient vector evaluated at two nearby points. For sparse Hessians, often many columns of the Hessian can be approximated with a single gradient evaluation by choosing the evaluation points judiciously.

If forward differences are used, then the i th column of the Hessian matrix is replaced by
\[\frac{\nabla f(x_k + h_i e_i) – \nabla f(x_k)} {h_i}\] for some suitable choice of difference parameter \(h_i\). Here, \(e_i\) is the vector with \(1\) in the i th position and zeros elsewhere. Similarly, if central differences are used, the i th column is replaced by
\[\frac{\nabla f(x_k + h_i e_i) – \nabla f(x_k – h_i e_i)} {2h_i}.\]

An appropriate choice of the difference parameter \(h_i\) can be difficult. Rounding errors overwhelm the calculation if \(h_i\) is too small, while truncation errors dominate if \(h_i\) is too large. Newton codes rely on forward differences, since they often yield sufficient accuracy for reasonable values of \(h_i\). Central differences are more accurate but they require twice the work (\(2n\) gradient evaluations against \(n\) evaluations).

Variants of Newton’s method for problems with a large number of variables cannot use the above techniques to approximate the Hessian matrix because the cost of \(n\) gradient evaluations is prohibitive. For problems with a sparse Hessian matrix, however, it is possible to use specialized techniques based on graph coloring that allow difference approximations to the Hessian matrix to be computed efficiently. For example, if the Hessian matrix has bandwidth \(2b+1\), then only \(b+1\) gradient evaluations are required.

VE08 is designed to solve optimization problems with a large number of variables where \(f\) is a partially separable function; that is, \(f\) can be written in the form
\[f(x) = \sum_{i=1}^m f_i(x),\] where each function \(f_i: \mathbb{R}^n \rightarrow \mathbb{R}\) has an invariant subspace
\[N_i = \{ {w \in \mathbb{R}^n : f_i(x+w) = f_i(x)} \qquad for all x \in \mathbb{R}^n \}\] whose dimension is large relative to the number of variables \(n\). This is the case, in particular, if \(f_i\) depends only on a small number (typically, fewer than ten) of the components of \(x\). Functions with sparse Hessian matrices are partially separable. Indeed, most functions that arise in large-scale problems are partially separable. An advantage of algorithms designed for these problems is that techniques for approximating a dense Hessian matrix (for example, forward differences) can be used to approximate the nontrivial part of the element Hessian matrix \(\nabla^2f_i(x)\). Approximations to \(\nabla^2f(x)\) can be obtained by summing the approximations to \(\nabla^2f_i(x)\).