Unconstrained Optimization

Contents

Back to Continuous Optimization

Unconstrained optimization problems consider the problem of minimizing an objective function that depends on real variables with no restrictions on their values. Mathematically, let \(x \in \mathcal{R}^n\) be a real vector with \(n \geq 1\) components and let \(f : \mathcal{R}^n \rightarrow \mathcal{R}\) be a smooth function. Then, the unconstrained optimization problem is \[\mbox{min}_x \; f(x).\]

Unconstrained optimization problems arise directly in some applications but they also arise indirectly from reformulations of constrained optimization problems. Often it is practical to replace the constraints of an optimization problem with penalized terms in the objective function and to solve the problem as an unconstrained problem.

Algorithms

An important aspect of continuous optimization (constrained and unconstrained) is whether the functions are smooth, by which we mean that the second derivatives exist and are continuous. There has been extensive study and development of algorithms for the unconstrained optimization of smooth functions. At a high level, algorithms for unconstrained minimization follow this general structure:

  • Choose a starting point \(x_0\).
  • Beginning at \(x_0\), generate a sequence of iterates \(\{x_k\}_{k=0}^{\infty}\) with non-increasing function (\(f\)) value until a solution point with sufficient accuracy is found or until no further progress can be made.

To generate the next iterate \(x_{k+1}\), the algorithm uses information about the function at \(x_k\) and possibly earlier iterates.

Newton's Method

Newton's Method gives rise to a wide and important class of algorithms that require computation of the gradient vector
\[\nabla f(x) = \left(\partial_1 f(x), \ldots , \partial_n f(x) \right)^T\]
and the Hessian matrix
\[\partial^2 f(x) = \left[ \partial_i \partial_j f(x) \right].\]

Although the computation or approximation of the Hessian can be a time-consuming operation, there are many problems for which this computation is justified.

Wikipedia Link to Newton's Method in Optimization

Newton's method forms a quadratic model of the objective function around the current iterate. The model function is defined by
\[q_k(s) = f(x_k) + \nabla f(x_k)^T s + \frac{1}{2} s^T \nabla^2 f(x_k) s.\]

In the basic Newton method, the next iterate is obtained from the minimizer of \(q_k(s)\). When the Hessian matrix, \(\nabla^2 f(x_k)\), is positive definite, the quadratic model has a unique minimizer that can be obtained by solving the symmetric \(n \times n\) linear system:
\[\nabla^2 f(x_k) s = - \nabla f(x_k).\]
The next iterate is then
\[x_{k+1} = x_k + s_k.\]
Convergence is guaranteed if the starting point is sufficiently close to a local minimizer \(x^*\) at which the Hessian is positive definite. Moreover, the rate of convergence is quadratic, that is,
\[ \| x_{k+1} - x^* \| \leq \beta \| x_k - x^* \|^2\]
for some positive constant \(\beta\). In most circumstances, however, the basic Newton method has to be modified to achieve convergence. There are two fundamental strategies for moving from \(x_k\) to \(x_{k+1}\): line search and trust region. Most algorithms follow one of these two strategies.

  • The line-search method modifies the search direction to obtain another downhill, or descent, direction for \(f\). It then tries different step lengths along this direction until it finds a step that not only decreases \(f\) but also achieves at least a small fraction of this direction's potential.

    Wikipedia Link to Line Search

  • The trust-region methods use the original quadratic model function, but they constrain the new iterate to stay in a local neighborhood of the current iterate. To find the step, it is necessary to minimize the quadratic function subject to staying in this neighborhood, which is generally ellipsoidal in shape.

    Wikipedia Link to Trust Region

Line-search and trust-region techniques are suitable if the number of variables \(n\) is not too large, because the cost per iteration is of order \(n^3\). Codes for problems with a large number of variables tend to use truncated Newton methods, which usually settle for an approximate minimizer of the quadratic model.

Wikipedia Link to Truncated Newton Method

Methods with Hessian Approximations

If computing the exact Hessian matrix is not practical, the same algorithms can be used with a reasonable approximation of the Hessian matrix. Two types of methods use approximations to the Hessian in place of the exact Hessian.

  • One approach is to use difference approximations to the exact Hessian. Difference approximations exploit the fact that each column of the Hessian can be approximated by taking the difference between two instances of the gradient vector evaluated at two nearby points. For sparse Hessians, it is often possible to approximate many columns of the Hessian with a single gradient evaluation by choosing the evaluation points judiciously.
  • Quasi-Newton Methods build up an approximation to the Hessian by keeping track of the gradient differences along each step taken by the algorithm. Various conditions are imposed on the approximate Hessian. For example, its behavior along the step just taken is forced to mimic the behavior of the exact Hessian, and it is usually kept positive definite.

    Wikipedia Link to Quasi-Newton Method

Other Methods for Unconstrained Optimization

There are two other approaches for unconstrained problems that are not so closely related to Newton's method.

  • Nonlinear conjugate gradient methods are motivated by the success of the linear conjugate gradient method in minimizing quadratic functions with positive definite Hessians. They use search directions that combine the negative gradient direction with another direction, chosen so that the search will take place along a direction not previously explored by the algorithm. At least, this property holds for the quadratic case, for which the minimizer is found exactly within just n iterations. For nonlinear problems, performance is problematic, but these methods do have the advantage that they require only gradient evaluations and do not use much storage.

    Wikipedia Link to Nonlinear Conjugate Gradient Method

  • The nonlinear Simplex method (not to be confused with the simplex method for linear programming) requires neither gradient nor Hessian evaluations. Instead, it performs a pattern search based only on function values. Because it makes little use of information about f, it typically requires a great many iterations to find a solution that is even in the ballpark. It can be useful when f is nonsmooth or when derivatives are impossible to find, but it is unfortunately often used when one of the algorithms above would be more appropriate.

Related Problems

  • The Nonlinear Least-Squares Problem is a special case of unconstrained optimization. It arises in many practical problems, especially in data-fitting applications. The objective function \(f\) has the form
    \[f(x) = 0.5 \sum_{j=1}^{m} r_j^2(x),\]
    where each \(r_j\) is a smooth function from \(\cal{R}^n\) to \(\cal{R}\). The special form of \(f\) and its derivatives has been exploited to develop efficient algorithms for minimizing \(f\).
  • The problem of solving a system of Nonlinear Equations is related to unconstrained optimization in that a number of algorithms for nonlinear equations proceed by minimizing a sum of squares. It often arises in problems involving physical systems. In nonlinear equations, there is no objective function to optimize but instead the goal is to find values of the variables that satisfy a set of \(n\) equality constraints.

References

  • Nocedal, J. and S. J. Wright. 1999. Numerical Optimization. Springer-Verlag, New York.
  • Optimization Online Nonlinear Optimization area