Newton Methods

See Also: Constrained Optimization Bound-Constrained Optimization

Many codes for bound constrained optimization implement line-search and trust-region versions of unconstrained minimization algorithms, so the discussion here is brief, emphasizing the differences between the unconstrained and bound-constrained cases.

A line-search method for bound-constrained problems generates a sequence of iterates by setting \(x_{k+1} = x_k + \alpha_k d_k \,\) where \(x_k\) is a feasible approximation to the solution, \(d_k\) is a search direction, and \(\alpha_k > 0\) is the step. The direction \(d_k\) is obtained as an approximate minimizer of the subproblem
\[\min_d \left\{ \nabla f(x_k)^T d + \frac{1}{2} d^T B_k d \; : \; d_i = 0 \; \forall i \in W_k\right\} \quad\quad(1.1)\] where \(W_k\) is the ”working” set and \(B_k\) is an approximation to the Hessian matrix of \(f(x)\) at \(x_k\). All variables in the working set \(W_k\) are fixed during this iteration, while all other variables are in the free set \(F_k\). We can express this subproblem in terms of the free variables by noting that it is equivalent to the unconstrained problem
\[\min_w \left\{ g_k^T w + \frac{1}{2} w^T A_k w, \; w \in R^{m_k} \right\}\] where \(m_k\) is the number of free variables, \(A_k\) is the matrix obtained from \(B_k\) by taking those rows and columns whose indices correspond to the free variables, and \(g_k\) is obtained from \(\nabla f(x_k)\) by taking the components whose indices correspond to the free variables, \(F_k\).

The main requirement on \(W_k\) is that \(d_k\) be a feasible direction, that is, \(x_k + \alpha d_k\) satisfies the constraints for all \(\alpha > 0\) sufficiently small. This is certainly the case if \(W_k = A(x_k)\), where
\[A(x_k) = \{ i : x_i = l_i\} \cup \{ i : x_i = u_i \|\] is the set of ”active” constraints at \(x\). As long as progress is being made with the current \(W_k\), the next working set \(W_{k+1}\) is obtained by merging \(A(x_{k+1})\) with \(W_k\). This updating process is continued until the function cannot be reduced much further with the current working set. At this point, the classical strategy is to drop a constraint in \(W_k\) for which \(\partial_i f(x_k)\) has the wrong sign, that is, \(i \in W_k\) but \(i \not \in B(x_k)\), where the binding set
\[B(x_k) = \{ i : x_i = l_i, \partial_i f(x_k) \geq 0 \} \cup \{ i : x_i = u_i, \partial_i f(x_k) \leq 0 \}\] is defined as before. In general, it is advantageous to drop more than one constraint, in the hope that the algorithm will make more rapid progress towards the optimal binding set. However, all dropping strategies are constrained by the requirement that the solution \(d_k\) of the subproblem be a feasible direction.

An implementation of a line-search method based on subproblem (1.1) must cater to the situation in which the reduced Hessian matrix \(A_k\) is indefinite, because in this case the subproblem does not have a solution. This situation may arise, for example, if \(B_k\) is the Hessian matrix or an approximation obtained by differences of the gradient. Here, it is necessary to specify \(d_k\) by other means. For example, we can use the modified Cholesky factorization.

Quasi-Newton methods for bound-constrained problems update an approximation to the reduced Hessian matrix since, as already noted, only the reduced Hessian matrix is likely to be positive definite. The updating process is not entirely satisfactory because there are situations in which a positive definite update that satisfies the quasi-Newton condition does not exist. Moreover, complications arise because the dimension of the reduced matrix changes when the working set \(W_k\) changes. Quasi-Newton methods are usually beneficial when the working set remains fixed during consecutive iterations.

The choice of line-search parameter \(W_k\) is quite similar to the unconstrained case. If subproblem (1.1) has a solution \(d_k\) and \(x_k + d_k\) violates one of the constraints, then we compute the largest \(\mu_k \in (0,1)\) such that
\[x_k + \mu_k d_k \,\] is feasible. A standard strategy for choosing \(\alpha_k\) is to seek an \(\alpha_k \in (0,\mu_k]\) that satisfies the sufficient decrease and curvature conditions. We are guaranteed the existence of such an \(\alpha_k\) unless \(\mu_k\) satisfies the sufficient decrease condition and
\(\nabla f(x_k + \mu_k d_k)^T d_k < 0\)

This situation is likely to happen if, for example, \(f(x)\) is strictly decreasing on the line segment \([x_k, x_k+\mu_k d_k]\). In this case it is safe to set \(\alpha_k = \mu_k\).