Sequential Quadratic Programming

Sequential quadratic programming (SQP) is one of the most effective methods for nonlinearly constrained optimization problems. The method generates steps by solving quadratic subproblems; it can be used both in line search and trust-region frameworks. SQP is appropriate for small and large problems and it is well-suited to solving problems with significant nonlinearities.

The SQP method can be viewed as a generalization of Newton’s method for unconstrained optimization in that it finds a step away from the current point by minimizing a quadratic model of the problem. A number of software packages (NPSOL, NLPQL, OPSYC, OPTIMA, MATLAB, and SQP) are based on this approach. In its purest form, the SQP algorithm replaces the objective function with the quadratic approximation
\[q_k(d) = \nabla f(x_k)^T d + \frac{1}{2}d^T\nabla^2_{xx} \mathcal{L}(x_k, \lambda_k)d\] and replaces the constraint functions by linear approximations.

Recall the general form of a nonlinearly constrained optimization problem:
\[ \begin{array}{lllll}
\mbox{minimize} & f(x) & & & \\
\mbox{subject to} & c_i(x) & = & 0 & \forall i \in \mathcal{E} \\
& c_i(x) & \leq & 0 & \forall i \in \mathcal{I}
\end{array}
\] The step \(d_k\) is calculated by solving the quadratic subprogram
\[ \min \{ q_k(d) : c_i(x_k) + \nabla c_i(x_k)^Td \leq 0, i \in \mathcal{I}
c_i(x_k) + \nabla c_i(x_k)^Td = 0, i \in \mathcal{E} \}\]

The local convergence properties of the SQP approach are well understood when \((x^*, \lambda^*)\) satisfies the second-order sufficiency conditions. If the starting point \(x_0\) is sufficiently close to \(x^*\), and the Lagrange multiplier estimates \(\{ \lambda_k \}\) remain sufficiently close to \(\lambda^*\), then the sequence generated by setting \(x_{k+1} = x_k + d_k\) converges to \(x^*\) at a second-order rate. These assurances cannot be made in other cases. Indeed, codes based on this approach must modify the subproblem when the quadratic \(q_k\) is unbounded below on the feasible set or when the feasible region is empty.

The Lagrange multiplier estimates that are needed to set up the second-order term in \(q_k\) can be obtained by solving an auxiliary problem or by simply using the optimal multipliers for the quadratic subproblem at the previous iteration. Although the first approach can lead to more accurate estimates, most codes use the second approach.

The strategy based on solving the quadratic subproblem makes the decision about which of the inequality constraints appear to be active at the solution internally during the solution of the quadratic program. A somewhat different algorithm is obtained by making this decision prior to formulating the quadratic program. This variant explicitly maintains a working set \(\mathcal{W}_k\) of apparently active indices and solves the following quadratic programming problem
\[\min \{ q_k(d) : \quad c_i(x_k) + \nabla c_i(x_k)^Td = 0, i \in \mathcal{W}_k \] to find the step \(d_k\). The contents of \(\mathcal{W}_k\) are updated at each iteration by examining the Lagrange multipliers for the subproblem and by examining the values of \(c_i(x_{k+1})\) at the new iterate \(x_{k+1}\) for \(i \notin \mathcal{W}_k\). This approach is usually called the EQP (equality-based QP) variant of SQP to distinguish it from the IQP (inequality-based QP) variant described above.

The sequential QP approach outlined above requires the computation of \(\nabla_{xx}^2\mathcal{L}(x_k, \lambda_k)\). Most codes replace this matrix with the BFGS approximation \(B_k\), which is updated at each iteration. An obvious update strategy (consistent with the BFGS update for unconstrained optimization) would be to define
\[s_k = x_{k+1} – x_k, \quad\quad\quad y_k = \nabla_x\mathcal{L}(x_{k+1}, \lambda_k) – \nabla\mathcal{L}(x_k, \lambda_k)\] and update the matrix \(B_k\) by using the BFGS formula
\[B_{k+1} = B_k – \frac{B_k s_k S^T_k B_k}{s^T_k B_k s_k} + \frac{y_k y^T_k}{y^T_k s_k}.\]

However, one of the properties that make Broyden-class methods appealing for unconstrained problems-its maintenance of positive definiteness in \(B_k\) -is no longer assured, since \(\nabla_{xx}^2 \mathcal{L}(x^*, \lambda^*)\) is usually positive definite only in a subspace. This difficulty may be overcome by modifying \(y_k\). Whenever \(y_k^T s_k\) is not sufficiently positive, \(y_k\) is reset to \(y_k \leftarrow \theta_k y_k + (1 – \theta_k) B_k s_k\), where \(\theta_k \in [0, 1)\) is the number closest to 1 such that \(y_k^T s_k \geq \sigma s^T_k B_k s_k\) for some \(\sigma \in (0,1)\). The SQP and NLPQL codes use an approach of this type.

The convergence properties of the basic sequential QP algorithm can be improved by using a line search. The choice of distance to move along the direction generated by the subproblem is not as clear as in the unconstrained case, where we simply choose a step length that approximately minimizes \(f\) along the search direction. For constrained problems, we would like the next iterate not only to decrease \(f\) but also to come closer to satisfying the constraints. Often these two aims conflict, so it is necessary to weigh their relative importance and define a merit or penalty function, which we can use as a criterion for determining whether or not one point is better than another. The \(l_1\) merit function
\[\mathcal{P}_1(x; \nu) = f(x) = \sum_{i \in \mathcal{E}} \nu_i |c_i(x)| + \sum_{i \in \mathcal{E}} \nu_i \max(c_i(x), 0), \] where \(\nu_i > 0\) are penalty parameters, is used in the NLPQL, MATLAB, and SQP codes, while the augmented Lagrangian merit function
\[\mathcal{L}_A(x, \lambda; \nu) = f(x) + \sum_{i \in \mathcal{E}} \lambda_i c_i(x) + \frac{1}{2} \sum_{i \in \mathcal{E}} \nu_i c_i^2(x) + \frac{1}{2} \sum_{i \in \mathcal{I}} \psi_i (x, \lambda; \nu),\] where
\[\psi_i (x, \lambda; \nu) = \frac{1}{\nu_i} \{ \max \{ 0, \lambda_i + \nu_i c_i(x) \}^2 – \lambda_i^2 \}\] is used in the NLPQL, NPSOL, and OPTIMA codes. The OPSYC code for equality-constrained problems (for which \(\mathcal{I} = \emptyset\)) uses the merit function
\[f(x) + \sum_{i \in \mathcal{E}} \lambda_i c_i(x) + \left( \sum_{i \in \mathcal{E}} \nu_i c_i^2(x) \right)^\frac{1}{2}\] which combines features of \(\mathcal{P}_1\) and \(\mathcal{L}_A\).

An important property of the \(l_1\) merit function is that if \((x^*, \lambda^*)\) satisfies the second-order sufficiency condition, then \(x^*\)is a local minimizer of \(\mathcal{P}_1\), provided the penalty parameters are chosen so that \(\nu_i > |\lambda_i^*|\). Although this is an attractive property, the use of \(\mathcal{P}_1\) requires care. The main difficulty is that \(\mathcal{P}_1\) is not differentiable at any \(x\) with \(c_i(x) = 0\). Another difficulty is that although \(x^*\) is a local minimizer of \(\mathcal{P}_1\) , it is still possible for the function to be unbounded below. Thus, minimizing \(\mathcal{P}_1\) does not always lead to a solution of the constrained problem.

The merit function \(\mathcal{L}_A\) has similar properties. If \((x^*, \lambda^*)\) satisfies the second-order sufficiency condition and \(\lambda = \lambda^*\), then \(x^*\) is a local minimizer of \(\mathcal{P}_1\), provided the penalty parameters \(\nu_i \,\) are sufficiently large. If \(\lambda \neq \lambda^*\), then we can say only that \(\mathcal{L}_A\) has a minimizer \(x(\lambda)\) near \(x^*\) and that \(x(\lambda)\) approaches \(x^*\) as \(\lambda\) converges to \(\lambda^*\). Note that in contrast to \(\mathcal{P}_1\), the merit function \(\mathcal{L}_A\) is differentiable. The Hessian matrix of \(\mathcal{L}_A\) is discontinuous at any \(x\) with \(\lambda_i + \nu_i c_i(x) = 0 \) for \(i \in \mathcal{I}\), but, at least in the case \(\mathcal{I}_0^* = \emptyset\), these points tend to occur far from the solution.

The use of these merit functions by NLPQL is typical of other codes. Given an iterate \(x_k\) and the search direction \(d_k\), NLPQL sets \(x_{k+1} = x_k + \alpha_k d_k \), where the step length \(\alpha_k\) approximately minimizes \(\mathcal{P}_1(x_k + \alpha d_k; \nu)\). If the merit function \(\mathcal{L}_A\) is selected, the step length \(\alpha_k\) is chosen to approximately minimize \(\mathcal{L}_A(x_k + \alpha d_k, \lambda_k + \alpha(\lambda_{k+1} – \lambda_k); \nu)\) where \(d_k\) is a solution of the quadratic programming subproblem (1.3) and \(\lambda_{k+1}\) is the associated Lagrange multiplier.