**Summary**:

### Case Study Contents

## Problem Statement

### Machine Learning

Machine learning is a type of artificial intelligence in which the focus is on the development of computer programs that can learn from data and adapt to changes. There are many types of machine learning algorithms; they can be classified based on the desired output of the algorithm or the type of input available for training.

Suppose that you wanted a computational device that for a given input would compute a certain output. You could directly design such a device or, depending on the situation, you could use a machine learning algorithm to create one, given that you can provide examples of input and output pairs. In some cases, machine learning algorithms can create such devices when we do not know how to make them directly ourselves!

In supervised machine learning, inputs are given to a modifiable algorithm, which can produce an output; the input, the output of the modifiable algorithm, and the correct output are given to a modifier algorithm, which can modify the modifiable algorithm. We do not know how to get a computer to learn how to program like we can, but we do know how to get computers to learn restricted kinds of modifiable algorithms such as some kinds of statistical calculations.

### Support Vector Machines

One kind of modifiable algorithm (or classifier) for which we have modifying (or learning) algorithms is a support vector machine. A support vector machine classifies inputs as belonging to one of two different different classes of outputs. It does this by calculating whether or not the input is above or below a classifying hyperplane. In two dimensions, the classifying hyperplane is a line, so an input is classified based on whether the input is above or below a certain line. The input to a support vector machine must therefore be a point in space, or a vector of numerical information. There are other classifiers of this kind as well, and a support vector machine is a particular kind of one. The classifying hyperplane of a support vector machine is the best such hyperplane. The best such hyperplane has two components. First, support vector machines will classify inputs even when the inputs are not linearly separable, in which case there is no hyperplane that classifies perfectly. In such cases, the classifying hyperplane of a support vector machine will have minimal error. Secondly, the classifying hyperplane of a support vector machine will be in between the two classes of data as much as possible. One way of understanding this is to imagine the case where the inputs are linearly separable, so that the calculation can be learned perfectly. The are an infinite number of such hyperplanes that classify the inputs perfectly. The classifying hyperplane of a support vector machine would be the one that was as far in between the two groups as possible.

The learning problem for support vector machines is a quadratic programming problem, and so the learning algorithm for support vector machines is a quadratic programming algorithm applied to this specific problem.

## Mathematical Formulation

Let

a = number of dimensions

P = a pxa matrix of positive points

N = a nxa matrix of negative points

m = vector used to represent the classifying hyperplane mx + b

b = variable used to represent the classifying hyperplane mx + b

e = vector of ones of some length

\( ((x)_+)_y=\max\{x_y,0\}, y = 1, 2, 3, ... \)

**Minimization of the degree of misclassification**

\[ \min_{mb}\frac{1}{p}||(-Pm+e_pb+e_p)_+||_1+\frac{1}{n}||(Nm-e_nb+e_n)_+||_1 \]

This expression is the minimization of the sum of the degree of misclassification of positive and negative points. The reason that the \(- Pm + eb\) and \(Nm - eb\) have negated signs with respect to each other is that we want the distances to the hyperplane of the positive points that were classified as negative whereas we want the distances to the hyperplane of the negative points that were classified as positive. Once that is calculated, the + operation sets all negative values in this vector equal to 0. The negative values are the distances to the hyperplane of the points that were classified correctly. The one-norm of these vectors then returns the sum of all such distances, and the 1/p and 1/n terms make the importance of the misclassification of positive points equal to the misclassification of negative points, even if there are more of one kind of points than the other. This + operation and the 1-norm cannot be explicitly expressed in linear programs but are calculated in the following equivalent linear program.

**Linear Program**

\[ \min_{mbcd}\frac{1}{p}e_p'c+\frac{1}{n}e_n'd \]

\[ Pm-e_pb+c\geq e_p \]

\[ -Nm+e_nb+d\geq e_n \]

\[ c\geq 0 \]

\[ d\geq 0 \]

Here, these new \(c\) and \(d\) vectors of variables, which are constrained to be nonnegative, are the primary component of what is explictly being minimized. For the constraints to hold, \(c\) and \(d\) only need to be increased from 0 so that the left hand sides of the main inequalities are greater than 1 when \(Pm - eb\) and \(-Nm + eb\) are less than 1 and thus misclassified. In these cases, the \(c\) or \(d\) variable will be increased just enough so that the constraint will hold. The degree that it had to be increased is equal to the degree that the point was misclassified. These formulations have not yet taken into account the separation margin, or the precise degree that the hyperplane is in between the positive and negative points. This separation margin is equal to \(2 / \|m\|_2\). Since this is proportional to \(m'm\), a multiple of \(m'm\) is added to the objective. The particular multiple of \(m'm\) that is added to the objective depends on the parameter \(f\). As \(f\) is increased, the effect of \(m'm\) on the objective is increased, and so the value of \(m\) will usually decrease. The reason for this parameter is to have the ability to avoid large weights to some degree. Large weights are empirically associated with overfitting, a problem in machine learning where the machine learning algorithm tends to memorize the training data and not generalize to unseen data well. As the objective is now a quadratic expression, the model is now a quadratic program.

**Full Quadratic Program**

\[ \min_{mbcd}\frac{1}{p}e_p'c+\frac{1}{n}e_n'd+\frac{f}{2}m'm \]

\[ Pm-e_pb+c\geq e_p \]

\[ -Nm+e_nb+d\geq e_n \]

\[ c\geq 0 \]

\[ d\geq 0 \]

## GAMS Implementation

## Try the demo!

**Using the program**