# E3: Maximum Likelihood Estimation with Probit Model (Binary Dependent Variable Case)

## Problem statement

In statistics, a probit model (binary dependent variable case) is a type of regression in which the dependent variable can take only two values (0/1), for example, married or not married. The name comes from probability and unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific category.

As an example, consider the purchase of fluid milk by Mexican households as it relates to the concern about the lack of an adequate intake of calcium, especially by children. We can apply probit regression on the Encuesta Nacional de Ingresos y Gastos de los Hogares (ENIGH) data (2002). For descriptions of variables in the data file, see here.

Assume for each observation $t$, the net utility gained from the consumption of fluid milk $U_t^*$, which is not observable, is related to a set of exogenous variables $x_t$ ($I \times 1$ vector, where $I$ is the total number of exogenous variables). Then, we are interested in coefficients $\beta$, which describe this relationship in the following latent model (as well as in the related probit model), assuming error term $\mu_t$ follows a standard normal distribution, i.e., $\mu_t \sim N(0,1)$:

\begin{equation}
U_t^* = x’_t\beta + \mu_t.
\end{equation}

This latent model is equivalent to the probit model \begin{equation}
y_t = x’_t\beta + \mu_t,
\end{equation}

when the relationship between latent utility variable $U_t^*$ and the observable response (0/1) variable of whether a household purchases fluid milk, $y_t$, satisfies:

y_t = \left\{ \begin{aligned} 1 & \quad \text{if U^*_t > 0 } \\ 0 & \quad \text{otherwise}. \end{aligned} \right.

Note that in the above model, the $j^{th}$ element of coefficients vector $\beta$, $\beta_j$ ($j \in \{1,2,\dots, I\}$) measures the change in the conditional probability $\Pr(y_t = 1|x_t)$ when there is unit change in $x^j_t$ ($j^{th}$ element in vector $x_t$). To further develop this regression model, in addition to i.i.d normally distributed error terms, we assume that the conditional probability takes the normal form:
$$\Pr(y_t = 1|x_t) = \Phi(x’_t\beta),$$
where $\Phi(\cdot)$ is the standard normal CDF.

## Mathematical Formulation

A standard statistical textbook such as Greene (2011) would show that the estimator $\hat{\beta}$ could be calculated through maximizing the following log-likelihood function $\ln\mathcal{L}(\beta)$:
$\hat{\beta} = \arg\max_{\beta}\left[\ln\mathcal{L}(\beta)\right] = \arg\max_{\beta}\left[\sum_t\left(y_t\ln\Phi(x’_t\beta) + (1-y_t)\ln\left(1-\Phi(x’_t\beta)\right)\right)\right].$

In order to report standard regression outcomes such as t-statistic, p-value and others as calculated in Example 1, we need the estimated co-variance matrix of the estimator $\hat{\beta}$, i.e., $\hat{V_{\hat{\beta}}}$, which is based on the inverse Hessian matrix according to Greene (2011),
$$\hat{V}_{\hat{\beta}} = (\hat{H})^{-1},$$
where $\hat{H} = \nabla^2\ln\mathcal{L}(\beta)_{|\hat{\beta}}$ is the estimated Hessian of the log-likelihood function $\ln\mathcal{L}(\beta)$ at the solution point $\hat{\beta}$.

GAMS provides a mechanism to generate the Hessian matrix $H$ at the solution point. As we can see from this maximum likelihood example in GAMS with gdx data, probit_gdx.gms, we rely on the convertd solver with options dictmap and hessian, generating a dictionary map from the solver to GAMS and the Hessian matrix at the solution point, then saving them in data files dictmap.gdx and hessian.gdx individually. Combining information from these two files will provide the Hessian matrix $H$ at the solution point $\hat{\beta}$.

## Demo

This demo provides two data input options for variable estimation and reports regression statistics based on a probit regression model. The reported statistics include estimators, standard errors, T values, and p-values (against non-significant coefficients assumption) at the estimated point. For the best results, we recommend using Firefox for this interactive case study.

### Option 1: Data in a text file

Users who have access to the data needed in the estimation should create a text file with the data, for example, the fluid milk consumption data in Mexico (ENIGH, 2002). See fluid_data.txt. User-provided data files must satisfy the following restriction:

• The column that contains dependent variable data must be indexed by y.

Note that the estimated variables in the probit model are indexed by the names of the explanatory variables in the data.

Users then can download a sample GAMS model file, probit_txt.gms (probit model with text input), and modify it to solve their own estimation problems. Users should specify their own set definitions (sets “t” and “n” in the sample), include their own table of data (as described above), and run the modified model to obtain the estimation results.

### Option 2: Data in a GAMS data exchange (gdx) file

Users who have access to the data in a GAMS data exchange (gdx) file can use one of the following two methods.

• Method 1: Solve using the NEOS Server
Users can click on the “Solve with NEOS” button to find estimation results based on the default gdx file, i.e., the fluid milk data from ENIGH (2002). See fluid.gdx. Alternatively, users can upload their own data by clicking on the button next to “Upload GDX File” and then “Solve with NEOS”. User-provided gdx files must satisfy the same restrictions as listed above in Option 1.

Clicking on the “Reset” button will clear the solution.

• Method 2: Calculate the regression statistics locally
Users who have access to GAMS can download the GAMS model file, probit_gdx.gms (probit model with gdx input), and solve the model locally with the following command:

• “gams probit_gdx –in=mydata”

where mydata.gdx is a data file provided by the user. The gdx file must satisfy the restrictions as described above in Option 1.

• ## Model

A printable version of the probit model is here: probit_gdx.gms for gdx input and probit_txt.gms for text input.