# Determining Protein Structure

## Introduction

Proteins are one of the major macro molecules that are present in all biological organisms. They engage into virtually every process within biological systems, such as catalyzing biochemical reactions, transporting and storing chemical compounds, signaling and translating the information from other proteins, maintaining the structures of biological components (e.g. cells, tissues), converting chemical energy into mechanical energy causing muscular movement, and generating immune responses to the harmful foreign bodies within the organism. The function that a protein assumes depends on its structure. Therefore, protein structure determination is of utmost importance for drug design and protein design studies.

X-Ray Crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy are the major experimental techniques for 3D protein structure determination. X-Ray Crystallography has been a more commonly used technique and obtaining protein structure information is a routine, highly automated procedure. Yet, it requires crystallization of the protein, which can take months. On the other hand, NMR Spectroscopy allows the study of a protein nearly under physiological conditions. However, it is hard to automate the NMR Spectroscopy experiments. An important bottleneck in NMR protein structure determination is the assignment of NMR spectrum peaks to the underlying atoms. This bottleneck can be represented as an assignment problem with the help of a homologous protein (protein with a similar structure as the protein under experimentation), which is called Structure Based Assignment (SBA) problem.

In this case study, we present two formulations from [1] and [2] for this problem and an interactive demo that solves the NMR SBA problem with the preferred formulation solver. The interactive demo uses the GAMS model representations for the formulations presented in the following sections.

## Problem Statement

The NMR SBA problem in [1], [2] and [3] is constructed by the Nuclear Vector Replacement (NVR) framework. The goal is to find a mapping between the set of peaks and the set of amino acids that minimizes the total mapping cost. Each peak-amino matching has an assignment probability, which is then converted into an assignment cost.

The available assignments are restricted by the Nuclear Overhauser Effect constraints that make the problem considerably harder to solve. In the NMR SBA problem, each peak pair has a binary relation called a NOE relation, i.e. for any given two peaks, they either have an NOE relation or not. The amino acids also have a similar binary relation, i.e. for any given two amino acids, the distance between the (amide) protons of the amino acids is either less than a threshold value (NTH) or not. The NOE constraints imply that for any given pair of peak – amino acid assignments (e.g. $$p_i \rightarrow a_i$$ and $$p_j \rightarrow a_j$$, if $$p_i$$ and $$p_j$$ have an NOE relation, then the distance of the protons of the amino acids that are assigned to those peaks, $$a_i$$ and $$a_j$$, must be less than the threshold value.

In the NOE constraints illustration figure, peaks 1 and 2 have an edge in between implying they have an NOE relation, and they are mapped to amino acids 1 and 2, respectively. Amino acids 1 and 2 also have an edge in between implying their distance from each other is under the threshold value. Hence, assigning peaks 1 and 2 to amino acids 1 and 2, respectively, is feasible. On the other hand, peaks 1 and 3 also have an NOE relation. However, the amino acids that are assigned to them, amino acids 1 and 4 do not have an edge in between, which means the distance between their amide protons is more than the threshold value. Thus, the assignments of peak 1 to amino acid 1 and peak 3 to amino acid 4 cause infeasibility.

## Mathematical Formulation

### Binary Integer Programming Formulation

The NMR SBA problem can be formulated as a Binary Integer Program (BIP) as follows:

Parameters
$$P$$ set of peaks
$$A$$ set of amino acids
$$NOE(i)$$ set of peaks that have an NOE relation with peak $$i$$, $$\forall i \in P$$set of amino acids
$$N$$ number of peaks to be assigned
$$NTH$$ distance threshold for an NOE relation
$$c_{ij}$$ cost of assigning peak $$i$$ to amino acid $$j$$, $$\forall i \in P$$, $$\forall j \in A$$
$$d_{kl}$$ distance between amide proteins of amino acids $$k$$ and $$l$$, $$\forall k, l \in A$$
$$b_{kl}$$ $$\left\{ \begin{array}{ll} 1 & \mbox{if \(d_{kl}$$ < $$NTH$$, $$\forall k, l \in A$$} \\
0 & \mbox{otherwise}
\end{array} \right. \)

Decision Variables
$$x_{ij}$$ $$\left\{ \begin{array}{ll} 1 & \mbox{if peak \(i$$ is assigned to amino acid $$j$$, $$\forall i \in P$$, $$\forall j \in A$$} \\
0 & \mbox{otherwise}
\end{array} \right. \)

Model
Minimize

$$\sum_{i \in P} \sum_{j \in A} c_{ij} x_{ij}$$

subject to:

$$\sum_{i \in P} x_{ij} \leq 1, \quad \forall j \in A$$
$$\sum_{i \in A}x_{ij} \leq 1, \quad \forall j \in P$$
$$\sum_{i \in P} \sum_{j \in A} x_{ij} = N$$
$$x_{ij}+x_{kl} \leq b_{jl}+1, \quad \forall j, l \in A, \forall i, k \in P, \forall k \in NOE(i)$$
$$x_{ij} \in \mathcal{B}, \forall i \in P, \forall j \in A$$

The first two constraints ensure that each NMR peak is assigned to at most one amino acid and, similarly, each amino acid is assigned to at most one peak. The third constraint determines the number of peak-amino acid assignments. Although $$N$$ is usually equal to the number of peaks, in rare cases, mapping all of the peaks could be infeasible. In such cases, $$N$$ allows us to obtain a partial solution. The next set of constraints are the NOE constraints. Finally, the last constraints restrict the variable only to take binary values.

### Mixed Integer Nonlinear Programming Formulation

We can also model this problem as Mixed Integer Nonlinear Program (MINLP). In this formulation, instead of having a large number of constraints for NOE violations, we add a penalty term to the objective function for each violation. We choose the penalty term as the maximum non infinity assignment cost so that any solution with an NOE violation will be less favorable.

Parameters
$$BD(j)$$ $$\{l \in A \ | \ d_{jl}$$ < $$NTH \}$$

Decision Variables
$$x_{ij}$$ $$\left\{ \begin{array}{ll} 1 & \mbox{if peak \(i$$ is assigned to amino acid $$j$$, $$\forall i \in P$$, $$\forall j \in A$$} \\
0 & \mbox{otherwise}
\end{array} \right. \)

Model
Minimize

$$\sum_{i \in P} \sum_{j \in A} c_{ij} x_{ij}+ \sum_{i \in P} \sum_{j \in A} \sum_{k \in NOE(i)} \sum_{l \in BD(j)} px_{ij}x_{kl}$$

subject to:

$$\sum_{i \in P}x_{ij} \leq 1, \quad \forall j \in A$$
$$\sum_{i \in A} x_{ij} \leq 1, \quad \forall j \in P$$
$$\sum_{i \in P}\sum_{j \in A} x_{ij} = N$$
$$x_{ij} \in \mathcal{B}, \quad \forall i \in P, \forall j \in A$$

In the model above, the objective function minimizes the total score associated with the assignment of NMR peaks to amino acids and the additional score (penalty) resulting from NOE relation violations. The first and second sets of constraints guarantee that each amino acid is assigned to at most one NMR peak and each NMR peak is assigned to at most one amino acid. Similarly, we determine the number of peak-amino acid assignments by the third constraint, and finally, the last constraints make sure that all the variables are binary.

## Interactive

The demo allows the user to select from a set of default proteins and specify a problem formulation to solve real NMR SBA problems.

• Select the model formulation: From the drop down list, select the formulation to solve the problem: IP for the binary integer programming formulation or MINLP for the mixed integer nonlinear programming formulation.
• Select the protein: From the drop down list, select one of the default proteins. The binary distances, the NOE relations, and the cost matrices will be read from the server. The number of peaks and the number of amino acids for the default proteins are as follows: Poly-n (31, 31); 1GB1, 1PGB, and 2BG1 (55, 55); FF2 (55, 80); 1UBI (70, 72).
• Solve with NEOS: Click the button to submit the problem to NEOS. Due to the size of the problem, it may take a couple of minutes for the solver to return the results.

Select the model formulation:
Select the protein:

## References

[1] Apaydın et. al. 2010. Nvr-bip: Nuclear vector replacement using binary integer programming for NMR structure-based assignments. The Computer Journal.

[2] Cavuslar, G., Catay B., Apaydin M. S. 2011. A Tabu Search Approach for the NMR Protein Structure-Based Assignment Problem. Working Paper/Technical Report. Sabanci University. ID:SU_FENS_2011/0001

[3] Apaydin, M.S., Conitzer V., Donald B.R. 2008. Structure-based protein NMR assignments using native structural ensembles. Journal of Biomolecular NMR, 40(4):263–276.