Identification of Common Structural Motifs from Proteases, Kinases, and Phosphatases Using a New Structure Comparison Method Identification of Common Structural Motifs from Proteases, Kinases, and

Protein 3-D structures are more functionally conserved than sequence and this claims the need of developing a computational tool for accurate protein structure comparison at the global and local levels. We have developed a novel geometry-based method for protein 3-D structure comparison using the concept of Triangular Spatial Relationship (TSR). Each protein is represented as a vector of integers, denoted as “keys”, where each integer represents a triangle formed by a triplet of Cα atom of the three amino acids in a given protein. Our method is independent of translation and rotation. The analysis of keys provides a deeper insight into structure and function relations of the proteins. We found two such keys: 3803315 (Ile-Leu-Leu) and 7903915 (Val-Ile-Leu). Nearly 100% of serine proteases, kinases, and phosphatases have one of these two keys, and these two keys have their specific Theta and MaxDist values. In addition, we observed shorter MaxDist found in the triangles from nonpolar amino acids (e.g. Val,


Short Communication
The well accepted fact that protein structures are more conserved than sequences accelerates the discovery need to develop an accurate method to describe the 3-D relationships between proteins. Structural alignment or comparison captures information not detectable in a protein's sequence due to the nature of protein folding: two amino acids that are far away from each other in a protein sequence may be brought close together in Maximal common subgraph detection [1], Ullmann subgraph isomorphism algorithm [2], and geometric hashing [3] in geometrybased; Monte Carlo [4], and Combinatorial Extension [5] algorithms in distance-based, and a genetic algorithm in secondary structurebased [6] comparisons. Dynamic programming algorithms have been used in both distance- [7][8][9] and secondary structure-based [8,10] comparisons. There are limitations in all existing methods. It requires efforts to develop better structural comparison methods.
We have developed a completely different method in which each protein is represented by a vector of non-negative integers.

Methods
The detailed explanation for calculating "keys" will be reported in somewhere else. We observed from literature that all keys are not equal interesting, keys having MaxDist less than or equal to a certain distance might be interesting. Therefore, we started to filter keys by Maxdist less than or equal to 18 Å and then find common keys among proteins by performing a simple set overlap for the set DOI: 10.26717/BJSTR.2020.26.004411 20277 of keys for each protein in a protein class. We started the experiment with kinase though any class can be picked first. We found too many keys with MaxDist less than or equal to 18 Å filter, therefore, we started to reduce the filter condition from MaxDist less than 18 Å in steps of one. We repeated the process and stopped at the filter condition MaxDist less than or equal to 11 Å as we found two useful keys 3803315 and 7903915 in kinases. Then we searched these two keys in two ways:

Results and Discussion
Our Method is Independent of Translation and Rotation Before studying protein structural comparison, we want to examine whether our method is independent of translation or rotation. We chose one protein from PDB (PDB ID: 2HAK, Chain A) [11], rotated it 350 and/or translated it 5 Å, and the original structure along with all these transformations yielded identical keys (Figure 1a). This analysis indicates our method consider identical structures no matter how a structure is rotated or translated. Next, we investigated the effect of Theta and MaxDist bin numbers on key frequencies. We predicted that larger bin numbers have less possibility to generate the same keys for two triangles with similar geometries, but with a sufficient degree of difference in angle or length. As predicted, the number of keys with high occurrence frequencies decreases with increase in Theta or MaxDist bin number (Figures 1b-1e).

Identification of Common Keys for Proteins
We were able to identify the Common keys from subclasses of serine protease, and subclasses of kinases and phosphatases. This motivated us to search for the common keys for serine proteases, kinases and phosphatases. We found two such keys: 3803315 (Ile-Leu-Leu) and 7903915 (Val-Ile-Leu). Nearly 100% of serine proteases, kinases, and phosphatases have one of these two keys ( Figure 2a). Greater than 80% of four random samples have one of the two keys (Figure 2a). Average frequency of these two keys is between 11 and 12 (Figure 2b). A representative structure of the two 3803315 and two 7903915 formed by 7 amino acids is shown in Figure 2c. Six out of seven amino acids are located in a β pleated sheet and the remaining one is from an α helix (Figure 2d).
We do not know the function of these two keys in protein folding.
However, our analysis shows that 3803315 and 7903915 have their specific Theta and MaxDist values (Figures 3a & 3b). Hydrophobic amino acids are most likely found in the core of globular proteins.
One supportive evidence is from the observation of shorter MaxDist found in the triangles from nonpolar amino acids (e.g. Val, Ile, Leu) compared with the triangle having charged amino acids (Arg, Lys, Glu, Asp) (Figure 3c). Because the core could play more important roles in protein folding than the protein surface, we suspect that the initial folding process, or some points during the folding of globular proteins could start from interaction between side chains of Val, Ile or Leu through hydrophobic interaction.