From Sequence to Protein Folding Variations

Protein folding is a big challenge in life science post-gene era. Particularly, a larger number of sequences have been well known but most of them still lack 3D structural information. A novel approach has been developed to overcome the hurdle. Based on the folded of 5 amino acids, the protein folding conformation for any protein 3D structure can be complete described by Protein Folding Shape Code (PFSC). Furthermore, all possible folding shapes for 5 amino acids were gathered and expressed in PFSC as digitized description. Finally, along a protein sequence from N-terminate to C-terminate, the folding variations can be presented in Protein Folding Variation Matrix (PFVM), and then possible conformations can be assembled. Abbreviations


Introduction
Rich protein structural data has been cumulated in sequences with one-dimensional information and in conformations with 3D structures. A large number of protein sequences are available in database today. To determine the order of amino acids, some of protein sequences may be measured by LC-MS experiments.
However, a huge number of protein sequences can be directly acquired by transaction of DNA and then translation of mRNA following genome development. Over 167,000,000 protein sequences have been accumulated in Universal Protein Resource (UniProt), and about 560,000 sequences among them were manually annotated. Furthermore, the knowledge of conformations is significant for biology functions because the protein folding plays important role as well as sequence for protein functions. The protein folding conformation can be straightly determinate according given 3D structural data. The protein 3D structures may be determined by either experimental measurements or computational approaches.
Experimentally, the protein structures can be measured by Nuclear Magnetic Resonance (NMR), X-ray crystallography or Transmission Electron Cryomicroscopy (CryoTEM), and so far over 155,000 of 3D structure data have been available in Protein Data Bank (PDB). However, only less than 0.5% of proteins have 3D structural data relative to hundreds of millions of protein sequences. Also, it is impossible to keep up with the pace of increase of number of protein one-dimensional sequences. On the other hand, the computational approaches become an important methodology to predict the protein 3D structures. So far, various methodologies for protein structure predictions have been developed [3][4][5][6][7][8][9][10][11][12][13][14][15][16]. Since

Challenges in Protein Folding Conformation
In generally, the thorough resolution for protein folding has several challenges. First, over 167,000,000 of proteins in UniProt are known only in one-dimensional sequences without 3D structural information. It is apparently to deal with such gigantic number of protein sequences is hard to be accomplished by either traditional experimental measurements or computation approaches. In contradiction, to prefer more accuracy with experiments and computational approaches would make more difficult to achieve the goal. Second, any single protein sequence may fold into an astronomical number of conformations which aggravate the task with further difficulty. In principle, the number of protein folding essentially is a function of the order of amino acids. Until now, scientists have put much effort trying to overcome the difficulty, however, the regulation or correlation between the protein folding and the order of amino acids has never been known. Third, even as an astronomical number of conformations were obtained, how to present or analyze such gigantic number of structural data is unimaginable. To face so many obstacles, it is not surprised that the protein folding was defined as one of 100 hard scientific questions in this century by Science in 2005 [6][7][8][9][10].

Protein Structure Fingerprint Technology
In order to handle a gigantic number of protein structural data, the protein structure fingerprint technology has been developed to overcome these hurdles. In this novel approach, all possible can be constructed with PFVM [11][12][13][14][15][16].
The procedure to obtain PFVM is briefly illustrated in Figure 1 Table 1

Conclusion
With protein structure fingerprint technology, the protein folding variations can be directly obtained from sequence, which are well displayed by PFVM. With amino acid sequence as input, the PFVM as output will be obtained with free access on web server www.micropht.com.