Font Size: a A A

The Influences Of Protein Coding Sequences On Protein Folding Rates

Posted on:2012-02-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:R F LiFull Text:PDF
GTID:1100330335973041Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
The problem of protein folding is a vital topic of biophysics in the 21st century, discovering the mechanism of protein folding is a great challenge in molecular biology. A key step is to find useful factors that are related to the protein folding rates. Presently, it is accepted commonly that protein folding rates are mainly influenced by its amino acid sequences, the protein structures, environment and temperature. We proposed that the protein coding sequences also contain important information which may influence the protein structures and functions, and these information play a key role in regulating the protein folding. We will search the information in protein coding sequences which can influence the protein folding rates.The protein coding sequence contains abundant information besides the information of hereditary language. We think that one of the most important information comes from the synonymous codon usage. One of the functions of the synonymous codon usage is regulating the mRNA structures whose common unit is palindromes, and it is found that palindromes especially those with special properties have extremely important biological functions. At the same time, many diseases are related with palindromes, and palindromes are frequently found in the cancer cells of human. So deep analyzing the formation of palindromes and opening out its biological mechanism becomes one of the main problem in learning the information and function of RNA sequence. We think palindromes contain both the information of mRNA sequences and of mRNA structures. So if protein folding rates are influenced by protein coding sequence, this kind of influence must be embodied in palindromes and the synonymous codon usage. The main contributions are summarized as follows:1. Study on the relationship between the folding rates and the average polarities of amino acid segments. The folding rates and the average polarities of segments ofα-helix andβ-strand in protease of E coli as well as of four viruses (SARS, HIV, HCV and HBV) were calculated, and the relationship between the foldind rates and the average polarities were analyzed respectively in segments ofα-helix andβ-strand. It is found that the folding rates are significantly correlated with the average polarities both for the two kinds of amino acid segments either in the E-coli protease or in the viral protein. For the segments ofα-helix, the average polarities show positive correlation with folding rates. And for the segments ofβ-strand, the average polarities show negative correlation with folding rates. It is concluded that the average polarity of amino acid plays an important role in the protein folding.2. The influence of the palindrome GC content and palindrome density in proteir coding sequence on the relationship between the protein coding sequence and the average polarities of amino acid segments. Based on the relationship between the foldind rates and the average polarities for the two kinds of amino acid segments of the four viruses (SARS, HIV, HCV and HBV), the parameters of palindrome GC conten and palindrome density in protein coding segments were defined and their influences on the relations between the folding rates and the average polarities were studied. Results show that the folding rates correlated positively with the palindrome GC contents and the palindrome density. Our analysis indicated that the influences of palindromes are related with the folding rates of peptide chains but not with the polarities, and that the influence just comes from the complexity and variability of palindrome structures or from the synonymous codon usage, and does not come from the translation information. The results mean that protein coding sequences do carry the information which can influence the folding rates of peptide chains or protein structures.3. In order to prove the influences of mRNA sequence and structure on the protein folding rates, the parameter of GC content in protein coding sequence was selected to do the primary analyse. We obtained the GC content of protein coding sequences (CGC) of 13 all-/βproteins given by Gromiha, which represents the vocabulary of mRNA, and added it in the Gromiha's regression equation of predicting protein folding rates to inspect its effect in protein folding process based on the same 13 all-/? proteins. Compared with Gromiha's results, the linear correlation coefficient between the experimental and the predicted values of the protein folding rates increased from0.96 to 0.98; the population variance decreased from 0.50 to 0.27; the value of chi-square decreased from 3.53 to 3.35 and theρ-value of chi-square test is 0.01 and 0.008 respectively in Gromiha's model and in our model. The result shows that the new parameter is valuable and the protein folding rates are really influenced by the GC contents of its corresponding protein coding sequence. Further analysis indicates that this kind of effect mostly comes from the information of the synonymous codon usage, but not from the translation information from codons to amino acids.4. From the above discussion, we know that when we chose the GC contents of protein coding sequences and added it in the Gromiha's regression equation of predicting protein folding rates, the results were improved significantly, but, the resulte of jackknife test (the P value for the term of GC content of coding sequence is 0.087) proved that the GC content of protein coding sequence do not contain all the information of influenceing protein folding rates, in which the information of the mRNA structures are not contained. So we defined the parameter of palindrome GC contents (PGC) in protein coding sequence, in which contain both the information of mRNA sequences and of mRNA structures. So in this part, parameter CGC was substituted with papameter PGC, and then the same linear regress analyse were done. Compared with Gromiha's results, the linear correlation coefficient between the experimental and the predicted values of the protein folding rates increased from0.96 to 0.99; the population variance decreased from 0.50 to 0.27; the value of chi-square decreased from 3.53 (p=0.01) to 2.86 (p=0.004), and the new results passed jackknife test. It shows that the results are improved ulteriorly. It means that palindrome GC contents have more effect on the protein folding rates. Further analysis indicates that this kind of effect mostly comes from the synonymous codon usage and from the information of palindrome structure itself. 5.18 all a proteins,18 allβproteins and 18 mixed-class proteins were selected as the analyzed sample, the correlatiom between the synonymous codon usage in the protein coding sequences and the protein folding rates were analyzed directely. The results show that there are 5 codons for the all a proteins,8 codons for the allβproteins and 4 codons for the mixed-class proteins related with the protein folding rates. And it is found that the influence of the synonymous codon usage on the protein folding rates were different to different class of proteins. For example, to code the amino acid of Glu, the GAG and GAA are both significant correlation with the protein folding rates for the allβproteins and for the mixed-class proteins, but the influence of the two codons on the protein folding rates are opposite for allβproteins and for the mixed-class proteins. To code the amino acid of Arg, for the all a proteins and for the mixed-class proteins, the influence of GAG and GAA on the protein folding rates are also opposite.6. We also selected the parameters of D, (the single base information redundancy) that describe the vocabulary of the hereditary language, the parameters of D2 (the adjacent base related information redundancy) that describe the phraseology of the hereditary language, and their derived parameters X(X= D2/(D, +D2)) as paremeters of protein coding sequence. Based on a bigger protein data, linear regression between the folding rates and these parameters of the vocabulary and phraseology of the corresponding coding sequences were respectively analyzed, the result indicated that for the Two-state proteins in all a-proteins and allβ-proteins, the protein folding rates significantly correlate with the D2 valuesand the X values of the corresponding coding sequences, especially for allα-proteins, the correlation coefficients reached 0.84, but for the Multi-state proteins inα-βproteins, the protein folding rates show significant negative linear correlation with the GC content of the corresponding coding sequences, the further analyse indicated that one part of the influence of GC content comes from the third bases of codons. Once again, it proved that the protein folding rates are influenced by the synonymous codon usage.7. The palindromes in HIV (HIV-1), HCV, SARS and several other coronary viruses'genomes were obtained and compared. The distribution of palindromes and some special palindromes were found. Comparing all the special palindromes in several high pathogenic viruses, we found that they are particular in GC content, length or their location, and that these special palindromes always locate in the key point in viruses'sequences. So we think these special palindromes must be not common sequences, but some structures with some vital functions. We guass these palindromes must carry some important information that influences the protein functions. So we think palindrome can actually be taken as a kind of parameter for studing the relationships between the mRNA and the protein.
Keywords/Search Tags:Protein coding sequence, Protein folding rates, Correlation, Palindrome structure, GC content of palindromes, Palindrome density, Information redundancy
PDF Full Text Request
Related items