| It is a longstanding idea that, in most species, synonymous codons are used with different frequencies (known as codon bias) and the order with which codons are used for one protein is far from random. There are 61 sense codons, therefore there are 3721 possible codon pairs (excluding stop codon pairs). It has been established by former studies that codon pair pattern in a given genome is also nonrandom and codon pair bias is a feature of different species which is independent of codon bias known as codon pair bias (CPB). Up to now, it is still not clear why some codon pairs are used more frequently than others. It has been suggested by previous experimental analysis that a selective force on codon pair preference within coding sequences may be translation, for the fitness of tRNAs within the A and P-sites in ribosomal may influence the efficiency of translation, and codon pair bias may have a component dictated by tRNA properties, rather than simply by codon properties.Analysis of codon pair usage in different organisms and its applications in bioinformatics and evolutionary studies are important issues for investigating gene expressing and genome evolution. CPB value has been applied on individual gene or individual genome to measure codon pair bias, but never codon-pair-by-codon-pair over an entire transcriptome. In this study, by using the methods of genomics and bioinformatics, the following researches have been done:1. Analysis of codon pair bias in 478 organisms through codon-pair-by-codon-pair over entire transcriptomesThe aim of this research is to analyze codon pair bias through codon-pair-by-codon-pair across all coding sequences (CDS) in 478 organisms from all three domains of life and try to find out some general rules of codon pair usage. Consensus coding sequences (CCDS) for Homo sapiens (human) and Mus musculus (mouse) as well as coding sequences (CDS) for Rattus rattus (rat), Bos taurus (cow), Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Escherichia coli and other organisms were downloaded from NCBI and UCSC. We developed several computer programs using Java, Python and R programming languages to carry out genome-wide analysis in this study. Based on these self-made computer programs, we computed the codon pair score (CPS) for each of the 3721 possible codon pairs. The CPS for a given codon pair is measured as the natural log of the ratio of the observed over the expected frequency of this codon pair over all coding sequences in a given genome. Positive and negative CPS values correspond to statistically over- and under-represented codon pairs. The codon pair bias (CPB) for an entire CDS with N codons (not including the stop codon) was then calculated as the arithmetic mean of the individual CPSs, and a CPS profile for the i-th CDS in a given genome is the vector of all its CPS values. For a particular species, the 5’and 3’portions of all CDSs were aligned according to their start and stop positions respectively, and an averaged head (the first 120 codon pairs) and tail (the last 120 codon pairs) CPS profiles were calculated by taking the mean of the CPS values of each codon pair position in the alignment respectively.We calculated total CPB for each CDS in the human as well as in the mouse, rat, cow, D.melanogaster, C.elegans, S.cerevisia, S.pombe and E. coli genomes. Specifically, in the human genome, the CPB distribution for a set of 17,635 CDS is shifted towards positive values, with the mean score being 0.075.We next inspected averaged head and tail CPS profiles of all the CDSs in a given genome. Remarkably, in nearly all species (441 out of 478) examined the CPS values are relatively low near the 5’end of mRNA and increase rapidly as the distance from the start codon grows. We call this effect a‘codon pair ramp’.In order to determine the typical length of the codon pair ramp we smoothed the averaged CPS profiles by calculating a mean value of the CPS profile within a sliding window of 10 codon pairs in length. The length of the ramp was then defined as the region in which the mean CPS value is significantly lower (Kolmogorov-Smirnov Test, P-value<0.05) than the mean of all 12 sliding windows. We found that the length of the codon pair ramp is about 20 to 50 codon pairs in almost all species examined. In the human genome the length of the codon pair ramp is 40 codon pairs, and the average CPS value in this region is 0.067, ~7% lower than the mean value of the first 120 codon pairs which is 0.072. By contrast, the average CPS value in the region between the 50th and the 120th codon pair is 0.076, ~6% higher than the mean value of first 120 codon pairs.We calculated the CPB value of the first 40 codon pairs for each individual CDS in a given genome. While the average CPB value for all CDS in human is 0.075, the mean value for the first 40 codon pairs of all CDS is 0.066, and the CPB value of the first 40 codon pairs in each CDS is significantly lower than the CPB value of the entire sequence (Paired t-test, p-value < 2.2e-16).We also found lower CPS values in the tail parts (the last 120 codon pairs) of coding sequences in 413 out of 478 species studied, such as human, mouse, rat, cow, D.melanogaster, C.elegans and E.coli, while in 69 out of 478 species studied, such as S.cerevisiae, S.pombe and A.fumigatus, no such tail ramp appears to exist. Out of 413 species possessing both head and tail codon pair ramp in 375 species the length of the head ramp is longer or equal than the length of the tail ramp.2. Comparing CPB between wild profiles and random profilesTo verify that the observed codon pair profile is not a trivial consequence of lining up all CDSs in a given genome by their start/stop codons, we developed a computer program by using R programming language with Seqinr package (http://seqinr.r-forge.r-project.org/) to generate random sequences for each CDS in E. coli, human and S.cerevisiae. Using this R program, we randomly shuffled each CDS in a given genome. The shuffling was done using two alternative methods: a) random permutation of codons occurring in a CDS while preserving the exact count of each codon (codon randomization), and b) random selection of synonymous codons for each amino acid while preserving the amino acid sequence and codon usage of a given CDS (synonymous codon randomization). Both procedures were repeated 50 times, and the averaged CPS profiles of random sequences of a given species were produced by using CPS value of each codon pair from wild genome.Average CPS values in these two profiles are negative which means that codon pairs in random sequences are statistically under-represented compared with wild sequences. Such negative values are expected because the codon pair usage of wild sequences is not random and not all combinations of two codons in wild sequences are used as frequently as in random sequences. Moreover, while codon pair ramps near the 5’end of mRNA exist in all coding sequences in a given genome, randomized sequences do not show this effect, indicating that the observed profiles are not a trivial consequence of lining up all CDSs in a given genome by their start/stop codons.3. Analysis the correlation between CPB ramp and translation speedThe aim of this research is to analyze the correlation between codon pair usage and translation speed, especially for the CPB ramp region.Based on several self-made Java and Python computer programs, we compared the tRNA adaptation index (tAI) to CPB in each CDS in human, mouse, rat, cow, D.melanogaster, C.elegans, S.cerevisiae, S.pombe and E. coli. The tAI value of a given transcript reflects its adaptation to the tRNAs pool in a given genome. tAI is a number between 0 and 1, with higher values corresponding to higher translation speed. A significant positive (albeit weak) correlation between these two values was indeed found in human (Spearman’sÏ=0.298, P<2.2e-16) and other species, which confirms that one possible force shaping codon pair bias is optimization of translational speed by means of the adaptation to the tRNA pool.We also calculated an averaged tAI profile for each codon pair position in a given genome. In this case, tAI values were calculated for each codon pair by taking the geometric mean of the tAI values of the two codons comprising a given codon pair. For all CDSs in a given genome, we compared the average CPS value for each codon pair position along coding sequences to the average tAI value for this codon pair position. We observed a strong positive correlation between average CPS values and average tAI values for each codon pair position in the codon pair ramp regions of human, cow, D.melanogaster, C.elegans, S.pombe and E. coli. For example, in human the CPS profile has a strong and significant (Spearman’sÏ=0.651, P<9.177E-06) correlation with the translation speed profile among the first 40 codon pairs. However, no significant correlation was found between CPS and tAI values for the 40th to 120th codon pairs (Spearman’sÏ=-0.032, P=0.776) in human. In mouse, rat and S.cerevisiae we did not find any correlation between CPB and tAI the in ramp region. However, in S.cerevisiae, when considering the first 120 codon pairs we found a week but significant positive CPB/tAI correlation (Spearman’sÏ=0.242, P=0.0078).Tight connection between codon pair bias and translation speed in the codon pair ramp region suggests that under-represented codon pairs slow down early elongation steps and thereby reduce the rate of translation in the vicinity of the translation initiation region. These findings are also consistent with the notion that it is translation initiation or early elongation, and not global elongation rate that is rate-limiting for gene expression. Interestingly, however, in mouse and rat we did not find any correlation between CPS and tAI. We speculate that in these organisms selection to promote mRNA stability, rather than translational selection, may affect the codon pair preference as well as codon usage.4. The effect of codon pair usage on the translation of GFP genesIn this study, we used the sequences of 154 green fluorescent protein (GFP) genes to test the effect of codon pair usage on translation. Sequences of 154 genes that varied randomly in their codon usage, but encoded the same GFP, as well as normalized fluorescence levels for pGK8 (T7 promoter, no leader sequence), reflecting their expression levels in E.coli, were obtained from Kudla et al’s work.The average CPB value of these genes is -0.098, lower than in E.coli’s endogenous genes (0.077). As expected we neither found the codon pair ramp in these data, nor was the CPB value of each complete gene sequence significantly correlated with fluorescence levels (Spearman’sÏ=-0.106, P>0.19). However, while considering only the first 40 codon pairs of each sequences (average CPS=-0.112) we found a significant and negative correlation (Spearman’sÏ=-0.256, P<0.01) between CPB and fluorescence levels. Furthermore, in the 25% of GFP constructs with the highest CPB values in the first 40 codon pairs (37 constructs) fluorescence levels significantly and strongly (Spearman’sÏ=0.514, P<0.01) correlate with codon pair bias of the first 40 codon pairs. These results fully suggest that instead of the global codon pair usage there is a relationship between the local codon pair usage and the expression level for each gene.In summary, in this study, several computer programs were developed by using Java, Python and R programming languages, and a broad survey of codon pair bias through codon-pair-by-codon-pair near the translation-initiation region of all protein-coding sequences in 478 organisms from all three domains of life has been completed. We found that in nearly all species there is a general tendency for increased CPB near the 5’end of protein coding sequences, to which we refer as“codon pair rampâ€in this study. Such ramp is constituted by the first 20 to 50 codon pair positions of the protein-coding sequence where codon pair bias is relatively low. Our finding of strong interconnection between codon pair bias and translation speed confirms the important role played by the nucleotide sequence near the 5’end of mRNAs in controlling early elongation. All the source codes of computer programs developed and used in this study are free available. Statistical evidence presented in this work remains to be experimentally verified and explained furthering our knowledge of how information stored in DNA sequences determines diverse cellular processes. |