| The inherent law of the k-mer spectrum in genomic sequences has recently become a major research focus.The k-mer spectrum of a species’ genome sequence is constant,and the differences of k-mer spectra are regularly among different species.The distributions of k-mer spectra in higher mammals(k>6)exhibit multiple peaks while those in lower organisms display a single peak.Based on the known knowledge,we studied the inherent laws of three-peak distributions of the 8-mer spectra in human genomic sequences,and found that the three types of CG 8-mers evolve independently,which is referred to as the independent selection law of the genome.We demonstrated the evolutionary characteristics of the three CG 8-mer spectra in detail and validated the independent selection law of the human genome,and proposed a hypothesis about the biological functions of the three CG 8-mers.Meanwhile,the distribution characteristics and their biological functions of the three CG 8-mers were studied in the nucleosome occupied sequences and CpG island sequences.At last,the distance conservation and the sequence differences of the 8-mers containing CG denucleotide in nucleosome occupied sequences were analyzed between human and mouse genomes.Our main findings are as follows:1.The three-peak distribution of the 8-mer spectrum in the human genomic sequences was analyzed.The reason for the presence of the three peaks is that the motifs within the three peaks have been formed through different evolutionary selections,which are the composition patterns of genomic sequences.The 8-mer spectra of the whole genome sequences,intergenic sequences,introns,and coding sequences in the human genome were detected.Apart from the coding sequences,the 8-mer spectra of the other three types of sequences were all three-peak distributions.In order to isolate the motifs in the three peaks,the XY dinucleotide(16 types)classification method was proposed and the entire 8-mers were divided into XY2,XY1,and XY0 subsets according to whether two,one,or none of the XY dinucleotide are included in the 8-mers.We found that only the CG0,CG1,and CG2 8-mer subsets under the CG dinucleotide classification can form the independent single-peak spectrum respectively,and the entire 8-mer spectrum of the genome can be thoroughly divided into three categories.However,the phenomina is not observed under the other 15 XY dinucleotide classifications.This phenomenon is termed as the independent selection law of the genomic sequences.2.The distribution characteristics of the 8-mer spectra of the three CG 8-mer subsets were analyzed.First,the positions of the three CG 8-mer spectra were investigated.Referring to the 8-mer spectrum of the random sequences,the positions of the three CG 8-mer spectra are clearly separated.CG2 8-mer spectrum is located at the low frequency end and is farthest from the center of the random spectrum,followed by that of the CG1 8-mer spectrum,and CG0 8-mer spectrum is located in proximity to the center of the random spectrum.Secondly,the conservativeness(monochromaticity)of the spectrum distributions of the three CG 8-mer subsets was analyzed.Based on the standard deviation of the spectrum distributions in the three CG 8-mer subsets,it was found that the CG2 8-mer subset showed strong conservation,followed by that of the CG1 8-mer subset,and CG0 8-mer subset has the lowest conservation.The results demonstrated that there are three features of the independent selection law in genomic sequences:(1)The 8-mers in CG1 and CG2 subsets are the consequence of genome directed evolution,while the 8-mers in CG0 subset are that of random evolution.(2)The phenomina of evolutionary separations are existed in the three CG 8-mer subsets.(3)The 8-mer spectra of CG2 and CG1 subsets are significantly conserved.We also analyzed the spectrum characteristics of the XY subsets under the other 15 XY denucleotide classifications and found that they did not meet the aforementioned three features.The independent selection law states that any DNA sequence is consists of the three independently evolved CG motif subsets,the content and distribution characteristics of which determine the biological functions of the DNA sequence.3.According to the features of the independent selection law and the experimental results as well as the theoretical analysis about functional sequences,we believe that the three CG 8-mers subsets have different biological functions.Therefore,we proposed the hypothesis that the motifs in the CG2 subset are the core motifs in the CpG island sequences,while the motifs in the CG1 and CG0 subsets reflect the diversity of Cp G island sequences.Likewise,the motifs in the CG1 subset are the main elements that constitute the nucleosome binding motifs,while the motifs in the CG2 and CG0 subsets reflect the diversity of nucleosome occupied sequences.In order to test our hypothesis,the 8-mer information of the three CG subsets were characterized respectively on the nucleosome occupied sequences and the Cp G island sequences of human genome.Based on the ROC analysis method,results showed that the most preferred motifs are the CG1 8-mer subset in nucleosome occupied sequences,followed by the CG2 8-mer subset.The most preferred motifs are the CG0 8-mer subset in nucleosome absent sequences,while CG1 and CG2 8-mer subsets are not preference.As for the CpG island sequences,the most preferred motifs are the CG2 8-mer subset,followed by the CG1 8-mer subset.The CG0 8-mer subset is predominantly found in non-Cp G island sequences.Thus,the hypothesis was supportted by our results.4.Based on the validation of the 8-mers containing CG denucleotide being the preferred motifs in nucleosome occupied sequences,we analyzed the characteristics of nucleosome occupied and absent sequences,including frequency and distribution of k-mer(k=1,2,3),and G+C content.Our results showed that the frequency of single bases in nucleosome occupied sequences is almost uniform,while the frequency of A/T in the nucleosome absent sequences is significantly higher compared to that of C/G,and the G+C content of the nucleosome occupied sequences is significantly higher than that of nucleosome absent sequences.There are no significant differences in other analyzed sequence characteristics between nucleosome occupied and absent sequences.In general,the analysis methods of common sequence information cannot effectively reveal the core sequence characteristics in studying the nucleosome occupied sequences.Based on the independent selection law of the genome to study functional sequences,it is confirmed that our method is effective and feasible.5.The relationship between a pair of CG-containing dinucleotide motifs in the sequences was examined from a conservative point of view.Based on the nucleosome occupied and absent sequences of the human and mouse genomes,the differences in the distance distributions and the distance variances of the 8-mer pairs,which containing CG dinucleotide,were statistically analyzed in the sequences respectively,the aim is to characterize the conservation of the motif pairs in the human and mouse sequences by analyzing the distance between a pair of CG 8-mers.The results show that the average distance difference between the CG 8-mer pairs on the nucleosome occupied sequences was significantly smaller compared with that on the nucleosome absent sequences,and the variance of the average distance difference distribution of CG 8-mer pairs on the nucleosome occupied sequences is also significantly smaller than that on the nucleosome absent sequences.This suggests that the distribution of the CG 8-mer pairs is strongly conserved in the nucleosome occupied sequences in the human and mouse genomes compared with the nucleosome absent sequences. |