Font Size: a A A

Theoretical Analysis And Prediction Of Nucleosome Positioning Based On Sequence Information

Posted on:2015-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Q XingFull Text:PDF
GTID:1260330428982697Subject:Biophysics
Abstract/Summary:PDF Full Text Request
Epigenetics is an important frontier and a new research hotspot in the post-genomic era. Nucleosome positioning is a major area of epigenetics. As the basic building block of higher-order chromosome structure, nucleosome not only provides the measure of packing genomic DNA, but also is involved with gene expression and regulation. By controlling DNA accessibility, nucleosome regulates various biological processes, such as DNA transcription, DNA replication, DNA recombination, DNA repair, mRNA splicing, and disease development, etc. Investigation of the nucleosome positioning across eukaryotic genomes can contribute to elucidate the formation mechanism of higher-order chromosome structure and is helpful for uncovering intricate gene expression and regulation.Nucleosome positioning along genome is determined by many factors including DNA sequence preferences, chromatin remodeling complex, transcriptional machinery, histone posttranslational modification, histone variant, etc. Intrinsic DNA sequence preferences of the nucleosome have been shown to be the most important factor over other factors recognized so far. Many theoretical and experimental studies of nucleosome positioning based on DNA sequence have existed. However, majority of theoretical studies paid more attention on core DNA wrapped around histone octamer and linker DNA between nucleosomes was given little attention. In this work, sequence characters of core DNA and linker DNA retrieved from high-resolution data of nucleosome positions were analyzed statistically. Based on DNA sequence signals, the novel position-correlation scoring function(PCSF) and support vector machine(SVM) were respectively developed to predict nucleosome positioning in eukaryotic genome.Firstly, the distribution of K-mer(K=1,2,...6) and sequence bias parameter Mk(i)(k=1,2,...6) were analyzed systematically in core and linker sequences across S. cerevisiae. The oligonucleotides composed A and T are more enriched in linker DNA than core DNA. The higher A+T content, the stronger rigidity of sequence. Thus, the lower content of oligonucleotides comprised of A and T in core DNA are helpful for DNA wrapping. The bias of the k-mer frequency Mk(i)(K=1,2,...6) in linker regions is drastically higher than that in core regions. In other words, the bias of the k-mer frequency or sequence conservation in linker regions is stronger than in nucleosome core regions. This result provides an important clue to prediction of nucleosome occupancy combined sequence bias parameter.Secondly, information redundancy Dk describes the vocabulary composition and grammar structure of genetic language. The calculated results of Dk across the genomes of S. cerevisiae, D. melanogaster, and C. elegans indicated that the value of Dk in core DNA is significantly different from that in linker DNA. The law of short-range correlation of the nucleotides is dominant in the nucleosome and linker DNA sequence was validated. This result probably decode the phenomenon that most of the theoretical models based on the frequencies of oligonucleotides or k-mer predicted nucleosome positioning with high accuracy. We also confirmed that the difference of information content between core DNA and linker DNA is universal. Neither the sequence length difference nor the difference of the method for constructing the dataset between the core DNA and linker DNA alters this difference.Thirdly, power spectrum analysis is a popular method for detecting periodicity in DNA sequences. Power spectrum of core DNA and linkr DNA acrocss S. cerevisiae, D. melanogaster, and C. elegans genomes showed that the3-nt and10-nt periodicities are obvious in the nucleosome DNA regions and are stronger than that in linker DNA regions for three model organisms. Besides, the specific power spectrum of different species was shown.Fourthly, to further clarify the effect of nucleotide correlation on nucleosome positioning, the parameter Fk(k=0...98) was calculated to examine particular base correlation corresponding to the16dinucleotides in core DNA and linker DNA. Using a1,584-element(99×16) vector as input vectors, the SVM was used to classify core and linker DNA regions in Homo sapiens, Oryzias latipes, C. elegans, Candida albicans, and S. cerevisiae. This model obtained a good performance with an average total accuracy of76.05%and an average MCC of0.4876in five organisms.Finally, a novel PCSF algorithm based on the bias of4-mer frequency M4(ⅰ) in linker sequences was developed to distinguish nucleosome vs linker sequences. The5-fold cross-validation demonstrated that this algorithm achieved a good performance with mean sensitivity of94.42%and specificity of94.35%. Next, the algorithm was used to predict nucleosome occupancy throughout the S. cerevisiae genome and a higher pearson correlation coefficient of0.761with the in vitro experiment nucleosome positioning map of16chromosomes was obtained. Besides, the nucleosome profiles surrounding specific gene are notably similar with experimental maps of nucleosome organization in vitro and in vivo. By analyzing the profiles of nucleosome occupancy predicted by PCSF in the vicinity of TSS, TTS and ACS, the pronounced nucleosome depleted regions can be confirmed. The results suggested that intrinsic DNA sequence preferences in linker regions have a significant impact on the nucleosome occupancy and PCSF algorithm is an effective tool to predict nucleosome positioning.
Keywords/Search Tags:nucleosome positioning, sequence bias, information content, periodic signal, power spectrum analysis, position-correlation scoring function, support vector machine
PDF Full Text Request
Related items