Font Size: a A A

Modeling And Discovering For Motifs Of Gene Promoter Sequences

Posted on:2013-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:B TianFull Text:PDF
GTID:2250330392970611Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The modeling and discovering for gene promoter sequences are important forunderstanding of gene expression and tissue specific regulation. Studies show thatcombining multiple tools is a good way to get full-scale results, then get candidatemotifs by reducing redundancy. However, reducing redundancy need steady motifmodeling method and reliable similarity scoring method. In this paper, we develop anew method for modeling motif and a probability based similarity scoring schema formotif comparison.Because lengths of different motifs may be not equal, we consider both positionalinformation content and pairwise nucleotides dependence, and we model motifs byextracting features from these information. Positional information content is theinformation that describes the significance of position, and it has been proved thatnucleotides at different positions are interdependent. In this paper, we compare twomotif modeling methods using data from database JASPAR.To improve accuracy of motifs comparison, we develop a probability basedsimilarity scoring schema (PS3), considering both the probability of two motifscoming from common source and the probability of two motifs coming fromindependent source. Then we classify two datasets, both of them contain25differentclasses. And results show that PS3is the best one among four methods.Since merging similar motifs into a new one may lead to shortage thatdistribution of nucleotides within one position tend to be uniform, so we try to clustermotifs without merging similar motifs into a new one. Meanwhile, we also givesolutions to some key problems within clustering for reducing redundancy. Firstly, wecompare two clustering procedures in this paper and prove that the later oneovercomes prior one. Then, we cluster1417motifs from76human tissue specificgenes of CardiacMyocyte and generate38motifs as an outcome. To analysis thesemotifs, we firstly connect them to known motifs using online tool STAMP, then findout GO terms mapping to these motifs using2852human tissue specific genes and13275GO terms. Results show that the percentage of overlap is60%, affirming thereliability of results of clustering. Given the disadvantage of existed methods for reducing redundancy, this paperdevelop a new method for modeling motifs and a probability based similarity scoringschema for motifs comparison. By connecting candidate motifs to known motifs andGO terms, we approve that motifs generated from our methods are reliable.
Keywords/Search Tags:Sequence pattern modeling, Similarity scoring method, Motifsdiscovering, Promoter sequence
PDF Full Text Request
Related items