Font Size: a A A

Recombination Hotspots And Protein Fold Recognition Based On Sequence Information

Posted on:2018-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:R WangFull Text:PDF
GTID:2310330533969233Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is an interdisciplinary subject of computer science and biology,it aims to use computer knowledge to solve biological problems.In recent years,the data about DNA sequences and protein sequences are booming due to the development of biological sequencing technology.The structure of sequence belongs to the primary structure of biological macromolecules,and the information about sequences can reflect characteristics of the spatial structure.It is a problem how to get more information about the structure and function of biological macromolecules by using these first order sequences.This dissertation tries to combine extraction method based on sequence information and machine learning methods to solve two problems about transcription and protein structure identification research,i.e.recombination hotspot identification and protein fold recognition.Recombination hotspots play an important role in the evolution of organisms,the identification of hotspots is helpful to study the function of DNA and protein.In order to improve the accuracy of recognition,researchers have used many kinds of sequence-based features to identify hotspots.Among them,the kmer feature which is based on sequence information is commonly used.However,with increasing length of the kmer,the feature vectors become sparse,because a lot of kmers appear only once or do not appear at all.This may cause the overfitting problem.In order to overcome this disadvantage,another feature called gapped kmer is used to solve the related bioinformatics problems.In this dissertation,the gapped kmer feature is first applied for recombination hotspot identification.By using the gapped kmer kernel proposed by Ghandi,this dissertation combines the gapped kmer feature and support vector machine method to construct a SVM-GKM predictor for recombination hotspot identification.The experimental results show that the SVM-GKM predictor performs better than other methods.The folding structure is the secondary structure of protein,which plays an important role for sequence structure and multi-level structure.The folding structure is also important for protein function research.The main problem of protein fold recognition is to improve the identification rate.To deal with this problem,this dissertation proposes two improvements.The first improvement is trying to obtain profile-based protein sequences by data preprocessing.Due to the single feature cannot contain full information,this dissertation adopts method about fusing multi features to obtain more sequence information.This dissertation constructs a predictor called PP-MF to predict the fold structure of protein.PP-MF employs five features,namely gapped kmer feature,auto-cross covariance feature,bi-gram feature,pseudo amino acid composition feature and five attribute characteristics feature.This dissertation does experiments on two datasets,and the experimental results show that the PP-MF method achieves higher accuracy than most of protein fold recognition methods.
Keywords/Search Tags:bioinformatics, recombination hotspot, fold recognition, gapped kmer, support vector machine, multi-features
PDF Full Text Request
Related items