Font Size: a A A

Predicting Functional Sites Based On Support Vector Machine And Extreme Learning Machine

Posted on:2016-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2310330512971039Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
In Bioinformatics,one of the research problems is to identify the gene functional sites.To do this,there are different ways to select the consensus region and predict the functional sites.In this thesis,we come up with a new method of gene functional sites identification based on extreme learning machine and support vector machine.Moreover,we give a new consensus formula to determine the consensus region.First,we determine the consensus region by defining the conservative strength formula of each site.Second,we extract the sequence position features of the consensus region by the multi-scale component and the adjacent and non-adjacent position correlating weight matrix.We also extract the upstream and downstream sequence component by incremental diversity of sequence component.Finally,we construct a support vector machine and extreme learning machine classifier to integrate feature information.Splice sites prediction:To recognize splice sites,we first quantitatively determined the consensus region and the upstream and downstream sequence by calculating the conservative strength of each site.Then we extract the five dimension features vector with the adjacent and non-adjacent position correlating weight matrix,and incremental diversity of components to express the sequence.Based on the support vector machine classifier and 5-fold cross validation,we use the five dimension features vector and apply it on HS3D 1:1 and 1:10 positive and negative dataset.The optimal donor and acceptor Matthews correlation coefficients of 1:1 dataset are 0.924 and 0.947 respectively.While those of 1:10 dataset are 0.754 and 0.734 respectively.Compared with the existing methods,our method produces a great improvement,especially the prediction accuracy of acceptors.Because the convergence of support vector machine is very slow on large datasets,we introduce the extreme learning machine classifier which has good generalization capability with fast learning speed.The results on 1:10 dataset show that the extreme learning machine and support vector machine are neck and neck.Promoter recognition:In this section,we use the 1400 promoter from the eukaryotic promoter dataset(EPD)as the positive sample and the 1290 Coding sequence(CDS)and 1264 intron sequence as negative sample.Similar to splice sites prediction,through the conservative strength of each site,we first get the consensus region,and then extract the thirteen dimension features vector with the correlation information of the consensus region,the components information and the CpG island information of the whole sequence to express the sequence.Based on the support vector machine classifier and 5-fold cross validation,we use the features vector to predict promoter.The optimal promoter and CDS Matthews correlation coefficients are 0.975,while the optimal promoter and intron Matthews correlation coefficients are 0.946.The results show that our methods improve the accuracy of the splice sites and promoters recognition,and are better than those of others.
Keywords/Search Tags:Gene functional sites, Support vector machine, Extreme learning machine, Splice sites, Promoter
PDF Full Text Request
Related items