Font Size: a A A

Research Of The DNA Sequence Classification Algorithm Based On Machine Learning

Posted on:2018-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y L DongFull Text:PDF
GTID:2310330515972329Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the human genome project and sequencing technology,produced massive biological data.From the massive biological data mining useful information is a problem placed in front of the scholars,bioinformatics has emerged in this backgroud.At present,the study of DNA sequence classification has become an important issue in bioinformatics.Therefore,the problem of DNA sequence classification is to be studied in this dissertation.In recent years,with the development of sequencing technology,the data scale of DNA sequences is increasing.The machine learning algorithm plays a more and more important role in DNA sequence analysis.Machine learning builds effective learning models based on mathematical statistical knowledge and algorithm theory,which can capture complex patterns hidden in a large number of DNA sequence data and make decisions based on them.Therefore,this dissertation uses machine learning algorithm to classify DNA sequence.By the analysis of machine learning classification algorithms,the K-nearest neighbor algorithm is used to classify the DNA sequence.K-nearest neighbor algorithm plays a very important role in machine learning classification algorithm.It is a more mature method in theory,and it is also one of the simplest machine learning algorithms.The algorithm is simple.Its rule is to train the sample data itself,without the need for additional data to describe it.Because the algorithm only considers the training sequence which is close to the training sample sequence,it is less affected by the noise data.In the classification process,the k-nearest neighbor algorithm directly uses the relationship between the training samples and the test samples,not only reduces the error in the classification process,but also reduces the adverse effects of classification feature selection on the classification results.Because of these advantages,it has been widely used in many fields,such as text classification?biological information and so on.In this dissertation,the K-nearest neighbor algorithm is used to classify the DNA sequences.Firstly,the DNA sequence is transformed into the data which can be identified by the K-nearest neighbor algorithm,that is to extract the feature information of the DNA sequence,the conversion of the original DNA sequence data into numerical feature.In this dissertation,three methods of feature extraction are fused together to form the feature vector of DNA sequence.They are feature extraction method based on singl-base,feature extraction method based on double-base content,feature extraction method based on three-base content.In the use of double-base content and three-base content using a rolling algorithm to calculate the frequency of each combination in a DNA sequence.Secondly,feature selection is carried out according to the fused featurevectors.This dissertation adopts the idea of dimensionality reduction to select the features,and the feature selection algorithms are principal component analysis and kernel principal component analysis.Finally,the k-nearest neighboralgorithm is used to classify the test sample data,and the classification accuracy is verified by comparing the output category of the test samples with the real category.In this dissertation,the classification performance is evaluated not only the classification accuracy,but also the response time.The feature selection algorithm uses kernel principal component analysis,which can achieve good results,which means that the k-nearest neighbor algorithm combined with kernel principal component analysis is better for DNA sequence classification.The factors that affect the performance of DNA sequence classifier are analyzed in this dissertation,they are sample size?the proportion of training samples and test samples.In the aspect of sample size,when the sample size is less and less,the classification accuracy is not getting higher and higher,it shows that the sample size is based on specific actual situation to take a suitable value.The proportion of training samples and test samples,the experimental results show that the ratio is 3:1,the machine can learn effectively.
Keywords/Search Tags:Bioinformatics, DNA sequence classification, machine learning, K-nearest neighbor algorithm, Kernel principal component analysis
PDF Full Text Request
Related items