Font Size: a A A

Machine Learning Algorithm Integration And Its Application For Sequence Classification Problems

Posted on:2021-05-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:F X MengFull Text:PDF
GTID:1367330632953405Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Classification is an important issue in statistics and management research.Scientific classification is the important basis for data mining,statistical prediction,and scientific decision-making.In classification problems,sequence data is an important research object.In today's information society and the era of big data,human beings continue to generate and accumulate massive serial data resources in production,life,and scientific research.Fully mining the information behind serial data is of great significance for scientific understanding of the natural world and the laws of economic and social development,better guidance and management of economic and social activities,better statistical forecasting and management decisions.Because machine learning has powerful data processing and self-learning capabilities,it can handle the massive,high-dimensional,and complex sequence data mining problems that traditional methods can't cope with.Therefore,recent research on machine learning has become an important research direction in disciplines such as management science and engineering,and computer science.With the development of new-generation high-throughput gene sequencing technology,gene sequence data has exploded.In the past,the relying on biological methods to study gene sequence data has great limitations,causing people to lack accurate scientific understanding of many genetic problems.Around this Spring Festival,an unknown new coronavirus?COVID-19?caused a major pneumonia epidemic,which have made genetic problems a hot topic in recent interdisciplinary research all around the world.In January of this year,the National Natural Science Foundation of China issued an emergency project guide on"Basic Research on Traceability,Pathogenesis,and Prevention of New Coronavirus?2019-n Co V??1?"to encourage interdisciplinary research and use new research paradigm concepts to systematically solve scientific problems.In the study of genetic problems,accurate classification of gene sequences is an important foundation and prerequisite.For this reason,this thesis studies sequence classification problems based on machine learning theories and methods.There are three key scientific problems to be solved:the first is the algorithm optimization of non-numerical sequence data mapping conversion and spectrum information mining;the second is the integration and innovation of machine learning algorithms for sequence classification;the third is the performance evaluation of different classification algorithm models and the credibility evaluation of classification results.At the application research level,the full thesis focuses on the classification of gene sequences.Several types of gene sequence classification methods and machine learning algorithm integration models are given,and the classification performance of various algorithm models is compared and evaluated through the construction of an AAA comprehensive fuzzy evaluation model.This paper starts with the problem of sequence classification,machine learning theory,and bioinformatics theory to systematically sort out the research progress of current machine learning algorithms in the fields of data mining and bioinformatics.Considering the shortcomings of the existing research,this thesis finds the entry point for studying sequence classification from the perspective of machine learning algorithm integration.By further combing and analyzing the research questions and research methods,the research objectives,research contents and research ideas of this thesis are clarified.The research is conducted from the theoretical and applied levels.At the theoretical level,this thesis focuses on the integration optimization and modeling of machine learning algorithms.It uses a progressive and step-by-step research method to systematically study the optimization of sequence data feature representation and spectrum information mining algorithms,integrated learning problem of bootstrap sampling,and SVR,the integration of hidden Markov models with discrete-time dynamic Bayesian networks and the credibility evaluation of their prediction probability,and the integration of BP neural networks and genetic algorithms.At the application level,this thesis focuses on the basic problem of exon classification of gene sequences.Based on the model and optimized algorithm constructed at the theoretical research level,the classification and discrimination of different gene sequences are carried out,and the classification performance of different models is compared and analyzed.The innovation of this thesis is mainly reflected in the following four aspects:Firstly,aiming at the mapping transformation method of non real valued sequence and the problem of mining spectrum information,three mapping transformation methods of"domain transformation"are compared and proved theoretically.Through the domain transformation,we can mine the spectrum information of the sequence data better,so that we can more intuitively use the spectrum signal to study the rules of the sequence data.On this basis,a fast algorithm for gene sequence spectrum information mining based on sparse optimization idea is proposed.The performance of this algorithm in gene sequence data storage and spectrum information calculation has been significantly improved.In terms of data storage,it can compress up to 50%of computer storage units.In the aspect of spectrum information mining operation,the complexity of the algorithm is reduced and the operation efficiency is improved.The simulation results show that the operation time of power spectrum and SNR is reduced by 83.18%and 61.33%respectively.Secondly,aiming at the problem of how to select training data set from small samples by SVM,an algorithm model based on bootstrap sampling and SVR interactive integrated learning is constructed.The algorithm model is a classification method based on the optimal threshold of spectrum information.Through interactive integrated learning,we can not only reduce the number of samples,but also avoid or improve the problem of poor training of SVM regression model caused by improper selection of training set,so that we can still get better training model and classification prediction results in the case of less samples.In order to demonstrate the performance of the algorithm model,this paper applies it to the solution of the optimal spectrum threshold of gene exons of different species,and establishes a multi-objective optimal threshold decision model.The simulation results show that the algorithm model is feasible and effective,and the average accuracy of the test results is more than 90%..Thirdly,aiming at the reliability evaluation of prediction probability and classification result of hidden Markov model,an integrated algorithm model of dynamic Bayesian network and hidden Markov model is constructed.The algorithm model is a classification method based on prediction probability.First of all,a comprehensive reliability evaluation model of prediction probability is designed based on the event tree and fault tree risk importance index.Then,a three state gene exon hidden Markov model is constructed.Finally,the performance of gene sequence classification can be further improved by integrating the discrete-time Bayesian network with the hidden Markov model.In the process of model solving and simulation,a hybrid algorithm of forward algorithm and EM algorithm is designed and simulated.The results show that the algorithm can get a more accurate starting and ending point of exon,realize the location and discrimination of single base of exon,and make the classification results more accurate.Last but not least,aiming at the problem that the initial parameters of BP network are not properly selected and easily fall into the local optimal trap,an algorithm model based on the integrated learning of BP neural network and genetic algorithm is constructed.The algorithm model is a method based on global optimization,which can achieve global optimization without precise logic reasoning.Through the optimization of genetic algorithm,the method and optimization of the most critical connection weights and threshold parameters of BP neural network are improved,so that the learning efficiency is improved,the problem that BP neural network is easy to fall into the trap of local optimal solution is avoided,and the global optimal search solution is truly realized,making the classification result more accurate.Through the simulation experiment,it is proved that the classification result obtained by the algorithm model is better.
Keywords/Search Tags:Sequence classification, Machine learning, Bootstrap sampling, DTBN model, HMM model
PDF Full Text Request
Related items