Font Size: a A A

Research On Key Techniques Of SNP Selection And Diagnostic Model For Schizophrenia

Posted on:2020-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:X B LuFull Text:PDF
GTID:2404330596996916Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Single nucleotide polymorphism(SNP)refers to DNA sequence polymorphism caused by variation of single nucleotide on genomic level.As an important genetic variation data,SNP data is suitable for the study of complex traits and genetic anatomy of diseases.Research on SNP data has become one of the most important topics in bioinformatics field.However,there are much more redundancy between SNP sites.In order to use SNP data for the diagnosis of complex diseases,a representative subset of SNPs must be selected.With the rapid development of machine learning technology,the selection of SNP subsets can be considered as feature selection problem.Therefore,this thesis applies the feature selection technology and classification model into the selection of SNPs and the diagnosis models for schizophrenia.Firstly,a subset selection method of information SNP based on K-MIM is proposed.Then designs a diagnostic model based on Xgboost for schizophrenia.The specific work is as follows:(1)Aiming at the problem of the strong correlation between SNPs,a new algorithm called K-MIM is proposed to group SNPs.K-MIM introduces the concept of mutual information in K-Means and proposes a new distance metric,which uses the mutual information to measure the correlation between features.It effectively solves the problem that the Euclidean distance cannot dig out the intrinsic relationship between SNPs.In addition,due to the problem that cluster center updating mechanism of KMeans failed under new distance metric,a new updating mechanism is proposed.According to the relationship between distance from sample to the center of cluster and the sum of distances from sample to other samples presents characteristic of approximate incremental function,the SNPs with the smallest distance from other SNPs in the cluster are used as the cluster center body instead of the original one.The experimental results show that the K-MIM has better non-information SNPs reconstruction than the K-Means and other improved K-Means algorithms,and compared with the information SNP selection methods such as MCMR and ReliefF,the new information SNP selection method has an average increase of 1.83% and 3.33% in the classification accuracy on the two data sets.Therefore,the information SNP selection method based on K-MIM algorithm proposed in this thesis has a great advantage in the selection of information SNP subsets.(2)The original ant colony algorithm considers that shorter information SNP subset has better non-information SNP reconstruction when calculating the pheromone accumulation,for this reason,this thesis proposes a new pheromone accumulation mechanism which considers the length and the quality of the solution at same time by introducing the prediction error of information SNP subset to the non-information SNP subset.Meanwhile,to avoid the algorithm falling into local optimum,the redundancy of information SNP subsets is introduced to the pheromone volatilization mechanism to volatilize pheromones adaptively.The experimental results show that the improved ant colony algorithm has better non-information SNP reconstruction than ACO,PSO and GA,and compared with the information SNP selection methods such as MCMR and ReliefF,the new information SNP selection method has an average increase of 1.33% and 1.11% in the classification accuracy on the two data sets.Therefore,the improved ant colony algorithm enhances the advantage of the information SNP selection method based on K-MIM algorithm.(3)In the diagnosis of schizophrenia,cost-sensitive Xgboost is proposed to solve the problem that the cost of misdiagnosis is different between patients diagnosed as healthy persons and patients diagnosed as healthy persons.Since the misclassification cost of the dataset is unknown,the algorithm proposes an adaptive misclassification cost weight according to their prediction errors and the mean errors of all samples to reduce the possibility of diagnosing a patient as a healthy person.At the same time,the depth of the tree is added to the regular term of the objective function to prevent the algorithm from overfitting.In classification experiment,cost-sensitive Xgboost is basically the same as the Xgboost,SVM and Neural Network algorithm in classification accuracy,and in the statistics experiment of misclassification times,compared with Xgboost,the misclassification times of patients diagnosed as healthy people in two datasets are reduced by 7.5% and 6.67%,which reduces the possibility of diagnosing patients as healthy people.
Keywords/Search Tags:SNP, feature selection, cluster, ant colony algorithm, Xgboost, schizophrenia
PDF Full Text Request
Related items