Font Size: a A A

Study On The Identification Method Of Sheep Varieties Based On The Whole Genome SNP Locus

Posted on:2019-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiuFull Text:PDF
GTID:2393330566967157Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In our life,variety identification has many applications as well as research value.However,there has been less research on identification of sheep breeds in recent years.Nowadays,with the rapid development of biotechnology,more and more sheep's gene expression information has been measured,allowing us to classify sheep of unknown species according to genetic data.The thesis adopted gene chip technology to determine sheep's genes.Because gene chip technology is developing fastly,thus,it allows us to acquire thousands of genetic data at a fast speed in the same time.But,by using this method,the number of samples cannot match the genetic data.Thus,the data are characterized by high-dimensional and small samples.The single nucleotide polymorphism(SNP)data is a type of gene data,which also has the same disadvantage.How to effectively deal with and analyze these data has become the focus of the majority of scholars.The main research content of the thesis is to select high-information SNPs to correctly classify unknown sheep breeds.There are two main difficulties in the study:(1)how to select effective features from high-dimensional data;(2)how to choose a suitable and efficient classification algorithm.The thesis first uses a traditional LSDL statistical method to experiment with a small number of samples,to explore the rules contained in it,to find problems and then to improve the implementation of the method.Ultimately,it aims to reduce the complexity so as to improve the computational efficiency.An improvement was proposed based on the problems existing in traditional algorithms.Because the characteristic of small sample of high order data is not taken into account in the traditional analysis method,the relationship between data cannot be well utilized.The thesis uses principal component analysis(PCA)and other classifications,which improve the classification accuracy,improve the generality of the method,and save the calculation time.First,the feature values used for classification are extracted by the PCA and the dimensions of the original data are reduced.These feature values are then used as the following classifier K nearest neighbor(KNN),Support Vector Machine(SVM),Random Forest(Random Forest,RF)and Back Propagation Neural Network input,using the filtered SNP data as the classification feature to classify the correct species.The algorithm used in the thesis can transform the SNP data expressed in vector form into a matrix form for expression,which can better utilize the correlation and structural characteristics among SNP data.Finally,it proves that the random forest algorithm in the four classification algorithms can better optimize the number of SNP sites used for species classification,thereby reduce the classification cost and improve the classification efficiency.
Keywords/Search Tags:SNP, Variety Identification, PCA, Random Forest, Support vector machine, Back Propagation Neural Network
PDF Full Text Request
Related items