Font Size: a A A

Research And Implementation Of Biological Feature Selection Algorithm Based On Hierarchical Clustering

Posted on:2020-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:F LiFull Text:PDF
GTID:2370330575979893Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of medical technology,computer technology and high-throughput data storage technology,biomedical data is generated in large quantities every year.In the field of disease research,how to find valuable information from massive disease data has always been a hot research issue in the field of data mining and machine learning.With the maturity of microchip technology,people can easily extract the expression level of all genes in biological samples,that is,gene expression spectrum data,and the gene expression spectrum data contains a large amount of gene-related information,if the biomarkers that are important to the disease research can be found and studied,which will not only promote the development of related disease research,but also provide new ideas for the diagnosis and treatment of related diseases.These biomarkers are often differentially expressed in disease samples and control samples.Combining data mining and learners in machine learning,learning sample features,and performing predictive analysis is an effective and important way to find biomarkers with biomedical value.There are thousands of genes in the human body.From the perspective of systems biology,genes with similar expression patterns have similar functions.These functionally similar genes work together to form a gene function subsystem,and in the functional subsystem.Among them,a small number of genes play a key regulatory role,and most of the genes play a supporting role.These key regulatory genes are one of the biomarkers of great research value,but how to effectively use the techniques of machine learning and data mining to mine these genes that play a key role in the pathology of disease remains a challenge.On the other hand,samples of specific disease categories in gene expression profile data may be difficult to collect,which is likely to cause category imbalances;And the number of samples of gene expression profile data is often much smaller than the number of genes,which will result in "Small n big p" problem.These factors lead to large interferences in the performance of the classifier when applying the machine learning classifier.Compared to category-balanced data,it is more difficult to classify data with unbalanced data to learn a good classifier.If the feature is not directly used for model training without screening and dimension reduction,it will not only lead to high complexity,low performance,but also overfitting problems.Feature selection technology is one of the important solutions to these problems.The feature selection algorithm can discriminate the features in the feature set,filter out the useless features and redundant features,and improve the performance of the predictive model while reducing the feature dimension.In response to these problems,this paper proposes a combination of systems biology,using hierarchical clustering to classify genes with similar expression patterns,and selecting appropriate clusters through dynamic pruning,ranking according to relevance to categories.Selecting the initial feature subset can greatly reduce the feature dimension.At the same time,the same feature subset can be replaced by the same cluster feature,the recursive feature is eliminated,and the feature subset with relatively good performance can be selected with the help of the embedded classifier.The experimental results show that the algorithm can achieve relatively good classification performance with fewer features,and compared with similar algorithms,the algorithm has relatively good stability.Among the features acquired by the algorithm on psoriasis,some features have been closely related to psoriasis in the literature,while others have no relevant literature for the time being,which may have important reference value for related medical research.
Keywords/Search Tags:systems biology, gene expression profiling, hierarchical clustering, dynamic tree cut, feature selection
PDF Full Text Request
Related items