Font Size: a A A

Research On Feature Selection Methods Based On Neighborhood Rough Sets For Tumor Gene Expression Data

Posted on:2020-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:2404330578967721Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Rough set theory is an effective mathematical tool for characterizing the fuzziness,uncertainty and incompleteness of knowledge.However,the classical rough set theory based on the strict equivalence relation has certain limitations in dealing with gene expression datasets with high dimensionality,small samples and continuous types.As an extended rough set model,neighborhood rough set theory is widely applied in the fields of artificial intelligence,data mining,pattern recognition,etc.For the mixed tumor gene expression datasets,this degree thesis studies some uncertainty measures based on the neighborhood relation in neighborhood decision systems.By combining with the dimensionality reduction techniques in machine learning,some feature selection algorithms for tumor gene expression data are proposed in neighborhood rough set models.These feature selection algorithms are used to classify tumor genes.The effectiveness of the proposed gene selection algorithms is verified under the theoretical analyses and the experimental implementations.The main work of this degree thesis is summaried as follows:(1)The traditional rough set uses discretization to process continuous data,resulting in some important information loss and reduced classification accuracy.The neighborhood rough set model can granulate the continuous tumor gene data through the neighborhood relationship,so that it retains classification information of data;therefore,this paper proposes a gene selection method based on Fisher linear discriminant and neighborhood dependency degree.First,Fisher linear discriminant method is used to perform preliminary dimensionality reduction for gene expression datasets,which obtains the candidate gene subsets.Second,in the neighborhood decision system,the neighborhood roughness is defined based on the neighborhood precision,and introduced into the dependency degree to construct the neighborhood dependence degree,which can be used to measure the knowledge roughness of the neighborhood decision system.Based on the internal importance and the external importance of attributes,the feature selection method based on neighborhood dependence degree in neighborhood decision system is constructed.Finally,a gene selection algorithm based on Fisher linear discriminant and neighborhood dependence degree is proposed to eliminate redundant genes in the candidate gene subset and obtain the optimal gene subset.The experimental results on the four standard tumor gene datasets show that the algorithm can effectively select the optimal tumor gene subset and obtain higher classification accuracy.(2)Aiming at the problem that the heuristic attribute reduction methods based on the monotonicity of the evaluation function have certain defects in the neighborhood rough sets and cannot get better attribute reduction results,first,the neighborhood entropy-based uncertainty measure methods are studied in the neighborhood decision system.The defines of the neighborhood credibility degree and neighborhood coverage degree are given,and introduced into the decision neighborhood entropy and neighborhood mutual information to fully reflect the decision ability of the attributes in the neighborhood decision system,and derive their properties and the relationship between these uncertainty measures;then,the decision neighborhood entropy and neighborhood mutual information are proved to be nonmonotonic through theoretical analysis,and the feature selection method using neighborhood entropy-based uncertainty measures is designed.Finally,a heuristic nonmonotonic feature selection algorithm in neighborhood decision systems is proposed combined with Fisher score to reduce the dimensionality and improve the classification performance of gene expression datasets.The experimental results of ten-fold cross-validation on ten public tumor gene datasets show that the algorithm can not only effectively reduce the dimensionality of tumor gene datasets,but also obtain better classification accuracy.(3)Traditional knowledge reduction methods based on the rough set theory mostly only research the influence of attributes in the classification subsets from the algebra view or information view in the domain,and do not get a more comprehensive measurement mechanism.To solve this problem,a tumor gene selection method based on neighborhood approximate decision entropy is proposed.First,the neighborhood approximation accuracy is combined with the neighborhood entropy based on the strong complementarity between attribute algebra definition and information theory definition,and the new average neighborhood entropy is defined;then,the neighborhood approximate decision entropy is constructed to deal with the uncertainty and noise of the neighborhood decision system and fully reflect the decision-making ability of attributes.Finally,in the neighborhood decision system,a tumor gene selection algorithm based on neighborhood approximate decision entropy is proposed to improve the classification performance of complex datasets.The simulation results on seven open tumor gene datasets show that this method can effectively select attribute subsets with high classification performance.
Keywords/Search Tags:Rough set theory, Neighborhood rough sets, Feature selection, Uncertainty measure, Tumor gene expression data
PDF Full Text Request
Related items