Font Size: a A A

Feature Screening For Ultrahigh-dimensional Categorical Data

Posted on:2021-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:S H YinFull Text:PDF
GTID:2370330611997970Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The remarkable improvement of the level of technology and computing power over-come many challengs in different areas like genetics,industry,finance and so on,which was hard to solve in the past because of the inapplicability of traditional methods and the limitations of technology.Modern technology allows ultrahigh-dimensional data to be collected at a relatively low cost in many fields of scientific research.Noting that ultrahigh-dimensional data differs from low dimensional and high dimensional data.Math-ematically,low dimensional data means that the dimension of predictors is samller than the sample size?p < n?.High dimensional data refers to a large dimension following the sample size at a polynomial rate?p = O?n??,? > 0?.Ultrahigh dimensional data allows the dimension growing exponentially with the sample size?log?p?= O?n??,? > 0?.Ultrahigh dimensional applications such as Single Nucleotide Polymorphism?SNP?,DNA microarray,stock trading,climate change,become frequently used in our daily life.A typical example is genomics.In the gene selection problem,the variables are gene ex-pression coefficients corresponding to the abundance of m RNA in a sample?e.g.tissue biopsy?,for a number of patients.A typical classification task is to separate healthy pa-tients from cancer patients,based on their gene expression “profile”.In general,there are fewer than 100 patient samples available for training and testing,but the number of predictors in the raw data is often greater than 10000.This type of data has two distinct characteristics.First,the dimensions of the predictors are far larger than the sample size.Second,sparsity causes that only a few variables which are active predictors are play an important role in practical problems such as classification,regression and cluster.These characteristics are consistent with what we call ultrahigh dimensional data.However,it is impossible to introduce all predictors due to “curse of dimensionality”.That is,the computation cost increases exponentially with the dimensionality.Traditional variables selection methods invovles a NP-hard combinatorial optimization issue.Dis-criminant analysis become as bad as random guessing and some high-dimensional penalty methods expose to high computation cost,inaccurate statistical results and algorithm un-stability.Therefore,feature screening for ultrahigh-dimensional data becomes a hot spot in recent years.The core idea of feature screening is excluding those features that signifi-cantly unrelated to response to reduce the dimension.Most importantly,feature screeningmethods satisfy sure screening property,which means that all active predictors are se-lected with the probability 1 as the sample size goes to infinity.After feature screening,we can obtain the active predictors,reduce the model dimension to a moderate scale,cut down the calculation cost and increase the interpretability of the model.There are several typical feature screening methods like Sure Independent Screening?SIS?,Sure Independent Ranking Screening?SIRS?,Kormogolov Filtering?KF?,Pairwise Sure Independent Screening?PSIS?,and sure independent screening bases on distance correlation?DC-SIS?,on empirical conditional distribution function?MV-SIS?.However,they exist shortcomings respectively.SIS is only available for linear regression models.SIRS is inapplicable for nominal response.DC-SIS requires strict conditions to obtain sure screening property.KF is applied only to binary classification.PSIS bases on expectation so it performs poorly for heavy-tailed data.MV-SIS allows a diverging number of class and it is robust to heavy-tailed distribution of predictors.But it loses screening efficiency in the case of multiple classification with small sample size.We propose an extended screening and ranking index based on MV,denoted by e MV,and then construct an extended screening procedure?e MV-SIS?to reduce ultrahigh-dimensional categorical data,whose response is categorical and covariables are continu-ous.e MV distinguishs from MV by using a higher power(?F?x|Y?-F?x??2?dF?x?,? ? N+,? < ?),instead of the square??F?x|Y?-F?x??2d F?x??in the difference be-tween the conditional distribution function of x given Y and the unconditional distribution function of x to measure the dependence relationship between the response and predictors,and its sample estimator is easily available too.When the alpha is 1,e MV degrades to MV.The proposed approach possesses the sure screening property under three specific conditions.We evaluate the performance of the proposed method through the Monte Carlo simu-lations given finite samples.Given a statistical model and truly active features,e MV-SIS is applied to screen out active predictors from raw data.There are three criteria which score the simulation performance of e MV-SIS.We compare the proposal with other repre-sentative feature screening methods,especially MV-SIS.The results show that e MV-SIS has superior performance in both binary classification and multiple classification linear discriminant analysis problems.Meanwhile,we explore the scope of the power ? in e MV to avoid too small computing outcomes degrade the screening accuracy.Given a proper value of alpha,we also design three complicated multiple index models and additionallyrecord the computation time.The results indicate that e MV-SIS is the best performance in accuracy but a little higher than MV-SIS in term of computation cost.We further study e MV-SIS by analysing two ultrahigh-dimensional categorical can-cer datasets.After choosing appropriate alphas and splitting training data and testing data randomly,we construct different feature screening procedures and compare them with MV-SIS.The first stage is to reduce the dimensionality of the training data using feature screening methods.The second stage is to use the reduced training data to fit the clas-sification models such as penalty logistic regression,support vector machine?SVM?and sparse linear discriminant analysis?SDA?respectively.Then we apply the corresponding reduced testing data on the fitted models and assess the their performances by average training and testing error under the different estimated model scale.Results from numerical studies and real data analysis show that the extended screen-ing procedure improves the original one.e MV-SIS retains the merits of MV-SIS.Firstly,it does not require any assumption of a specific model and parameters.Secondly,it is robust to heavy-tailed distribution and potential outliers of predictors.Thirdly,it allows the cat-egorical response having a diverging number of classes in the order of O?n??with some? > 0.e MV-SIS has some additional advantages that it is acceptable in computational cost and able to screen out true signals efficiently under the situation of multi-catefories problem with insufficent sample size.
Keywords/Search Tags:ultrahigh-dimensional categorical data, feature screening, sure screening prop-erty, linear discriminant analysis
PDF Full Text Request
Related items