Font Size: a A A

Statistical Inference Based On Ultra-high Dimensional,Highly Censored And Unstructured Data

Posted on:2024-03-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W ChengFull Text:PDF
GTID:1520307310971639Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The twenty-first century is the era of data science.Big data faces the challenge of complex structure in addition to its characteristics of volume,velocity and variety.With the rise of artificial intelligence technology,the application of machine learning methods to analyze and model complex data has drawn significant attention both domestically and internationally.In this dissertation,we mainly introduce machine learning methods to study highdimensional,highly censored and unstructured problems of complex data,and the main research contents are as follows:First,we propose a model-free feature screening method for ultra-high dimensional classification data.The established method does not necessitate prior specification of a classification model and can reduce the dimensionality from ultra-high to relatively high dimensionality.Under fairly mild assumptions,we demonstrate that the proposed screening method has properties of sure independence screening as well as ranking consistency.The effectiveness and superiority of the proposed approach are verified using extensive simulations and experiments with real data.Furthermore,we modified the screening statistics so that they could be used for feature screening of ultra-highdimensional complete and survival data.Second,we perform a variable selection for the selected high-dimensional survival data.Due to the fact that the sure screening method can only guarantee that all important variables are retained with probability one,its chosen features still contain redundant information.Consequently,this dissertation evaluates the importance of each feature based on the importance score index of the survival tree and screens out redundant features that perform poorly in the previous layer.The representation ability of the model is steadily enhanced through the layer-by-layer learning mechanism of deep survival forests.The proposed deep survival forest proves its exceptional prediction performance on a broad range of real-life datasets.In addition,survival data are often accompanied by a high censoring rate when there is a significant lack of valid sample information available for modeling.Therefore,we fully exploit the covariate information of the censored samples for modeling highly censored survival data using semi-supervised learning and data transduction techniques.In a cascade deep forest framework,we treat each decision tree as a tool for sample feature space slicing,and the survival time of censored samples in a leaf node can be obtained by transducing the survival time of uncensored samples in that node.This cascade learning framework effectively captures the covariate information of highly censored samples and significantly improves the prediction accuracy of the model.Finally,as the digitalization process continues to accelerate,unstructured data(text,images,sound,and video)have persisted to grow at an exponential rate.As a non-parametric deep model,the model of recurrent neural networks is a powerful tool for modeling unstructured time-series data.In this dissertation,we theoretically investigate the generalization ability of recurrent neural networks based on the Rademacher complexity and the covering number theorem of the parameter matrix.Compared with existing works,our bound does not require the assumption of an upper bound on the activation function of the hidden layer,and it derives a tighter generalization bound for recurrent neural networks under some assumptions.In addition,based on empirical risk minimization theory,we establish an estimation error bound for recurrent neural networks.These fundamental theoretical results offer excellent theoretical guidance for the applications of models to unstructured time-series data.
Keywords/Search Tags:Ultra-high dimensionality, High censoring, Unstructured data, Deep survival forests, Recurrent neural networks
PDF Full Text Request
Related items