Font Size: a A A

Several Researches On Modern Statistical Methods And Their Applications In Data Analysis

Posted on:2022-06-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:A ShanFull Text:PDF
GTID:1480306608972489Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The world today is in the stage of the fourth industrial revolution,the era of intelligence,and" Big Data" is one of the most important characteristics of the background in this era.Its huge size,complex structure,various types and strong timeliness determine that it will bring both opportunities andchallenges to statistical research.This paper mainly discusses some issues in the fields of statistical algorithms,time series analysis and survival analysis.We focus on the following topics:bayesian inference for finite mixture regression model based on non-iterative algorithm,efficient approximation of statistical significance in local trend analysis of dependent time series and statistical inference on median residual life model with censored length-biased data.Finite mixtures regression(FMR)models are powerful statistical tools to explore the relationship between a response variable and a set of explanatory variables from several latent homogeneous groups.The aim of FMR is to discriminate the group an observation belongs to,and reveal the dependent relationship between the response and predictor variables in the same group after classification.Finite mixture regression model with normal error assumption(FMNR)is the earliest and most used mixture regression model in practice.In recent years.many authors have extended the finite mixture normal regression models to other error distribution based mixture regression models.The classical methods to deal with these mixture models are mainly based on Gibbs sampling for Bayesian analysis and EM algorithm for finding the maximum likelihood estimator(MLE)from frequentist perspective,and the crucial technique in these methods is to employ a group of latent variables to indicate the group an observation belongs to,and formulate a missing data structure.Although EM algorithm and Markov Chain Monte Carlo(MCMC)based algorithm are widely used in dealing with mixture models,there are still some weak points in these algorithms,which should not be omitted.As to EM algorithm,the standard error of estimated parameter is always calculated as the square root of the diagonal element of the asymptotic covariance matrix motivated by the central limit theorem,but when the size of samples is small or even medium,this approximation may be unreasonable.In the case of Gibbs or other MCMC based sampling algorithms,the samples used for statistical inferences are iteratively generated,thus the accuracy of parameter estimation may decrease due to the dependency in samples.Besides,although there are several tests and methods to study the convergence of the generated Markov Chain,no procedure can check convincingly whether the stage of convergence has been reached upon termination of iteration.So it would be a beneficial attempt to develop some algorithm with more effectiveness and computationally feasibility to deal with complex missing data problems.In Chapter 2,we propose an effective Bayesian statistical inference for finite mixtures regression model from a.non-iterative perspective.We first introduce a group of latent multinomial distributed mixture component variables to formulate a missing data structure,and then combine the EM algorithm,inverse Bayes Formula,and sampling/importance resampling(SIR)algorithm into a non-iterative sampling algorithm.Finally,we implement the IBF sampling to generate i.i.d.samples from posterior distributions and use these samples directly to estimate the parameters.We conducted simulation studies to evaluate the performance of the algorithm by comparison with the EM algorithm and Gibbs sampling.The results show that the IBF algorithm can estimate parameters more accurately than the EM algorithm and Gibbs sampling,and it runs much faster than Gibbs sampling.Then it is applied to the classical tone perception data set with supporting results,the analysis shows the practicability and advantages of the IBF algorithm,and then the algorithm selection is discussed.The IBF sampling can be directly extended to other mixed regression models,such as those driven by student-t or Laplace error distributions.Time series data is an important resource for exploring the dynamic changes of biological systems.Due to the rapid development of molecular biology technology and the significant reduction of sequencing costs,a large amount of biological time series data has been generated in molecular biological research over the past decade,while the determinate patterns of association between various biological factors can further deepen the understanding of biological system functions and the interactions between them.Among the statistical methods used in time series,local similarity analysis(LSA)has been extensively carried out to identify the correlation between various factors,and the similarity measure in local similarity analysis is the difference between two time series,which can be the sequence at either the level of gene expression or OTU abundance,etc.As suggested by Ji and Tan(2004),the degree of similarity shown by rising,unchanged,or falling trends in time series data can be taken as another indicator of the correlation among various biological factors,which is known as local trend analysis(LTA).In LTA,local similarity analysis is performed on the transformed trend sequence,and the corresponding similarity measure is referred to as the local trend score.Currently,LTA has been widely adopted in many biological fields,nevertheless,it takes long to evaluate the statistical significance of local trend analysis through permutation test.By extending the statistical significance evaluation method of local similarity analysis theory to local trend analysis.Xia et al.(2015)developed the statistical significance evaluation method of local trend analysis.However,this method is effective only when the original sequences are independent and identically distributed.In Chapter 3,on the basis of this and prior studies,we improve the approximation method proposed by Xia et al.and propose a general method of statistical significance evaluation for local trend analysis,called the Stationary Theoretical Local Trend Analysis(STLTA).First of all,the original sequence was discretized into a changing trend sequence and the local trend score was calculated.Then,according to the spectral decomposition theory of the matrix,the variance of the trend sequence was estimated for different state spaces.Finally,in combination with the limit theory of Markov chain local similarity analysis,the limit distribution of the local trend score was obtained,and the approximate p value of the local trend score was calculated.Meanwhile,we prove the effectiveness of the new method by a large number of simulation and real analysis on data sets "MPHM"&"PML".In respect of biomedical study,the residual life defined at time t has been met with wide spread applications in various life tests.In the current research,the commonly used measures to evaluate the residual life are the mean residual life and the median residual life.The mean residual life has been widely used in the research of survival analysis,however,when the probability distribution of the target variable is a highly skewed distribution or a highly heavy-tailed distribution,the mean residual life cannot be obtained.In this case,a long-term survivor will have a significant impact on the mean,because outliers are very sensitive to the impact of the mean,which leads to that the mean residual life may not be existing in some cases.The median residual life model is more flexible and robust than the mean residual life model.It is mainly manifested in two aspects:first,as we all know,compared with the mean residual life,outliers are relatively insensitive to the influence of the median;second,multiple different survival functions can correspond to the same median remaining life,that is,the relationship between them is many-to-one,while the correspondence between the evaluation remaining life models and the survival functions is one-to-one,which shows that the median residual life model is less restricted than the mean residual life model.Therefore,statistical inference based on the median residual life model will be more robust.In addition,by extensively analyzing the residual life quantiles,we can provide a more complete method of inferring the probability distribution of residual life.Inspired by the existing research on quantile or median residual life,In Chapter 4,we propose an estimation method to study the median remaining life model with censored length deviation data,and obtain the point and interval estimation of the median remaining life.It is proved that the statistic-21ogR(m0(t))proposed in this paper converges weakly to χ12 when the censoring variable C satisfies P(C≥τ)>0.At the same time,the method proposed in this paper is compared with the Bootstrap method and the asymptotic normal interval estimation method through the method of data simulation,and our suggestions are given based on the comparison results.Finally,by the analysis of the classical Channing House data set,the effectiveness of the new method is shown.
Keywords/Search Tags:Finite Mixtures Regression, Non-iterative Sampling, In-verse Bayesian Formula, Local Trend Analysis, Dependent Time Series, S-tatistical Significance, Median Residual Life, Length-biased Data, Empirical Likelihood
PDF Full Text Request
Related items