| In recent years,with the continuous development of big data technology,data acquisition and storage methods have become more and more abundant,and data types have become complex and diverse.When the response variable is multivariate correlation data,that is,the response variable,and there is correlation between them.At this time,the simple application of traditional linear model and generalized linear model is often not feasible.Therefore,it is necessary to introduce the generalized linear model of multivariate correlation response vector.At the same time,as the application scenarios of vector generalized linear model become more and more abundant,the characteristics of high data dimension will appear in the application process,which makes traditional statistical methods and theories more difficult to apply and variable selection difficult.At present,the mainstream variable selection method linear model and generalized linear model have been studied to form a relatively complete theoretical system,but there is still a lack of theoretical guidance for vector generalized linear model,so it is necessary to study vector generalized linear model.This paper mainly studies the variable selection problem of vector generalized linear model.Considering the strong correlation of covariates,a random search method based on Gibbs sampler is proposed under the framework of vector generalized linear model.This paper mainly carried out the following research work :(1)A random search algorithm was established by combining Gibbs sampler and model selection criterion BIC.With a large number of explanatory variables,there are a large number of candidate models to choose from.All candidate models are defined as a set,and a one-to-one mapping of candidate models is established for the convenience of Gibbs sampling process.The sampling conditional transition probability is constructed by BIC criterion.The stable Markov chain was obtained by Gibbs sampling,and the optimal model was selected according to the obtained candidate model Markov chain.Finally,numerical simulation is used to verify the feasibility of the proposed algorithm.Compared with the traditional full subset search method,this method can improve the efficiency of variable selection and avoid computational disasters.Compared with the coefficient compression method,this method can be better applied in the framework of vector generalized linear model.(2)The established random search method is applied to the vector generalized linear model with different data structures for numerical simulation.Firstly,the accuracy of the algorithm is analyzed through the simulation data.The continuous data and discrete data with binary response vector structure are randomly generated by the computer,and the strongly correlated covariates are set.Then,the corresponding vector generalized linear models are established respectively,which are combined with the established random search method to generate the Markov chain of the candidate models.By calculating the marginal distribution probability of the variables in the Markov chain,the appropriate threshold is set,and the independent variables are selected to be selected into the model.Finally,the parameters of the optimal model are estimated and compared with the classical coefficient compression method Lasso.(3)In order to further illustrate that the algorithm is more feasible than the full subset search method in the case of a large number of independent variables,the algorithm is applied to a set of real data.Firstly,cross-sectional data of the New Zealand population were analyzed,and binary data of hypertension and heart disease with correlation were selected as response variables.13 independent variables that may be related to response variables in the dataset,such as age,height and weight,were included in the subset of candidate models.The generalized linear vector model is established,and the Markov chain of the candidate model is generated by combining with the established random search method.By calculating the marginal distribution probability of each variable in the Markov chain,the appropriate threshold is set,and the independent variable is selected to be selected into the model.Then,the optimal model is obtained by using the full subset search method for the above 13 variables,and the model is compared with the random search method.Finally,based on the data,37 normally distributed independent variables unrelated to the response vector were added as interference factors.The vector generalized linear model is reconstructed for the data,and the random search method is applied to the data.It is proved that the full subset search method is invalid and the random search method is still feasible under the condition of a large number of independent variables.Through the above analysis,the following main conclusions are reached:(1)Through the simulation data experiment,it is concluded that for the generalized linear model of vector,the model selected by the random search method based on Gibbs sampling is closer to the real model than that selected by the traditional Lasso method,and the obtained parameter estimation is more accurate.(2)Empirical analysis shows that,for the vector generalized linear model,the optimal model obtained by the random search method based on Gibbs sampling is the same as the optimal model obtained by the traditional full subset search method,and it is more operable than the full subset search method with a large number of independent variables,and has certain advantages. |