| The accurate prediction of the corporate credit risk(CCR)is of great significance to the environmental construction of benign credit economy for the developed markets.With the continuous development of economy in recent years,companies are facing more severe competitions and difficulties in the process of transformation and upgrading.In this process,there will be a lot of financial risks,by which the CCR problems are becoming more and more prominent.To evaluate and predict the CCR comprehensively and accurately is the core task of risk prevention for many companies.It is also the necessary requirements for these companies to continuously improve their management.However,with the deepening of informatization,networking,digitization and intelligentization in economic and social development,great changes have taken place in the sources and types of CCR datasets,and the datasets for predicting the CCR share the characteristics of being high dimensional low sample size(HDLSS).The current methods for predicting the CCR are only suitable for those low-dimensional and large-sample-sized datasets.This makes the original predicting algorithms insufficient,and even impossible in prediction.Therefore,this dissertation intended to first analyze the limitations of the current methods,to build a new framework of full indicators for prediction by taking the external factors into consideration,and then to theoretically design predicting algorithms by deep neutral networks(DNN)and ensemble machine learning methods,and finally to put forward and justify the corresponding solutions to the CCR prediction among real HDLSS datasets.The main contents of this dissertation include the following three aspects:First,to build a new framework of full indicators for the CCR prediction.With the rapid development of science and technology in the era of big data and the continuous innovation of the technology in network information,the information in big data with the characteristics of five V’s(Volume,Velocity,Variety,Value and Veracity)has aroused widespread concerns.More and more evidence indicate that many external factors may also have an important impact on the CCR prediction.In the era of networking information,the data of public opinions(such as Google or Facebook)contain a lot of information about the public’s cognition and evaluation of the CCR.These timely data could reflect the CCR from a behavioral perspective.In addition,the strong contagion of the CCR along the supply chains will bring adverse effects on the actual operation of the target companies.The indicators of the public opinions and supply chains could be taken into consideration in order to build a new framework of full indictors for predicting the CCR.Therefore,this dissertation organically combines the external and internal factors affecting the company’s credit risk,integrates the network public opinion indicators and supply chain indicators,and constructs a new framework of full indicators for the CCR prediction.Based on the 441 financial data provided by Standard & Poor’s Rating Company,the network public opinion data provided by Google and Facebook,supply chain data in Bloomberg Database,this dissertation evaluates the new framework of full indicators for the CCR prediction proposed in this dissertation.In the evaluation,this dissertation compares and analyzes the prediction ability and prediction efficiency of 12 different combinations of featured indicators.It is found that the prediction ability of the new framework of full indicators is the best.Deleting any indicator variables will reduce the prediction ability of the dataset.It is proved that the external indicators included in this dissertation can further improve the CCR accuracy.The new framework of full indicators makes full use of the timely warning of the network public opinions,which also help to prevent the harms caused by supply chain risk spillover and avoid the occurrence of systematic CCR.Second,the research on predicting model of Ensembled-Dynamic WeightsRestricted-DNN(EDR-DNN)for HDLSS dataset of CCR.To the condition that the CCR dataset structure sharing HDLSS after the new information sources are added,how to select the algorithm suitable for the prediction with the HDLSS dataset from a wide variety of algorithms has attracted much attention.The systematic literature review indicates that the research in the field of HDLSS and CCR suffers from the following limitations:(1)the research on the datasets with a sample size of 5 to 10 times the number of dimensions is rare;(2)the research only addressing either high dimension or small samples,but not simultaneously in a balanced way;and(3)the application of some algorithmic methods in the algorithmic model,such as the rationality of adding L1 regularization after batch standardization,often calls in question.In view of these limitations,this dissertation condenses the ideas of solving the HDLSS problems in CCR prediction from the existing research and constructs an algorithm model suitable for HDLSS datasets in CCR prediction,that is,the EDR-DNN model.It is proved theoretically for the first time that L1 regularization alone will fail in deep neural network,especially being added by a batch standardization layer.And an algorithm model suitable for HDLSS datasets is constructed,that is,EDR-DNN model.This model shares the following traits:(1)to standardize the regularization of L1 by adding L2 constraints on the basis of single L1 regularization;(2)to not only parallel a NN model with a SVM model,but also integrate a LR model on a full connection layer in order to adjusts the dynamic weights of the parameters between the models.The integration approach is helpful to introduce the transfer learning methods as a whole and to further optimize the dynamic weights of the parameters between the models,so as to effectively solve the problems in collaboratively processing the datasets being both high-dimensional and small sample sized.Moreover,this model yields an AUC(the area under curve)of80.12%,which is higher than that provided by integrating the featured indicators,the traditional machine learning algorithms and ensemble algorithms,such as NN and SVM.Compared with the hybrid integrated algorithm with the highest prediction accuracy in prediction,the prediction accuracy yielded in this model is also nearly 5% higher.Through comparative analysis,it is verified that the EDR-DNN model constructed in this dissertation has a better prediction than the other models.Thirdly,the research on the model based on time decay-long short term memory(TD-LSTM)imputation for the prediction of the CCR dataset being HDLSS.By reviewing the research on the missing values imputation,it is found that previous studies put the time function right before the selected algorithm models,which cannot effectively detect the state before and after missing values imputation;Although the calculation method of time function itself provides effective information such as the location of missing values and the time intervals,but it is still lack of flexibility for the global application of the model.And for the imputation of HDLSS missing values and of the CCR missing values,the methods are outdated and weak in pertinence for the missing values of irregular time series datasets,HDLSS datasets in particular.In regard to the shortcomings of these previous studies,this dissertation proposes the TD-LSTM model for imputation,which has been improved in the following four aspects.(1)on the basis of the original gating of LSTM,the time attenuation gate and refresh gate which can better detect and process the missing values state of irregular time series are added;(2)the time decay function is introduced to capture the relationship between the input variables,hidden variables and the corresponding time interval of missing values.Based on the combination of different weights and the change of time variables,its assignment has more forms and increases the flexibility of the expressions of time decay function;(3)the SENet module is introduced to improve the generalization ability of the overall model in dealing with HDLSS problems;and(4)based on the stacking of two algorithm models and the introduction of Auto-encoder,the HDLSS problem has been effectively solved.In the empirical stage,this dissertation first compares the imputation effect of TD-LSTM model with other traditional imputation methods on artificial irregular time series datasets.The comparison results show that the imputation effect of the TD-LSTM model is the best.Since then,the TD-LSTM model and the EDR-DNN model have been jointly applied in this study,and the prediction accuracy is as high as 84.15%,which is higher than the prediction accuracy obtained by the approaches relying on the combination of featured indicators,traditional machine learning algorithms and the single EDR-DNN application.The imputation efficiency of TD-LSTM model and the prediction ability of EDR-DNN model are fully verified.Therefore,this dissertation provides a practical method to solve the problem of CCR prediction with missing values,and also provides a reference paradigm for reasonable missing values repair and effective prediction of HDLSS datasets in the further research.To sum up,the research based on the CCR prediction for HDLSS datasets could offer new ideas and new approaches to the prediction with massive data in the era of big data.In this dissertation,the new framework of full indicators having included the external factors for the CCR prediction and the paradigm by designing the predictive algorithm accordingly could provide valuable reference for a new approach to economic risk prediction,especially the CCR prediction. |