Font Size: a A A

Research On The Balance Between Accuracy And Generalization In Data-driven Modeling

Posted on:2021-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2370330611953443Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Data-driven modeling can uncover hidden patterns behind data without the need for a priori knowledge,and is a hot topic in the future era of big data study.In contrast to earlier modeling approaches that calculated parameters based on determined models,symbolic regression,as one of the most important methods of data-driven modeling Mathematical expression models that can simultaneously compute parameters and discover explicit expressions have a high degree of fitting accuracy.However,higher precision implies more complex models,overfitting to training samples reduces the ability to adapt to unknown samples,and generalization is difficult to ensure.Therefore,it is interesting to study the balance between accuracy and generalization.The first step is to analyze the generalization ability of the model starting from how to reasonably calculate the complexity of the symbolic regression algorithm model.According to Occam's razor rule,the simpler the model,the more likely it is to approach the laws implicit in the model and the stronger the generalization ability.Therefore,based on the existing definitions of model complexity,a mixture of syntactic complexity and semantic complexity of nonlinear models is used to evaluate the model,which gives a reasonable measure of complexity both in the structure of the model and in the expression of the model,and a simpler expressive model structure and a smoother curve surface can be obtained based on the measures in the study.It is demonstrated through experiments on the dataset that a reasonable evaluation of model complexity helps to improve the generalizability of the model.The generalization of the modeling is further enhanced by a method that evaluates the importance of the fitted data during the model training process.Depending on how each training data in the dataset performs differently during the fitting process,dynamic weights are assigned to each data and used as a reference to set an allowable error within a certain range for each sample.That is,the model trained on the basis of the dynamic weights of the samples is a uniquely deterministic mathematical model with a cumulative uncertainty float range at each data point,the more important the point,the smaller the float range,i.e.,the smaller the allowable error.In order to balance the accuracy and generalization capabilities in modeling,the final model is obtained by using multi-objective optimization and integrated learning,drawing on the structural risk minimization principle in machine learning.The minimization of both empirical risk and confidence risk are considered in the structural risk minimization principle,and a multi-objective optimization approach is adopted in the study,with the sum of errors representing the empirical risk and the confidence risk approximating the model complexity as the two optimization targets,optimizing the set of solutions distributed on the Pareto frontier,and fusing these mutually independent solutions into a final desired model using an integrated learning approach.Finally,the method proposed in this paper is applicable as a general framework for most symbolic regression algorithms,and is applied to a particle swarm optimization algorithm for solving symbolic regression problems(PSSR),and comparative experiments are performed on mainstream datasets to validate the effectiveness of the proposed method.
Keywords/Search Tags:Data-driven modeling, Symbolic regression, Model complexity, Structural risk minimization, Multi-objective optimization
PDF Full Text Request
Related items