Font Size: a A A

Establishing Interpretability For Support Vector Regression And Its Application

Posted on:2011-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:X S TanFull Text:PDF
GTID:1100360302994374Subject:Agricultural Entomology and Pest Control
Abstract/Summary:PDF Full Text Request
Regression analysis, the main task is prediction and explanation, playing an important role in data analysis. Multiple linear regression (MLR), stepwise linear regression (SLR), partial least squares regression (PLS), quadratic polynomial regression (QPR) based on traditional empirical risk minimization principle, which often adopt correlation coefficient, like R and F, to test the model validity. Besides, for ascertaining the order of prominence and significance between factors,t-test or F test is applied to analyze the regression coefficients of single item and quadratic terms and interaction terms. Therefore, the explanation of traditional models is better, but limited at high-dimensional and nonlinear and small sample. What's more, the single item and quadratic terms often contradict when sorting the factor significance. And artificial neural network (ANN) owns good nonlinear ability, while there are many defects such as uncertainty model structure, poor interpretability, being easy to over-training, insufficient training, over-fitting, local minimum. Support vector regression (SVR), which is based on statistical learning theory and developing rapidly, has the advantages of non-linear characteristics, strong generalization ability, high prediction precision and avoiding the over-fitting, and so on. However, the poor interpretability of SVR has not been resolved yet.A complete model test and factor analysis methods were established for SVR based on F-test and the interpretability of QPR model, including the significance tests of regression model and of single-factor importance, the single-factor effects and sensitivity analysis, the significance tests of two-factor interaction and so on, the poor interpretability is intended to solve. After validated by two data sets, it was applied to screen indexes of drought resistance and the relationship between cotton bollworm pupal development duration and temperature, at last, three optimization experiments were guided. The results were as follows:1) The interpretability system was verified by data from previous studies (including quantitative structure-property relationship (QSPR) between 76 kinds of anionic surfactants and two prescription optimization examples). The explanatory results consisted with those of stepwise linear regression model and quadratic polynomial stepwise regression model but subtle differences existed. Indicating that the reasonableness of the explanatory systems, and support vector regression model revealed superiority compared with the reference model. Therefore, more reasonable interpretations of SVR are expected.2) Explanatory systems will be applied to two examples of agriculture-related, (ⅰ) In this paper, setting the survival percentage under repeated drought condition as the target and support vector regression (SVR) as the nonlinear screen tool, six integrated indicators were selected (the sort of importance):plant height**>praline**>malondialdehyde**>leaf age**>area of the first leaf under the central leaf**>ascorbic acid**, which were highlighted from 24 morphological and physiological indicators in 15 paddy rice cultivars. The results showed that support vector regression model with the six integrated indicators had a distinct improvement in fitting and prediction precision than those of in MLR and SLR models. Considering the simplicity of indicators measurement, the support vector regression model with the only six morphological indicators including shoot dry weight, area of the second leaf under the central leaf, root shoot ratio, leaf age, leaf fresh weight and area of the first leaf under the central leaf was also feasible, (ⅱ) Improved support vector regression was applied to thoroughly research the relationship between cotton bollworm pupal development duration and temperature. The results indicated that predictive ability of SVR model-fitting and leave-one-out (determination coefficient R2 is 0.998 and 0.996 respectively) based on all samples (observed data) was better than the traditional non-linear model (such as Logan model, Lactin model and Wang model). Besides, three basic temperatures of the pupal stage were more credible and the importance order for each factor to cotton bollworm pupal development duration is given (temperature**>female/male pupae**> constant/variable temperature**). Finally, the samples of uniform selection for the independent prediction, and the 20 samples for the training set. Which indicated that the performance of independent predictor for SVR model with R2= 0.981, when the sample further reduced to 12, R2 is only reduced to 0.964. While the best results for 20 samples based on the traditional model of Lactin, the performance of independent predictor R2 was only 0.958. What's more, the improved SVR had solved poor explanatory ability, and more superiority was viewed for small sample set than traditional non-linear models. Therefore, the theoretical guidance was provided for forecasting pest outbreaks and artificial insect breeding.3) Interpretative system guide formulation optimization, (ⅰ) It was applied to optimize the complicated fermentation medium including nine factors for a variant of Escherichia coli. The variant of Escherichia coli could produce glutamate decarboxylase which transformed gultamic acid into gamma aminobutyric acid in vitro. The optimization results of the medium showed that OD630, an activity index of glutamate decarboxylase, increased from 1.528 in the initial median to 2.303 in the optimal medium after carrying out tests of 28 schemes, the best fermentation conditions showed as follows:beef extract 5g/L, peptonelOg/L, NaCl 3g/L, glutamate 2.3g/L, glucosem 2g/L, KH2PO43g/L, MgSO40.6g/L, pH 6.8, fermentation time 20h. (ⅱ) It was applied to optimize the technological parameters of raw cassava ethanol production. The best ethanol yield parametes as follow, solid-liquid rate 1:1.8, initial pH value 3.5, fermentation temperature 32℃, yeast inoculation amount 3.5×107cell/mL, (NH4)2SO40.5g, rotational speed 140rpm, dosage of gluloamylase, a-amylase and cellulase were 200u/g,12u/g and 25u/g, fermentation cycle 120h, under the optimal conditions, ethanol concentration could reach 15.7%. The best conversion rate of raw material parametes were that solid-liquid rate 1:2.5, pH value 4, temperature 36×, yeast inoculation amount 5.5×107cell/mL, (NH4)2SO4 3g, rotational speed 160rpm. Dosage of gluloamylase, a-amylase and cellulase were 170u/g, 10u/g and 25u/g, fermentation cycle 120h,38.63%conversion rate of raw material could available. However, the initial formulation of the ethanol yield was 9.2%,24.76%conversion rate of raw materials (iii) It was applied to optimize the complicated artificial diet including six factors for cotton bollworm, Helicoverpa armigera (Hubner). The optimization results showed that the mean weight each pupa increased from 0.2436 g in the initial benchmark formulation to 0.3044 g in the optimal prescription after carrying out only 14 schemes. And the Optimal artificial diet was obtained:Soybean powder 172 g, wheat bran 14.4 g, yeast extracts 68 g, sucrose 21.2 g, rapeseed oil 2 drops, VC 40 chips. Therefore, it was not only more efficient than the reference models, but also better than the early experiments established UD-SVR formulation optimization, and experiments number was reduced.In summary, established support vector regression explanatory systems had solved poor interpretability, it also provided a measurement precision, strong guidance, good explanation, optimal and efficient solutions for experimental design and analysis of multilevel formula optimization.
Keywords/Search Tags:support vector regression, explanatory systems, quantitative structure-activity relationship, drought resistance index, developmental period, formula optimization
PDF Full Text Request
Related items