Font Size: a A A

Application Of Novle Hybrid Feature Selection With Machine Learning Methods On Chemical Engineering Data

Posted on:2017-01-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z F LiuFull Text:PDF
GTID:1361330596452623Subject:Chemical Engineering and Technology
Abstract/Summary:PDF Full Text Request
A huge amount of data with high complexity have been produced in chemical engineering processes because of the application of various monitoring instrument and large efforts in experiments.Nowadays,the main problem concerning chemical engineering data is how to deal with the big scale of data and the unintuitive relationship among data.As a consequence,feature?variable?selection become a hot topic in various chemical engineering research areas.The present work aims to solve the problem of the existing feature selection method.According to the complexity and non-linearity characteristic of chemical engineering data,a set of novel feature selection and enhanced optimization methods was proposed and has been applied to several chemical engineering fields.The results not only increase the accuracy and simplicity of model,but also assist researchers to understand the process.In order to meet the demands for rapid feature selection methods,a novel sequential feature elimination approach based on Boruta algorithm has been developed.Based on the variable importance evaluation stability of Boruta algorithm,it is coupled with the efficient greedy search to decease the number of features existed in the original data set step by step.A series of feature subsets with different length were produced,which can be used by decision makers.In the application on chemical biodegradability research,16 feature subsets have been obtained and utilized to build models based on random forest for predicting the biodegradability of chemicals.The built model based on the optimal feature subset can improve 1.4 percentage points in accuracy on the external validation data set compared with the former results.As for the application on one data set collected from a real CO2 removal process,the number of required operational variables for predicting all 3 response variables have been compressed into4 based on the approach developed in this section successfully.Meanwhile,the models built by the selected operational variables possess optimal accuracy on the validation data sets.The feature selection approach not only simplified the inputs of modeling but also laid a foundation for further optimization of the entire process.Wrapper approaches in the feature selection methods are capable to meet the demands of high predicting accuracy in chemical engineering,however,the overfitting issue is one of the main sources deteriorating the prediction ability of wrappers.In order to avoid the problem,a novel wrapper approach has been developed in this section.It combines algorithms of self-organizing maps?SOM?and random forest to cluster the features into several groups,and then the representative features can be selected based on the given groups.Furthermore,the representative features are forcibly added into the final feature subset to cover the original information as wide as possible.Based on the representative features,one can appropriately tune the operator parameters of genetic algorithm?GA?in order to control the depth of search.Applying the method on the chemical biodegradability data set,the novel SOM-RF wrapper shows an excellent ability of resisting overfitting problem as the probabilities of mutation and crossover operators in GA are set to 0.3 and 0.2 respectively through comparing with 6 different scenarios of search.Meanwhile,the prediction accuracy is increased from 0.877calculated by former researchers to 0.893 obtained by SOM-RF wrapper.Based on the Occam's razor principle,one of overfitting sources in wrapper methods is that excessive amount of features is introduced as inputs to build models,thus multiobjective wrapper approach is involved in this section to resist the overfitting issue.Multiobjective wrapper considers the complexity and generalization ability of models in the same time,and then it is capable to produce a variation curve between model prediction accuracies and the amount of features existed in the subsets.As the merits of mutltiobjective wrapper,it is applied on selecting feature subsets for a quantitative structure properties relationship?QSPR?model which is developed to predict the octane number of pure components.The adopted data set includes a number of oxygenates and nitrogen compounds in order to construct a universal model fitting to multiple types of molecules.Due to the excessive amount of redundant descriptors will decrease the performance of QSPR models,two steps of feature selection methods are developed in this research.Firstly,filter methods based on Person correlation coefficient and Boruta algorithm are successively utilized to compress the irrelative features out of the data sets.In the second step,multiobjective wrapper method is introduced over the post-filter data set to refine the results instead of single object wrapper in order to avoid the overfitting issue.A variation curve between the lengths of feature subsets and generalization capabilities corresponding to the feature subsets are provided by the second step of feature selection.Finally,feature subsets including 12and 23 descriptors are dug out to build regression models for prediction of RON and MON respectively.Based on the selected feature subsets,support vector machine is adopted as regression method to construct models for predicting RON and MON respectively.The mean absolute errors of predicting RON and MON are decreased below 4 and 4.4 unites respectively.The results compared with those of previous studies only adopting hydrocarbons,the errors of MON is reduced by 1.3 units and the performance on RON remains in the same level.As the overfitting will still arise in the search process of multiobjective wrapper,the novel object function of weighted-sum is developed in this section.It is effectively combined with random forest,which considers the internal validation performance on training set and the generalization capability over "selection" set in the same time.Meanwhile,a novel two phases multiobjective wrapper is developed as the traditional methods take too long running time since they resist overfitting relying on repeated calculation to avoid the randomness.In the first phase,linear discriminate analysis is combined with multiobjecitve optimization method NSGA-?to produce candidate solutions.The second phase utilizes non-linear classifier to further refine the solutions.In the most time-consuming optimization process,the linear classifier is adopted,thus the optimization time is greatly reduced.Meanwhile,the refining process is combined with weighted-sum objective function in the second phase to ensure the prediction capabilities of the selected solutions.The novel approach is applied on chemical biodegradability data and two important results are produced:firstly,the optimal prediction accuracy of 0.894 is achieved by one subset including 19 descriptors,which is close to the accuracy of 0.893 calculated by SOM-RF wrapper in Chapter 3 but is more stable.Secondly,two shorter feature subsets with satisfied prediction capabilities have been obtained and they achieve the precision over 0.88 using only 5 or 6descriptors that greatly simplify the original inputs of model.Feature selection is essentially an optimization problem which relies on algorithms of optimization,thus it is necessary to develop a research on hybrid algorithms to improve the search capabilities.A hybrid method of combining genetic algorithm and pattern search is introduced in the optimization of Lugri type methanol plant.Through the optimization of shell temperature trajectories and recycle rate of carbon dioxide,the results show that the production rate of reactor have been increased by 2.53%as the recycle rate of carbon dioxide remains the level of 5%.It improves the economic benefits while decreases the emission of CO2.The hybrid optimization algorithms of also have a broad prospect for improving the effect of feature selection.In summary,appropriate strategies of feature selection hybridizing with advanced machine learning algorithms can improve the prediction precisions while reduce the complexities of models.It is able to turn the "black box" of huge amount of chemical data into "gray" ,and then it assists researchers to further dig out the complicated mechanisms of chemical engineering,which is helpful to completely change the "black box" into "white" .
Keywords/Search Tags:Feature (Variable) selection, Machine learning, Multi-objective optimization, Overfitting, Hybrid stratege
PDF Full Text Request
Related items