| The quality of Hot rolled strip steel is the key factor affecting enterprise customer service level and economic benefits.It is always the goal that enterprise pursues to realize the accurate prediction of the product defects and improve the technical parameters and the quality of the product in the production process through the large-scale production data.In this paper,the process parameters of hot-rolled strip steel production of an iron and steel enterprise are analyzed to find out implicit relationship between process parameter values and product quality and extract the key process parameters feature based on the machine learning method.The research problem in this paper can be summed up as a binary classification problem based on imbalanced data set of iron and steel industry process.Combined with data analysis methods in statistics and machine learning,the research process can be divided into four steps:data preprocessing,feature selection,model building and parameter tuning and evaluation results.1)In data preprocessing,the original data set is processed by missing value filling,single value processing,outlier cleaning and duplicate value processing and rounded up into the available data set.2)The purpose of feature selection is to reduce the dimensionality of the high dimensional data set,extract the features that have the greatest impact on the target variables in the data set and remove the redundant features to facilitate the establishment of the classification model.3)In the stage of selecting and constructing the classifier model,random forest algorithm is chosen as the basic classifier,and an improved optimization algorithm is proposed which includes optimization of the unbalanced data set,optimization of the splitting node algorithm and the mutual information is applied to the construction of random feature subset in random forest.4)Finally,the accuracy of the classification model is verified by Stratified K-fold cross validation,and the classification result is presented by confusion matrix and ROC curve.At the end of this paper,we introduce the data visualization analysis software which is developed on the basis of the open source software Orange Canvas and can realize all the data analysis process from source data to final evaluation result. |