With the rapid development of information technology,data mining is ubiquitous as an interdisciplinary subject.At the same time,as the country attaches importance to education,smart education as a new application of information technology in the field of education is also a hot issue.Schools and educational institutions have realized the importance of smart education,and they have used new education methods such as distance education and interactive learning to achieve educational diversification.Distance education breaks the geographical and time constraints,and also brings a series of problems such a as high dropout rate.It is very necessary to apply the data mining technology to smart education,improve its deficiencies,and promote the construction of a smart management system.When conducting data mining research,most of the research data used is in an imbalanced state.If the problem of imbalanced data is too serious,the model prediction performance will be greatly reduced.Based on the educational data actually operated by an Open University,this thesis conducts research on the construction of imbalanced data classification model from the data level and the algorithm level.,realizes the purpose of warning relevant teachers and students,promotes the construction of the Open University’s smart education management system.The research content of this thesis is summarized into the following four points:First,this thesis selects and sorts out the multi-source data according to the research objectives.And it uses data preprocessing measures such as data integration,data cleaning,data conversion,and feature selection to further process multi-source data to obtain imbalanced performance data set and imbalanced student loss data set.Secondly,this thesis uses SPSS Modeler to make a preliminary screening of the algorithm.According to the experimental results combined with comprehensive model evaluation indicators,the best data mining algorithm is selected from the six classification algorithms of C5.0,Bayesian network,LR,SVM,RF and ANN.Thirdly,an improved random forest algorithm based on repeated random undersampling and a new type of Gini coefficient CostGiniis proposed.This improved algorithm improves the imbalance data classification problem from the data level and the algorithm level.The repeated random undersampling method resamples the imbalanced training data to generate multiple balanced training data subsets,which improves the problem of losing most sample information;the new Gini coefficient CostGinitakes into account the cost of misclassification of samples.The model based on the improved algorithm effectively improves the prediction performance.Finally,model deployment achieves the purpose of early warning of relevant teachers and students.It provides technical support for teachers to intervene in student learning in a timely manner,and for students to improve their learning behavior in a timely manner.In this way,smart management is realized and the construction of an Open University’s smart education management system is promoted. |