Font Size: a A A

ETL And GBDT Based Parallel Duplicate Removal Method For Question Bank

Posted on:2017-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2417330569998651Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,modern society and education have entered a high degree of information age."Internet + education" to the field of education has undergone profound change.Mass production of various education resources have also led to the education platform in the problem of relatively high degree of repetition.In order to reduce the storage space,improve the retrieval efficiency and the user experience,this paper presents a method based on ETL and GBDT to test the parallel de-emphasis.During the experiment,the comparison of the results of the multi-feature combination training model after pretreatment with ETL showed that the combination of features such as simahash based on GBDT had achieved a good de-emphasis effect,and the hadoop cluster computing can improve the speed of calculation,expand the scale,and deal with the large-scale data ability.The main work of this paper is as follows:1)As the form and format of the title content is not uniform,this paper designs a set of data pre-processing process of the test questions based on ETL,and schedule and preprocess the data of the test questions database,so as to provide the data source for extracting the textual characteristics of the question bank.2)In order to solve the duplicate removal problem,this paper designs a set of training model based on GBDT,it can extract and combine simhash,LCS,jaccard and TF-IDF,and then calculate the similarity by calling the model,which can improve the accuracy of duplicate removal.3)Using hadoop streaming + python method,stroing the test data and trained model in the HDFS,using hadoop clusters to calculate the similarity in parallel,which greatly improves the speed of the question to heavy,and supports large-scale data.
Keywords/Search Tags:duplicate removal, ETL, feature selection, GBDT, hadoop
PDF Full Text Request
Related items