Font Size: a A A

Research On Truth Discovery Based On Ensemble Learning And Object Difficulty

Posted on:2023-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2568307076985339Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity and application of mobile sensing devices,data on social networks has shown explosive growth.Today’s society has entered an era of information explosion.The Internet is full of conflict data,which causes serious interference to the people who are want to get information of interested objects from the Internet.The goal of truth discovery is to efficiently process these conflict data to obtain the truth of object.In the field of truth discovery research,a large number of methods have been proposed by using different theoretical models or by considering different aspects such as sources,objects,and claims.However,none of the existing methods can outperform other methods in all application scenarios,so a small number of ensemble truth discovery methods have been proposed,but there are several problems to be solved in the existing ensemble truth discovery methods:(1)At present,the only two ensemble truth discovery methods are for categorical data,and there are few studies on ensemble methods for continuous data.(2)Existing ensemble truth discovery methods either only treat the source and method pairs(source,method)as virtual data sources,without considering the reliability of the method separately,or simply combine the results from different algorithm directly,without considering useful intermediate results,such as source reliability,or only use the existing truth discovery method to integrate the results of different algorithms,and the performance of the existing method is limited,which will reduce the accuracy of the integration results.(3)Most of the existing methods ignore the difficulty of object.The only method is to consider the error factor of the claims on the categorical dataset,and the modeling way is single.The integration of these methods will reduce the accuracy of truth discovery.For the above problems,this article uses the related technologies in machine learning,probability theory and random process to conduct comprehensive research.The main content is as follows:For the problems of integration on continuous data and object difficulty,inspired by the Bagging and Stacking method in ensemble learning,this paper proposes an ensemble truth discovery framework based on parallel ideas.Specifically,the object difficulty is modeled according to the characteristics of continuous data and using multiple different traditional truth discovery algorithm as the base algorithm.For a given dataset,each base algorithm would calculate the estimated value and difficulty of all objects,and after obtaining all the results from base algorithm,the results of all the base algorithm are used as the input to perform central truth discovery.In order to obtain the final estimation value of the object during central truth discovery,this article proposes a new algorithm DETD which considering the reliability of the base algorithm to aggregate the result.Comprehensive experiments are carried out on two real-world datasets,and the results show that the ensemble truth discovery framework based on parallel ideas can effectively estimate the truth of objects and improve the accuracy of truth discovery.For the problems of underutilized source reliability and limited performance of existing methods,inspired by the Boosting and Stacking method in ensemble learning,this paper proposes an ensemble truth discovery framework based on serial ideas.Specifically,this article models the difficulty of the object,and uses different traditional truth discovery algorithms as the base algorithm,and use the serialization method to calculate results of the base algorithm.For each base algorithm,we not only calculate the object estimated value and difficulty,but also calculate the reliability of all sources,and use the prediction of the source reliability given by the previous base algorithm as the initialization of source reliability of the latter base algorithm.This can effectively reduce the impact of the unified initialization of the source reliability to the results.After serialize obtaining all the results from base algorithm,the results of all the base algorithm are used as the input to perform central truth discovery.This article proposes a new algorithm SETD for central truth discovery.Experiments on the two real-world datasets show that the SETD model is better than all base methods.Finally,this article designed and developed an ensemble truth discovery system.The system realizes the two ensemble truth discovery frameworks proposed in this article.The main functions include user login and registration,original dataset and ground truth dataset upload,base algorithm selection,ensemble way selection,and truth calculation and results download.Users can register personal information and login to the website,upload local datasets,select different traditional truth discovery algorithms as the base algorithm,select parallel or serial ensemble methods,and finally download the dataset that completes the ensemble truth discovery.
Keywords/Search Tags:Data mining, Truth discovery, Ensemble learning, Object difficulty, Source reliability initialzation
PDF Full Text Request
Related items