Font Size: a A A

Microblogging Advertising Publisher Identification Based On Similarity Calculation And Semi-Supervised Clustering Method

Posted on:2019-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhaoFull Text:PDF
GTID:2348330545485236Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of microblogging,it has become one of the most mainstream forms of self-media in China.Massive microblogging data is very important for public opinion research work.The research results of microblogging data are also very promising for sociology and journalism.However,there are more and more advertisement contents in the microblog space.The information seriously affects the user experience and related research work of ordinary users.This thesis proposes a microblogging advertising publisher identification method:for the user dimension,puts forward the concept of core microblog,and extracts features through similarity calculation,timing regularity calculation and other methods,and uses a semi-supervised clustering algorithm C-DBSCAN to cluster features and as a result identify the microblogging advertisement publisher.The main work of this article includes:1.Put forward the concept of core micro-blog sequence:This thesis deals with the phenomenon of the inclusion of a large number of ordinary microblogs in advertising microblogs,proposes the concept of core micro-blogs,and filters user core topics from a large number of mixed micro-blogs.2.Presenting a text vector representation method based on TF-DF weighting:This paper studies word vector generation algorithms based on neural probabilistic language model,such as Word2vec,and proposes simple mean representation and text vector representation based on TF-IDF weighting.3.Put forward an improved WMD similarity measurement method:This thesis studies the existing text similarity measurement method,proposes an improved WMD model to calculate text similarity,and calculates the weight of each word by TF-IDF method.And take it as the weight of the conversion distance.4.C-DBSCAN clustering method based on pairwise constraints is proposed:This paper makes a modification to the clustering algorithm DBSCAN based on labeled data,proposes the semi-supervised learning clustering algorithm C-DBSCAN,and guides the clustering process through pairwise constraints.This thesis uses 3.1 million microblog data and designs related experiments.The experimental results show that the proposed method is reasonable and effective.
Keywords/Search Tags:Weibo Advertising, WMD, DBSCAN, Similarity Calculation, Feature Extraction
PDF Full Text Request
Related items