Font Size: a A A

Research And Application Of Topic Model For Short Texts Based On Part-of-Speech Feature And Semantic Enhancement

Posted on:2020-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:M ChenFull Text:PDF
GTID:2428330575958036Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Short text media has gradually become an important source of information in people's daily life.Mining potential topics from these short text corpora is crucial for some content-based analysis task.Compared with traditional long texts,such as news reports and scientific literature,the content of short texts is usually short,casual,real-time and massive,which makes the topic analysis task for short text a challenging problem.The existing topic models for short texts often fail to perform comprehensive and specific analysis for a topic,and can not solve the sparse problem in short text well.Moreover,these models mainly focus on fixed short text corpus,which belongs to the offline model,and cannot handle the real-time short text stream in the actual scene.In addition,the training algorithms of topic models for short texts are often stand-alone.Because of the relatively low generation costs and high user participation,the data size of short texts is usually large,leading the performance of the stand-alone training algorithms very poor.Based on the above problems,this paper proposes a part-of-speech feature and semantic enhancement-based topic model for short texts,including offline and online modes.At the same time,we design and implement the parallel training algorithms of topic models for short texts in large-scale scene.In addition,these research results are applied to the the actual system of the Jiangsu citizen hotline service platform.The primary contributions of this paper are highlighted as follows:(1)For the fixed short text corpus,we propose an offline part-of-speech feature and semantic enhancement-based topic model for short texts called PFE-DMM.By introducing the feature word distribution of each topic,PFE-DMM can effectively describe different aspects of a topic.At the same time,PFE-DMM promotes semantically related words under the same topic for specific part-of-speech features,which can alleviate the sparsity problem of short texts specifically.(2)For the real-time short text stream,we propose an online topic model for short text called OPFE-DMM based on PFE-DMM.By dividing the short text stream into different time slices,and capturing the topic coherence with a historical contribution factor,the model can track the evolution of a topic.(3)In order to improve the training efficiency of topic models in large-scale corpus,we design and implement the parallel training algorithm for topic models proposed in this paper based on Spark.At the same time,the algorithm is further optimized to maintain the accuracy of the model and reduce the training time.(4)Based on the real needs of a citizen hotline service platform of Jiangsu Province,we build an efficient system for large-scale text analysis based on the key technical methods proposed above,which verified the validity of the topic model for short texts proposed in this paper.
Keywords/Search Tags:Topic model, Short text, Text mining, Natural language processing, Parallel computing
PDF Full Text Request
Related items