Font Size: a A A

12345 Mayor Public Telephone Text Clustering Based On K-means

Posted on:2022-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:F X CaoFull Text:PDF
GTID:2517306491460184Subject:Statistics
Abstract/Summary:PDF Full Text Request
Due to increased citizens' appeals to 12345 Mayor Public Telephone,a largescale increase in hotline data has also accompanied.Therefore,this article uses keyword extraction technology and text clustering methods to analyze based on these large-scale data,deep mining and exploration of citizens' appeal themes.To realize the mining of the text data of 12345 Mayor Public Telephone,the main work of this article is as follows:(1)Word vector representation.The word vector is the basis of text clustering.To obtain a good word vector representation,the word2 vec method is used to train the text data to obtain a distributed vector representation of words.From the experimental results,the word vectors obtained by this method can alleviate the problem of feature sparseness and the inability to calculate word similarity and provide a good text representation for subsequent text clustering tasks.(2)Keyword extraction.In this article,the keyword extraction task is regarded as a classification problem.First,use TF-IDF to construct a candidate key set.Then,use the deep pre-trained language model ”Transformer's Bidirectional Encoder Representation”(BERT)to fine-tune the keyword extraction.Since the text of 12345 Mayor Public Telephone is short and contains many phrases and different local nouns,there is much noise data in the direct extraction of keywords.Therefore,this article uses chi-square statistics to build a list of stop words for the mayor's public hotline.Instead of removing stop words,using stop words selected by removing chi-square statistics,the F1 score index of the keyword extraction task increased by 6%.(3)Text clustering.At present,12345 Mayor Public Telephone uses the industry-first level and the industry-second level as the criteria for obtaining citizens' demands.Still,the industry-first level indicators are too general to obtain an intuitive understanding of the event,resulting in superficial decision-making.However,the industry-second level indicators' classifications are too complicated,and there are overlapping classifications.Some classifications cannot directly reflect the citizens' demands,which makes it impossible to grasp the key points in decision-making.In response to the above problems,this paper uses the Kmeans method to perform cluster analysis on the text to obtain the subject of the text.The results show that text clustering analysis can solve the problems of excessive classification and cross-topic and can directly obtain the topics of citizens' appeals.
Keywords/Search Tags:keyword extraction, BERT, appeal data, text clustering
PDF Full Text Request
Related items