Chinese Text Keyword Extraction Algorithm Based On Graph And LDA

Posted on:2020-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:Q Guo

Full Text:PDF

GTID:2428330572973667

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

The scientific and technological progress in the information age is accompanied by the generation of massive data.It is extremely important to get valuable and critical information from such complex and redundant content.This is also the significance of data mining.Text information is one of the most influential forms of information that we meet.One of the ways to let readers know the content of a text quickly is to extract the keywords.However,manual extraction of keywords is not only time-consuming and laborious,but also unable to cope with the generating speed of texts.Therefore,the thesis carries out the research of the Chinese text keyword extraction algorithm,and designs the algorithm from two aspects:statistical features and semantic features.The main work of the thesis is as follows:(1)As for TextRank algorithm,it relies on the co-occurrence window to establish the connection between candidate words,and does not make full use of the information in the document,which results in relatively poor keyword extraction results.Hence a Chinese text keyword extraction algorithm based on the word-sentence collaboration is proposed.Based on the graph model,the algorithm utilizes more statistical features.It considers the distribution of words in sentences,and combines the importance of sentences to build a word-sentence matrix to complete the keyword extraction process of Chinese texts.The experimental results show that the algorithm has a significant improvement in Precision,Recall and F1-measure compared to TextRank,SingleRank and HMM-Rank when the number of extracted keywords is small.But the algorithm sacrifices the time efficiency,and the average running time of the algorithm is nearly 3 times that of SingleRank.(2)Since the word-sentence collaboration algorithm is not as good as SingleRank when extracting one to three keywords,the thesis combines semantic features with the graph model,and proposes a Chinese text keyword extraction algorithm based on LDA topic model.Different from WSC-Rank,it calculates the topic relatedness of the candidate words to the document,which results in that the damping factor in the graph model changes with the topic relatedness of the candidate words.The experimental results show that the algorithm avoids the weakness that WSC-Rank has low Precision when extracting fewer keywords.And it has higher Recall and F1-measure than other algorithms when more keywords are extracted.The average running time of LDA-Rank is slightly higher than WSC-Rank,but lower than Word2vec-Rank.

Keywords/Search Tags:

keyword extraction, graph model, word-sentence collaboration, LDA topic model

PDF Full Text Request

Related items

1	Research On Keyword Extraction Based On Latent Topic Model And New Word Discovery
2	Research On Text Extraction Method Based On Key Sentence And Keyword Association
3	Research On Keyword Extraction Method Based On Document Topical Structure And Word Graph Iteration
4	Research On Keyword Extraction Method Based On Semantics Features
5	The Research And Implementation Of Network Hostpot Analysis Based On Hierarchical Topic Model
6	Research On Keyword Extraction Algorithm Based On Neural Topic Model
7	Construction Of Topic Model Based On Keyword Vector And Visualanalysis Of Comment Data
8	Complex Text Keyword Mining Method Based On Graph Embedding Model
9	Automatic Keyword Extraction Algorithms Based On Word Embedding And Multiple Features Fusion
10	Design And Implementation Of Technology News Analysis System Based On Topic Model