| The scientific and technological progress in the information age is accompanied by the generation of massive data.It is extremely important to get valuable and critical information from such complex and redundant content.This is also the significance of data mining.Text information is one of the most influential forms of information that we meet.One of the ways to let readers know the content of a text quickly is to extract the keywords.However,manual extraction of keywords is not only time-consuming and laborious,but also unable to cope with the generating speed of texts.Therefore,the thesis carries out the research of the Chinese text keyword extraction algorithm,and designs the algorithm from two aspects:statistical features and semantic features.The main work of the thesis is as follows:(1)As for TextRank algorithm,it relies on the co-occurrence window to establish the connection between candidate words,and does not make full use of the information in the document,which results in relatively poor keyword extraction results.Hence a Chinese text keyword extraction algorithm based on the word-sentence collaboration is proposed.Based on the graph model,the algorithm utilizes more statistical features.It considers the distribution of words in sentences,and combines the importance of sentences to build a word-sentence matrix to complete the keyword extraction process of Chinese texts.The experimental results show that the algorithm has a significant improvement in Precision,Recall and F1-measure compared to TextRank,SingleRank and HMM-Rank when the number of extracted keywords is small.But the algorithm sacrifices the time efficiency,and the average running time of the algorithm is nearly 3 times that of SingleRank.(2)Since the word-sentence collaboration algorithm is not as good as SingleRank when extracting one to three keywords,the thesis combines semantic features with the graph model,and proposes a Chinese text keyword extraction algorithm based on LDA topic model.Different from WSC-Rank,it calculates the topic relatedness of the candidate words to the document,which results in that the damping factor in the graph model changes with the topic relatedness of the candidate words.The experimental results show that the algorithm avoids the weakness that WSC-Rank has low Precision when extracting fewer keywords.And it has higher Recall and F1-measure than other algorithms when more keywords are extracted.The average running time of LDA-Rank is slightly higher than WSC-Rank,but lower than Word2vec-Rank. |