| With the rapid development of the information age,there is a massive amount of Chinese text data generated on the domestic Internet every day.While a large amount of data resources have flooded into people’s vision,it has also brought about the problem of information overload.How to process massive text data so that people can quickly browse to the desired text content has become increasingly important.Therefore,it is necessary to generate text summaries through technology.In the Text Rank algorithm,the quality of graph model construction has an important impact on the generated summary results.The node weight calculation based on the "voting" mechanism needs to consider both the importance of the nodes themselves and the importance of the connected nodes.Therefore,there is a need to have a richer representation of the graph model nodes,and there should also be a deeper consideration of the relationship between the edges of the nodes.Based on this,this article proposes some improvements to the algorithm in order to generate higher quality abstracts.This thesis introduces sentence vectors generated by BERT model to represent nodes when building graph model.This model has stronger extraction ability and better expression of text semantic information.At the same time,the cosine similarity calculation rule is used to calculate the similarity between sentences,which improves the similarity calculation method based on content overlap used by the original algorithm.Nodes of different importance have different effects on other nodes.This thesis presents a method of calculating edge weights that fuse sentence features.Considering the semantic and structural information of the text,the similarity matrix is fused with sentence position features,sentence keyword features,sentence clue word features and sentence title features to optimize edge weights between nodes.This improves the way that edge weights only consider the similarity between sentences in the original algorithm.To solve the semantic duplication problem between extracted summary sentences,this thesis introduces an improved maximum boundary correlation algorithm to process candidate summary sentences,which improves the quality of summary extraction by reducing the redundancy of sentences.This thesis combines the words in the title with the keywords extracted from the text to build a keyword table of the text and apply it to the keyword weight calculation of the text.The traditional algorithm extracts keywords only in the form of word co-occurrence,so it optimizes the extraction of keywords.Combined with Word2 Vec model and Text Rank algorithm,the external semantic features are introduced to generate word vectors,which take into account the word type,word frequency,word location,word span factors of words,and combine the word vector similarity relationship to extract keywords.This thesis first compares the importance of sentence position feature,sentence keyword feature,sentence clue word feature and sentence title feature,and determines the weight factors of each feature by designing relevant experiments.To verify the performance of the improved keyword extraction algorithm,this algorithm is compared with TF-IDF algorithm and traditional Text Rank algorithm by calculating the accuracy,recall and F1 value of the different number of keywords extracted.The quality of the abstract of the improved algorithm is evaluated using Rouge index,and compared with Lead-N method,MMR algorithm and Text Rank algorithm.The improved keyword extraction algorithm and summary extraction algorithm are experimented on NLPCC 2017 dataset,and the results show that the performance of the two algorithms in this dataset is better than that of traditional methods. |