Font Size: a A A

Research And Implementation Of Text Summarization Technology Based On Machine Learning

Posted on:2021-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2428330620964107Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the amount of information on the network has increased significantly.The extraction and summarization of information caused by the explosion of information are also very important.Therefore,the current research on extracting important and distinct information from a large amount of information is particularly important.The technology of summary extraction,as an important study,can extract representative text from articles,which greatly reduces the redundancy of information and is good for text analysis.Therefore,the automatic summarization based on the machine learning technology is studied in this thesis.The algorithm based on LDA and Doc2 Vec and the generative automatic summarization based on text semantics and text structure are designed in this thesis,which are used to complete the design and implementation of the information extraction system.Based on the study of automatic summarization,the corresponding algorithms and the structure of this thesis and the main work are analysed and designed:(1)In the first,the significance and background of automatic summarization are described in this thesis.The research status and development of automatic summarization at home and abroad are elaborated in the thesis.The process of text preprocessing is introduced.In the direction of extractive technology,the algorithm principles and the process of TextRank and the text clustering algorithm are mainly introduced.The basic theory is introduced in this paper.The generative text summarization technology Seq2 seq are introduced in this thesis.At the same time,the evaluation algorithm ROUGE for evaluating the quality of summarization is introduced.After that,these algorithms will also are compared with the experiments designed in this thesis.(2)Based on the principle of extracting sentences from the original text to represent the central meaning,the algorithm combining LDA and Doc2 Vec is designed this thesis.The LDA model is mainly used to classify text topics and sentence topics in articles.The Doc2 Vec model is used to convert sentence text into sentence semantic vector.The entropy information model is used to determine the selected sentence to form the text summary.The Chinese short text data set published by the Internet is used to complete experiments.The ROUGE algorithm is used to evaluate summarization.The algorithm has better quality in abstract extraction.According to the size of different articles,the size of the abstract can be determined so that it can be applied to different sizes of articles to complete automatic summarization.What's more,it is more effective than adding more artificial factors.In the mass,the field of the data is paid attention in the algorithm,and the specific situation of each article is considered.According to the meaning topic contained in each article,the scale of abstract and the corresponding sentences are determined.Therefore,the overall situation of the data is paid attention,but also the different emphasis of each article can be noticed.However,if there is not a sentence that can represent the central meaning,the above algorithm can not satisfy the designed purpose of the thesis in terms of extraction effect.So the generative summarization will be considered to solve this problem.(3)The generative summarization can simulate the thought of human.After understanding the content of the original text,the main meaning of the original text can be summarized by the algorithm.The generative algorithm that text semantics and text structure are combined in is designed in this thesis.The text semantics and text structure based on the features of Chinese are considered as the network input in the algorithm.Based on the principle of the seq2 seq model the attention mechanism is added to improve the abstract generation quality.After completing experiments on the features of Chinese short text data set,this algorithm is more effective in generating summary results in terms of evaluation value is shown.The text semantics and the text structure based on the characteristics of Chinese are considered in this algorithm.In terms of text structure,five factors are mainly considered: the number of keywords,the number of entities,the length of sentence,the number of summarized keywords and the similarity with key sentences.At the same time,the attention mechanism of deleting repeated information is added to the algorithm.The experimental result of the algorithm is performsed well in the Chinese dataset.Finally,the design and implementation of a information extraction system in this thesis is completed.What's more,the above-mentioned text abstract extraction algorithms are added into this system and the function of the system is demonstrated.
Keywords/Search Tags:the technology of text abstraction, machine learning, extractive text abstraction, generative text abstraction
PDF Full Text Request
Related items