| With the continuous innovation and development of network information technology,online data has already shown explosive growth.As one of the most important information carriers on the Internet today,the text has become increasingly large.How to quickly and effectively obtain valuable information from a large amount of text has important research value and practical significance.Text clustering can automatically explore the hidden knowledge in the text and provide a method for the effective classification of text information.However,text clustering based on the traditional single-machine serial mode is incapable of both efficiency and scalability in the face of large-scale text processing requirements.In order to solve the above problems,this thesis firstly uses Python language to extract text keywords based on word frequency statistics.Through text processing such as English text preprocessing,text feature selection and text modeling,the keywords of English text based on TF-IDF weights can be realized.Automatic word extraction provides the basis for text mining in the following text.Secondly,this thesis implements DBSCAN parallel text clustering algorithm based on Spark distributed computing platform.This article elaborates the implementation of parallel strategies and parallel algorithms,and analyzes the actual data.The algorithm effectively improves the speed of the text clustering through data segmentation and memory calculations.And,we found clusters of arbitrary shapes,reducing the impact of noise data on clustering results.The algorithm results in good quality.In the end,this thesis integrates the above research,and uses text data of scientific research results as an example to design and implement a text processing clustering prototype system based on Spark platform.The objectives and requirements of the system were analyzed.The architecture and functional modules of the system were designed and implemented.The prototype system includes key functional modules such as information acquisition,text processing,text modeling,text clustering and etc.The prototype system completed the text processing and clustering on scientific research thesis.Finally,this prototype system applied the research content to real life. |