A Research About DBSCAN Text Clustering Based On Spark Platform

Posted on:2019-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:M Guo

Full Text:PDF

GTID:2428330593950127

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

With the continuous innovation and development of network information technology,online data has already shown explosive growth.As one of the most important information carriers on the Internet today,the text has become increasingly large.How to quickly and effectively obtain valuable information from a large amount of text has important research value and practical significance.Text clustering can automatically explore the hidden knowledge in the text and provide a method for the effective classification of text information.However,text clustering based on the traditional single-machine serial mode is incapable of both efficiency and scalability in the face of large-scale text processing requirements.In order to solve the above problems,this thesis firstly uses Python language to extract text keywords based on word frequency statistics.Through text processing such as English text preprocessing,text feature selection and text modeling,the keywords of English text based on TF-IDF weights can be realized.Automatic word extraction provides the basis for text mining in the following text.Secondly,this thesis implements DBSCAN parallel text clustering algorithm based on Spark distributed computing platform.This article elaborates the implementation of parallel strategies and parallel algorithms,and analyzes the actual data.The algorithm effectively improves the speed of the text clustering through data segmentation and memory calculations.And,we found clusters of arbitrary shapes,reducing the impact of noise data on clustering results.The algorithm results in good quality.In the end,this thesis integrates the above research,and uses text data of scientific research results as an example to design and implement a text processing clustering prototype system based on Spark platform.The objectives and requirements of the system were analyzed.The architecture and functional modules of the system were designed and implemented.The prototype system includes key functional modules such as information acquisition,text processing,text modeling,text clustering and etc.The prototype system completed the text processing and clustering on scientific research thesis.Finally,this prototype system applied the research content to real life.

Keywords/Search Tags:

Text clustering, DBSCAN, Spark, Distributed computing

PDF Full Text Request

Related items

1	Improvement Of Spark-based Multi-density Clustering Algorithm And Its Application In Text Mining
2	Research On Customs Commodity Risk Tax Detection Based On Spark Platform
3	Research On Adaptive Parameter Of DBSCAN Algorithm And Its Application On Spark Platform
4	KDSG-DBSCAN:A High Performance DBSCAN Algorithm Based On K-D Tree And Spark GraphX
5	The Research On Web Text Clustering Based On DBSCAN Optimized Algorithm
6	Research On Parallization Of DBSCAN Clustering Algorithm For Spatial Data Mining Based On Spark Platform
7	Parallel Research On Data Mining Algorithm Based On YARN And Spark Framework
8	Research On Text Clustering Algorithm Based On DBSCAN
9	Research And Application On Distributed Clustering And Incremental Clustering Based On DBSCAN
10	Research On Improved DBSCAN Algorithm Based On Spark Platform