Font Size: a A A

Research And Implementation Of Parallel Text Clustering Based On MapReduce

Posted on:2018-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:C H XuFull Text:PDF
GTID:2348330512983222Subject:Engineering
Abstract/Summary:PDF Full Text Request
The network produces massive data every day, such as text, video, picture and so on.The text takes a large proportion of data and contains much important information.Text mining therefore has great value in business, medical and scientific research. Text clustering is an unsupervised text mining method, and it divides a set of text into multiple clusters. Text clustering is applied in many fields, such as redundancy elimination in natural language processing, and search engines use text clustering to produce concise and convenient search results efficiently. However, the traditional text clustering is difficult to deal with large-scale text effectively. In order to meet the challenge, the thesis studies the text clustering based on MapReduce.Affinity Propagation (AP) is an efficient clustering technique to deal with datasets of many instances. It selects the cluster centers through message passing between objects and does not use the number of clusters as parameter. However, the AP has oscillation and its preference value needs to be preset. In this dissertation, author proposes a method to overcome the shortcomings of AP, it is then applied to text clustering.The main content of the thesis is as follows:1. Based on intensive study of the text preprocessing technology, thesis proposes a method that combines the word2vec based on the neural network and the TFIDF to overcome the shortcomings of the semantic representation of the word bag model. At last, author apply this method to the text representation.2. Cuckoo Search (CS) is a simple and efficient meta-heuristic algorithm. The CS is easy to fall into the local optimal solution because of fixed step size factor and discovery probability. Therefore, thesis proposes improved CS algorithm based on the best solution and Gaussian disturbance. Subsequently, author proposes CSAP algorithm.3. Spark based on memory model is a MapReduce implementation, and it provides a large number of friendly programming interface that user needn’t pay much attention to finishing the Map and Reduce function. The Spark program run 100 faster than Hadoop, because it stores the intermediate results in memory.Therefore, Spark is more suitable for iterative algorithms. Based on the Spark,the thesis proposes a parallelization scheme of CSAP algorithm and obtains a good speedup.
Keywords/Search Tags:text clustering, CS, AP, MapReduce, Spark
PDF Full Text Request
Related items