Font Size: a A A

Text Clustering And Summary Extractting On Big Data

Posted on:2016-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:F Y MengFull Text:PDF
GTID:2298330467491852Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the widely used of big data, data mining has become increasingly important. The aim of data mining is to discover implicit, effective, valuable, understandable patterns from large amount of unorganized data and further to draw a conclusion on trends and correlation time to provide problem-solving level supports for users.Clustering is a key technique in data mining for static data analysis. Different from text classification, data clustering is unsupervised learning and widely used. Summary extraction, after feature selection and vectorization, can be transformed into a clustering problem. In this paper, we studied and improved many clustering methods and extended clustering to a summary extraction problem. The main work of this paper includes the following aspects.For clustering, current methods are mainly divided into three categories, hierarchical clustering algorithm, clustering method based on partitioning and clustering algorithm based on density and grid. In our subject, we improve two kinds of method. Fisrtly we impoved hierarchical clustering by using maximum heap to reduce the time complexity, then we focused on CLIQUE. CLIQUE divided the database into grids and handle grids instead of data points. Grid algorithm has the advantage of high efficiency, and can handle high dimensional but inevitably has the disadvantages of grid clustering algorithm, taking no consideration of the distribution of data. In our subject we compare among the three kinds of data clustering algorithm, carries on the experiment according to actual text data, and based on the drawback of CLIQUE, a new clustering method based on multi-splitting grid (CBMG) is proposed. In CBMG algorithm grids are further split into cells in order to discover the data distribution in each grid. So if the data in a grid belongs to different clusters, CBMG can easily handle it. The following experiment proved the effectiveness of CBMG.The application of clustering is widely. Summary extraction based on query can be treated as a clustering problem. Summary extraction mainly consists of extracting summary extraction and generation summary extraction. In our subject, we used the extracting way to find features of a sentence. The features contained relying on text feature selection, query expansion, length of sentence, position of the sentence and the title words. Then we generate vectors of the sentences, and transferring summary extraction problems to clustering problems, and sloving the problems with method like hierarchical clustering.
Keywords/Search Tags:clustering, feature seletion, multi-splitting grid, summaryextraction
PDF Full Text Request
Related items