Text Clustering And Summary Extractting On Big Data

Posted on:2016-04-01

Degree:Master

Type:Thesis

Country:China

Candidate:F Y Meng

Full Text:PDF

GTID:2298330467491852

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the widely used of big data, data mining has become increasingly important. The aim of data mining is to discover implicit, effective, valuable, understandable patterns from large amount of unorganized data and further to draw a conclusion on trends and correlation time to provide problem-solving level supports for users.Clustering is a key technique in data mining for static data analysis. Different from text classification, data clustering is unsupervised learning and widely used. Summary extraction, after feature selection and vectorization, can be transformed into a clustering problem. In this paper, we studied and improved many clustering methods and extended clustering to a summary extraction problem. The main work of this paper includes the following aspects.For clustering, current methods are mainly divided into three categories, hierarchical clustering algorithm, clustering method based on partitioning and clustering algorithm based on density and grid. In our subject, we improve two kinds of method. Fisrtly we impoved hierarchical clustering by using maximum heap to reduce the time complexity, then we focused on CLIQUE. CLIQUE divided the database into grids and handle grids instead of data points. Grid algorithm has the advantage of high efficiency, and can handle high dimensional but inevitably has the disadvantages of grid clustering algorithm, taking no consideration of the distribution of data. In our subject we compare among the three kinds of data clustering algorithm, carries on the experiment according to actual text data, and based on the drawback of CLIQUE, a new clustering method based on multi-splitting grid (CBMG) is proposed. In CBMG algorithm grids are further split into cells in order to discover the data distribution in each grid. So if the data in a grid belongs to different clusters, CBMG can easily handle it. The following experiment proved the effectiveness of CBMG.The application of clustering is widely. Summary extraction based on query can be treated as a clustering problem. Summary extraction mainly consists of extracting summary extraction and generation summary extraction. In our subject, we used the extracting way to find features of a sentence. The features contained relying on text feature selection, query expansion, length of sentence, position of the sentence and the title words. Then we generate vectors of the sentences, and transferring summary extraction problems to clustering problems, and sloving the problems with method like hierarchical clustering.

Keywords/Search Tags:

clustering, feature seletion, multi-splitting grid, summaryextraction

PDF Full Text Request

Related items

1	Grid Clustering Algorithm
2	The Research Of An Energy-aware Multi-sink Routing Protocol Based On Grid Clustering
3	Research On Clustering Algorithm Based On Grid Point Density Estimation
4	Parameter Grid Clustering Algorithm
5	Multi-Density Clustering And Outlier Recognition Algorithm Based On Grid Adjacency Relation
6	The Research On Non-Spherical Clustering Algorithm Based On Grid Partition
7	The Research Of Grid-based Parallel Clustering Algorithm And Clustering For Data Stream
8	A Multi-density Gradient Grid Clustering Algorithm Based On The Optimal Division
9	Research On Data Stram Clustering Algorithm Based On Similarity And Grid Partition Optimization
10	Study On Grid-based Clustering Algorithms