Based On Hadoopâ€™s Domain Knowledge Construction Key Technology Reaserch

Posted on:2016-08-03

Degree:Master

Type:Thesis

Country:China

Candidate:P Wu

Full Text:PDF

GTID:2308330464456261

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Studies have found that if the domain knowledge is applied to a variety of information processing technology, such as information extraction, information retrieval, data mining and so on, very good results can be achieved. The key for that lines in the existence of a domain knowledge base. Nowadays, experts in the field are heavily involved in constructing domain knowledge base, which is time-consuming. With the time of Big Data coming, information in text form growing exponentially. How to find the information they need from the ocean of information accurately and quickly to construct the domain knowledge base is a current challenge researchers have to face.Compared with traditional constructing domain knowledge base, this paper proposes a higher accuracy and faster constructing solution. We do further research on two of the most important key technologies for constructing solution:1. Text mining is a kind of technology which obtains knowledge from text information. However, the sparse and high-dimensional of text data will reduce the text feature extraction accuracy. Feature reduction achieves the purpose of dimension reduction by removing some irrelevant, redundant and noisy features. Itâ€™s one of the key technologies of constructing domain knowledge base. Traditional dimension reduction methods only consider the statistical information of words while ignoring the semantic information word contained, the selected feature set often canâ€™t represent the meaning of text accurately and completely in the semantic aspect. Chinese has lots of polysemy and synonyms, more consideration should be paid on the semantic information of words. The paper proposed a dimension reduction idea with the aid of semantic_base, which introduce How Net base to project features into How Netâ€™s low semantic space and calculate the semantic similarity to merge the synonyms and near synonyms. Experiments show that the method this paper proposed can reduce the dimension of feature space. TF-IDF neglects the influence of other features of words and semantic on keywords, the paper proposes a keywords extracting algorithm based on TF-IDF, word position, word span and How Net semantic weighted combination of domain knowledge. The experiment findings show that the proposed method have higher accuracy and recall rate.2. Text Clustering is an important technology of Text mining. Itâ€™s also one of the key technologies of constructing domain knowledge base. The paper firstly experiment serialization of text clustering, the result proved that it canâ€™t complete task within the effective time when in face of massive data. To solve this problem, we do further research on the basic architecture of the open source distributed platform Hadoop and its key technologies: HDFS distributed file system and Map Reduce programming model. Design based on the Hadoop distributed platforms distributed parallel text clustering algorithm for several pieces: parallel construction text vector, parallel matrix similarity computation, parallel matrix multiplication and parallel data partition. The experiments show that the design of distributed parallel text clustering algorithm is feasible for dealing with the massive, hign-dimensional data sets, and it can reduce the time complexity greatly and have higher accuracy and recall rate.The experiment selected the field of Big Data itself as the experimental object, constructing its domain knowledge base. Then, term management based on the domain knowledge base is built and domain terminology service and navigation service are provided.

Keywords/Search Tags:

Domain Knowledge Base, Semantics, Text Clustering, Hadoop, Spectral Clustering Algorithms

PDF Full Text Request

Related items

1	Spectral Clustering Algorithms And Its Application In Text Clustering
2	The Short Text Fuzzy Spectral Clustering Based On Semantic
3	Research On Text Clustering Algorithm Based On Spectral Clustering
4	Research On Text Spectral Clustering Algorithm Based On Hidden Topics
5	Parallel Spectral Clustering Algorithm Based On Hadoop
6	Research On Fuzzy Spectral Clustering Segmentation Algorithm And Apply It To Text Clustering
7	Study Of Text Clustering Algorithm Based On Semantics
8	Research Of Web Document Clustering Menthod Based On Hadoop
9	Design And Implemention Of High Performance Text Clustering Algorithm Basic On Hadoop
10	Oneof Text Clustering Algorithm Based On Big Data