| Studies have found that if the domain knowledge is applied to a variety of information processing technology, such as information extraction, information retrieval, data mining and so on, very good results can be achieved. The key for that lines in the existence of a domain knowledge base. Nowadays, experts in the field are heavily involved in constructing domain knowledge base, which is time-consuming. With the time of Big Data coming, information in text form growing exponentially. How to find the information they need from the ocean of information accurately and quickly to construct the domain knowledge base is a current challenge researchers have to face.Compared with traditional constructing domain knowledge base, this paper proposes a higher accuracy and faster constructing solution. We do further research on two of the most important key technologies for constructing solution:1. Text mining is a kind of technology which obtains knowledge from text information. However, the sparse and high-dimensional of text data will reduce the text feature extraction accuracy. Feature reduction achieves the purpose of dimension reduction by removing some irrelevant, redundant and noisy features. It’s one of the key technologies of constructing domain knowledge base. Traditional dimension reduction methods only consider the statistical information of words while ignoring the semantic information word contained, the selected feature set often can’t represent the meaning of text accurately and completely in the semantic aspect. Chinese has lots of polysemy and synonyms, more consideration should be paid on the semantic information of words. The paper proposed a dimension reduction idea with the aid of semantic_base, which introduce How Net base to project features into How Net’s low semantic space and calculate the semantic similarity to merge the synonyms and near synonyms. Experiments show that the method this paper proposed can reduce the dimension of feature space. TF-IDF neglects the influence of other features of words and semantic on keywords, the paper proposes a keywords extracting algorithm based on TF-IDF, word position, word span and How Net semantic weighted combination of domain knowledge. The experiment findings show that the proposed method have higher accuracy and recall rate.2. Text Clustering is an important technology of Text mining. It’s also one of the key technologies of constructing domain knowledge base. The paper firstly experiment serialization of text clustering, the result proved that it can’t complete task within the effective time when in face of massive data. To solve this problem, we do further research on the basic architecture of the open source distributed platform Hadoop and its key technologies: HDFS distributed file system and Map Reduce programming model. Design based on the Hadoop distributed platforms distributed parallel text clustering algorithm for several pieces: parallel construction text vector, parallel matrix similarity computation, parallel matrix multiplication and parallel data partition. The experiments show that the design of distributed parallel text clustering algorithm is feasible for dealing with the massive, hign-dimensional data sets, and it can reduce the time complexity greatly and have higher accuracy and recall rate.The experiment selected the field of Big Data itself as the experimental object, constructing its domain knowledge base. Then, term management based on the domain knowledge base is built and domain terminology service and navigation service are provided. |