Font Size: a A A

Design And Implementation Of Patent Document Storage Platform Based On Hadoop

Posted on:2017-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y J LiFull Text:PDF
GTID:2348330512468192Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the past 100 years,human science and technology advanced by leaps and bounds,so that with the rapid growth of patent information,global patent documents after 100 years of accumulation has multitude.As a kind of massive information resources,patent literature data plays an important role in the development of scientific research and technology.Patent document data is large,complex content,and belongs to unstructured data,so that traditional data storage methods are far from meeting the demand.Research on efficient storage and retrieval of patent documents is of great significance.According to the characteristics of patent documents and data,this paper analyzes the requirement of patent document storage platform,and designs the structure and main class of patent document storage platform based on Hadoop.This paper designs HDFS storage structure,and optimize the format of small file storage.According to the document of patent documents,the structure of Hbase database is designed.According to the platform storage situation,the storage model of K-means clustering algorithm based on massive data is proposed.In addition,the data query module is designed,and the upload module is designed to eliminate the limitation of file upload.This paper analyzes the characteristics of patent documents,and give feature words different values to improve the formula of TF-IDF,according to the position of feature words.By this way we improve the performance of classification,and compute parallel text space vector by Map/Reduce.This paper reduces dimension of space vector,in order to reduce noise interference and clustering time.Consider of the low precision and slow convergence rate of' current mainstream clustering algorithm,the K-means clustering algorithm based on the collected text is put forward,and then the spatial vector is clustered and stored.The small file storage format sequence file is optimized to implement hbase and HDFS mapping,optimize storage space and improve the retrieval efficiency of HDFS.In this paper,we set up the two level index based on the index of patent directory and the cluster center,judge the user's search intention to narrow the scope of search,and achieve retrieval and efficiency optimization.Translation based on dictionary,and iterative disambiguation of Bias classification algorithm is used in the solution of translation ambiguity.And we use cluster center index to reduce calculation range of Bias classification algorithm,and reach cross language retrieval and efficiency improvement.This paper optimizes the input keyword to support the general search,multi keyword search,and cross language retrieval.Finally,the feasibility of the platform is proved by experiment.From the optimal data storage structure,hadoop distributed framework for research object,hbase non-relational database,HDFS distributed file system,Map/Reduce programming model,this paper researches on the efficient storage and retrieval of patent document data based on the project of "massive patent document cloud computing application storage platform".
Keywords/Search Tags:Hadoop, K-means clustering algorithm optimization, Text quantization, Patent document storage
PDF Full Text Request
Related items