| The scale of the academic resources of the Internet is growing rapidly,and it has become an urgent problem to be solved in order to process,analyze and store the information of academic resources.Word segmentation of Chinese and English is the foundation and key of text information processing,which aim to convert the continuous character sequence into a word sequence according to a certain criterion.Conditional Random Fields(CRF),based on the character tagging,and other statistical machine learning methods are the main research methods of Chinese word segmentation.The best English word segmentation method is like Stemmer,based on the rules.The word segmentation module of English and Chinese is the basis of the data processing and analysis for the Recommendation system in academic resources in our workgroup,it plays an important role in the search and recommendation modulars.Distributed storage system can server better for data storage,retrieval and disaster recovery of the Recommendation system in academic resources.This paper researches two aspects of word segmentation and distributed storage,then designs and implements the word segmentation and distributed storage system based on academic resources.Specifically,the main of this thesis are as follows.(1)Design and implement the Chinese word segmentation system for academic resources based on Stanford Word Segmenter.Firstly,research Chinese word segmentation technology,obtain Chinese corpus and train model with CRF++,then develop Chinese word segmenter.Contrast with the CRF++-based segmenter,the Stanford Word Segmenter,has high accuracy,good stability,and can be extended well to the academic resource information.(2)The English word segmentation system based on Lucene is designed and implemented for the academic resource recommendation system.At first,we build the English word segmentation module,and then use it for the multi-threaded word segmentation system.(3)Develop data manipulation interface on the base of HBase.Research and design the storage and retrieval technology of HBase,and then complete the academic resource storage scheme.Experiments on task of academic resources system show that it supports the massive data storage,and has the advantages of high safety,quick retrieval etc. |