Font Size: a A A

Storage And Processing Of Primary Education Resources Based On The Hadoop

Posted on:2016-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:J W FangFull Text:PDF
GTID:2348330476455756Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Primary education resources contain a wealth of knowledge. It is important to extend humanoid intelligence system’s knowledge library and improve their intelligence level that obtaining rich semantic information from primary education resources and building knowledge mapping of primary education. In order to build knowledge mapping of primary education, it must have large amounts of data as a support. So it is the foundation and crucial that building a database of the primary education resources to build primary education knowledge mapping. This thesis is supported by the 863 project " The key technologies of humanoid intelligent knowledge understanding and reasoning oriented to primary education "(2015AA015403), and the three aspects of acquisition, storage and processing about the primary education resources. The major work of this thesis is as follows:1) The database of primary education resources was built. The thesis got the primary education resources from the Internet via a simple distributed Web crawler based on the Scrapy framework which is designed by this thesis. At the same time, the thesis processed these resources data in the Hadoop MapReduce framework, and stored the processed data into the HBase database. Eventually the thesis built the database of primary education resources.2) A storage schema was proposed to store primary education resources. Primary education resources have characteristics of small file and large number, so they are not suitable to be stored directly into the HDFS file system. This thesis proposed a storage schema to store these resources files. The storage schema merged those resources files into some larger files firstly, and stored these larger files into the HDFS via Sequence file. The experiment result showed that the storage schema proposed by this thesis was beneficial to save storage space and improve the processing efficiency of the primary education resources.3) The algorithm of text extraction based on the distribution function of the page row block was improved. The algorithm of text extraction based on the distributed function of the page row block was wrong to treat the block of links as the text information. In order to solve this problem, this thesis improved this algorithm. In the processing of text extraction, the improved algorithm considered two restrictions of the number of punctuation and the ratio of the number of link text’s characters to the number of total characters, and the processing of the compressed file. The experiment result showed that the improved algorithm could well solve the problem of extracting the block of link.
Keywords/Search Tags:Primary Education Resources, Small File Storage, Distribution Function of Row Block
PDF Full Text Request
Related items