Font Size: a A A

Meteorological Text Categorization Feature Selection Method And Its Implementation On MapReduce

Posted on:2016-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:C X JinFull Text:PDF
GTID:2180330470969725Subject:Meteorological information technology and security
Abstract/Summary:PDF Full Text Request
With the development of society, the amount of information is increasing explosively. Moreover, in the meteorological industry, there are all provinces, municipalities, autonomous regions and special administrative regions building websites for meteorological service and the amount of websites is more than 1200. It’s just because of the enormous magnitude of meteorological text that how to collect useful information from those huge text data is becoming the hotspot of research. Text categorization technology can fetch relative time-efficient information from a large amount of documents and it’s a key technology which can extract relative information from enormous meteorological text. Similarly, feature selection is a core technology for solving the problems which are resulted from the high computation complexity and short precision because of high dimensionality of term. Therefore, this thesis makes it as a pointcut to propose a modified feature selection algorithm which is based on chi-square statistics for the typical algorithm’s lacking in comprehensive evaluation between term frequency and term distribution. Besides, this algorithm’s validity is validated in data sets of common text categorization and meteorological text. Meanwhile, faced with a large quantity of text information and the endless running time of single computer, we propose a scheme for meteorological text categorization which is upon the parallel computing framework of MapReduce. The distributed parallel computing’s improvement on classification efficiency is validated by the experiment on the data set of meteorological text. This thesis completes the following work:(1)Proposing a modified chi-square statistics algorithm based on term frequency and distribution. Under the circumstances of having studied the principle of classical feature selection algorithms and analyzed the insufficiency of them, it proposes the modified algorithm based on conventional chi-square statistics. The modified algorithm computes term distribution by using sample variance and revises chi-square statistical evaluation function with the maximum term frequency. It takes both term frequency and the evaluation which term distribution gives to the chosen feature words into consideration. From the experiments on data sets of classical text categorization and meteorological text, the method is validated improved on the effect of categorization.(2) Designing and realizing meteorological text categorization based on MapReduce. This thesis disposes of meteorological text parallelly based on the parallel computing framework of MapReduce and Hadoop which is an open-source platform. The method mentioned in this thesis is not only the process of parallel computing and realizing categorization algorithm, but also includes meteorological text pretreatment, TFDCHI algorithm, distributed parallel computing scheme in the duration of text description and computing and executing tasks separately as much as possible. Compared with the different results of experiments, it is proved that the distributed parallel processing method has higher classification efficiency.(3) Collecting the text information from the websites for meteorological service of China Meteorological Bureau(CMB) and all provinces, municipalities, autonomous regions and special administrative regions. Then pretreating the collected text information and analyzing the structure of data set what will form the data set of meteorological text which is convenient to be classified.
Keywords/Search Tags:meteorological text categorization, feature selection, chi-square statistics, term frequency, term distribution, MapReduce
PDF Full Text Request
Related items