| With the development of commercial and Internet, business of large scaleand business system are also show the rapid growth. The common useself-service system has been widely used along with the rapid development ofthe aviation business, so the number of travelers who using it to handle theservice of taking plain are increasing daily. Because of the complexity ofbusiness processes and scale of the number of users, the s ystem everydayproduces vast amounts of log files which contain valuable customer data. Wecan convert the potential user data into valuable things through the analysis oflog data files.However,the traditional handling method of large-scale logdata has become increasingly inadequate, the research work of distributedparallelization of log processing method will be crucial. MapReduceprogramming model, which under distributed computing platform Hadoop,become the first choice of large-scale log analysis for its simple, easy to use,applicability and large-scale data processing.As the distributed computing platform of MapReduce, Hadoop is consistof MapReduce and HDFS. Hadoop can not only easily organize computerresources, in order to build their own distributed computing platform, but alsotake advantage of the distributed cluster computing and storage capacity tocomplete the analysis of massive data processing.As the theoretical basis of the log analysis and realization basis of loganalysis method, Data Mining technology extract potentially usefulinformation and knowledge from a lot of incomplete, the noise, and the actualapplication data. The thesis depends on the basis of in-depth research and thecharacteristics of log files produced by self-service check-in system, put forword a set of log file data preprocessing method and analytical methodssuitable for distributed and parallel processing. The data pre-processingmethods adjust the analysis of data format and content by a series of oprationof the original log file data cleaning, integration, transformation and statute,thus, effectively reducing the data size of the distributed processing andimproving the efficiency of the log analysis. Distributed and parallelprocessing methods efficiently complete analysis of the massive log fileprocessing, and then get valuable customer data and business data formanalysised data so as to provide strong support for the formulation anddevelopment of the business.This thesis which focuses on the distributed log analysis methods, dependson researching the log analysis approach thoroughly and distributed andparallel computing technology, and design the platform of distributed parallellog analysis system,which on the basis of B/S architecture and combite theJ2EE technology with Hadoop.Verified by experiment, the system has greatlyimproved the efficiency of the analysis than traditional serial computinganalysis system on the dealing with analysis of large log files.According to the Distributed log analysis system, the thesis implementsthe whole process of log analysis. User can upload log files to the server andthen select the corresponding elements for the analysis of data preprocessing.System will automatically deliver the log files which preprocessed to thedistributed computing node for analytical processing. Finally, the result ofdata analysis will be show by the way of chart in the system page.Us ers canalso export analysised data into an Excel spreadsheet and emailed to therelated development and operational personnel in order to provide a strategyfor the airlines to develop new business. |