Font Size: a A A

Design And Implementation Of Distributed Computing System Based On Hadoop

Posted on:2016-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:K Z GuoFull Text:PDF
GTID:2308330470978586Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In this paper, we study how to deal with a large amount of data. The purpose is to construct a distributed computing system with small size, low cost, high performance and low risk. Distributed computing and parallel computing fusion together, for the development of windows, running within the local area network of small, safe, fast processing large amounts of data, solve the traditional computing framework is not conducive to the expansion, single equipment bottleneck problems of calculation.Based on the research of Hadoop framework, this paper presents two services: distributed storage service and distributed computing service. The system uses three layer architecture design to support the whole cluster running. Master is responsible for the global information control, Job is responsible for scheduling tasks, Task responsible for data storage and calculation. Support users to upload a custom format of the data, not the original data for the two segmentation, can support the calculation of some specific native data types. Open algorithm API, and the traditional MapReduce framework for the extension, the calculation can carry other data source, easy data exchange processing. Localization of the Reduce process, saving the network transmission time of Map intermediate data. The combined interface of the scheduling service is the final result of the calculation, which will be used to calculate the pressure dispersion, and make full use of the machine in the cluster. Take the form of adding a dynamic link library (DLL file) to support users to embed a custom algorithm, the algorithm is pre stored in the Task node, save the start time. The default cluster computer has certain reliability, simplify the design of disaster recovery. The use of hardware resources in the calculation of the use of the strategy is to seize the type, the single unit in the implementation of the task according to the configuration file to load the cache, to speed up the next calculation. The number of data blocks determines the number of threads, the thread is safe to open multiple threads, high concurrency, and give full play to the performance of CPU. Finally aiming at the realization of the distributed computing system, two types of testing algorithm is constructed:a class is comparative wordcount algorithm with Hadoop cluster and another is reflect the image alignment algorithm for data interoperability. Through the analysis of the test data verified the system of small amount of data calculation request can real-time response, and as a running in a secure network environment of small clusters, computing power provided by the system can meet the needs of small and medium-sized general, achieving the desired design goals.
Keywords/Search Tags:Distributed Computing, Hadoop, Real Time Response
PDF Full Text Request
Related items