| With its advantages in distributed clustering,Hadoop provides an efficient solution for storage and computing of large data.As an important large data processing platform,Hadoop requires high efficiency and performance,however,the installation and deployment of its cluster has some limitations,such as the large number of configuration files,nodes configured and the complex deployment process,so that the distributed cluster deployment frequently ends in failure.At the same time,the cluster can not allocate resources dynamically and effectively,In order to make full use of Hadoop Yarn’s resources,the queues often preempt the computing resources which causes important tasks to block.Therefore,the research of deployment and performance optimization for Hadoop platform is necessary and meaningful.Currently,in view of the above problems,the rapid deployment of the Hadoop platform and automatic scaling are based mainly on the traditional virtual machine technology(VM).The traditional virtualization technology for resource utilization of the real physical host can not achieve the level of real physical host,there is a lot of overhead in resource utilization and startup speed and performance,and there is a problem that it is difficult to configure files,automate the creation and deployment of services flexibly.In this thesis,the traditional virtualization technology of the Hadoop platform is studied,and the Hadoop platform deployment and optimization scheme based on container virtualization technology is proposed.The main research work includes:(1)In view of the disadvantages of the traditional virtual machines,such as low utilization of resources,slow startup speed and large performance overhead,the rapid deployment plan of the Hadoop platform is put forward and designed.Based on the container bottom lightweight virtualization method,through the construction of Dockerfile and automated build script and the application of serf for node management,and taking the dnsmasq as a lightweight DNS server,the container between network communication is optimized and file sharing achived,so as to improve the efficiency of Hadoop cluster deployment,and enhance its availability.(2)Because the utilization rate of host resources is low,deployment and expansion is complex,resource isolation can not be dynamically adjusted and can not quickly respond to business problems,Yarn on Swarm cluster architecture program,is proposed.Using Swarm to achieve the underlying resource scheduling,each Docker container runs a NodeManager with resource allocation accurate to memory size and CPU count,by modifying the number of containers to dynamically change the Yarn resource situation,a dynamic on-demand telescopic services can be achieved,through a more granular allocation of resources,Hadoop Yarn resources can be used reasonably.(3)The performance optimization system of Hadoop platform is designed and implemented.Firstly,the self-developed flat network plug-in is adopted to solve key issue of network communication in cluster deployment,so that all the container networks are connected on the second floor;Secondly,Dockerfile and automation script is designed to solve the problems that it is difficult to configure files and automate the creation and deployment of services flexibly and quickly,the container resource visualization interface is designed and implemented based on Shipyard,the Jekins-based automated construction and continuous integration module is designed and implemented,all these improve the efficiency of the cluster supervision greatly.In this thesis,research,design,implementation and testing of the deployment and optimization scheme are done for Hadoop platform based on container virtualization technology.With its bottom-level virtualization approach,two-tier resource scheduling and its friendly and efficient visual control for container cluster and Hadoop cluster bring the fine-grained resource allocation.The research,design,implementation and testing verifies its feasibility and efficiency in rapid deployment,automatic scaling and resource optimization. |