| The 21st century we enter a new period--the era of big data. Data have been regarded as the source of wealth. Big data promote the development of cloud computing rapidly. Cloud computing has become a new business model. It has attracted more and more attention in the field of industry, academia and society. "Cloud" provides a new dimension with fixed or mobile for global users, providing computing resources in the form of infrastructure as a service(IaaS), platform as a Service (PaaS) and software as a service (SaaS). These resources are based on the environment of the Internet. People can choose to pay by usage or allocate resources.Due to the use of resources is uncertain. For the cluster and applications built on the cloud platform, it is a double-edged sword to determine the size of its resource capacity, which could lead to inadequate or excessive supply. For cloud resource tenants,excessive demand for resources will lead to waste of resources and cost too much; and for Cloud Service Provider, excessive supply of resources to tenants will result in low overall resource utilization. Therefore, resource scheduling problems in cloud computing are considered to be as difficult as non-deterministic polynomial (NP)optimization problems.In order to improve the utilization of resources, the research work is carried out from two levels. They are the cluster inside and the cluster scale in this paper.(1)I made a scrutiny into the Hadoop principle architecture, the MapReduce computing framework and the HDFS file system. Then, I studied the three scheduling algorithms which supported by the Hadoop system. In this process, I found the shortage of the existing algorithm. The self-learning method was used to scheduling resource,and the feature-weighted Naive Bayesian scheduling algorithm was proposed. The experimental results shown that the use of feature-weighted Naive Bayesian scheduling algorithm is less time and high resource utilization than using Hadoop’s default scheduling algorithm when running WordCount jobs.(2) Hadoop cluster overall resource in short supply and over-supply will lead to resource saturation and a waste of resources. Combined with the cloud platform OpenStack and big-data tools Hadoop, a system which can dynamically adjust the scale of the cluster has been designed. The whole system is composed of three modules:monitoring, scheduling and virtual machine management. In the scheduling part, the timer adjustment only handles the jobs which have the feature of periodic and stability.Although the threshold adjustment can handle almost all cases, it causes the delay of resources supply. In this paper, time series workload forecasting algorithm based on SVM is proposed. The accuracy of forecasting results has a crucial influence on decision-making. Therefore, the SVM algorithm and the ARMA algorithm are used to predict the time series of the workload. The experimental results show that the prediction results of the SVM model are more accurate than the ARMA model under the model of growth and irregularity. |