| With the development of computing technology and network technology, the data center and computing center which are constructed by the distributed computing and parallel computing system are widely used in industry, business,technology, military and other aspects. A large number of complex tasks are decomposed into several computational tasks which use the technology of parallel processing in these applications, finally, the results of the calculation can be effectively combined so that we can get the final result.Thus,we can learn from that Effective task scheduling mechanism is the key factor to affect the performance or efficiency of distributed computing system in the process of the tasks’ decomposition and calculation.Of course unreasonable task scheduling method will seriously affect the computing ability of the system,it will reduce parallel efficiency,it even can’t reach the ability of the parallel computation. Because of these reasons, the scheduling problem of the task has always been the core content of distributed system, network system and cloud computing system, it is also a hot research topic.However, with the continuous increasement of the size of the system and computing ability, the stability and reliability of the system have become the key issue to ensure the smooth execution of the parallel application. For example, In the Tianhe 2 super computer, Google data center or large-scale cluster, due to the complex of the application program and high power consumption, these led that the distributed system is extremely prone to failure, therefore involves a set of complete reliability guarantee mechanism is particularly important, and a high reliable scheduling algorithm is one of the important method.The topic of this paper is “Guarantee performance, improve reliabilityâ€, it deeply researches how to ensure the reliability of heterogeneous distributed computing system and the efficient usage of computing resources. The tasks are divided into two types in this article: real time task and non real time task. Reliable scheduling strategy with high reliability and high performance is achieved by main versio and sub version scheduling technology. The main contents of this thesis are as follows:(1)A scheduling algorithm based on the reliability cost of computing nodes and communication links(DRCAMD) is proposed for the problem of the reliable scheduling of real-time tasks in distributed heterogeneous computing systems. This method can adjust the weight function of the system by setting the weights. It balance different needs of different users in the system’s scheduling performanceand reliability.Also,in this article,we propose a new for scheduling analysis in which wo don’t consider the main version and vice version of a variety of overlapping state as for the scheduling problem for real-time tasks with dependencies. The experimental results show that under the condition a certain computing nodes and the communication links are in failure,we will get the advantages of the algorithm’s reliability and performance.(2) In order to solve the problem of reliability scheduling for mixed critical tasks, we proposed a reliability analysis method for the two stage reliable scheduling algorithm which is based on main version and sub version scheduling strategy and combined task level degradation processing method. The first stage of the algorithm is mainly to dispatch the mixted critical tasks which need dispatching according to the priority level of scheduling.We preferentially dispatch the tasks which has high priority scheduling. The system overhead caused by the copy of the replica is reduced by using the replica overlap method in the scheduling process. The second stage is the scheduling of tasks which are scheduled to the heterogeneous processor.we degrad the Scheduling task which can’t meet the demand the scheduling. Experimental results show that the performance of MCRSS is better than other algorithms,it is suitable for heterogeneous cluster environment especially,the tasks are to achieve a greater speed change, nodes dynamically join or exit the cluster and so on.It make the system have strong flexibility and reliability.(3)As for the Scheduling problem for a DAG task with priority dependency, we propose a scheduling algorithm based on the earliest completion time of the subversion task in this paper(EFTBT),this method can obtain the earliest completion time of the task scheduling and the constraints of scheduling of the target processor in different conditions by analyzing the status of the main version task scheduling.And it proves the rationality of the constraint. This method can obtain better scheduling performance. In addition, the problem of multi DAG task scheduling at the same time in the application of scientific work exits, in order to solve the problem of multiple subsequent tasks can not be solved by the unfair, a multi DAG( MDDL) scheduling strategy based on hierarchical idea is proposed.It can effectively solve the problem. The experimental results show that the two algorithms can more effectively improve the performance of scheduling than classical algorithm.(4)As for characteristics of heterogeneous and dynamic characteristics of large-scale heterogeneous distributed computing systems, DAG task scheduling strategy with dependency which is based on reliability cost is proposed. The strategy is the earliest completion time of the backup task in this paper(EFTBT).A strategy which is more suitable for the actual application requirements’ communication model and implementation is proposed.Fault characteristics analysis method forheterogeneous distributed computing systems is established. On the basis of this, a fault tolerant scheduling algorithm(RAPA) is proposed based on the communication competition model.Experimental results show that EFTBT and HEFT have better performance and reliability than the classical algorithm. |