| In the era of big data,how to deal with massive and high-velocity big data in real time to quickly extract its value has become an important challenge in the current computer system field.Distributed stream processing systems have become the cornerstone for realtime computing of big data.Many excellent distributed stream processing systems have been developed and widely deployed in the industry.These systems can be classified into two categories.One is a batched stream processing system that discretizes stream data into micro-batches and another is a continuous stream processing system that processes stream data record by record in a pipelined manner.In these two systems,task scheduling and data grouping are the keys for these systems to achieve high throughput and low latency.However,the design of current scheduling mechanisms does not take into account the characteristics of job deployment and runtime,and can not schedule tasks and data efficiently,which seriously degrades the performance of streaming applications.Therefore,how to exploit these characteristics of job deployment and runtime to optimize task scheduling and data grouping is a critical problem.Firstly,current batched stream processing systems always run in heterogeneous environments due to hardware update and resource sharing.Straggler exists widely in these systems and impacts system performance heavily.Traditional straggler mitigation mechanisms are reactive,and belong to post-scheduling methods.They fall behind when dealing with straggler tasks in batched stream processing systems,resulting in long job completion time and high resource overhead.In order to solve this problem,a pre-scheduling straggler mitigation framework exploits the characteristics of recurring jobs,identifies potential stragglers by analyzing execution information of historical jobs and evaluates nodes’ capacity with the ILC(Iterative Learning Control)model.Then it pre-schedules job input data to each node before next batch to mitigate potential stragglers.Secondly,by analyzing the performance of a real streaming application in Tencent production clusters,we found that batched stream system not only has heterogeneous resources,but also has heterogeneous tasks.Unfortunately,current batched stream processing system implementations designed for homogeneous environments schedules tasks according to data locality and free slots,ignoring task size and node capacity.This makes batched stream processing system perform poorly on heterogeneous environments.Meanwhile,some alternative optimizations take actions after one task has fallen behind,which are compensation mechanisms.They can not efficiently mitigate the effects of heterogeneity on the performance of batched stream processing system.To deal with this issue,a blank scheduling framework pre-steals large tasks from slow nodes to fast nodes by being aware of task size and node capacity.It schedules tasks according to the principle of large task first,and then fills free slots by choosing small tasks corresponding to node’s capacity.Finally,data grouping has a great impact on the performance of continuous stream systems.However,existing workloads grouping strategies can be classified into four categories(i.e.raw and blind,data skewness,cluster heterogeneity and dynamic load-aware).These traditional stream grouping strategies do not consider network distance between two communicating operators.In fact,the traffic from different network channels makes a significant impact on performance because of their underlying communication mechanisms.How to group messages according to network distances to improve performance has been a critical problem.In order to solve this problem,a network-aware grouping framework identifies the resource location of two communicating operators,then introduces weight grouping to decide the number of tuples sent to each network channel by assigning each channel a weight and priority.It adopts dynamic weight control to adjust network channel’s weight and priority online by periodically analyzing runtime information.In summary,based on the optimizations on task scheduling and data grouping in distributed stream processing systems by combining the characteristics of job deployment and runtime,a series of runtime-aware scheduling strategies are proposed to improve the performance of streaming applications in distributed stream processing systems. |