Font Size: a A A

Research On Real-time Data Warehouse Optimization Technology Based On Flink

Posted on:2024-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:R J WangFull Text:PDF
GTID:2568307079970939Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the further development of Internet technology,the scale of real-time data has rapidly increased,and Real Time Data Warehouse(RTDW)has become an important means of data analysis,which realizes the aggregation of real-time data through Flink stream processing engine and can provide second or minute level data synchronization capability,so that the ETL process of RTDW can meet managers’ needs for data freshness requirements.However,in the stream processing engine,the workload scale faced by stream tasks that are allocated at once and run for a long time is difficult to predict,and the stream processing engine needs a suitable scheduling algorithm to achieve elastic scheduling of stream tasks.Meanwhile,Flink’s current dynamic scaling mechanism,which implements operator scaling by restarting stream tasks and reassigning state,makes Flink unable to process data during scaling,which brings additional throughput and latency overhead.To address the above problems,the thesis proposes an elastic scheduling algorithm DS2-VUT,based on the Data Flow model by combining Kingman’s Formula and DS2 elastic scheduling algorithm.The DS2-VUT algorithm uses the user-specified maximum waiting time for operator instances and the coefficient of variation of input data arrival as well as processing speed as input parameters to determine the maximum utilization of each operator in the stream task,this enables DS2-VUT to make precise scale-up or scale-down decisions for workloads with different fluctuation magnitudes,allowing the stream processing engine to improve real-time performance of stream tasks while wasting as few physical resources as possible.Meanwhile,in order to reduce the time delay of Flink during scaling,the thesis implements a dynamic scaling mechanism based on control events on Flink,which combines the jump consistency hash partitioning algorithm and uses the keys of the input data as random number generation seeds,so that no additional key mapping occurs when the number of operator instances changes,reducing the scale of state migration on operator instances and the number of blocking operator instances during scaling,enabling Flink to continue processing the input data during scaling,and bringing better throughput and real-time performance to the real-time data warehouse based on Flink.To validate the effectiveness of the design solution,the thesis tests the DS2-VUT elastic scheduling algorithm and the control event-based dynamic scaling mechanism using the Word Count task and Yahoo Streaming Benchmark by simulating a Flink compute node through a resource-constrained Docker container.Experimental results show that the DS2-VUT algorithm increases the total number of operator instances of the stream task by only one to two when facing low volatility input data compared to the DS2 algorithm,and increases the total number of operator instances by two to three when facing high volatility input data.DS2-VUT can improve the real-time performance of the stream task in both cases while wasting as few physical resources as possible.Moreover,the control event-based dynamic scaling mechanism reduces the scale of state migration during scaling compared with the native mechanism of Flink,speeds up the scaling,and the modified Flink can continue to process real-time data during scaling,improving the throughput of Flink during elastic scheduling.
Keywords/Search Tags:Flink, Real Time Data Warehouse, Elastic Scheduling, Dynamic Scaling
PDF Full Text Request
Related items