With the advent of the era of big data,distributed computing engine platform has attracted more and more attention.Apache Flink is a memory-based distributed computing engine platform that fully supports stream processing.It regards batch processing as a limit case of stream processing,and uses the coneept of stream processing to solve batch processing,which provides a new idea and method for data analysis.The traditional association rules algorithms Apriori,FP-Growth and Eclat have some limitations.Choosing an appropriate association rule mining algorithm and improving it is one of the research focuses of this paper.EMU has accunulated a lot of data in daily operation and maintenance.How to acquire knowledge from these data to guide the operation and maintenance of EMUs and improve their reliability has become an urgent problem to be solved.This thesis improves Eclat algorithm on Flink platform and applies the improved algorithm to EMU fault Association mining.The main work includes:(1)A decision strategy based on the comparison of specific elements is proposed to quickly judge whether the intersection operation can get frequent items.By adding this criterion to Eclat algorithm,the intersection operation of frequent items can be skipped,the number of iterations can be reduced,and the efficiency of the algorithm can be improved.Compile the improved algorithm program before and after the improvement,and process the open data sets in Flink local execution environment to do comparative experiments to verify the effectiveness of the improved method.(2)A data preprocessing method-field digitization,is proposed to convert complex text into simple positive integer in EMU data and record this one-to-one mapping relationship.After field digitization of EMU data,different types of fields correspond to different continuous intervals,so field types can be filtered by simple numerical comparison.The digitization of data sets not only reduces the memory consumption in the process of calculation,but also improves the computational efficiency of the algorithm.(3)A filtering strategy based on field digitization and research purposes is proposed to filter out frequent items that do not contain fault information.By optimizing frequent itemsets,this strategy reduces the iteration radix of intersection operations and improves the efficiency of the algorithm.The validity of the improved method is verified by comparing the pre-processed EMU data.(4)Flink on YARN mode cluster is deployed to provide environment support for parallel processing of large-scale EMU data sets.Flink has a concept of parallelism,which can be achieved by setting the value of parallelism greater than 1.Adjust the parallelism and repeat experiments to explore the relationship between the parallelism and the computing efficiency of the platform.Compiling Map function and Reduce function under MapReduce platform to compare the computational efficiency of the two platforms under the same conditions. |