Font Size: a A A

Implementation Of A Distributed Streaming Graph Partitioning System

Posted on:2020-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:R Z LuoFull Text:PDF
GTID:2370330575955037Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of large-scale graph computing and graph storage,com-puting and storage capabilities of standalone machines are inadequate.Dividing graph data into multiple sub-graphs and distributed processing them is a common solution.To divide a graph is challenging.On the one hand,a good partitioning algorithm needs to strike a balance between reducing subgraph connectivity and balancing the size of each subgraphs.On the other hand,the graph partitioning system should also have strong scalability,reliability,availability,and performance.The graph partitioning problem can be formalized as a(k,v)balance graph par-titioning problem,which divides the graph G =(V,E)into k components whose size is no more than vn/k.Currently,common graph division algorithms can be classified into four categories according to online or offline,edge partition or vertex partition.A-mong them,the online edge partition algorithm is a popular research direction because of its good real-time performance and good performance on power-law degree graphs.Providing efficient and reliable engineering implementation for these algorithms is a valuable topic.We introduced a reliable distributed streaming graph partitioning system which consists of the streaming graph partitioning subsystem and graph storage subsystem.The first subsystem is designed based on the HoVerCut[1]algorithm proposed by Saj-jad in 2016.HoVerCut supports horizontal and vertical scaling through sub-partitioners,which synchronizes with each other through a global shared state.However,the al-gorithm has high communication latency.This thesis discusses the communication performance problems in the HoVerCut algorithm,and proposes three optimization measures to reduce the redundant communication scale and optimize the IO model.In addition,for improving scalability,a variety of databases can be used as back-end stor-age for the shared state.Because the shared state is the only way for sub-partitioners to synchronize with each other,so the shared state can cause single point of failure.We design and implement a graph storage subsystem based on Raft consensus algorithm to improve the reliability and availability of the shared state.The Bloom-filter algorithm is used to pre-screen the incoming edges.In order to further improve the performance of the shared state module,this thesis also adds a layer of cache to the graph stor-age subsystem.The cache is maintained by the cache aside pattern and provides fault recovery through the Raft layer.We implement a distributed streaming graph partitioning system by C++,and tests the effectiveness and efficiency of the system based on GoogleTest.It is verified by experiments that after using the optimized HoVerCut algorithm,our system can im-prove the partitioning speed without reducing the quality of the partitioning result.As the size of the dataset increases,running time of the optimized algorithm grows much slower than the original one,and processing time of each window in the optimization algorithm is close.With support from the graph storage subsystem,the shared state is much more reliable and available.The major contr:ibution of this thesis can be summarized as follows:· Discussed and optimized the original algorithm by reducing communication re-dundancy and improving IO model,thus improving performance of the algorith-m.· Aiming at the single point failure problem of shared state in HoVerCut algo-rithm,a graph storage subsystem is implemented based on the Raft consensus algorithm,which can improve the availability and reliability of shared state.A cache is maintained for some of the frequently accessed data,thereby reducing read requests from the shared state to further improve read efficiency.· Propose an edge filtering mechanism based on the shared state module,and intro-duce the Bloom-filter algorithm to pre-screen the input edges to further improve efficiency.
Keywords/Search Tags:Distributed system, Streaming graph partitioning, Graph database, Raft
PDF Full Text Request
Related items