Font Size: a A A

Hypergraph Parallel Computation Framework For Large-scale Complex Data Processing

Posted on:2015-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:X Y QinFull Text:PDF
GTID:2180330452964156Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays, big data which is used to describe and defne the massive data generat-ed in the era of information explosion, has been mentioned more and more often. Withthe coming of the era of cloud computing, big data is able to demonstrate its power.In the past, people used to adopt the approach of sampling which only compute on asmall subset of data. Nowadays, thanks to the technique of big data, users can performcomputation on the the full data, which requires the support of an efective distributedcomputing environment essentially. However, previous research focus on the low-levelfeld which is hard to provide an consistent primitive, and forced user to write diferentlow-level programs for diferent applications.MapReduceabstractionsolvedthisissueefectivelybyprovidingsimpleprogram-ming model which allows users to focus on high-level application instead of low-levelprogram. However, as a result of this kind of simplicity, it’s hard to compute on com-plex data, users have to write a bunch of MapReduce jobs which result in the inef-ciency of execution. A new model is required which should support the representa-tion of complex data and provide simple programming model. Graph-based parallelcomputation frameworks solved this issue, it represents data by graph and provides ahigh-levelprogrammingmodelbyoperationsdefnedonthevertices. Atthesametime,the low-level optimizations of these frameworks include the graph representation andpartitioning etc. Representative of these frameworks includes Pregel and GraphLab.Pregel is based on BSP model which make use of the super-step abstraction, computa-tions in a super-step will not be executed until the last one has been performed. Thisstrategy may cause the issue that some computing nodes may become the bottleneck ofthe entire system. GraphLab provides the execution of synchronous and asynchronous, the algorithms which do not need the synchronization of super-step will be executedefectively. However, the graph-based framework are all based on simple graph whichmeans one edge in the graph can only connect two vertices. The drawback of thesemodels is that it’s hard to represent the group information which often exists in thesocial network, and will cause information loss.Thus, in order to solve the issues mentioned above, we propose a hypergraph-based parallel computation framework which named HGPC, the data are representedbyhypergraphwhichmeansanedgecalledhyperedgeinthisgraphcanconnecttwoandmore vertices. Hyperedge stands for the concept of group naturally. In this paper, wedescribethedatarepresentationinHGPC,andproposeasimpleprogrammodelnamedMAN.Additionally, wediscusstwohypergraphpartitioningapproacheswhichadoptedin HGPC, they are randomized greedy strategy and parallel multi-level partitioningalgorithm.InordertoverifytheefectivenessofHGPCframework, wefetcheddatafromsinaweibo. For the purpose of fetching data from social network like sina weibo efectivelyand fexibly, we also design and implement a scalable distributed crawling frameworknamed COLA. The experiments demonstrate the efectiveness and scalability of CO-LA.WedescribeandimplementthelinkpredictionalgorithminhypergraphsbasedonHGPC. The experiment which make use of the fetched data demonstrates that, HGPCis able to describe the algorithm based on hypergraph model sufciently, and executedthese algorithms efectively.
Keywords/Search Tags:big data, hypergraph, parallel computing, graph-based computation framework
PDF Full Text Request
Related items