Towards a High-Performance and Flexible Data Analytics Syste

Posted on:2018-03-16

Degree:Ph.D

Type:Dissertation

University:The University of Wisconsin - Madison

Candidate:Grandl, Robert

Full Text:PDF

GTID:1448390002998009

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Due to the explosion of large amounts of data and large number of machines, data analytics frameworks became crucial components in developing new technologies and generate new discoveries.;Resource scheduling remains a key building block of modern data analytics frameworks. As data volumes increase, jobs from many users and applications, consisting of many tasks contend for the same pool of shared resources. Consequently, today's cluster schedulers have to deal with multiple resources, consider jobs with complex structures and allow job-specific constraints. Furthermore, schedulers must provide performance isolation between different users and groups through fair resource sharing while ensuring performance and efficiency.;We first present the design of a multi-resource packing cluster scheduler called Tetris which efficiently packs tasks to machines based on their requirements of all resource types. Doing so avoids resource fragmentation as well as over-allocation of the resources that are not explicitly allocated. Tetris combines heuristics of packing and improving average job completion time and shows that achieving desired amounts of fairness can coexist with improving cluster performance.;Given that the users of data analytics jobs observe the outcome of performance isolation when their jobs complete, and they care less for instantaneous fair-share guarantees, we explore how to design an altruistic, long-term approach scheduling solution called Carbyne, where jobs yield fractions of their allocated resources without impacting their own completion times. Leftover resources donated through altruism enables an additional degree of freedom in cluster scheduling to further improve secondary objectives.;Although a variety of different frameworks for expressing and running large-scale data analytics exist today, they share the common attribute that they are compute-centric in nature. Key details of job execution depend on the physical structure of the data parallel computation. Driven by these observation, we introduce a fast and flexible data analytics framework called F 2 which separates computation from data, making them equal first-class citizens. It then enables data-driven computation -- determining logic changes, computing parallelism, and scheduling tasks for execution are all triggered by the relevant intermediate data becoming available.

Keywords/Search Tags:

Data, Performance, Scheduling

PDF Full Text Request

Related items

1	Performance Optimization For Big Data Progressing Systems In The Cloud
2	Research And Implementation Of SURTRAC Performance Evaluation System
3	Research Of The Data Placement Strategy And Performance On Heterogeneous Distributed System
4	Research And Implementation Of Container Scheduling For Big Data Services
5	Towards a High-Performance and Flexible Data Analytics Syste
6	Research On SLA-Based Scheduling Meghanism In Kapreduce Environments
7	Research On Key Technologies For Big Data Movement In High-performance Networks
8	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew
9	Delivering Consistent Network Performance in Multi-tenant Data Centers
10	Study Of Data Disseminating Schemes Based On Data Scheduling And Retrieval In Wireless Networks