Research And Implementation Of DSP Data Warehouse Optimization Based On Spark

Posted on:2018-11-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhang

Full Text:PDF

GTID:2348330515996690

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Nowadays,the modern society is a modern society with the rapid development of computer information technology.The industry with the "Internet plus" the trend of rapid development,resulting in a large number of different areas of Internet data.Business data,data warehouse according to the data provided for the enterprises at all levels of decision-making,the development of relations with enterprise data more closely,so we urgently need to seek a kind of optimization method and technical support for enterprises to handle large data of the new development.Now more popular big data computing framework is Hadoop and Spark,most companies learn and use the technology to meet the needs of their business.In this case,this paper puts forward the research and design of Spark based data warehouse optimization based on DSP(Demand-Side Platform)demand side advertising industry.Through rigorous analysis process of data warehouse,in order to make the whole process to improve the efficiency of data processing,were chosen from the three aspects of framework process,data storage,data processing a full range of progressive optimization.In the framework of data warehouse,data from the data source when the process is sent to Hadoop Spark,choose to publish subscribe messaging system with high throughput of distributed Kafka,which can realize fast unified online and offline message.For the problem of slow speed of data storage,Spark Streaming from HBase and HDFS(Hadoop Distributed File System)data read and write combination of open source database,the connection partition can accelerate the speed of data access.In the process of data processing,the algorithm of sampling aggregation algorithm is used to solve the problem that the task of the data is inconsistent with the size of the task.Through the experiment data comparison test,for ordinary data or non tilt data,the data warehouse optimization overall time spent less than 10% more traditional data warehouse operation process,and improve the system throughput and storage performance.In this paper,we propose a new algorithm,which can converge the data quickly,and improve the execution efficiency of the whole data warehouse.

Keywords/Search Tags:

data warehouse, Spark, Kafka, HBase, data skew

PDF Full Text Request

Related items

1	Research Of Data Skew On Spark Based On Imporved Partition Method
2	Research On Partition Loading Balance Based On Spark Data Skew
3	Research On And Application Of The Solution For Spark Data Skew Scenarios
4	Research Of Performance Optimization For Data Skew Based On High-speed Networks
5	Spark Task Scheduling With Data Skew And Deadline Constraints
6	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism
7	The Research Of Log Processing Platform Based On Apache Kafka
8	Research On Data Skew Optimization In Spark Computing Framework
9	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark
10	A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark