| Nowadays,the modern society is a modern society with the rapid development of computer information technology.The industry with the "Internet plus" the trend of rapid development,resulting in a large number of different areas of Internet data.Business data,data warehouse according to the data provided for the enterprises at all levels of decision-making,the development of relations with enterprise data more closely,so we urgently need to seek a kind of optimization method and technical support for enterprises to handle large data of the new development.Now more popular big data computing framework is Hadoop and Spark,most companies learn and use the technology to meet the needs of their business.In this case,this paper puts forward the research and design of Spark based data warehouse optimization based on DSP(Demand-Side Platform)demand side advertising industry.Through rigorous analysis process of data warehouse,in order to make the whole process to improve the efficiency of data processing,were chosen from the three aspects of framework process,data storage,data processing a full range of progressive optimization.In the framework of data warehouse,data from the data source when the process is sent to Hadoop Spark,choose to publish subscribe messaging system with high throughput of distributed Kafka,which can realize fast unified online and offline message.For the problem of slow speed of data storage,Spark Streaming from HBase and HDFS(Hadoop Distributed File System)data read and write combination of open source database,the connection partition can accelerate the speed of data access.In the process of data processing,the algorithm of sampling aggregation algorithm is used to solve the problem that the task of the data is inconsistent with the size of the task.Through the experiment data comparison test,for ordinary data or non tilt data,the data warehouse optimization overall time spent less than 10% more traditional data warehouse operation process,and improve the system throughput and storage performance.In this paper,we propose a new algorithm,which can converge the data quickly,and improve the execution efficiency of the whole data warehouse. |