| With the development of the information technology,researchers need to process and analyze massive amounts of raw data,which is a complicated process and requires a lot of calculations,in the process of research.Therefore,researchers usually divide the entire processing into multiple steps.They organize these steps according to the dependencies between data and data handlers,and reassemble them into scientific workflows that can automated or semi-automated operate in computer.During the execution of scientific workflow,massive amounts of intermediate data will be generated.These intermediate data are usually large in scale with complicated dependencies.Some important data will be reused by researchers or shared among various research institutions.Therefore,the execution of scientific workflow requires high-performance computing resources and massive storage resources.With the development of distributed technology,the cloud technology provides a new development platform for scientific workflow,follows grid and cluster.The cloud environment not only has massive storage resources and high-performance computing resources,but also is conducive to scientists in different regions to carry out various engineering cooperation.A large amount of intermediate data will be generated when cloud scientific workflow is executed.These intermediate data can be stored by consuming storage resources,or deleted,which would consume computing resources when reused.Therefore,how to improve the efficiency and intelligence of cloud scientific workflow and manage these intermediate data reasonably and effectively has become a challenging problem.In addition,the cloud scientific workflow is oriented to multiple scientific research institutions or researchers at the same time.Thus,there will be multiple requested datasets needed to regenerate at the same time or in a short period of time.It is unreasonable to process these requests separately,so a multi-request-oriented data management method is needed to improve the operational efficiency of cloud scientific workflows and reduce service costs.Based on the above analysis,this paper studies the data regeneration and storage optimization problem of cloud scientific workflow.The work of this paper can be summarized as follows:1.The traditional data management method is oriented to single request,but the actual situation exists that the system processes multiple requests at the same time.Aiming at the problem of intermediate data storage in scientific workflow,in order to regenerate multiple request data with minimum computational cost,a specific multi-request data regeneration strategy is proposed.Based on the proposed regeneration strategy,an optimization model of scientific workflow intermediate data storage is constructed.In addition,the enumeration and genetic algorithm are designed to solve the optimization model.Lastly,Experiments and comparisons are conducted to evaluate the developed methods.Results show the developed multiple requests data management method is effectiveness and positiveness.2.In the process of research,we found there is double calculation in the calculation process of multi-request data regeneration cost.Aiming at the problem of repeated calculation in the multi-request data regeneration calculation method,the data regeneration method and the data regeneration calculation process are analyzed.Firstly,the optimality of the data regeneration method is proved,and then an improved method for calculating the cost of multi-request data regeneration is proposed.We design the experiments to verify the proposed method.The result shows the cost of the improved multi-request data regeneration calculation method is lower and more accurate than the original multi-request method.3.By analyzing the shortcomings of the traditional scientific workflow of data dependency graph structure,a more flexible scientific workflow data management model based on data flow graph is given.After analysis,the scientific workflow model based on data flow graph solves the problems of ambiguity and difficult execution of traditional models.We design experiments to verify the new model,the experimental results prove that the scientific workflow data management model based on data flow graph is more accurate. |