Font Size: a A A

Design And Implementation Of Large-scale Model Training Support Platform

Posted on:2023-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:S Y RenFull Text:PDF
GTID:2558306845496154Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet era,information technology has been improved,recommendation system emerges at the right moment and gradually becomes an irreplaceable part of people’s daily life.Along with the continuous development of the recommendation system,the deep learning model of the recommendation system is constantly improving and the amount of recommended content grows exponentially over time.Sparse parameter is a typical characteristic of the recommended business scenario,under the dual challenges of rapid growth of recommended content and complex business scale,in order to support large-scale model training and improve the performance of model training in the industry,almost every recommence-related Internet company needs to have its own large-scale distributed sparse parameter deep learning model training system.How to provide a better solution for distributed training of sparse parameter large-scale model in recommended business scenarios,and optimize the process of recommended business scenarios from off-line side training to online side production service,will be a research topic which is very meaningful.The company where I worked as an intern has a business in the field of recommendation systems,developed a large-scale model training platform based on the recommendation of business scenarios,can provide a platform for algorithm users which is a large-scale distributed sparse parameter deep learning model training platform and make the algorithm users block the development difficulties to more focus on the training of the model itself,and also provides task management and cluster management functions.At the same time,complete solutions are provided for the whole process from offline side model training to online side production services and provides a complete solution,which supports the recommended business scenario of distributed training of sparse parameter large-scale model of the company and also the development of recommendation system.During the process of project development,I have conducted research on the research status of recommendation business scenarios and the background of related industries,I have learned the technology of industry solutions,including the sparse domain isolation scheme supported by dynamic characteristics based on Tensorflow,the distributed cluster training scheme of Yarn and the solution of online side production service.At the requirements analysis stage,according to the actual business scenarios,research in-depth,and finally determine algorithmic users’ real needs.At the general design stage,specific solutions are given for each module of the platform from offline side to online side.At the detailed design stage,realize the function development and code writing of each module based on the related technologies from the general design stage.And after the development of the platform for each module functional and non-functional test verification.At present,the platform has been running online in the company,and the use effect is good.The development team can continue to upgrade,maintain and develop new requirements of the platform according to the feedback and suggestions of the platform users,so as to make the platform more perfect.
Keywords/Search Tags:Recommendation, Large-Scale Model, Distributed Training, Sparse Parameter
PDF Full Text Request
Related items