| In machine learning,many studies involve solving optimization problems,stochastic gradient descent algorithm(SGD)is one of the most popular optimization algorithms,but it still has some shortcomings,such as: the generated variance causes slow convergence;The step size is relatively strict;Easy to fall into the saddle point,etc.Solving these problems to continuously optimize algorithms is the general trend.Most of the convergence analysis of stochastic gradient descent algorithm and its variants is carried out under convex or strong convex assumptions.In fact,optimization problems in deep learning are often nonconvex,so this paper realistically studies non-convex optimization problems,discussing two common types of optimization problems in deep learning: stochastic optimization problem and finite-sum optimization problems.In recent years,researchers have proposed a series of variance reduction algorithms to optimize the SGD algorithm.They take a specific form of stochastic gradient estimation,which can reduce the stochastic gradient variance and accelerate the convergence of algorithm.As an extension algorithm of variance reduction algorithm,probabilistic gradient estimator(PAGE)uses the form of probability,which overcomes the common double cycle structure in variance reduction algorithm.In addition,the well-known momentum technique can improve the performance of the algorithm in stochastic optimization.It integrates the past information into the current update through the exponential decreasing weighted moving average,which can also accelerate the convergence of the algorithm.Therefore,this paper combines the probability estimation gradient class algorithms with momentum in two different positions.The momentum version of the corresponding probability gradient estimation class algorithms are proposed,and their convergence analyses are carried out for two types of optimization problem.Furthermore,we also disscuss the complexity of stochastic gradient in the finite-sum optimization problem.It is worth mentioning that algorithms proposed in this article can be seen as a framework.When specific parameters are selected within the framework,the algorithms can degenerate into existing algorithms of which the convergence results are consistent with existing theories.In other words,they can be seen as special cases of the algorithms in this article. |