Reinforcement learning method has been widely used in the problem of sequence decision-making.Traditional reinforcement learning stores the historical experience in a dictionary,which stores the state observed by history and the expected cumulative reward obtained by taking an action in this state.When the online model sees a state again,it only needs to query the action with the largest cumulative reward in this state in the dictionary to make action selection.However,the discretization of dictionaries leads to the poor generalization ability of models,so it is necessary to learn by enumeration,with high trial and error cost and low learning efficiency.The objective function of Gauss process is often continuous,because the feedback difference of the points with similar state is not very big,so it is assumed that the points on the objective function conform to Gauss distribution,which determines that it has stronger generalization ability than the dictionary to store empirical knowledge,and can reduce exploration points.However,only reducing the number of probe points is not enough to improve the learning efficiency of the model,and we need to know where these sampling opportunities should be used,so we use Bayesian optimization to select the most effective points.Therefore,this paper uses the reinforcement learning model based on Gaussian process and Bayesian optimization,called BO-GP-Q,which can solve the maximum value with as few sampling times as possible,and then solve the problem of high trial and error cost and low learning efficiency.However,the assumption of environmental stability is very limited.In many real world problems,nonstationary reinforcement learning problems are often encountered.Market systems such as the stock market and foreign exchange market may be affected by agent behavior and various other basic conditions.The environment of nonstationary environment will change with time,which is the most common problem in reinforcement learning.When the traditional reinforcement learning method is used in non-stationary environment,it will lead to low learning efficiency.Because when the environment changes,the learned knowledge becomes useless,the agent must learn the new environment.The problem is that the model doesn’t know when the environment changes,and even if it recovers to a previously learned environment,the agent must learn it again,because the strategies it learns in other environments will interfere with its re adaptation to the environment.At the same time,the existing reinforcement learning research on non-stationary environment needs prior knowledge of environment or does not need prior knowledge of environment,but only aims at specific non-stationary environment,and lacks general reinforcement learning research applicable to all non-stationary environment.Therefore,this paper considers the design of forgetting strategy,forgetting interference reinforcement learning model to adapt to the information of non-stationary environment,in order to quickly obtain the optimal strategy,making the model suitable for all non-stationary environment without prior knowledge of environment.BO-GP-Q reinforcement learning model can solve reinforcement learning problem efficiently by solving the maximum value with as few sampling times as possible,but it can not adapt to non-stationary environment.The forgetting strategy designed in this paper makes BO-GP-Q adapt to all non-stationary environments.It is a general algorithm and the first exploration of this research direction.Because this reinforcement learning model is random sampling when the model is space-time,and then UCB sampling,in which UCB sampling will select the points with large sum of mean value and standard deviation multiple.Therefore,the forgetting strategy is designed from the perspective of mean and standard deviation.Six conjectures are verified by experiments.First,because of the interference of useless knowledge from different environments,the learning effect of non forgetting reinforcement learning model in non-stationary environment is getting worse and worse with the increase of experienced environments,and when all environments have experienced it,the learning effect of the model will not be better.Secondly,the reinforcement learning model of complete forgetting in non-stationary environment generally starts from scratch in each environment,and can learn stable effects;there will also be the situation that the model of the previous environment learns adapts to the new environment.Thirdly,according to the strategy of time forgetting,we can find an optimal time interval,so that the model can forget the interference information to the maximum extent when using only the data of this time interval length,and adapt to every environment in the non-stationary environment quickly.However,this strategy assumes that each environment appears long enough to learn a stable model from scratch.However,not all non-stationary environment problems can occur in every environment long enough,so this strategy has limitations.Fourth,the points with large mean value have great influence on the selection of model points,and the points with small mean value represent the poor effect of this point.Therefore,forgetting the points with large mean and keeping the points with small mean can quickly reduce the interference information of the non-stationary environment,at the same time,it can avoid repeatedly exploring the positions with poor effect and reduce the learning cost.Fifthly,the strategy that the point with large standard deviation may obtain is wrong,so forgetting the point with large standard deviation can reduce the interference information of the model and make the model adapt to the changing environment quickly.Sixthly,because the point with large forgetting mean or standard deviation can make the model adapt to the non-stationary environment,the point with large forgetting mean and standard deviation multiple can make the model quickly obtain the optimal strategy in the non-stationary environment.And the efficiency of different forgetting strategies is compared by model evaluation index.Because the model complexity is too low to learn a stable model,there will be under fitting,and the model complexity is too high,there will be over fitting.Therefore,this paper compares the effect of each forgetting strategy in three different model complexity: low,medium and high,and obtains the optimal forgetting strategy in different model complexity.In order to choose the best forgetting strategy in the non-stationary environment from simple problem to complex problem. |