| In the era of big data,the explosive growth of information on the Internet makes it difficult for users to filter out the part of the information they are really interested in when using the Internet to provide services.Recommender ystems,as an information filtering technique,are often used to solve this problem.It mines the user’s potential preferences according to the user’s historical behavior data,and then recommends the information that the user is interested in according to the user’s preference.The traditional offline recommendation methods use the historical offline data of users to learn the corresponding static recommendation models.These methods cannot achieve ideal results in scenarios with large fluidity of item sets or user sets(news recommendation,etc.).The learned static model can neither deal with the cold start problem nor track the changes of user preferences in real time.In order to solve the problem of offline recommender systems,the research on online recommender systems is becoming more and more extensive.The online recommendation system can update the recommendation model in real time according to the user’s feedback on the recommended items to improve the recommendation effect.This thesis will focus on the online recommendation scenario and study the personalized recommendation algorithm in the online scenario.The specific research results of this thesis are as follows:(1)Online recommendation needs to update the recommendation system in real time according to the user’s feedback,and its interactive characteristics conform to the reinforcement learning scenario.The multi-armed bandit algorithm is a simple reinforcement learning method.It takes advantage of the sequential interaction characteristics of reinforcement learning and avoids the complex and computationally expensive problems of other reinforcement learning algorithms,and has become a research hotspot in online recommendation.This thesis introduces how to convert online recommendation tasks into reinforcement learning tasks,conducts relevant research on online recommendation research methods based on multi-armed bandit,and summarizes the research direction of online recommendation methods based on multi-armed bandit,and the corresponding research progress.(2)To deal with the lack of feedback on new user arrivals and item popularity among online recommenders(cold-start problem).This thesis proposes an adaptive dynamic clustering bandit algorithm for online recommendation system,ADCB and ADCB+,based on adaptively splitting and merging clusters,which incrementally enforce both userlevel re-assignment and cluster-level readjustment in recommendation rounds to efficiently and effectively learn the individuals’ preferences and their clustering structure.Especially,the proposed ADCB+ method further exploits both the accumulated cluster preference parameters and each individual’s personalized feature through the adaptively weighting of the two influences according to the number of user interactions.The experiments on three real datasets consistently show that,the proposed ADCB and ADCB+schemes outperform than some existing dynamic clustering based online recommendation methods.(3)Most existing bandit algorithms impose a fixed assumption on the recommendation environment that both user preferences and dependencies between users are static over time.In reality,however,this assumption hardly holds due to the ever-changing interests and dependencies of users,which inevitably leads to suboptimal performance of recommender systems in practice.This thesis proposes an online recommendation algorithm NCUCB based on non-stationary contextual bandit.It can solve the problem of online recommendation in a non-stationary environment,that is,in a time-varying scenario of user interests.The algorithm maintains a pool of bandit models that are used globally.And when the system interacts with the user,it will additionally create,update,and delete the global bandit model according to the user’s interaction state.The system detects whether each user’s interests have changed using a change point detector that estimates confidence based on the current user reward.At the same time,a model selector is maintained to select the appropriate slot machine model to serve the user by considering how well the model matches the user’s recent historical data and the model’s popularity among all users.The effectiveness of the proposed algorithm is verified in three real datasets. |