| With the rapid development of communication technology,the contradiction between the demands of massive users and the increasingly tight communication resources becomes more and more prominent.Device-to-device(D2D)communication is a direct connection technology,which effectively relieves the load of base station.However,in D2 D system,unreasonable waveform parameter decision usually leads to resource waste,system throughput decline and even communication interruption.Communication waveform parameter decision can improve the communication performance of the system by adjusting the waveform parameters of wireless signal reasonably to adapt to different channels.However,there are still some problems in waveform parameter decision,such as large amount of calculation,few decision parameters and poor generalization ability.Reinforcement learning directly interacts with the environment to learn without prior knowledge,so it is very suitable for solving decision-making problems.In this paper,the waveform parameter decision algorithm of D2 D communication system based on reinforcement learning is studied.The specific research contents are as follows:Firstly,a new waveform parameter decision model for D2 D system based on reinforcement learning is proposed.Compared with the traditional waveform parameter decision model,the new model has more parameters and stronger generalization ability.In order to solve the problem of slow convergence in the decision-making of D2 D users’ access frequency and transmission power,a distributed decision algorithm M-AC based on Actor-Critic(AC)is proposed in this paper.Each user is assigned two neural networks.The simulation results show that the M-AC algorithm can effectively improve the system throughput and the convergence speed is faster.In the AC network,the policy gradient update is based on the joint action reward,without considering the low system throughput caused by the contribution of individual user’s actions.In this paper,the AC algorithm is further improved by introducing reliability allocation and considering individual user action reward value.Simulation results show that the improved C-AC algorithm can further improve the system throughput.Secondly,aiming at the problem of channel estimation and poor generalization ability in traditional modulation mode and coding rate decision,this paper proposes a decision algorithm based on Q learning and Sarsa(?).Compared with the traditional Adaptive Modulation and Coding(AMC)technique,the algorithm proposed in this paper does not need to estimate the channel,but can directly make the parameter decision according to the actual system throughput,and can be adapted to different channel environments.Aiming at the problems of redundant fluctuation and slow convergence in the decision-making process caused by large exploration action space,the algorithm is improved by dynamically reducing the action space.Simulation results show that the improved algorithm has smaller mean square error,and the improved Sarsa(?)algorithm has a faster convergence speed than Q learning.At the same time,the system throughput is better than Modulation and Coding Scheme(MCS)index table.Aiming at the problem of large initial mean square error caused by random initial position of the system,the improved Sarsa(0.1)algorithm was further combined with MCS index table to optimize the algorithm.Experimental results show that the initial mean square error of the optimized algorithm is effectively reduced and the convergence speed is faster. |