| With the situation of cyberspace security situation becoming increasingly severe,how to efficiently discover vulnerabilities and locate potential threats in the program as soon as possible has become a major problem for security re-searchers.As a mainstream technology in the field of vulnerability discovering,the fuzzing technology has found a large number of vulnerabilities in various fields in the past decades,proving its high practical value.However,the traditional fuzzing technology has great blindness when mu-tating samples,resulting in the generation of a great many invalid mutated sam-ples,seriously affecting the efficiency of fuzz testing.Through the investiga-tion,it is found that the existing researches based on reinforcement learning to improve the traditional fuzzing technology all adopt value-based reinforcement learing algorithm only to guide the direction of mutation,which is difficult to solve the problem of high-dimensional state space and action space in the process of fuzz testing.Based on the above background,this paper proposes a method based on DDPG reinforcement learning algorithm to improve traditional fuzzing tech-nology.It first models the traditional fuzzing technology as a Markov decision process,that is,the program sample input is used as the environmental state,the mutation function is used as the action strategy,and the code coverage is used as the feedback reward.Then it chooses the value-based and policy-based DDPG reinforcement learning algorithm to solve the process,learning from it to obtain the optimal action selection strategy,realizing the intelligent selection of mutation actions based on the current input sample data,mitigating the blind-ness of traditional mutations and generating mutated samples with the largest code coverage reward to reduce the generation of invalid samples,to improve the efficiency of traditional fuzzing technology finally.We have designed and implemented a general-purpose fuzz testing sys-tem based on reinforcement learning called RLFUZZ.The experimental result based on the LAVA-M dataset shows that compared with the random mutation of traditional fuzz testing and the mutation guided by value-based reinforcement learning algorithms like DQN in existing work,mutation guided by DDPG re-inforcement learning algorithm performs better in terms of code coverage and distribution of effective samples.At the same time,we tested RLFUZZ on real software and found new vulnerabilities successfully,thus preliminarily verify-ing the validity and feasibility of the research. |