| The Internet’s swift growth has caused a dramatic surge in the amount of network applications and software,as well as an immense increase in the number of codes needed to construct them.Code comments are also an important part of programming work.Comments can help engineers quickly understand the meaning of code and promote software maintenance and expansion efficiency improvement.But still some code is missing comments,or comments are not clear.Therefore,the method of automatic code annotation is used to solve this problem.Code automatic annotation can be divided into information retrieval based,template based,and neural network-based methods.In recent years,neural network methods have gradually become mainstream due to their high efficiency and good performance.Automatic code annotation mainly faces two issues: differences between code and natural language,as well as out of vocabulary(OOV)words in the code.This thesis proposes two new methods that use an improved Encoder Decoder framework to automatically annotate code,thereby solving the above two problems,narrowing the difference between code and natural language,and also addressing the impact of off vocabulary words on code annotation,improving model annotation accuracy.The main work of this thesis is as follows:(1)In order to build a model with stronger understanding of source code and annotations,a code automatic annotation model based on the Encoder-Decoder framework is proposed from the perspective of extracting input general expressions.This model fully utilizes the feature extraction ability of pre trained models,selecting Code BERT pre trained on massive code corpora as the encoder,selecting a 6-layer Transformer Decoder as the decoder,and inputting the source code into the Encoder to obtain semantic vectors for decoding,outputting the predicted annotation sequence.At the same time,in order to address the issue of differences between code and natural language,parameter sharing is used to narrow the word vector expression between source code and annotations,reducing differences.(2)In order to construct a model with better generalization,the model is expanded.In practical applications,due to differences in programmer programming habits and the OOV phenomenon in the code,higher requirements are put forward for the generalization of the model.The model not only needs to extract high-quality feature expressions,but also needs to perform well in new application environments.Therefore,this thesis expands the model by adding confrontation training in Encoder as a means of regularization.Specifically,after the source code passes through the embedding layer,add appropriate perturbation to the obtained Embedding,so that the model can reduce the overfitting of training data,enhance the generalization ability,and further improve the robustness of the model.(3)The effectiveness of the model was verified on a publicly available dataset.In order to verify the effectiveness of the proposed model,comparative experiments were designed on a publicly available dataset,and BLEU,ROUGE,and METEOR were selected as evaluation criteria for comparison with baseline and improved methods.It is verified that the model proposed in this thesis has a stronger feature extraction ability,and the improved model with confrontation training as a means of regularization has a more robust performance.The final experimental results verify the effectiveness and reliability of the method proposed in this thesis.Based on the method model proposed in this thesis,efficient automatic annotation of source code can be achieved in the case of missing or poor annotation quality,effectively improving programmers’ understanding of the code and improving programming efficiency. |