Font Size: a A A

Research On Encoder-Decoder Based On Auxiliary Syntax Information For Code Comment Generation

Posted on:2022-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z H LiangFull Text:PDF
GTID:2518306539969309Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Code comment generation is a task which aims to automatically generate natural language comments that can describe the function or semantics of the given source code.In today’s Internet era where the amount of code is rapidly increasing,code comments are particularly important for the maintenance of programs.It is a large workload for the developers to write complicated code while taking into account the corresponding code comments.The code comment generation task can effectively reduce the burden of developers editing code comments,improve the work efficiency of developers,and have broad application prospects in real life.The Encoder-Decoder framework is the meanstream framework of the existing work.In this framework,the encoder is responsible for encoding the source code to produce an implicit representation of the source code;the decoder is responsible for decoding the code comments according to the representation provided by the encoder.However,the existing work still faces the following challenges:(1)It is difficult to obtain representation that contains rich semantics for source code.(2)It is challenging to extract the complex and changeable key information in the source code.The solution of the existing work is mainly to use the copying mechanism based on the pointer networks.However,due to the complex and changeable grammatical structure of the source code,the efficiency and accuracy of the copying mechanism are limited.(3)The evaluation metrics used by the model in the training phrase and the test phrase are inconsistent.This inconsistency will lead to deviations in model training and affect the final model effect.Our work starts from the encoder-decoder framework.The research contents and innovations include:(1)In order to obtain the source code encoding with richer semantics,we introduce the syntax information of the source code in the encoding stage,and propose a Syntax-Associated Encoder.In addition to the text and structure information,the source code also contains unique syntax information.In order to mine and introduce syntax information,the Syntax-Associated Encoder uses the syntax type from the abstract syntax tree nodes of the source code.The Syntax-Associated Encoder is Tree-LSTM based and encodes the abstract syntax trees according to the syntax type of the nodes,thereby enhancing the distinction between the subtrees in the abstract syntax tree,so that the generated encoding of the abstract syntax tree contains richer semantic information.(2)In order to improve the efficiency and accuracy of extracting the key information in the source code input,we propose a Syntax-Restricted Decoder.The decoder implements a syntax information based node selection strategy and a time window based copy-decaying generation strategy during the copying process.Respectively,the syntax information based node selection strategy filters irrelevant candidate nodes according to the syntax type of the abstract syntax tree nodes,thereby reducing the computational consumption and interference of redundant nodes,and improving the efficiency and accuracy of the copying process;the time window based copy-decaying generation strategy uses probabilistic penalties on the copied nodes within a certain time window,thereby inhibiting the frequent copying of the same node,increasing the possibility of other potential candidate nodes being selected,and improving the copying accuracy.(3)In order to solve the problem of the lack of labels in the intermediate stage caused by the introduction of the copying mechanism and the inconsistency of evaluation metrics between the training phrase and the test phrase,we propose a hierarchical reinforcement learning method.The reinforcement learning method adopts a unified reward signal feedback sampling actions from both stages,so that the labels of the intermediate stage is not needed during training,and we theoretically demonstrate the effectiveness of the design.In addition,the evaluation metrics used in the test phrase can be introduced into the training phrase as a reward signal,so as to ensure that the metrics in the training phrase and the testing phrase are consistent.We demonstrate the effectiveness of the model through experimental result analysis on four programming datasets,Wiki SQL,Co Na La,Django,and ATIS.Then,we make a more detailed analysis of the two strategies of the decoder.Finally,in order to further evaluate the contribution of each part of the model to the overall performance,we conduct ablation study on the development set of the Wiki SQL dataset.In addition,we conduct a case study to verify the effectiveness of the model from the generated samples from models.
Keywords/Search Tags:code comment generation, syntax information, encoder decoder, copying mechanism, reinforcement learning
PDF Full Text Request
Related items