Font Size: a A A

Pre-training For Program Understanding And Generation

Posted on:2024-02-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:D Y GuoFull Text:PDF
GTID:1528307349985539Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Program understanding and generation is a critical area of research within artifcial intelligence and software engineering,which aims to enable computers to automatically analyze,process,and generate programs to assist humans in completing complex tasks.In recent years,signifcant advancements in pre-training techniques have been achieved in the feld of program understanding and generation.These techniques can help reduce coding errors,improve code quality,and increase development effciency for programmers.Additionally,they can empower individuals to perform complex programming tasks using natural language,reducing the learning curve and enhancing work effciency.Therefore,research on program understanding and generation methods has signifcant application prospects and societal value.However,there are still some problems to be solved in the current pre-trained models for program understanding and generation.In terms of data,existing code pre-trained models have not yet taken full advantage of the rich information provided by code structures such as data fow graphs and abstract syntax trees,nor have they fully considered the external relationships between codes,such as function calls and code reuse.This makes it diffcult for the models to accurately understand and generate programs.In terms of models,it is essential to build a unifed model that adapts to diferent task requirements to reduce costs,and further optimize the computational complexity of the model to improve effciency,thus better supporting applications in real-world scenarios.To address these issues,this paper conducts research in the following four aspects to improve the performance and effciency of code pre-trained models in program understanding and generation:1.To address the issue of insuffcient utilization of code structures,this paper proposes a code structure based program understanding method.This method frst serializes the data fow of the graph structure,while using graph-guided attention mechanisms to preserve its structural information,enabling the model to understand and capture the information of the graph structure from the serialized data fow.In addition,this paper introduces two graph-aware pre-training tasks such as edge prediction and node alignment to further learn the semantics of code from a large amount of graph-structured data,thereby improving the performance of the model.Experimental results show that the utilization of code structure information and graph-aware pre-training tasks can enhance the performance of pre-trained models in program understanding and achieve better performance in several program understanding tasks.2.To address the issue of insuffcient handling of code-relatedness,this paper proposes a retrieval-enhanced program generation method.This method assists the model in generating code by retrieving related data from a database based on the context of the code.To this end,this paper proposes two retrieval methods based on variational auto-encoder and contrastive learning,according to diferent scenarios,and employs a meta-learning approach to utilize the retrieved data to assist the model in generating code.Experimental results show that the retrieval-enhanced code generation method can more efectively utilize related code data,improve the quality of generated code,and achieve better performance in several program generation tasks.3.To address the issue of poor model structure generality,this paper proposes a unifed code pre-trained model.This model uses a masked attention matrix to control the behavior of the model and is pre-trained under various tasks to support diferent types of downstream tasks.Besides,in order to enhance the model’s ability to understand and generate programs,this paper introduces code comments based on natural language and code structures based on abstract syntax trees during the pre-training process,encouraging the model to learn semantic and syntactic information of the code.The model is evaluated on nine datasets for program understanding and generation tasks and achieves better performance in most tasks.4.To address the issue of high computational complexity,this paper proposes an effcient optimization method.This method leverages the dependencies between codes and introduces four sparse attention patterns to optimize the computation of existing code pretrained models.These attention patterns reduce the model’s computational complexity from quadratic to linear level,efectively decreasing the model’s memory consumption and improving inference speed.Experimental results show that this model achieves better performance and effciency in long code understanding and generation tasks.
Keywords/Search Tags:Program Understanding and Generation, Code Pre-trained Model, Code Representation Learning, Code Structure, Attention Matrix
PDF Full Text Request
Related items