Code Representation Learning Based On Convolutional Autoencoder

Posted on:2024-08-03

Degree:Master

Type:Thesis

Country:China

Candidate:W T Li

Full Text:PDF

GTID:2568307052996179

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

In nowaday’s information-based society,applications are ubiquitous in our lives and their existence improves the efficiency of people’s productive lives,but the process of developing applications is time-consuming and labour-intensive.Developers need to constantly write new code,update and iterate on the functionality of the software according to user needs,and continuously maintain the system after it has been implemented.In order to improve the efficiency of developers in developing software process,many researchers have worked on programming language processing tasks,hoping to make the development process more efficient and secure by using deep learning methods to assist developers.For deep learning based program language processing tasks,the learning of code representations is a key and fundamental task,and the quality of the learned code representations directly determines the performance of downstream programming language processing tasks.Since the code naturalness hypothesis proposed by Hindle et al.revealed that both program and natural languages have similar statistical rules,researchers in recent years have drawn on the experience of the development of deep learning in the field of natural language processing and applied many models that perform well on natural language processing tasks to program language processing tasks.Although some success has been achieved,two problems remain with current code representation learning.(1)Most of the existing code representation models do not make full use of the rich structural information that programming languages have.In the case of code representation models based on abstract syntax trees,they are able to model the structure of the syntax tree as a whole,but still ignore the variability of subtree structure within the tree.The structure of grammar trees varies from language to language,and sub-trees representing different semantics under the same grammar tree have different structures.Modelling them using a uniform neural network is unable to capture this variability.(2)Current work in the field of program language processing typically involves endto-end training of models on a supervised dataset for a particular task.While developers contribute a large amount of code metadata to code repositories,containing comments,description documents,submission PR information,building supervised datasets requires data cleaning and annotation of this code,which can only be achieved by professionals with a background in computing.The time and labour costs of building a dataset are therefore extremely expensive.Moreover,these supervised learning models usually perform well only on specific tasks or specific datasets,and tend to show weak generalisation when changing datasets and tasks.Two main contributions of work have been made in this paper to address both of these issues.(1)The proposed Trans2 bin,an abstract syntax tree multinomial to binary tree tool based on custom semantic rules.In this paper,a set of custom syntax tree conversion rules is proposed for the abstract syntax tree structure of languages such as C/C++ and JAVA,combined with the program semantics represented by the subtrees,to convert irregular multinomial syntax tree structures into standard binary tree structures.The tool is able to fully learn the differences in syntax tree structures during the pre-processing phase,preserving the maximum amount of structural information about the program code.(2)CNN Autoencoder,a convolutional self-encoder code representation model based on abstract syntax trees,is proposed,which learns code representations unsupervised by means of convolutional neural networks and deconvolutional neural networks.In this paper,we train the model on large unsupervised datasets of C/C++and JAVA,and fine-tune it on supervised datasets of code classification,code clone detection,and cross-language code search to verify the generalization ability of the model.The experiments demonstrate that Trans2 bin is able to fully capture the variability of the syntax tree subtree structure,simplify the subsequent modelling process of the code representation model and learn the vector representation of the program code more effectively than other syntax tree preprocessing tools.Compared with other unsupervised code representation models,CNN Autoencoder is able to fully learn the structure and location of subtrees during the convolution and deconvolution of abstract syntax trees,and has better performance in three tasks: code classification,code clone detection and cross-language code search.

Keywords/Search Tags:

Abstract Syntax Tree, CNN Autoencoder, Code Representation, Un-supervised Learning

PDF Full Text Request

Related items

1	Design And Implementation Of Abstract Syntax Tree Based Code Defect Detection
2	Research On Source Code Plagiarism Detection Based On Abstract Syntax Tree
3	Automatically Based On The Abstract Syntax Tree And Static Analysis Of The Cloned Code Refactoring
4	Research On Clone Detection Based On Intermediate Representation Of Source Code
5	Research On Hierarchical Contrastive Learning Based Source Code Representation
6	Development Of Static Code Defect Detection Tool Based On Abstract Syntax Tree
7	Research And Design Of Source Code Homology Detection System Based On Text And Abstract Syntax Tree Compare
8	Research On Software Defect Prediction Based On Code Representation
9	Research On SQL Injection Defense Based On Abstract Syntax Tree
10	Semantic Understanding Of Vulnerability Source Code Based On Representation Learning