| Nowadays,code search is an important branch at the intersection of natural language processing and software engineering.The development of efficient code search algorithms can facilitate the reuse of high-quality software and thus improve the productivity of software developers.Using natural language which describes the function of code snippets as input,code search task is the process of searching through a huge code base to obtain relevant code snippets.The difference between the high-level intent of text query and the low-level implementation of code fragments makes code search face many challenges.The existing code search methods are mainly divided into:(i)keyword-based code search,(ii)program feature-based code search,(iii)query reconstruction-based code search,(iv)classical machine learning algorithm-based code search,and(v)deep learning-based code search.Among them,deep learning-based code search methods have received much attention because they improve the semantic understanding of codes and queries.However,it still faces two challenges: first,the existing models do not explicitly model any interaction between codes and queries until the last step of computing their similarity;this results in that models cannot explore the global similarity and local similarity between codes and queries.Second: code fragments differ from natural language sequences in their structural features;most approaches characterize the structural semantics of source code based on abstract syntax trees(AST)and their variants,but the data flow and control flow information in abstract syntax trees is incomplete,so how to extract the structural semantics of source code completely becomes a key challenge for code search tasks.For the first challenge mentioned above,the research goal of this thesis is to explore the matching relationship between code and query;for the second challenge mentioned above,the research goal of this thesis is to effectively characterize the structural semantics of code fragments.To this end,this thesis does two work to improve the performance of code search as follows:(i)This paper proposes Graph CS,a code search method based on relational graph convolutional network,to address the shortcomings of existing code search methods that ignore the query and code matching relationships.This method represents code fragments as code graphs based on abstract syntax trees,represents queries as text graphs based on selection resolution trees,and applies relational graph convolutional networks to embed code and text graphs.The proposed node-level matching strategy applies a multi-view matching function to align all node pairs of code graph and text graph with cross-graph matching,and updates the node embedding representation based on the matched features,which is used to explore the fine-grained matching relationship between code graph and text graph nodes;the proposed graph-level matching strategy applies a neural tensor network to measure the relationship between two embedding vectors in multiple dimensions,which is used to explore the global similarity between code graph and text graph.Experimental results on two publicly available datasets show that Graph CS has a significant improvement in code search performance compared to the baseline model.(ii)This paper proposes Ad Graph CS,a code search method based on program dependency graph representation learning,to address the problem that existing code search methods cannot extract structural semantics based on abstract syntax trees.Ad Graph CS constructs code graphs based on program dependency graphs,defines rules for dependencies to mark control dependency and data dependency edges in programs and complements node types,this effectively reducing the size of code graphs and enhancing the representation of code data flow.The graph node initialization module proposed by Ad Graph CS applies the encoder of transformer to capture the contextual information of program statements.And also Ad Graph CS proposes the graph neural network module combined with the self-attentive hierarchical pooling: the Graph-level graph embedding vector obtained from the hierarchical update of graph node information aggregation characterizes the structural features of code fragments,thus realizing the complete extraction of the structural semantics of code fragments.In addition,Ad Graph CS uses a pre-trained ALBERT model as an encoder for natural language sentences,and the pretraining of a large corpus helps the model to accurately understand the intent of the query text.The experimental results demonstrate that extracting structural features from program dependency graph is an effective exploration,the designed embedding module improves the understanding of code semantics,and the model performs better on each dataset. |