Source Code Representation Technology For Similarity Measurement

Posted on:2022-09-05

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Zou

Full Text:PDF

GTID:2518306725992919

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the development of big data and the accumulation of a large amount of knowl-edge in data,software systems have gradually transformed from informatization to in-telligence,such as intelligent software engineering.Source code comprehension can be used for many intelligent software engineering tasks,including code classification,de-fect detection,clone detection,and code retrieval.However,the existing methods for comprehension cannot capture the code semantic from the literal aspect completely,and are complex and not robust for code syntax.It lows or even ignores the code syn-tax extraction.Futher more,the methods based on abstract syntax tree are disturbed by numerous noises,which significantly reduces the performance of code comprehension.Objectively,source code comprehension is a procedure to make a text can ex-press its function.And we can find a mapping,to represent the code as a fixed,low-dimensional and dense vector,which can be used to measure the similarity between codes and texts in the semantic space with the vector space model.We first convert the source code will into AST that independent with programming language,then construct a sequence of path pairs based on our definitions and algorithms to obtain features con-sist of syntax and semantics.In this thesis,we proposed a hybrid encoder,that is,a sub-token encoder for semantic,and a path pair encoder for syntax.Among them,the sub-token encoder encodes text fragments in the source code to a vector,which strength-ens the semantic features.And we adopt a simpler static path encoder to handle the syntax of the code,which leads to a more robust and accurate syntax comprehension.In particular,we also proposed a dynamic path fusion method named Self-attention based Path Fusion,i.e.SPF.So that,the syntax features fusion can be more effective,the noise in the existing AST encoding method can be greatly reduced,the accuracy of code comprehension can be improved,the problem scale can be simplified,and the code encoding can be more efficient.We conduct two tasks: 1)method name generation and 2)semantic matching between code and text to test our methods.The experimental results show that all our methods obviously outperform the benchmark experiment on both tasks.Relied on the robustness and efficiency of RNN,the sequential features between AST nodes are more precise,which improved syntax feature extraction.The proposed dynamic SPF method attached to the RNN network suppressed the interference of noise in data and obtained a high-quality fused feature automatically.Finally,compared with baseline,both two tasks improved the metrics about 20% and 60%.In addition,based on the variant of loss and API corpora,we further optimized the proposed SPF to a better code comprehension method.

Keywords/Search Tags:

Code Comprehension, Intelligent Software Engineering, Semantic Matching, Similarity Measurement

PDF Full Text Request

Related items

1	The Research On Thesis Similarity Algorithm Based Combination On VSM And Semantic Comprehension
2	The Research Of A Textual Semantic Comprehension Model Based On Human Cognitive Process (HTSC)
3	C Code Similarity Measurement Algorithm Based On Levenshtein Distance
4	Research And Application Of Wordnet-Based Semantic Similarity Measurement
5	Research On Code Comprehension Based On Deep Learning
6	Design And Implementation Of Intelligent Customer Service Multi-service Semantic Comprehension System Based On Grammar Rules Matching
7	Research Of Geospatial Semantic Integration And Semantic Similarity Measurement Based On Knowledge
8	Research On The Interactive General Tools For Program Comprehension
9	Study On Semantic Matching And Sentence Reasoning For Machine Reading Comprehension
10	Research On Interpretable Reading Comprehension Based On More Accurate Evidence Training