| With the diversification and continuous evolution of programming languages,in order to better combine the advantages of various programming languages,some large-scale software has begun to develop in the direction of multilingual code development.Multi-language code search can help developers develop and maintain multi-language code,and improve their development efficiency.The known code search work usually only focuses on the grammatical structure or text information of the code,and most of the work is conducted on a specific programming language,making it unable to meet the needs of multilingual code search.The above-mentioned problems are often caused by differences in grammar and language mechanisms between different programming languages.In order to reduce the differences between multiple languages and improve the performance of multilingual code search,this article proposes a semantic graph-based approach for multiple programming languages.Multilingual code search model.This model maps semantically similar codes to a unified semantic vector space through in-depth understanding of code semantics,so that the model can overcome the differences in the expression of multi-language codes and more accurately locate the codes that users need.The main work of this paper includes:1.Aiming at the difference of multi-language code representation,this paper proposes a basic model of multi-language code representation based on semantic graph.This paper uses the intermediate representation(IR)of the code to extract the data flow and control flow in the code,constructs a code semantic graph to accurately represent the code semantics,eliminates the differences in the form of multilingual code,and uses graph neural networks to mine the semantic graphs Semantic information.In addition,in the process of constructing the semantic graph,in order to overcome the compilation problems encountered by different programming languages when obtaining IR,this paper implements a code fragment packaging tool that supports both C and Java languages.The packaging tool increases the compiling pass rates of C language and Java language to 49% and 69% respectively by encapsulating the lack of dependencies of the code.2.In order to further eliminate the differences in the representation of multi-language codes,this paper proposes a code representation enhancement model based on comparative learning.The model uses a variety of methods to mutate the code,and uses comparative learning to make the model focus on similar semantic features among multi-language codes,thereby further improving the consistency of code representation in different programming languages.3.This paper designs a large number of experiments to prove the effectiveness of the proposed model and design.This paper compares the basic model and enhanced model of multi-language code representation with multiple related works on the three most commonly used programming languages(C,Java,and Python).The experimental results show that the multi-language code representation basic model and the code representation enhancement model based on comparative learning presented in this paper show significant performance improvements compared to baseline tasks on single-language and multi-language tasks.For the basic model,the average MRR on the three single-language tasks reached 0.485,which was 7% to 31% higher than the baseline model.On the multi-language task,the basic model was 41.2% and 79.5%higher than the baseline..As for the enhanced model,the performance of the model on multilingual tasks was improved by 5.1% on the original basis. |