| With the rapid development of the software industry,software requirements are updated more frequently,and the scale of software source code data has expanded rapidly.When requirements change,it is very difficult and cumbersome to manually find the code that needs to be modified in the face of massive source code files.In this case,establishing the mapping relationship between software requirements and source code can effectively improve the efficiency of software development.Existing research usually uses information retrieval and machine learning techniques to establish mapping relationships.However,thesemethods require manual participation and are extremely costly.With the development of deep learning,text representation provides the possibility to automatically and accurately construct the mapping relationship between requirements and code.Text representation is based on word vectors based on context modeling,and quantifies the semantic relationship between heterogeneous data through the geometric distance between vectors,alleviating the semantic gap between natural languageand code.Text segmentation is the basic step of word vector representation.However,due to the endless emergence of domain new words of software engineering,it is difficult for general-purpose word segmentation tools to correctly classify domain new word,making the existing word vector representation model unable to learn the semantics of new words.This leads to the lack of word vectors of domain new words,which in turn affects the semantic understanding of the demand text.Secondly,due to the complex nature of the code itself,the current code representation method has incomplete code feature extraction and information loss.To address these issues,our works are as follows:(1)In order to accurately identify domain new words in domain texts,we propose a new word discovery algorithm based on N-Gram and word vector pruning.First,use the N-Gram model to pre-segment the domain corpus.Combined with a variety of statistical features to filter word segmentation results.Then a word vector pruning algorithm based on the BERT model is designed to prune candidate new words according to vector similarity.Finally,the commonly used dictionaries are used to filter to obtain domain new words.The results of comparative experiments show that the algorithm can effectively discover domain new words.In order to further mine richer domain knowledge from domain texts,we propose a domain lexicon construction method based on K-means clustering.By clustering the vectorized text data,the domain vocabulary and daily words are separated,and the domain thesaurus is obtained after manual screening.The thesaurus will be used as a vector representation of external knowledge optimized words.(2)In order to enhance the semantic representation of requirements and code vectors,we propose two pre-training strategies to optimize the word vector representation model.First,the sub word embedding based on location information is used to solve the problem of out-of-vocabulary,and the vector representation of the model for low-frequency words is enriched.At the same time,the vector representation method of domain knowledge fusion is used to increase the weight of important words in the domain and enhance text semantic understanding.The pre-trained word vectors will be used to initialize the vector representation layer of the software requirements and source code mapping model.The results of ablation experiments show that these two pre-training strategies can effectively improve the performance of the mapping model.(3)Aiming at the mismatch between the high-level intent reflected by the requirements and the low-level implementation details of the code,we design a software requirements and source code mapping model based on transfer learning.This method first uses program static analysis technology to extract information of different granularity of the code,and use the pretrained enhanced word embedding model to represent the requirement description and code fragments as vectors to realize the transfer learning of word vectors.Then build a multi-layer neural network based on the selfattention mechanism to capture the semantic information at the sequence level of requirements and codes,realize the joint semantic representation of heterogeneous data,generate the feature vectors of requirements and codes,and judges whether there is a relationship between them by comparing the similarity between the vectors.Finally,based on the model,a mapping tool for requirements and codes is designed and implemented.By comparing with other tools on 18 open-source projects,the effectiveness and feasibility of the tool are verified. |