| With the advent of the big data era,the digital resources on the Internet are growing rapidly,and plagiarism is becoming increasingly rampant.Manual detection is no longer able to cope with the massive online resources.Therefore,research on automatic plagiarism detection methods has become crucial.Traditional plagiarism detection methods lack the depth of semantic analysis of the text and cannot cope with complex paraphrasing.The latest research,which uses deep learning techniques to extract deep semantic features,has achieved some improvement in plagiarism detection work.However,existing plagiarism detection methods still have the following shortcomings:First,compared with traditional methods,deep learning-based methods require more processing time and have low detection efficiency when facing massive plagiarism data.Second,most existing methods use sentences as plagiarism units and compare independent sentences in pairs.This method does not combine with contextual information and cannot cope with complex situations such as splitting one sentence into multiple sentences,merging multiple sentences into one sentence,and plagiarizing multiple sentences into multiple sentences.Third,the sequential nature of text features is not considered,leading to ineffective feature extraction and integration.To solve the above problems,this thesis proposes a novel plagiarism detection method.The method is divided into three stages: paragraph-level,sentence-level,and post-processing.In the paragraph-level stage,a new similarity factor called IPFMGS is designed to measure the similarity between paragraphs.By comparing paragraphs to each other and filtering out plagiarized paragraphs,this method ensures efficient detection and improves filtering effectiveness.In the sentence-level stage,first,a multi-sentence semantic feature extraction and fusion network is proposed to use convolutional neural networks to fuse multiple sentence semantics,comprehensively capturing plagiarism features under various complex situations.Second,multiple features are extracted using a single-sentence semantic feature extractor and a vocabulary feature extractor.Third,the Bi-LSTM sequence model is used to fuse the extracted features and combine contextual features to detect plagiarized sentences and effectively integrate features.In the post-processing stage,unlike existing methods that directly merge sentence pairs,this thesis proposes a plagiarism fragment matching algorithm to determine the correspondence between plagiarism fragments.The proposed method was experimentally evaluated on three datasets(PAN12,PAN13,and PAN14),and the results show that the proposed method outperforms existing detection methods. |