| In recent years,the phenomenon of plagiarism has been widely concerned by the public,and plagiarism related events have frequently been on the news hot search.Plagiarism not only infringes on the legitimate rights and interests of the original author,but also causes bad social impact,which should be firmly resisted.A large proportion of plagiarized works are in the form of documents.Therefore,the technology of document duplicate checking has important practical application value to restrain the occurrence of document plagiarism.However,the traditional document plagiarism detection technology can only find out the literal similarity between documents,and it cannot measure the semantic similarity between documents well.The model proposed by this paper can learn the deep semantic features and shallow layout features of the document respectively to model the document,and incremental learning.The two-stage model calculation is used to improve the query speed of searching similar documents.The work completed in this paper can be divided into the following points:(1)According to the characteristics of computer science experiment report documents,a feature extraction algorithm based on the characteristics of computer science experiment report documents is proposed.The XLNet is used to obtain the vectorized representation of the document’s deep text semantic features,convolutional neural network and gated recurrent unit are used to obtain the vectorized representation of the document’s shallow layout features.Then,the final document feature representation vector is obtained by the binary classification training task of judging whether two documents have plagiarism relationship through siamese network.(2)When the model above applied to new discipline document data,the problem of "catastrophic forgetting" will appear.Based on the knowledge distillation and the data replay technology,this paper has achieved the goal of model’s incremental learning,which makes the model behave similarly on the two datasets,and improves the stability during model training phase.(3)Aiming at the problems of slow model training,large memory consumption and low efficiency in real-time document similarity estimation query caused by the large number of model parameters,based on knowledge distillation technology,that is,using the teacher model with more parameters to guide the student model with fewer parameters to learn,the model "slimness" was realized.At the same time,the two-stage model calculation method is used to accelerate the query speed of similar documents,it has improved the system performance and efficiency under the premise of guaranteeing the query results’ quality.(4)Designed and implemented a document plagiarism detection system based on semantic neural network,described in detail the realization of different functions in the system,and demonstrated it with a visual diagram.The performance of the system and the correctness of module function are verified by the test.In this paper,the acquisition and preprocessing of experimental report document data of computer science,the modeling of deep text semantic features and shallow layout features of the document,the incremental learning of document model,the rapid search and display of similar documents and the convenient system management functions are realized.Finally,the document plagiarism detection system based on semantic neural network is completed,which has certain application value. |