| With the development of Internet,plagiarism is becoming more and more serious,and plagiarism detection has become the focus of academic research.People can through a variety of ways to get copied resources,more serious is the phenomenon of intellectual property theft also will be born,copy of this social phenomenon has extensive harm,plagiarism detection can effectively prevent the copying of this social phenomenon.The existing research on plagiarism detection mainly includes three aspects: the acquisition of plagiarism corpus,the retrieval of plagiarism sources and the text alignment of plagiarism.Based on these three aspects of research,the following innovative work is carried out.The main method of plagiarism corpus is to acquire corpus by manual work.According to this method is the quality and time efficiency problems,proposes a method for text alignment algorithm based on the corpus of plagiarism detection,automatic access to copy data,provide the basic data for the study of plagiarism detection.In this paper,a framework and a text alignment algorithm based on text alignment algorithm for plagiarism detection are presented,and the data obtained in this paper are statistically and evaluated.In view of the existing heuristic search method based on the source of the lack of theoretical support only depends on the experience of experts,this paper studies retrieval filtering model based on supervised learning source,gives the source retrieval framework and filtering algorithm,discusses the method of sorting learning and classification based on the method of filtering the retrieval performance in the source,a detailed comparison of the effects of the characteristics of various source retrieval performance of the source.In the process of filtering model construction,the feature and supervised learning algorithm with the best retrieval performance is successfully selected.Based on the copy word matching text alignment method in copy detection,detection of low copy fuzzy has been higher performance,but in the face of the implementation of various high fuzzy plagiarism means copying will exhibit poor retrieval performance.To solve this problem,a semantic based text alignment method is proposed.Semantic information is introduced into plagiarism detection,and the dispersed expression of words is analyzed.A semantic based text alignment model is given.Proved by experiment,this paper studies the way to construct filtering model and seedsearch model to make up for the shortcomings in the current study,improve the overall performance of plagiarism detection,provides a new direction of research methods and research for the source retrieval task and filtering text alignment seed search task. |