Font Size: a A A

Design And Implement Of Dulplicate Document Detection Based On Similarity Estimation

Posted on:2015-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:X PanFull Text:PDF
GTID:2308330473950526Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of computer network application technology, the similar information is growing exponentially in Internet. Massive similarity documents not only have consumed high network storage space but also have a negative impact on the user experience. The openness of the information platform and the convenience of getting digital text caused that academic misconducts, such as coping paper and even illegal plagiarism, are escalating. what’s more, these misconducts bring many serious consequences. In order to improve the efficiency of information retrieval and protecte intellectual property rights, many similarity estimation techniques,which are appled to design and implement document copy detection systems, have important technical significance and application value.Paper makes a in-depth research base on the theories and methods of document similarity estimation, then designs and establishs a document similarity detection system, in order to quickly and accurately detect the similarity of documents in massive data environment. The main works of this paper are reflected as follows.The document similarity detection system, which contains three functions(document information preprocessing, document similarity computing and result presenting and exporting), is based on Minwise Hash and focused on solving some hard work such as document clustering, valuations similarity algorithm, coloring similarity evidence, generating similarity report, data statistical analysis and other issues.Using waterfall model of software engineering along the whole designing process, this thesis introduces business requirements, functional requirements, non-functional requirements, system architecture design, system functional design, and database design of the document similarity detection system. This thesis shows the function realization environment, interface design, the code of the key functional modules and the testing result of functional and performance of the main modules.As the result of the development, the final system has lots of adventages: more user-friendly operation, the extraction efficiency of various types of document(e.g., pdf, word) and computation efficiency are significantly improved。...
Keywords/Search Tags:document similarity detection, Minwise hash, similarity measure, fingerprint
PDF Full Text Request
Related items