| Essentially, duplicate checking of programming assignments means to detect similarity between program codes. A duplication rate can be measured through similarity detection. This thesis is designed to study the duplicate checking of Python programming assignments, and can be applied to duplicate checking system of Python programming as one method in python language teaching. Although the research subject of this system is python language, but its design principles and methods are universal.Therefore, it is equally applicable to other programming languages.This thesis has a deep research on the theories and methods of plagiarism detection system at home and abroad. Some advantages and disadvantages between different systems can be found in this study. Methodology of this system is based on combination of Attribute counting technology and Structure measurement technology.This essay attempts to approach a study on Vector space model theory and methods, learning from the application of Vector space model in text classification. This system transforms programming code into eigenvectors combined of characteristic vectors through Vector space model. Thus, similarity between characteristic vectors can reflect the program code similarity.In this thesis, a study on regular matching technology, the abstract syntax tree technology and some common feature extraction algorithm theories and methods has been carried out. It has realized two feature extraction algorithms based on program code and abstract syntax tree.This essay is designed to have a research on some common feature weighting methods. An improved inverse document frequency weighting algorithm has been adopted in this design. Such weighting algorithm can be applied in characteristic vectors weighting.This thesis aims to approach a study on some common vector similarity measure methods and has designed a new vector similarity calculation method which applies to this system.Based on the two different feature extraction methods of program code and the abstract syntax tree, we use Cosine Angle, Correlation Coefficient, and Divergence Indicator to detect programming similarity in three different levels respectively, and compare the results with MOSS system. Meanwhile, in order to verify the feasibility of duplicate checking system, this thesis also selected actual teaching programming code was tested and compared with the results of manual alignment.After contrastive analysis on final test result, it is clear that this system can recognize plagiarism of different levels and make relevant judgments. Moreover, through comparison between the results of this system and MOSS system, it can be found that such duplicate checking system is better which extracts features based on Abstract Syntax Tree, calculating similarities from differential degrees. |