Font Size: a A A

Authorship analysis: Discovering the author of a software document

Posted on:2007-04-05Degree:Ph.DType:Dissertation
University:University of Louisiana at LafayetteCandidate:Fanguy, Philip JFull Text:PDF
GTID:1445390005968699Subject:Computer Science
Abstract/Summary:
The purpose of this research is to study and develop methods of determining the author of software programs. The authorship techniques developed analyze a collection of the author's past programs to extract information to choose the likeliest author of a suspect program.; Two approaches were attempted to accomplish this task. First, traditional software complexity measures were examined, using analysis of variance, to identify differences between authors; however, no significant differences were detected using these measures alone (with the exception of measures on the comments of the programs). The second approach created lists of characterizing terms (words and phrases within the programs) for each author that can be used to identify the author in any future programs written by the author. The focus of this dissertation is on these term selection techniques and their ability to choose the most influential terms for author identification. The techniques were developed and tested on programs using the C++ programming language (obtained from intermediate level programming classes).; Five term selection techniques were attempted. The Probability Technique selects terms that are used more often by one programmer than the group of programmers. The Rank Technique selects terms that are ranked relatively higher for a programmer than for the group of programmers. The Quintile Technique groups the terms into six bins according to rank and selects the terms in a bin that few programmers use. The Probability Deviation Technique selects terms for each author with probabilities a number of deviations above the mean of probabilities for all authors. The final technique, the Bayesian Inference Ratio Technique, uses a ratio that compares the probability that the term is used by the programmer to the probability that the term is used by any programmer to determine if a term is selected. Of these term selection techniques, the Bayesian Technique produces the best results in terms of Authorship Accuracy (percentage of correctly identified suspect programs) and terms selected. The Bayesian Technique was validated in further studies on additional sets of programs. Term selection techniques, in general, and the Bayesian Technique specifically, are considered to be valid author identification methods.
Keywords/Search Tags:Author, Programs, Term selection techniques, Software
Related items