| Stylistics is a branch of linguistics used for writing guidance and anonymous author recognition.With the development of statistics and the advent of the computer age,the computation writing stylistic is formed by the combination of statistical and writing style.and it has been grate applied in authorship identification.The main application of the authorship identification technology has been to achieve author prediction or identify,cultural protection,public opinion monitoring,and copyright protection for anonymous authors and so on.Therefore,this thesis proposes to use the computation writing stylistic to realize the authorship style identification,aiming at giving some reference for the development of network literature copyright protection in recent years.The real purpose of this thesis is to uses the computation writing stylistic to research the author’s writing style,and then from the author’s writing style analysis to achieve the author’s article similarity judgment.The focus of this thesis is on how to establish authorship style identification model and on how to improve authorship identification rate by optimizing features.At first the author had analyzed the traditional features stop-words,Chinese word segmentation and part-of-speech tagging principle,then combined the characteristics of authorship style identification,this thesis proposed a improve way to mining the relevance and stable grammatical features by using data mining method the Apriori algorithm for part-of-speech sequence mining,the part-of speech sequence feature had a great performance at the author identification.This thesis used the same age martial arts novelist Jin Yong,Gu Long,Dongfang Yu,Liang Yusheng’s works as the first type of experimental data set for feature performance verification,with independent style writer Wang Duyu and Jin Yong similar writer Ni Lan,Feng Ge and Li Liang as the second type of experimental data to verified the author’s style identifacation model.the integrated learning representative Random Forest(RF)and the classification model representative logistic regression classifier were chosen as the classification to construct the model.Based on the results of feature verification,the features of tag-based stop-words and mixed multi-level features are used to further improve the features.After obtaining the author’s style identification model,the performance verification of the author’s style identification model has been made by the independence and similarity analysis way and the second type of data. |