| With the continuous development of the internet,various data have been grown exponentially.While people enjoy the convenience brought by data to their lives,they also suffer from the proliferation of misinformation.Authorship identification technology can be used to identify the authors of information on the internet,distinguish fake and plagiarized articles,track the source of spam information,and play an important role in maintaining the healthy ecology of the internet.The existing authorship identification models are designed for English.Due to the differences between Chinese and English in grammar and language composition,significant bias is introduced when English authorship identification models are applied to Chinese texts.In addition,existing representation learning methods fail to avoid the influence of common features(i.e.,irrelevant features)on text features,resulting in impure text features and ultimately affecting the accuracy of classification.Therefore,this paper focuses on the following two aspects to address the aforementioned issues:(1)A fine-grained Chinese authorship identification model is proposed to address the issue of significant bias introduced when English authorship identification models are applied to Chinese texts.Firstly,the BERT model is utilized to extract contextual features of the text.Then,parallel convolutional layers are employed to extract finegrained features of 1-4 character words.Finally,a self-attention mechanism is introduced to allocate weights to the windows of the parallel convolutional layers,thereby improving the model’s identification performance.Experimental results demonstrate that compared to baseline models such as BERT,Text CNN,and recurrent neural networks,the proposed model improves the accuracy of Chinese authorship identification by 2.22%,8.44%,and 8.1%,respectively.(2)For Chinese text classification tasks such as authorship identification,there are limitations to improving classification accuracy solely through optimizing the model structure.To address this issue,this paper proposes an Inverted Attention Orthogonal Projection Module(IAOPM)to improve representation learning performance by reducing the impact of common features on text features,thus fundamentally improving the classification accuracy of text classification tasks such as author identification.IAOPM uses inverted attention(IA)to iteratively reverse the attention map on the text features to obtain purer common features.Then,the text features are projected onto the orthogonal direction of the common features(which is not conducive to classification and can even confuse performance)to obtain purer and more suitable text features for classification.Unlike the existing orthogonal projection methods,IAOPM can extract common features within a single network without any branch networks,increasing the flexibility of orthogonal projection methods.Furthermore,a orthogonal loss is designed and used during training to ensure the quality of common features,making IAOPM have better purification performance than the original method.Experiments show that the text classification model based on IAOPM achieves an average accuracy improvement of 1.02%,0.44%,and 0.52% on multiple text classification datasets,outperforming baseline models,self-attention mechanisms,and original orthogonal projection methods.In summary,this paper proposes two effective methods to address the issue of Chinese author identification.These methods not only improve the efficiency of dataset utilization but also significantly enhance the accuracy and generalization ability of the models.Through innovative improvements on existing technologies and algorithms,this paper provides practical and operational solutions for text classification tasks such as Chinese author identification. |