| Ancient Chinese(Literary Chinese)is the epitome of thousands of years of Chinese culture.The difficulty in understanding classical literary texts is high and the number of ancient texts passed down is huge,requiring a great deal of effort to sort and uncover their value.Therefore,it is necessary to introduce efficient NLP technology to process,understand,and research these types of documents.Current pre-trained language models in the field of NLP,including English and Chinese,have achieved tremendous success.However,Classical Chinese writing differs significantly from modern Chinese,making general-purpose modern Chinese pre-trained language models unsuitable.This article proposes WYWLM(Wen Yan Wen Language Model),which employs various pre-training techniques on large-scale corpora to address the characteristics of Classical Chinese,such as short sentences,concise vocabulary,organized text,and frequent quotations.In this work,a new pre-training task based on contrastive learning is introduced,using dictionaries as a medium to leverage vast amounts of modern Chinese text,allowing the model to learn better representations of Chinese characters and words.A style bridging decoder is introduced to enhance the language model and bridge the gap between Classical Chinese and modern Chinese.Additionally,a Classical Chinese dictionary containing character/word definitions and sources is used to incorporate knowledge into the language model.The evaluation benchmarks such as GLUE,Super GLUE,and CLUE play an important role in pre-trained language model research by allowing researchers to assess the performance of their language models.However,current evaluation benchmarks are not suitable for Classical Chinese(文言文).In order to enable researchers to evaluate pre-trained language models in the field of Classical Chinese using a standardized framework,this paper proposes a specific Natural Language Processing(NLP)evaluation benchmark called WYWEB(Classical Chinese Web).WYWEB consists of eight tasks,including sentence classification,sequence labeling,reading comprehension,and machine translation,among others.This benchmark enables researchers in the Classical Chinese domain to assess the capabilities of their models using a unified standard.Multiple pre-trained models for classical Chinese and WYWLM were evaluated on WYWEB,and the results show that WYWEB can be used to evaluate pre-trained models’ performance in multiple dimensions.WYWLM achieved the best score,demonstrating that the pre-training method designed for classical Chinese is effective.These techniques will serve as a backend support component for a classical literature reader and provide RESTful interfaces to users.Based on this research,the WYWEB dataset and WYWLM model will be open-sourced,contributing to the classical Chinese NLP research community. |