| In recent years,with the advancement of technology,artificial intelligence(AI)has also been rapidly developing.Against this backdrop,the Supreme People’s Court of China has proposed vigorously strengthening the development and application of smart judicial systems.The application of natural language processing(NLP)in the legal field has become a normal phenomenon,such as case text classification and legal automatic question-answering systems.These downstream NLP tasks are all based on a very basic task,namely Chinese word segmentation.There are a large number of professional vocabularies in the legal field that are constantly updated as society develops.This has resulted in the existing segmentation tools being more mature in general fields,but there are still many problems with Chinese word segmentation in professional fields.The main solution is to conduct professional field new word discovery and improve the word segmentation lexicon.However,the relevant datasets for new word discovery tasks rely on manual labeling and it is difficult to achieve large-scale.As for the problem of new word embedding,existing word embedding models require large-scale corpora for training,and require that each word appear frequently enough.However,in the legal field,new words are frequently updated and it cannot be guaranteed that all new words have enough relevant text for training the word embedding model.To address these issues,this paper makes the following contributions:1.Aiming at the problem of lack of new word discovery data in the professional field,this paper proposes a method of anti-transfer learning,taking the part-of-speech annotation of the general corpus as the source domain,the new word discovery task in the legal field as the target domain,using BERT to complete the coding,and extracting the private features and shared features of the task in three parts,in order to enhance the feature fusion effect,it is proposed to use a double-layer Bi LSTM and combine neural adapters to complete the feature fusion.The bilinear attention mechanism is added after the multi-head attention mechanism to ensure that the features that are conducive to the new word discovery task can be extracted from the shared features.2.To address the problem of new word embedding,this paper proposes a new word embedding algorithm that integrates character features,subword semantics,and context information.First,character-level phonetic features are extracted from the target word,and n-gram subword semantics are obtained.Then,a random feature attention mechanism is used to generate a vector of the context of the target word,while ensuring the effect of extracting dependencies between input sequences and reducing model complexity.Multiple context vectors are then aggregated to enrich the semantic representation of word embedding.Finally,a meta-learning method is used to enable the model trained in a general domain to quickly adapt to professional corpus.3.A new word discovery system for the legal field has been designed,including user login and registration,new word discovery,and word embedding generation.The system supports users to upload their own training data for model training,as well as performing new word discovery and word embedding generation tasks using the model.The system also calls backend interfaces to complete the two-dimensional visualization of word embedding. |