| As the amount of information in natural language increase sharply,processing and understanding this information becomes of great importance.Named Entity Recognition(NER)technique aims to extract entities such as places and names of people from natural language,and plays an important role in understanding and processing natural language.With the deep learning techniques employed in NER tasks,the accuracy of the model continues to improve,and the volume of the model is also rapidly expanding.Large models(1)are difficult to run on edge devices with limited computing power,and(2)even in the cloud with sufficient computing power,there are problems of slow reasoning,high power consumption,and high cost.This thesis utilizes knowledge distillation and unlabeled datasets to improve the performance of the small model by transferring the knowledge learned by a large model.With the model volume is compressed by 10 to 20 times,the accuracy of the compressed model remains similar to the original model,making it possible to work on devices with limited computing power.The main contributions of this thesis are as follows:(1)A method based on sample entity richness is proposed to mine suitable augmented samples from dirty datasets.Dirty data sources are complex and may contain garbled characters and non-natural language data.Direct use has limited improvement in model generalization,and manual cleaning requires unaffordable manpower.Suitable sample mining method is proposed by judging whether a sample contains task-related entity.Based on suitable sample mining method,a clean augment dataset is built.(2)A method of using proxy samples constructed by combining augment samples and training samples is proposed.The distribution of the augment dataset is different from the distribution of the training dataset.In this thesis,the augment samples are moved to the training set through a linear combination at the word vector level to generate proxy samples that are closer to the distribution of the training set.Using this sample training effectively reduces the model bias caused by different data distributions.(3)A method of using a balanced batch containing same amount of training samples and augment samples is proposed.DL-based learning algorithms rely on Minibatch Stochastic Gradient Descent(SGD)to find suitable parameters for the model.In gradient descent,the loss of all samples in a batch will be averaged.By combining enhanced samples and training samples in a batch,more information is introduced and the noise of a batch is reduced.(4)A method of weighting samples is proposed to balance the weights of different samples.According to the predicted relationship between the sample label and the teacher model,a small weight is applied to the samples that differ too much,and the samples that do not match the prediction and the label are removed.Based on this method,noise samples and difficult samples are removed,which is more beneficial for the student model to imitate the output of the teacher model.This proposed methods has been verified on multiple datasets,and achieved the following results when the model is compressed by 10 times to 30 times:(1)Compared with direct compression,using augment sample mining,the F1 of the compressed model is improved by 0.6%~1.6%;(2)After using proxy samples combined with sample mining,the F1 of the compressed model is increased by 0.7%~1.8%;(3)Under the combined action of the above four methods,the F1 of the compressed model is improved by 4.8%~10.08%;compared with the uncompressed model,the loss of F1 Usually within 1%,it exceeds the uncompressed large model by about 0.5%.Experiences in the prototype system show that deploying the compressed model locally and the pre-compressed model in the cloud can provide high-quality services regardless of whether the user device is connected to the service. |