| The word embedding layer is a fundamental component in many natural language processing models,often characterized by a large number of parameters.Previous methods for compressing word embeddings include low-rank matrix factorization and vector quantization.While these methods effectively reduce the parameter size of the word embedding layer,they have some limitations.Firstly,they do not fully exploit the longtail distribution of word frequencies for compression.Secondly,they do not address the trade-off between compression rate and model accuracy.Typically,multiple sets of compression rate parameters are randomly selected and evaluated experimentally,and the set with the least accuracy loss is chosen as the final result.However,this approach has limitations as it relies solely on experimental results to select the optimal compression rate,which may not be accurate enough.Therefore,there is a need for a more effective method to address the challenges in word embedding layer compression and achieve a better balance between compression rate and model accuracy.To compress the word embedding layer,this study considers the importance of different words,which is determined based on their frequency of occurrence in the document.Due to the existence of Zipf’s Law phenomenon,the words are sorted according to their frequency and divided into multiple groups,each using a different parameter size for compression.Vector quantization,low-rank matrix decomposition,product quantization,and differentiable quantization are employed as compression methods after grouping.However,the splitting points are manually set.In order to achieve better grouping and training and allow computers to assist in finding the optimal splitting points,this study proposes a joint training objective that simultaneously considers compression rate and task performance.By using the joint training objective,an appropriate combination scheme can be found that maintains good task performance while achieving moderate compression rate.The problem is transformed into a network architecture search problem,which can be solved using network architecture search methods from automated machine learning.Experimental results show that compared to basic low-rank matrix decomposition methods,the proposed method in this study can further compress the parameter size of the word embedding layer to 22.6%.This means that the parameter size of the word embedding layer can be significantly reduced,thereby improving the efficiency and performance of the model.To test the compression scheme of word embeddings in the BERT pre-training model,this study improves the BERT model based on the automated machine learning search for the optimal word embedding layer compression scheme.A comprehensive,concise,and effective compression scheme for the pre-training model BERT is proposed.For the word embedding layer of BERT,we use automated machine learning methods to learn the splitting points and parameter sizes of each group to find the optimal word embedding compression scheme.When processing longer sentences,BERT faces the challenge of excessive attention computation.Existing solutions include sparse Transformers,sliding windows,and dilated sliding windows.However,these methods are complex in design and pose challenges for practical implementation.To address this problem,this study proposes a new comprehensive compression scheme that leverages the importance of local context.Specifically,through recursive chunking,the computational complexity of attention in BERT when processing long sentences can be significantly reduced.However,this compression scheme may lead to a loss in model accuracy.To mitigate this issue,knowledge distillation is adopted,allowing the compressed smaller model to learn from the knowledge of the larger model,thereby improving the performance of the model.Through experimental comparisons,the feasibility and effectiveness of the compression scheme are demonstrated. |