Research On Algorithm Of Address Text Similarity

Posted on:2024-03-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Shi

Full Text:PDF

GTID:2568307103474154

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Address plays an important role in life.The expression of address is complex and diverse,which requires an exact match with the standard address in the database to unify and standardize the address of life on various application scenarios.Based on CCKS address similarity database and BERT(Bidirectional Encoder Representation from Transformers)Language Model,in view of the data missing and data imbalance of the address similarity database and the shortcoming of capturing dependencies in self-attention model,the optimization and innovation of this thesis focuses on data preprocessing,model architecture and model ensemble.The main work and results are as follows:(1)A variety of data preprocessing methods for address are proposed,one of them is used to complete incomplete or irregular addresses which based on the Aho-Corasick automaton to match and filter the input addresses and output alternative addresses can be completed,and then in order to avoid introducing error information,it based on the number of options to judges whether to complete the addresses.In addition,the proposed data preprocessing methods also reduce complex data which contains multiple label difference points through factorization of contentious points,expand data by relationship transitivity,and reduce data redundancy by group deduplication.The experimental data shows that the proposed data preprocessing method improves the performance of the model by improving training data information,reducing training data complexity and so on.(2)A self-attention mechanism optimization method is proposed,which aims at the narrow range of information which the original self-attention mechanism pays attention to and the monotony and solidification of information which the existing optimization method based on Gaussian distribution pays attention to.The proposed self-attention mechanism casts the Gaussian distribution as an adds value join on the original self-attention mechanism,it combines the point of the characteristics of Gaussian distribution,the relationship between words in self-attention mechanism,the wide representation of relative position,and the theory different network structures may extract the similar features to control the expectation and value of Gaussian distribution.The experimental data shows that the proposed self-attention mechanism optimization method improves the performance of the model by correcting adaptively the information which the self-attention mechanism pays attention to.(3)A data partition method of model ensemble is proposed which aims at the Adhocratic data imbalance problem in the address similarity database is further magnified during model ensemble.In the process of generating sub-datasets,the label with low data volume proportions is not involved in cross-grouping data partition,and yet these labeled data join the sub-datasets with the full amount when other labeled data adopt data partition.The experimental data shows that the proposed data partition method of model ensemble improves the performance of the ensemble model by optimizing training data distribution of sub-model from bottom to top.

Keywords/Search Tags:

Address, Text Similarity, Data Preprocessing, Self-attention Mechanism, Data Partition Method of Model Ensemble

PDF Full Text Request

Related items

1	Research On Data To Text Generation Based On Deep Learning
2	Chinese Text Sentiment Analysis Method Based On Text Data Enhancement And ELECTRA Language Model
3	Address Data Application Study In Matching And Consolidation
4	Research On Short Text Similarity Based On Deep Learning
5	Research On Short Text Similarity Algorithm Based On BiLSTM And Attention Mechanism
6	Research On Text Semantic Similarity Based On Deep Learning
7	Research And Improvement Of Text Similarity Calculation Method
8	Research On Text Similarity Recognition Based On LSTM
9	Attention-based BertCNN: A Method For Text Similarity Calculation
10	Research On Long Text Classification Algorithm Via Multi-model Fusion With Attention Mechanism