Font Size: a A A

Research On Algorithm Of Address Text Similarity

Posted on:2024-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ShiFull Text:PDF
GTID:2568307103474154Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Address plays an important role in life.The expression of address is complex and diverse,which requires an exact match with the standard address in the database to unify and standardize the address of life on various application scenarios.Based on CCKS address similarity database and BERT(Bidirectional Encoder Representation from Transformers)Language Model,in view of the data missing and data imbalance of the address similarity database and the shortcoming of capturing dependencies in self-attention model,the optimization and innovation of this thesis focuses on data preprocessing,model architecture and model ensemble.The main work and results are as follows:(1)A variety of data preprocessing methods for address are proposed,one of them is used to complete incomplete or irregular addresses which based on the Aho-Corasick automaton to match and filter the input addresses and output alternative addresses can be completed,and then in order to avoid introducing error information,it based on the number of options to judges whether to complete the addresses.In addition,the proposed data preprocessing methods also reduce complex data which contains multiple label difference points through factorization of contentious points,expand data by relationship transitivity,and reduce data redundancy by group deduplication.The experimental data shows that the proposed data preprocessing method improves the performance of the model by improving training data information,reducing training data complexity and so on.(2)A self-attention mechanism optimization method is proposed,which aims at the narrow range of information which the original self-attention mechanism pays attention to and the monotony and solidification of information which the existing optimization method based on Gaussian distribution pays attention to.The proposed self-attention mechanism casts the Gaussian distribution as an adds value join on the original self-attention mechanism,it combines the point of the characteristics of Gaussian distribution,the relationship between words in self-attention mechanism,the wide representation of relative position,and the theory different network structures may extract the similar features to control the expectation and value of Gaussian distribution.The experimental data shows that the proposed self-attention mechanism optimization method improves the performance of the model by correcting adaptively the information which the self-attention mechanism pays attention to.(3)A data partition method of model ensemble is proposed which aims at the Adhocratic data imbalance problem in the address similarity database is further magnified during model ensemble.In the process of generating sub-datasets,the label with low data volume proportions is not involved in cross-grouping data partition,and yet these labeled data join the sub-datasets with the full amount when other labeled data adopt data partition.The experimental data shows that the proposed data partition method of model ensemble improves the performance of the ensemble model by optimizing training data distribution of sub-model from bottom to top.
Keywords/Search Tags:Address, Text Similarity, Data Preprocessing, Self-attention Mechanism, Data Partition Method of Model Ensemble
PDF Full Text Request
Related items