As the first step of natural language processing,natural language representation plays a decisive role in the subsequent analysis of natural language.Encoding Chinese natural language can provide analysis basis for sentence similarity,information retrieval,text classification and other tasks.In the study of natural language coding,deep learning and pre-training models are mainly carried out.However,in deep learning,the neural network training speed is slow,the semantic differentiation of words with the same sentence pattern is not obvious,and the training time is long,the space is occupied,and the interpretability is poor.Although the code of Synonym Cilin covers 77,343 words,it cannot update the code of new words and hot words in time,and the similarity of word meaning will be 0 when analyzing the similarity.To solve these problems,considering the natural language processing oriented studies,modern Chinese semantic primitives is in an important position,this study based on semantic primitives,combined with the genetic mechanism of primitive,therefore,carry on the improvement of natural language encoding,and applied to the words and the sentence similarity computing,achieved good results.Therefore,from the perspective of semantic primitives,this paper proposes a coding algorithm based on semantic primitives and applies it to similarity algorithm.A natural language coding algorithm based on semantic primitives is proposed.The algorithm is to a word,which is not in the ‘synonym cilin(extension)’,and cilin is as the basic primitive library.Through the baidu encyclopedia crawler technology for the interpretation of the word,it uses Text Rank keyword extraction algorithm to extract meaning primitive to extract out the meaning of primitive as coding unit for getting the cilin code of each primitive,then makes up the vocabulary coding vectors eventually.A similarity algorithm based on genetic characteristics of semantic primitives is proposed.The similarity of two unincluded words is obtained by calculating the coding vector of words and coding distance based on the coding feature of the cilin.The similarity of two sentences is analyzed from semantic and syntactic dependency,and finally verified on different sentences.The experimental results show that the proposed coding algorithm can perform similarity analysis for the unincluded words,which makes up for the defect that the similarity of the unincluded words is 0,and the lexical similarity is reasonable.It is better than other algorithms in sentence similarity. |