Font Size: a A A

Research On Critical Technologies Of Sentence And Paragraph Alignment In English-thai Bilingual Language In Asean Region

Posted on:2023-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:C H ZhangFull Text:PDF
GTID:2555306836964279Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
With the deepening of exchanges between China and ASEAN,the economic development,network security and geopolitical security posture of ASEAN have gradually attracted China’s attention.In order to facilitate researchers to study the network opinion of ASEAN,it is necessary to establish ASEAN-related parallel corpus.The ASEAN-related parallel corpus can effectively improve the level of machine translation,word sense disambiguation and cross-language entity alignment among ASEAN languages,and improve the information processing ability of minor languages.Cross-language sentence alignment technology is the key technology to establishing ASEAN-related parallel corpus.Because there are few high-quality corpus resources available for research in ASEAN,there is a lack of a large number of parallel data for model training;in addition,there is no public evaluation data set,so it is difficult to evaluate the model effectively;furthermore,different granularity leads to different alignment techniques used for aligning language materials.To solve the above problems,this thesis focuses on Thai,and makes an in-depth study on sentence alignment and paragraph alignment technique.The main work and research results of this thesis are as follows:Aiming at the problem that the English-Thai sentence alignment task requires a large quantity of training resources,and it is difficult to obtain sufficient parallel corpus from the Internet for model training directly,two methods are proposed.One is to provide a priori knowledge through the cross-language word vector and combine it with the Siamese Network to obtain the cross-language sentence vector model.The other is to obtain the cross-language sentence vector through knowledge distillation,then determine whether the sentence meaning of different languages is similar according to the sentence vector similarity obtained by coding.Both methods effectively improve the accuracy of sentence alignment tasks under limited corpus data.The paragraph is a collection of sentences that are closely related in terms of semantics.This thesis improves the General paragraph vector method in paragraph alignment scheme and proposes a multi feature paragraph alignment method.Firstly,the selection methods of paragraph feature sentences and n-gram splicing are considered at the same time to obtain the candidate feature sentence set,and then the similarity of the feature sentences is calculated,the paragraph with the highest similarity of the feature sentences is regarded as a parallel paragraph.Compared with the traditional coding method,this method effectively improves the F1value by about 1 percentage points.
Keywords/Search Tags:network opinion, parallel corpus, English-Thai, sentence alignment, paragraph alignment
PDF Full Text Request
Related items