Font Size: a A A

Quantitative Research On Language Similarity Relationship In East Asia-Pacific Region

Posted on:2018-01-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J ZhaoFull Text:PDF
GTID:1365330515991343Subject:Chinese Ethnic Language and Literature
Abstract/Summary:PDF Full Text Request
From the middle of the 19 th century to the beginning of the 20 th century,historical linguistics had successfully solve the problem of the genealogical relationship between the most Euopean languages,genealogical classification had been carried out on the world languages.Asian languages first start from studying the historical relationship of Indo-Chinese languages,gradually establish the Sino-Tibetan Language Family.The genealogical classification of the Sino-Tibetan Language has long been studied for about 200 years,result in a series of controversial language classification,involved in the languages of the mainland of the east Asia,southeast Asia peninsula,as well as the hemisphere southeast across the south pacific area,including Dong-Tai language,Miao-Yao language,Tibeto-Burman language,Chinese language,Austroasiatic language,Austronesian language.So far,many point of views have been put forward,for example,Thai-Kadai language family,Sino-Tibetan language family,Austronesian language family,Austroasiatic language family,Austro-Thai language family,Sino-Austronesian language family,etc.Scholars difficultly achieve consensus from the above views,their disputes involve in the specific language belonging and the relationship between language families,etc.The traditional language classification method relies on the experience and can't do the quantitative description of the degree of the relationship between languages.The essence of the etymology statistics is the selection of cognates,which depends on the expert experience,so this method is not objective and controversial.Aiming at the disputes of the past research on Sino-Tibetan language classification,the goal of the article is based on the principle of the computational linguistics,using computer means to establish the objective,repeatable language classification system.In this paper,by using computer technology and statistical method,the language relationship is studied based on certain mathematical model to write the specific program,so the study of the language relationship is formal,algorithmic,and automatic.The objective linguistic distance is calculated on the basis of the differences between varieties themselves.In recent years,Levenshtein Distance has proven to be effective to measure linguistic distances between languages or dialects.Levenshtein Distance is applied to the diverse linguistic fields,e.g.,computational linguistics and dialectology.Kessler(1995)first applied Levenshtein Distance to measure the linguistic distances between Irish Gaelic dialects.Thereafter,the approach has been applied in a bundle of studies to measure the linguistic distances between languages or dialects,for example,Dutch dialects,Sardinian,Norwegian,Scandinavian languages,German.In addition,Levenshtein Distance has been applied to the Indo-European,Austronesian,Turkic,Indo-Iranian,Mayan,Mixe-Zoque,Otomanguean,Huitotoan-Ocaina,Tacanan,Chocoan,Muskogean,and Austro-Asiatic language families.Levenshtein Distance obtains better results in the practice of the German Max Planck,and has proved to be effective to measure the language distances between western languages.Levenshtein Distance between two strings is defined as the minimum number of edits needed to transform one string into the other.When words are treated as phonetic transcription sequences,the Levenshtein Distance between two words is the minimum number of edits in phonetic transcription that are needed to transform one transcription into the other between the two languages.We suppose that this reflects the perception way of the pronunciation differenc and the change phenomenon of the process of the language evolution.In this way,two different pronunciations of any cognate in two different languages can be compared.The distance between any two languages can be calculated based on the distances of cognates.But,Greenhill questions the method of the Levenshtein Distance language classification.Greenhill(2011)tests the performance of the Levenshtein Distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages.Comparing the classification proposed by the Levenshtein Distance to that of the comparative method shows that the Levenshtein Distance classification is correct only 40% of the time.Standardizing the orthography increases the performance,but only to a maximum of 65% accuracy within language subgroups.Greenhill thinks that Levenshtein Distance fails to identify language relationships accurately,and the major cause of this poor performance is that the Levenshtein Distance is linguistically na?ve.Based on the research conclusions of Greenhill,this paper revises the traditional algorithm of Levenshtein Distance using the Almeida&Braun Articulation System,and improves the performance of the method of Levenshtein Distance language classification.Then,the revised Levenshtein Distance algorithm has been verified by Indo-European six languages and seven languages of Tibetan branch.The verified results show that the results of the revised Levenshtein Distance classification are very similar to that of the traditional historic comparative method,thus the revised Levenshtein Distance algorithm is feasible,and its classification is credible and objective,and can calculate and classify the relationship between languages.The above systematic quantitative method of the language relationship is algorithmic and automatic,does not depend on the subjective judgement.Last,this paper applies the above established classification system to study the language relationship of the Sino-Tibetan language family,and makes the classification on the 77 languages or dialects of the east asian mainland and southeast asia-pacific region.In this paper,our own language classification is obtained and some own opinions are proposed.The study results of the 77 languages quantitative classification shows that the method of the revised Levenshtein Distance classification proposed in this paper can be applied to the study of the east asian languages,and be extended to all languages or dialects in China to make the comprehensive,accurate,relatively scientific classification consequently.
Keywords/Search Tags:Sino-Tibetan Language Family, Almeida&Braun Articulation System, Levenshtein Distance, Swadesh Core Words, Phylogenetic tree
PDF Full Text Request
Related items