Font Size: a A A

Quantitative Word Order Typology:A Crosslinguistic Study Based On Large-scale Treebanks

Posted on:2023-07-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J W YanFull Text:PDF
GTID:1528306782998369Subject:Foreign Language and Literature
Abstract/Summary:PDF Full Text Request
As a popular field of linguistics,typology mainly focuses on language universals and linguistic classifications.With the development of information technology and computer science,large-scale databases,quantitative methods,and multidimensional perspectives are playing an increasingly important role in the field of linguistic typology.Although word order typology has developed and matured,it is still worthy of further exploration and research for the following reasons.First,the basic word order typology proposed in Greenberg(1963)marks the beginning of modern typology.In the context of the “quantitative turn” in linguistics,it is particularly important to return to the“probabilistic nature” of the basic word order typology and to investigate the typological patterns and universals in cross-family languages.Second,the IndoEuropean languages have received extensive attention in the basic word order typology,but there is still a lack of research on the typological classification of cross-genus languages based on quantitative word-order indicators.Such studies can better examine the role of word-order features in the typological classification of languages and explore the relationship between genealogical and typological classifications.Finally,within the Indo-European languages,the Slavic ones are known for their “free word order and rich morphological features”.This “complexity trade-off” relationship between word order and morphology is an important manifestation of language as a “human-driven complex adaptive system”.Therefore,further exploration of the complex system of cross-group languages can provide a better picture of how humans effectively encode linguistic information for efficient communication and explain language individualities and variations.Based on the above,three research questions are examined from both discrete and continuous perspectives:(1)Can the implicational universals of Greenberg’s basic word order typology be validated at the cross-family level?(2)Can an accurate linguistic typological classification of the Indo-European languages be achieved at the cross-genus level?(3)Can the synergistic evolution of word orders of the Slavic languages be explained at the cross-group level?The results show that the statistical and quantitative methods based on large-scale treebanks,from both discrete and continuous perspectives,can well reflect the quantitative features of languages across families,genera,and groups,and well display the universal patterns,typological classifications,and dynamic evolutions of languages at different levels.(1)In terms of the language universals at the cross-family level,we compared the dominant orders of 74 languages based on large-scale treebanks with their dominant orders in The World Atlas of Language Structures(WALS).We found that there exists no statistical difference between these two,and treebanks are capable to provide typological information for languages not covered in WALS.This result illustrates the feasibility and advantages of extracting word-order information of languages from treebanks.Based on the discrete results,we tested the implicational universals of Greenberg’s basic word order typology,and found that the first four implicational universals are well confirmed,while the fifth one is violated.Specifically,two languages that meet the antecedents of Universal 5 fail its consequent.The above verification of the cross-family universals from the discrete perspective gives us an intuitive understanding of the application of treebank-based data in the typological study of language universals.Then,we examined the word order universals from a continuous perspective.The results show that,except for the fifth universal,the probabilities of each word order pairs involved in each universal are closely correlated.Moreover,mixed-effects models can be used in the predictions of implicational universals,and the predictions also indicate that the fifth one is falsified.It means that the implicational universals only indicate a tendency,rather than an absolute.The above analysis calls for a more rigorous and scientific approach to language universals,and provides a reference for exploring the implicational universals based on quantitative approaches and multidimensional perspectives.(2)As for the typological classifications of languages at the cross-genus level,we focused on the quantitative features of word order among the four genera of IndoEuropean languages(Romance,Germanic,Indic and Slavic)based on 11 Parallel Universal Dependency Treebanks.Under a discrete perspective,we found that the most frequent word orders in the cross-genus treebanks are highly consistent with the word orders in Greenberg’s basic word order typology.They are all located at the top of the long-tailed Zipf distribution.In addition,the dominant orders of the five major binary word-order relations are largely consistent with the dominant orders given in WALS.This result further demonstrates the feasibility of adopting treebanks for the discrete typological classification of cross-genus languages.From a continuous perspective,we achieved reasonable typological classifications of the Indo-European languages based on both the frequency of the binary word orders and the dependency direction of the binary word-order combinations,and the results of the typological classifications are very close to the traditional genealogical classifications.However,the word order freedom indicators based on the binary word orders did not provide a good typological classification of the Indo-European languages.The results suggest that there are both commonalities and differences in terms of the word-order features of sub-genus languages,and that the typological classifications of languages vary under different typological parameters.(3)Regarding the synergetic evolutions of languages at the cross-group level,we used 24 cross-group treebanks of 13 Slavic languages as the target dataset and 10 treebanks of 4 non-Slavic languages as the baseline dataset to address the issue that “the word order freedom of binary word orders did not achieve a good typological classification of languages”.For one,we found that large-scale treebanks can provide new evidence and reference for the dominant order of the ternary word-order relations(subject-verb-object order)of Slavic languages,viz.,all Slavic languages in our sample demonstrate a strong tendency to be SVO dominant.For another,based on the word order freedom of the ternary word-order relations and the morphological richness of the treebanks adopted,we found that the ternary word-order relations can better capture the flexibility of word orders in Slavic languages and verify the traditional typologists and linguists’ proposition that “the word order of Slavic languages are more flexible”.Meanwhile,the word order freedom and morphological richness of the Slavic languages are highly correlated.In other words,the “complexity trade-off” hypothesis is confirmed.Moreover,one noteworthy point is that,compared to the modern Slavic languages,the ancient ones are morphologically more complex and syntactically freer.In other words,modern Slavic languages are less morphologically marked and more rigid in their word orders.This result suggests that language is a dynamic and synergistic system and human beings tend to use the “least effort principle” to encode language for efficient communication,providing an instantiation that language is a selfregulated and self-adapted human-driven system.The results show that quantitative approaches can well capture the word-order features of languages,demonstrating language universals,typological classifications and synergetic evolutions in various dimensions.Although it is different from the traditional typological research methods,exploring the typological features of word order based on large-scale treebanks has demonstrated its great vitality and possibilities.Moreover,word order,as one component of the human-driven complex adaptive system of language,has implicational universals across families,typological features across genera,and synergetic dynamics with other components across groups.This is precisely what word order typology can provide,namely,the accurate and thorough understanding of languages in terms of the verification of linguistic universals,the typological classification of languages,and the explanation of language variations.This study examines the typological features of the basic word orders from macroscope to microscope.Based on three levels of genealogical classifications,it captures the cross-linguistic universals and differences,enriches the research dimensions of word order typology,expands the boundaries of quantitative typology,and provides a comprehensive and systematic examination of word order types across language families,genera,and groups.Specifically,it delves into the widely accepted typological universals across language families,reveals the commonality and uniqueness of typological features across language genera,and analyzes the synergistic evolution and dynamic equilibrium of languages across language groups.This study is of profound significance in revealing the language universals of word orders,showing the commonalities and differences of languages,explaining the dynamic evolutions of languages,and promoting the “quantitative turn” of typology.Meanwhile,it is of great value in explaining how human beings encode language as a complex adaptive system,and revealing the “probabilistic nature” of languages.It may provide important references and broad possibilities for linguistic research,especially for cross-linguistic and cross-domain natural language processing and machine learning.It may also guide the rational development and utilization of typological knowledge in the field of artificial intelligence in the future.
Keywords/Search Tags:Cross-linguistic, Treebank, Word order typology, Quantitative linguistics, Universal, Typological classification, Synergetic evolution
PDF Full Text Request
Related items