Font Size: a A A

Automatic Identification And Extraction Of English Verb Patterns: A Study Based On The Clustering Of Concordances

Posted on:2016-03-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:T YuFull Text:PDF
GTID:1225330467491162Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Patterned language is pervasive in texts. Summarization and extraction of language patterns are of great significance to language studies, lexicography and language education. Traditional manual pattern extraction is time-consuming, labor-intensive and inapplicable to large-scale corpora. Also, previous studies on automatic pattern extraction have not shown promising results. Therefore, this study aims to automatically extract English verb patterns based on similarity measure and clustering of concordances. This research mainly addresses two research questions:(1) What are the factors that influence the clustering of concordances? How to set the groups of concordances?(2) What are the presicion and recall of the automatically extracted English verb patterns? What factors influence the precision and recall?Based on Pattern Grammar (Hunston&Francis2000) and Verb Pattern List (Francis et al.1996), this study summarizes essential elements in verb patterns to build feature sets as the starting point of the automatic classification of concordances. The five-step procedures are as follows:1) Extract concordances from POS-tagged corpora.2) Summarize the essential elements in verb pattern lists to build feature sets.3) Transform linguistic information in concordances into features.4) Measure similarities between each two concordances and cluster concordances into groups automatically.5) Extract shared features in each group of concordances and generate a verb pattern list.Data for model debugging and testing are extracted from the written part of BNC (90million tokens). Data for model debugging are composed of8000sampled concordances (1000lines for8verbs appeal, complain, end, give, hold, insist, persuade, protect respectively), which are used for summarizing the transformation rules of pattern elements. To calculate the precision of automatic pattern extraction,5365concordances for6verbs of different frequencies (admit, agree, argue, claim, lead, tell) are sampled from PDEV’s website database, which provides manually categorized concordances according to the pattern types (n≥5) defined by the research team led by Patrick Hanks. Concordances of each verb are then categorized into groups. Finally, the automatically extracted patterns are compared with manually labeled ones to calculate the precision of clustering.To explore the best method to set K, concordances with manual pattern labels (testing set) are classified into groups twice. Firstly, the number of concordance groups (K) is set according to manual classification. Secondly, K is set based on the internal validity measure of KMeans. The analysis of automatically extracted verb patterns from concordances in the two data sets shows that:1) Different verbs, the number of concordances and the heterogeneous level of concordances co-influence the precision of the clusterings of concordances; The selection of K based on internal validity measure is much more flexible and open, which yields better results and higher precision,2) Average precisions of automatic pattern extraction are90.99%and95.91%respectively in the two clustering of concordances, higher than the average precision achieved in previous studies (81%); Parentheses and special sentence structures are the main reasons which influence the precision and recall of the automatic extraction of English verb patterns.The automatic pattern identification and extraction model proposed in this study is feasible to exhaustive and automatic analysis of verbs in large-scale corpus and can be broadly applied to the analysis of other word classes.
Keywords/Search Tags:Pattern Grammar, concordances, similarity, clustering, patternextraction
PDF Full Text Request
Related items