Font Size: a A A

Constructing A Model For The Automatic Identification Of Move Structure In English Research Article Abstracts

Posted on:2017-04-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:1225330482985522Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
In the era of Big Data, it is of great significance to explore the knowledge structure and discover the trends of a research area. It is widely acknowledged that abstracts have been an accessible and important source of information for Knowledge Discovery. Nevertheless, the existing data mining techniques could not identify key information moves, not to mention being able to anchor the key information inside these moves. This research gap calls for the constructing of a model to automatically identify move structures.Based on the research in the automatic text categorization, three types of models have been suggested in the field of Natural Language Processing, each of which has their strengths and limitations. The first type, what was commonly referred to as "bag of words" models, is based on word frequency and statistical methods. It could exhaust all the term features, but it is likely to have the problem of sparse features due to the lack of feature selection. The second type of models is based on linguistic rules, which could avoid the problem of sparse features but are unable to exhaust all features. The third type which integrates both "bag of words" and contextual features produces better performance, but it could only deal with structured abstracts. As for unstructured abstracts, its performance is not satisfying.In response to the current situation, this study aims to construct a model which is able to automatically identify the move structure of more types of English abstracts with better performance. Knowledge and techniques in several disciplines were employed, including Corpus Linguistics, Natural Language Processing, Information Retrieval and statistics. Besides, theories and concepts in Linguistics (e.g. Move Analysis) have been drawn to make up for the limitations of the existing models.There are four stages in constructing the model:(1) Data preparation and pre-processing. At this stage we downloaded the abstracts of all the English research articles of Applied Linguistics published from 1993 to 2014 from the database Web of Science. Altogether,440 texts were collected after we teased out the data for book reviews, conference articles, and editorials. The data were cleaned and pre-processed with POS tagging and parsing. (2) Manual Annotation. The manual annotation was carried out by highly experienced researchers of this field and the whole process took them a whole year. The coding scheme with six moves and sentences as the analytical unit was an integration of both the top-down (i.e. coding based on the existing coding schemes) and bottom-up (i.e. coding with no existing coding schemes) approaches. Then two coders coded the whole texts independently and achieved high degree of agreement (Kappa=785). Finally, they discussed and corrected any difference between their coding in order to achieve absolute agreement. (3) Feature extracting and model constructing. After manual annotation, various features were extracted to predict the move structure. The effective predicting features were identified to construct the model with the classifier Conditional Field Model. (4) Evaluating the model. At the evaluation stage, we used 10-fold cross-validation to evaluate the system. The data were randomly divided into 10 sets, and each time 9 sets were used to train the classifier while 1 set was used to test the classifier. The final performance was the average of the ten times. In addition, the final performance was compared with the previous models in order to explore the advantages and shortcomings of this model.This study has contributed to the current research mainly in three aspects. Firstly, the current study has made contribution to the genre study by being data-based. Unlike the traditional genre study, this study approached genre on the basis of big data. Secondly, this research has validated the effectiveness of four features proposed by existing models. In addition, this model found another three new features and confirmed that the three new features retrieved with the corpus approach are more effective in predicting than those obtained by the traditional ways. In terms of the three orients of features, meaning-oriented features are the strongest predictors for moves (F=0.609), form-oriented features are the weakest (F=0.317). and contextual features lie in between (F=0.428). Thirdly, this study has constructed an effective model for the identification of move structure. The performance of this model (F=0.7819) is the best by far among the existing models. Its performance towards informative abstracts (F=0.8218) is 4.5% higher than the best model of the existing ones. In order to ensure the comparability, we used the same data set to train the "bag of words model" (AntMover), it turned out that the performance of our model is 23% higher than that of AntMover.Constructing a model for the automatic identification of move structure in English research article abstracts is a necessary step for Knowledge Discovery to anchor the key moves, and further to pinpoint the key information inside each move. In addition, the automatic identification of move structure breaks through the manual analysis of move structure in ESP for a long time, which could help the theoretical and empirical studies of move analysis develop into a more comprehensive, multi-perspective and multi-dimensional field of study by integrating with other research areas.
Keywords/Search Tags:genre analysis, move structure, automatic identification, English abstracts
PDF Full Text Request
Related items