Font Size: a A A

Rater Bias Studies In Online Tem4 Essay Marking

Posted on:2011-11-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LuFull Text:PDF
GTID:1115330332959114Subject:English Language and Literature
Abstract/Summary:PDF Full Text Request
With online essay marking being accepted as the norm in direct testing of writing, the Test for English Majors Testing Center ushered in two major reforms in the 2009 Test for English Majors (Grade Four). One remarkable change in relation to scoring was the implementation of online scoring in TEM 4 essay marking in replacement of"paper-and-pencil"scoring. The online scoring marking is defined as a process by which the essay scripts are scanned and the images transmitted electronically to an image server at the test control center. These images are then distributed electronically and marked on screen by raters. Marks are captured electronically without manual intervention.The other change in TEM 4 essay marking, equally remarkable, was to supplant the holistic marking scheme with the analytic marking scheme. Until 2009, the marking scheme used in TEM 4 essay marking was a 15-point holistic scale with 5 bands (3-5, 6-8, 9-11, 12-14, 15). The brief descriptions for the five bands were Effective Communication with Accuracies for Band 5, Good Communication with Few Inaccuracies for Band 4, Passable Communication with Some Inaccuracies for Band 3, Problematic Communication with Frequent Inaccuracies for Band 2, and Almost No Communication for Band 1.The new 15-point rating scale put to use in operational rating sessions in 2009 adopts a different approach by evaluating candidates'essays in terms of Ideas &Arguments (1-7), Language Use(1-6) and Mechanics(1-2).The advance research conducted by Dr. Li Qinghua in 2008 and 2009 attests to the reliability and validity of the new analytic rating scale.The present investigation of TEM 4 rater behavior, especially rater bias patterns, constitutes an integral component of the rating validation study of the newly-adopted TEM 4 online essay marking. It is anticipated that the present study will uncover the possible rater bias patterns resulting from the interactions of different facets in the online TEM 4 essay marking procedure. It is also anticipated that the current investigation will serve to shed some light on the theoretical basis as well as operational models for more effective TEM 4 rater training, which has been tremendously emphasized by the TEM Testing Center ever since the first TEM 4 rating session. The momentous implications of the present study consist in the more efficient quality control of essay marking and the achievement of test fairness in such large-scale and high-stakes tests as TEM 4.The current research aims at finding answers to the following research questions:1. How do TEM 4 essay raters evaluate online essay marking?2. How do TEM 4 essay raters evaluate online rater training?3. What are the manifestations of TEM 4 essay rater bias?4. How do the rater-bias-related facets in the online rating process interact with each other?5. How effective is the online rater training program in reducing rater bias?6. What distinctions can be perceived between online essay marking and"paper-and-pencil"essay marking?Previous researches closely associated with rater bias studies in online TEM 4 essay marking comprise investigations into the testing of ESL (English as a Second Language) writing, studies of essay rater behavior, studies of rater bias, researches on rater training as well as online essay marking. It is precisely the research findings and theories in these fields that furnish the exploratory framework of the current study.According to a multitude of investigations, ESL writing tests have evolved from indirect testing of writing to direct testing of writing. The former is characterized by selected responses like true-false statements and multiple-choice questions, controlled writing and sentence combination tests. The latter, however, elicits samples of writing by requiring test-takers to complete writing tasks. The concept of communicative language competence is believed to have given rise to the direct testing of writing. According to Hamp-Lyons (1991:7), it is during the 1970s that there was increasing emphasis on language as communication. The 1980s witnessed developments in task-based learning and assessment. However, the direct testing of writing is not without dispute and disfavor. One of the basic arguments against the direct assessment of writing has been its unreliability. The issue of validity has also been brought into attention. It is researchers like Jacobs et al. (1981) that helped to establish the prestige of the direct assessment of writing. According to Jacobs et al. (1981:3), the direct testing of writing emphasizes to learners the importance of language for communication, rather than for mechanical or meaningless"language-like"behavior, and promotes a closer match between what is (or should be)taught and what is (or should be) tested, i.e. communicative skills for language use.The types of scoring schemes (in some cases also termed rating scales) utilized in ESL/EFL writing tests vary with differing purposes of writing tests. Among the widely acknowledged scales/schemes are the holistic scoring scheme, the analytic scoring scheme and the primary trait scoring scheme, all of which have been formulated in accordance with different testing conceptions as well as various purposes of writing tests. The holistic scoring scheme has the advantage of being very rapid, which makes it possible for each essay to be marked by more than one rater, thus increasing reliability. The epitome of such a rating scale is the Test of Written English Scoring Guide for TOEFL. But an inherent problem with the holistic scoring scheme is that it is not capable of providing diagnostic information, especially information about the uneven development of subskills in individuals. One danger in depending on such a rating scald, Weir (1993:163) states, is that a marker's impression of the overall quality might have been affected by just one or two aspects of the work. In some ways this is similar to the halo effect, i.e. the possibility that the rating of one criterion might have a knock-on effect in the rating of the next. A remedy for it is the use of another type of marking scheme---analytic scoring scale, the representatives of which include Anderson's Scheme and the ESL Composition Profile designed by Jacobs et al. in 1981. In spite of the fact that analytic scoring may take longer than with the holistic method, it has the added advantage of"performing a certain diagnostic role in delineating students'strengths and weaknesses in written production"(Weir, 1993:164). A primary trait scoring approach is conceived to be highly performance-oriented and task-specific. Primary trait scoring, according to Hamp-Lyons (1991:246), is based on a view that one can only judge whether a writing sample is good or not by reference to its exact context, and that appropriate scoring criteria should be developed for each prompt.For a long span of time, the rating of ESL/EFL writing has been dependent on human raters, which is inescapably accompanied by subjectivity on the part of raters, resulting in what is generally referred to as the rater effect. Jacobs et al.(1981:3) point out that the direct testing of writing utilizes the important intuitive, albeit subjective, resources of other participants in the communication process---the readers of written discourse, who must be the ultimate judges of the success or failure of the writer's communicative efforts. One of the three sources of inexactness, according to Mathews (1985:90), is the inexactness in mark schemes, and variety of interpretation and application of the mark schemes by the markers. Lumley and McNamara (1995:55) claim that it has long been recognized, (for at least a century, in fact) that variability in test scores associated with rater factors is extensive. Unreliable raters are defined by Alderson (1995:128) as somebody who changes his or her standards during marking, who applies criteria inconsistently, or who does not agree with other examiners'marks. From a psychometric perspective, Lumley summarizes (2005:23), rater variability, or error, can be characterized in various ways:rater severity or leniency, where a rater consistently rates higher or lower than the performance merits, or than other raters;the halo effect, where a rater fails to distinguish between conceptually distinct and independent aspects of a student's composition;central tendency, when the rater tends to give ratings clustered closely around the mid-point on the scale;restriction of range, when raters effectively do not use the entire scale, making it hard to distinguish between test takers; and randomness or inconsistency in rating behavior.Numerous measuring devices traditionally used to monitor the intra-rater reliability and inter-rater reliability include: double marking, multiple marking, the calculation of the mean and the standard deviation, the analysis of variance, and the calculation of reliability coefficient (alpha) and the correlation coefficient. The online rating systems developed in approximately the past fifteen years have proved effective in both detecting and minimizing rater-related measurement errors, errors which have been universally viewed as factors persistently affecting the reliability and validity of writing tests.The researches on such measurement errors, which once placed the focus on only the reliability of writing tests, have been incorporating the studies of scoring reliability and those of scoring validity. The investigations into the exact sources of variability in essay marking used to centre on scrutinizing test results. It is the present-day discussions of scoring validation that take an integrated approach by examining the elements involved in the rating process (candidates'ability, the rating scale, the rating mode,the context of rating, rater behavior, rater training, etc.)Approaches to investigating the rating process, Lumley (2005:25) concludes, have centered on two main areas, focusing on what is claimed to be observable and knowable:factors influencing the raterthe process or sequence of steps the rater adopts in ratingMore integrated presumably are the studies of rater behavior, which inspect all the possible facets exerting an influence on essay raters in terms of interactions. The insights into the rater bias, particularly sources of rater bias, have derived from the understanding that raters interact interminably with numerous facets during the rating process. Researchers like Diederich et al, Vaughan, and Hamp-Lyons found that different raters responded to different facets of writing and did so with some internal consistency. The revelations from their studies also showed that essay raters responded to cultural differences in essays and did so differentially in ways that appeared to be partially attributable to their experiential backgrounds and to their response to the students'linguistic/rhetorical backgrounds. The variability of rater behavior may also be accounted for by such background variables as gender, race, discipline, the geographic origin, and the amount of exposure to the writing of nonnative users of English. To Lumley (2005:312), rating is at one level a rule-bound, socially governed procedure that relies upon a rating scale and the rater training which supports it, but it retains an indeterminate component as a result of the complexity of raters'reactions to individual texts. The task that raters face is to reconcile their impression of the text, the specific features of the text, and the wordings of the rating scale, thereby producing a set of scores. The indeterminacy is manifest in the variations in scores given by raters. Lumley identifies the following kinds of behaviors related to the stages of the rating procedure (2005:141):Management behaviorsReading behaviorsRating behaviorsThe concept of rater bias derived from the realization that raters have idiosyncratic responses to different facets in the rating process. Lumley and McNamara (1995:56) submit that raters may display particular patterns of harshness or leniency in relation to only one group of candidates, not others, or in relation to particular tasks, not others, or on one rating occasion, not the next. That is, there may be an interaction involving a rater and some other aspect of the assessment settings. Such an interaction is termed bias. Engelhard (1994:98) defines it as the tendency on the part of raters to consistently provide ratings that are lower or higher than is warranted by student performance. To Schaefer (2008:466), the term bias refers to rater severity or leniency in scoring. The study of bias is thus, observes McNamara (1996:143), the study of interaction effects, e.g. systematic interaction between particular raters and particular candidates, or between particular raters and particular tasks/items, etc. A number of rater bias studies have investigated the interactions between raters and rating domains/items, raters and candidates, raters and reading strategies, raters and rater backgrounds, raters and rating occasions, and raters and rater training programs.The progress gained in the exploration of rater bias is also attributed to the rapid development in science and technology, especially in such arenas as statistics and computer software development (e.g. FACETS). In addition to the more conventional devices like the calculation of inter- and intra- rater reliability coefficients, new techniques are available for the inquiries into the interrelationships of the many facets. Some researchers, for instance, employ think-aloud protocols as a methodology to identify and categorize rater bias patterns. Others choose to rely on the Item Response Theory and Generalizability approach. And there are still others who have analyzed rater bias through the use of many-facet Rasch measurement advocated by Linacre in 1989. The many-facet Rasch model (MFRM) is a logistic latent trait model of probabilities which calibrates the difficulty of test items and the ability of test takers independently of each other, but places them within a common frame of reference. MFRM expands the basic Rasch model by enabling researchers to add the facet of judge severity (or another facet of interest) to person ability and item difficulty and place them on the same logit scale for comparison. The computer software FACETS based on MFRM has proved inordinately efficient in rater bias analysis. The bias analysis function unique to the FACETS, as Wigglesworth (1993: 307) comments, provides an assessment of each rater with respect to their rating characteristics and also provides an indication of whether the performance of each rater with respect to each item has been consistent or not.The contributions of rater bias analysis to the objectivity and fairness of the measurement of writing ability are immense. Bias analysis, according to Wigglesworth (1993: 309), identifies systematic subpatterns of behavior which may occur from an interaction of a particular rater with respect to some aspect of the rating situation. One of the practical applications of rater bias analysis is assumed to provide means of effective rater training, a vital procedure that serves to ensure scoring reliability and validity in essay marking. Rater bias analysis, Schaefer (2008:466) remarks, can help researchers explore and understand the sources of rater bias, thus contributing to improvements in rater training and rating scale development. In spite of the unanimity among researches with regard to the necessity of rater training, there are divergences of views in conclusions drawn about the effectiveness of essay rater training. Hamp-Lyons (1990:81) admits that indications are that the situation is not that simple: the context in which training occurs, the type of training given, the extent to which training is monitored, the extent to which reading is monitored, and the feedback given to readers all play an important part in maintaining both the reliability and the validity of the scoring of essays. Lumley and McNamara (1995:57) focus on the practicality of rater training and point out that rater training can reduce but by no means eliminate the extent of rater variability and that the main contribution of rater training is to reduce the random error in rater judgments. In other words, rater training is successful in making raters more self-consistent. Though there have been scattered reports on the effectiveness of rater training, the investigations of such an issue have never been systematic. The effectiveness of rater training is still relatively little understood and there have been numerous calls for the expansion of the research initiative in this area. The four seminal questions raised by McNamara (1996:230) are to be addressed in the evaluation of any essay rater training program.What is the effect of rater training?Do differences in rater harshness survive training?Do rater characteristics persist over time?What is the effect of background in raters? The design and implementation of online essay marking is believed to facilitate rater training activities and thereby reinforce the effectiveness of training programs.The present rater bias study in TEM 4 essay marking is an attempt to explore the bias patterns of TEM 4 essay raters and reveal the sources of rater bias by scrutinizing the detailed rating data provided by the online ratingsessions. The investigations are conducted from the perspective ofinteractions between raters and facets that supposedly affect the performanceof raters. A more practical issue addressed in the studies is the formulationof a working model for more effective rater training. The research frameworkconsists of two components, with the first being the investigation of raters'perceptions of and attitudes toward the newly-implemented online TEM 4 essaymarking system and the online training program, and the second focusing onTEM 4 rater bias patterns and sources of bias. The basic rationale for thepresent research is to rely on both secondary research in the form of a surveyof literature pertinent to rater bias and primary research in the form ofsampling data analysis. The empirical investigations consist of aquestionnaire survey of 2009 TEM 4 essay raters, and the analysis of samplingdata collected from 2009 TEM 4 online rating sessions and from an experimentconducted in November, 2009.The test used as a context for this study is Section A (Composition) ofthe writing component of the Test for English Majors (Grade Four), alarge-scale, criterion-referenced test administered to check theimplementation of China's National English Language Teaching Syllabus.Since the first test in 1990, TEM 4 has been made compulsory for Englishlanguage majors across China and has been administered once a year in May.It is the Shanghai Office of the National English Language Teaching Committeeand the TEM Testing Center that have been responsible for the arrangementsinvolved in the annual marking. The writing component of TEM 4 amounts to 25%of the total score, with Section A (Composition) taking up 15% and SectionB (Note-writing) taking up the remaining 10%. The 2009 TEM 4composition-writing task requires candidates to write a composition of about200 words on the following topic prompt:Tourism is a booming business in China. However, some people worry that too many tourists may bring harm to the environment, while others don't think so. What is your opinion? The questionnaire survey of 70 TEM 4 essay raters reveals strong professional qualifications on the part of all 2009 TEM 4 raters, who are university/college EFL instructors holding a bachelor's degree (for 11.4% of the raters),a master's degree (for 72.9% of the raters) or a doctoral degree (for 15.7% of the raters) in English language and literature. The majority of the raters (94.3%) have a minimal of 10 years'EFL teaching experience in such fields as English reading and writing, English grammar, translation, linguistics, English/American literature, cultural studies, and cross-cultural communication. It is in their experience as TEM 4 essay raters and raters in other large-scale and high-stakes writing assessments that raters differ, with only 28.65% of the raters being participants of TEM 4 essay marking sessions held in previous years. Responses to questions in the survey are indicative of highly favorable perceptions of both the TEM 4 online essay marking mechanism and the online rater training program. Close to 93% of the raters agree that online essay marking is relatively easy to operate and more than 90% of them find the new rating mode helpful in managing the tempo of marking. More than 85% of the respondents view online rating as effective in improving the marking efficiency, clarifying the possible misinterpretation and misuse of the rating scale and minimizing measurement errors.According to a comparative study conducted in November 2009, TEM 4 online essay marking mechanism outperformed the more conventional"paper-and-pencil"essay marking procedure in that online scoring facilitated the discrimination of test-takers'writing ability, the decrease in raters'overall variability in severity/leniency, the improvement of raters'internal self-consistency, and the reduction of rater bias.The current investigation of TEM 4 rater bias patterns relies heavily on FACETS 3.58, a computer program for the construction of linear measures from qualitatively-ordered counts by means of many-facet Rasch analysis. The many-facet Rasch model that suits the present study is as follows:log (Pnijk/Pnijk–1)=Bn–Di–Cj–Fkwhere Pnijk is the probability of writer n being awarded on item (category)i by judgej a rating of kPnijk–1 is the probability of writer n being awarded on item (category)i by judgej a rating of k–1Bn is the ability of writer nDi is the difficulty of item (category) iCj is the severity of judge (rater) jFk is the difficulty of the step up from category k–1 to category k According to Linacre (1991:8), persons, items (categories) and judges are facets with a number of elements of their own. For each element, FACETS provides a measure (linear quantity), its standard error (precision) and five fit statistics. The fit statistics enable diagnosis of aberrant observations and idiosyncratic elements. Results are presented tabularly and graphically. An extraordinary feature of FACETS is that it can also quantify discrepant interactions between elements of different facets. Once different measurements have been estimated from a data set, differential facet functioning, equivalent to differential item functioning or"item bias", can be investigated automatically. A judge's bias on one item, or an item's bias against a group of persons can be identified and its size and statistical significance estimated.The analysis of sampling data collected from 10 TEM 4 raters'assessment results of 400 essays is illustrative of considerable variability in relation to the candidate facet, the rater facet as well as the rating category facet. The marked and complex interactions between raters and such facets in the rating process involve principally rater-category bias and rater-candidate bias. The rating scale used in the operational TEM 4 rating sessions contains three rating categories of Ideas & Arguments, Language Use, and Mechanics. Of all the possible rater-category interactions, six percent cases manifest significant bias , while fifty-three percent of the rater-candidate interactions are found to demonstrate significant bias. The bias analysis, processed by the computer software FACETS 3.58,yields recurring patterns of bias in terms of bias magnitude/size and bias direction.Subpatterns connected with bias magnitude surface when raters are found to be most biased towards the Mechanics rating category, less biased towards Language Use, and still less biased towards Ideas & Arguments. The other apparent bias-magnitude-related pattern is that the highest percentage of significantly biased rater-candidate interactions is found among the candidates with"polar"ability estimates. In other words, candidates whose ability is extremely high or low are more apt to incur significant rater bias. Examination of the bias direction uncovers raters'involuntary"compensation"strategy when they make use of rating categories in the rating scale. In the sample analysis, some raters who exhibit bias interactions for both Ideas & Arguments and/or Language Use, and Langue Use and/or Mechanics, tend to reverse their severity pattern from one to the other. Although the exact cause of such rating behavior still eludes the researcher of the current study, a plausible interpretation for such a strategy is the subliminal decision of the rater to make compensations for being over-severe or over-lenient with any particular rating category. What becomes apparent from the bias direction analysis is the tendency for raters to show more severe bias towards the candidates with higher ability estimates.Attempts to identify the bias sources fortify the assumption that there exists a certain correlation between rater bias patterns and rater background, particularly raters'teaching experience and rating experience.According to the analysis of the sampling data collected from three online TEM 4 rater training sessions, the training program has fulfilled a substantial part of its objective in that it has succeeded in reducing the variability of raters'overall severity/leniency and raters'bias magnitude. Nevertheless, statistics provided by FACETS bias analysis indicate that the rater training program is found wanting in effectiveness. Despite the meticulously-designed training activities, there are still manifestations of significant variability related to overall rater severity/leniency, and significant rater-candidate and rater-category bias. In addition, the feedback provided to raters after the initial training session might have aggravated the prevalent mental strain of TEM 4 essay raters, the consequence of which was the slight increase in central tendency in rating.It becomes imperative, therefore, that the TEM 4 rater training program be re-examined and that the rater training effectiveness be re-addressed. The present research advocates adopting a more integrated rater training approach based on detailed rater bias analysis. In the first place, both the initial training session and frequent ongoing training sessions are to be equally stressed, for it is presumably the ongoing training sessions that preclude fluctuations of severity/leniency in the operational rating. Secondly, a more interaction-oriented training program will enable raters to gain access to individualized feedback of the trial rating sessions, including such statistics as the group mean, the benchmark mean, rater bias size, rater bias direction, and rater bias target. Accompanying such statistics should be descriptions of and comments on the sample essays selected for trial rating activities. A more interaction-oriented training program will also allow for ample discussions among raters, experienced or inexperienced, for the sake of possible uniform interpretations of the rating scale in use and the salient features of sample essays. Meanwhile, the selection of sample essays for trial rating activities requires caution and expertise on the part of test rating administrators. It is suggested that sample essays be selected in such a way that they demonstrate different features of writing as well as various characteristics of test candidates. It is advisable that the analytic rating scale be used for diagnostic purposes. Finally, an effective training program is to be based on a better understanding of rater characteristics derived from conscious efforts to create rater profiles and scrupulous analysis of rating behavior, especially rater bias analysis.The Test for English Majors (Grade Four) has seen a surge in the number of both schools and test-takers. Only 155 universities and colleges participated in the test in 1992, but the number of participating schools in 2009 became 798. In 1992 only 8554 candidates took the test, but the number of test-takers skyrocketed to 260,000 in 2009. Originally regarded as a means to check the implementation of the national teaching syllabus, TEM 4 is also being used in the evaluation of English programs in many of the universities and colleges in China. Given the ever-growing population of test-takers and the immense social impact of TEM 4, it is of paramount importance to standardize essay raters to ensure test fairness. It is legitimate to view the present study, in that sense, as a meaningful endeavor to address the reliability and validity of online TEM 4 essay marking.
Keywords/Search Tags:online essay marking, rating variability, rater effect, interaction, rater bias, rater training
PDF Full Text Request
Related items