| Speaking assessments commonly use multiple tasks(e.g.,read aloud,individual presentation)or task types(e.g.,integrated or interactive tasks).However,inadequate attention has been paid to the communication contexts defined by different tasks or task types,which can be a factor affecting the test construct through the interaction of construct and context(Bachman,2007;Chapelle,1998;Fulcher,2015).As noted by McNamara,Hill,and May(2002),the ways that rating scales are constructed and interpreted represent the de facto test construct of a speaking assessment.In addition,the rating scale is important,due to its function in mediating the task-and rater-related factors in performance assessment(Schoonen,2005).In considering these arguments concerning speaking assessments that involve multiple tasks,it is apparent that the rating scales not only reflect the intended test construct,but also affect the performance ratings in terms of score outcomes and rater behaviors.These issues regarding the rating of speaking performance affect the extent to which the intended construct is operationalized in testing practice.From a unitary perspective on validity(Bachman,1990;Chapelle,1999;Messick,1995),the examination of the inferences from test scores that are derived from different rating scales is actually an investigation of test construct validity under different rating conditions.However,very few,if any,empirical studies have been carried out to investigate these issues regarding rating scales.Previously,most studies on rating scales were conducted in the context of essay scoring instead of speaking assessment.Therefore,this study has investigated the role that rating scales play in mediating the context-related factors in oral English tests,e.g.,task characteristics or types of oral communication.The focus of this research was on determining the degree to which the mediating effects of rating scales can influence test scores,facets of measurement,and rater judgments.Based on the relevant findings,the study further explored how rating scales affect the construct validity in oral proficiency assessments.Drawing on the interactionalist construct theory(Bachman,2007;Chalhoub-Deville,2003;Chapelle,1998;He & Young,1998),this study compared how raters performed differently in using three types of rating scale with various degrees of interaction between the construct to be measured and the task features as given in the College English TestSpoken English Test Band 4(CET-SET4).Focusing on the same set of core criteria but using three types of scoring scale,the raters gave multiple scores on the test takers’ performances.Over the entire test they used test-based analytic scales(analytic scoring).On individual tasks they used task-based holistic scales(task-based scoring).For the assessments of non-interactive communication and interactive communication,they used task type-based analytic scales(task type-based scoring).Under a unitary concept of validity,test validation is the process of justifying inferences made from test scores.Following this understanding,this study adopted an argument-based approach to validating rating practices.One overarching question and three sub-questions were raised pertaining to the inferences of “evaluation”(i.e.,the assessment of examinee performance with reference to the rating scales),“generalization”(i.e.,the examination of score consistency across parallel sets of tasks and raters),and “explanation”(i.e.,the investigation of the link between test scores and the intended construct).Overarching question: In what ways and to what extent can rating scales affect the construct validity of the CET-SET4?RQ1.How do rating scales affect the assessment of examinee performance?(evaluation)1a.How do scores derived from different rating scales correlate with each other?1b.How do scores derived from different rating scales differ from each other?1c.How do rating scales affect the spread of examinee ability?1d.How do rating scales affect the functioning of rating criteria and subscales?1e.How do rating scales affect rater bias toward rating criteria and examinees?1f.How do rating scales affect raters’ confidence in making decisions?RQ2.How do rating scales affect score consistency across tasks and raters?(generalization)2a.How do rating scales affect rater consistency and severity?2b.How do rating scales affect the relative contributions of measurement facets to score variability?2c.How do rating scales affect score dependability in rating practice?RQ3.How do rating scales affect score meaning and interpretation?(explanation)3a.How do raters perceive the interaction between the construct to be measured and the context of oral communication?3b.How do the interactions between rating scale,performance context,and test taker performance affect raters’ rating processes?This study used a multistage evaluation design and a mixed-methods approach(Creswell,2015;Creswell & Plano Clark,2011)to investigate the effects under consideration.Stage 1(theory conceptualization)started with a review of the relevant literature,and then defined the speaking construct from an interactionalist perspective(Chapelle,1998).This definition was subsequently applied to the development of rating criteria for the three types of rating scale.This stage of the study also laid a theoretical foundation for the development of analytic scoring scales,task-based scoring scales,and task type-based scoring scales.These scales were considered with reference to the minimal(Chapelle,1998),moderate(Chalhoub-Deville,2003),and strong claims(He & Young,1988)made by the interactionalist construct theory.In addition,the research framework was constructed by integrating the research questions into an argument-based framework for validating rating practices(Knoch & Chapelle,2017).This approach helped to relate the issues about rating to the task of validating test construct.In stage 2(instrument development),the rating criteria and rating rubrics for the three types of rating scale were developed on the basis of the aforementioned theoretical underpinnings and the rating scales used by major oral English tests.For all three types of rating scale,the learner factors at phonological,grammatical,discoursal,ideational(content),and pragmatic levels served as the core criteria.Guideline questions for the semistructured interviews were also drafted,with a specific focus on the raters’ experiences and processes in using different rating scales.In stages 3-4(pilot study of research instruments and methods),these research instruments were piloted.With a fully-crossed design,the small-scale pilot study triangulated the quantitative and qualitative data.A tentative discussion was then conducted on the effects that rating scales had on the construct validity of the speaking assessments.In stages 5-6(instrument revision and the main study),the rating scales and interview guidelines were revised on the basis of the pilot study to better meet the research objectives.The questionnaires to be used for the main study were also developed.In the main study,six raters with extensive experience in the rating of speaking performance were invited to rate 166 speech samples of responses to the CET-SET4,using the three types of rating scale.The raters were formed into two groups,with a fully-crossed,counter-balanced design used within each group.Among the 166 speech samples,34 were selected as anchor samples.After each rating session,questionnaire surveys and semi-structured interviews were conducted to elicit the raters’ feedback on their rating experiences and the focuses of rating with the three types of rating scale.In stages 7-8(data analyses and discussion),the rating data were analyzed using statistical techniques based on classical testing theory(e.g.,descriptive analysis,internal correlation analysis,paired-samples correlation analysis,and paired-samples t-tests),and many-facet Rasch measurement(MFRM),as a means for answering RQ1.Generalizability(G)theory and MFRM were applied to the test scores to inform a discussion of the issues raised by RQ2.On the qualitative side,the interview data on the raters’ processes of rating were analyzed with reference to the interactionalist analytical framework,which provided a means for examining the process of speaking performance rating(adapted from Cumming et al.,2002).The qualitative findings from the questionnaire surveys and interviews were mainly used in the discussions concerning RQ3 and the issue of rater confidence in RQ1.For RQ1,data analyses revealed various degrees of score correlation and difference between the three types of rating scale.A significant divergence was identified between the test scores generated by task type-based scoring and those generated by analytic scoring or task-based scoring.Task-based scoring was identified with the weakest halo effect.The widest spread of examinee ability was identified when the measurement system for task type-based scoring was applied.Task-based scoring was found to be the least effective method for discriminating among different levels of oral English proficiency.The scale steps of the three types of rating scale were mainly used as expected.Analyses of the interactions between raters,examinees,and rating criteria(tasks)uncovered that the contribution of rater bias to score variation should be given sufficient attention.Specifically,for all three types of rating scale,the raters were substantially more consistent in relation to the examinees than in relation to the rating criteria(or tasks).The raters differed most in terms of rating severity across the various rating criteria of analytic scoring,and they showed the most substantial and evident tendencies toward bias in relation to examinee ability when using task type-based scoring scales.For example,the raters tended to grow harsher or more lenient with the increase of examinee ability.Regarding rater confidence,the raters were more confident in making decisions when using analytic scoring scales and task-based scoring scales.Nonetheless,a higher level of rater confidence did not necessarily imply higher score reliability in rating practices.For RQ2,the raters scored most reliably and consistently when using test-based analytic scales,and they demonstrated the most variation with task type-based scoring.In addition to the main effect of examinee ability,the rater-by-examinee-by-criterion interaction and other undifferentiated errors,and the rater-by-examinee interaction,also accounted for large percentages of the total score variance among the three types of rating scale.Regarding score generalizability in the context of the CET-SET4,the raters assigned scores most consistently when using test-based analytic scales with double ratings.The issue of score generalizability was found to be most problematic for task-based scoring with single rating.For RQ3,the raters showed a tendency to overlook the discoursal-level criteria for all three types of rating scale.The rating criteria at phonological,ideational(content)and pragmatic levels were found to be most relevant to the context of performance.In particular,the raters’ focus of their rating practices varied with the rating scales being used,i.e.,they applied a “self-monitoring focus” on their own rating behaviors for analytic scoring scales,an “ideational focus” on the test takers’ performance in topic development for task-based scoring scales,and a “communicative focus” on the interaction between the test takers in interactive communication for task type-based scoring scales.These research findings helped to reveal the main factors that could affect or even jeopardize test construct validity.For analytic scoring,the sources of issues included the halo effect and the raters’ decisions on macrostrategies for rating.For task-based scoring,one major source of potential construct jeopardy was the repeated score penalty on core aspects of the speaking construct(e.g.,phonological ability)that are essential to test performance on all tasks.A further factor that may have affected test construct validity was the issue of score weighting on ideational-level criteria.For task type-based scoring,the sources of factors included the co-construction of speaking ability in interactive communication,and repeated score penalties on core aspects of the speaking construct in both non-interactive and interactive communication tasks.Three major limitations were identified for this study.First,the CET-SET4 is primarily designed to assess whether the test takers can fulfill the basic requirement of the college English curriculum in China.Therefore,regarding sample selection,the study lacked speech samples from advanced English learners in China.Second,in terms of the sources of qualitative data,the study only used questionnaire surveys and semi-structured interviews as research tools due to the constraints on time and budget.Future research may include more competent English speakers and employ verbal protocol analyses to glean richer information on raters’ rating practices.Finally,this study is in essence an exploratory one,and the scale of the empirical study is limited.Future studies are expected to employ a larger number of subjects and a wider variety of techniques,so as to generate more convincing and original results that are of greater significance to both testing theory and practices.In summary,this study builds on the interactionalist construct theory to conduct pioneering research on the rating of speaking performance.The study provides new insights into the definition and operationalization of the speaking construct with controlled interactions between learner factors and contextual variables.By using an argument-based approach to validating the processes of rating,this study also sets an example for relating rating issues to test construct validation. |