| After the State Council issued the "Implementation Opinions on Deepening the Reform of the Examination Enrollment System" on September 3rd,2014,colleges and universities across China have begun to widely implement college English placement testing and placement teaching.Placement test is the important basis of placement teaching.The placement tests of different colleges and universities are developed under different organizers.There is no common test development monitoring mechanism between these local tests,there is no unified test specification,and there is no equating treatment between test items.The test syllabuses and rating scales they use are different,so the test scores may not be comparable.In April 2018,the Ministry of Education announced a national unified English language standard.A unified language scale will facilitate the development of rating scales and the interpretation of scores(Jin et al.,2017;Lin,2015;Jianda Liu,2015).CSE is our country’s first English language ability standards for English learners in our country.Researchers have conducted many studies surrounding CSE,involving the relationship between CSE and teaching,the linking of language tests to CSE,validation framework for linking language tests to language standards,application of CSE to formative assessment,and etc.Among the relevant studies on CSE-Speaking Scales,some have discussed the principles and methods of constructing CSE-Speaking Scales,some discuss oral proficiency descriptors,some studies focus on the linking of speaking tests to the CSE-Speaking Scales.To be specific,among those speaking tests linked to the CSE-Speaking Scales,there are such international language test programs as TOEFL iBT,IELTS,Aptis,there are such independently developed nationalwide test as the speaking tests of College English Test Band 4 and 6(CET-SET4,and CET-SET6),and there are school-based speaking English teaching and test system whose feasibility and effectiveness have been justified.It is not difficult to find that the research on the application of CSE-Speaking Scales is not enough in terms of the improvement and perfection of CSE-Speaking Scales,and there has been no empirical research on the application of CSE-SS to the development of speaking rating scale for school-based English placement test.CSE currently do not provide specific description of contextual factors or test tasks.More research on contextual factors is needed to verify the applicability of CSE.Interactional construct theory defines construct from the perspective of interaction between language competence and context,thus can provide a comparatively full picture of communicative ability(Chapelle,1998).Argument-based validity validation approach can solve the limitations of traditional validity validation methods which cannot validate validity systematically or multi-dimensionally,and can provide a set of operable validation procedures for validity validation(Knoch&Chapelle,2018).In order to verify the adaptability of CSE in our country’s school-based tests,and to make the scores of the independently developed College English Placement Test(CEPT)of Hunan University more comparable,and to make the score interpretation of CEPT easier to understand and accept,this study employing a mixed methods multistage evaluation design(Creswell&Clark,2011),mainly based on Chapelle’s(1998)interactional construct theory,referring to the CSE-Speaking Scales,developed three new CEPT-speaking rating scales(namely task-based holistic rating scale,task-based analytic rating scale,and task type-based analytic rating scale).According to the argumentbased validation approach,this study collected quantitative and qualitative data in order to validate the validity of these three new speaking rating scales in rating practices,and to explore the impact of the score meaning and interpretation of the textual CEPT speaking score report and the three newlydeveloped verbal reports on teaching and learning feedback.The research questions are as follows:Research Question 1.Which parameters of the existing(old)CEPT speaking rating scale are consistent with those proposed in the CSE-SS?Which parameters proposed in the CSE-SS were not highlighted or even missing in the old CEPT speaking scale?How does the newly-developed(new)speaking rating scales in this study strengthen or include these parameters proposed by the CSE-SS?Research Question 2.The new speaking rating scales developed in this study emphasizes the constructs in the CSE.How is the validity of the newlydeveloped rating scales in rating practices?How to validate the validity of the newly-developed rating scales?Research Question 3.What is the impact of the score meaning and interpretation of the speaking rating scales developed in this study on teaching and learning feedback?Specifically,what is the backwash effect of the textual report and verbal report on teaching and learning feedback?To explore Research Question 1,this study firstly compared the language standard parameters of the CSE,CLB and the old CEPT speaking rating scale;listed the levels and descriptors of the CSE speaking standards,CLB speaking benchmarks and the old CEPT speaking rating scale in the form of a table for comparative analysis.Because the old CEPT speaking rating scale was developed with reference to CLB speaking benchmarks,this study needs to find out the similar and different parameters between CSE and CLB,and at the same time,to find out if there are any missing CSE parameters in the old CEPT speaking rating scale,thus to explore which parameters proposed by the CSE need strengthening or including into the new CEPT speaking rating scales,thereby,to modify or develop new CEPT speaking rating scales.To explore Research Question 2,this study,based on the argument-based framework for validation proposed by Knoch and Chapelle(2018),focusing on the evaluation,generalization,and explanation inferences,proposes relevant assumptions,and,accordingly,evidences are collected to back up the assumptions,so as to validate the validity of the rating scales in the actual rating practices.In order to validate the validity of the three rating scales developed with reference to the CSE speaking standards,two rating experiments were designed and administrated:pilot test and main experiment.(1)The purpose of pilot test was to test the newly-developed rating scales and the analysis methods.Through a small-scale fully-crossed experiment(5 raters,30 examinees),the newlydeveloped rating scales were tested.Quantitative and qualitative data were combined to discuss the rating validity of the rating scales.(2)The main experiment was to improve the newly-developed rating scales and completed the validation of the rating scales’validity.Based on the test results of the pilot study,the rating scales and interview guideline questions for raters were revised,and a rater-oriented questionnaire was developed at the same time.In the main experiment,6 raters with rich speaking rating experience were invited to score 74 speech samples using three rating scales.These samples were from the CEPT speaking test and stored in a newly-developed network scoring system.The 6 raters were divided into two groups,and data were collected by using fullycrossed counter-balanced experiment design within group(Linacre,1994).Among the 74 speaking samples,26 were anchor samples.After rating each examinee,the rater immediately made a verbal report on the examinee’performance.After each round of scoring,the raters were interviewed on the experience of using each specific rating scale.To explore Research Question 3,this research has designed two questionnaires,which were respectively oriented to teachers and students,and questionnaire surveys were conducted to extract the perceptions of teachers and students on four different performance reports.The participants of the questionnaire survey were 74 examinees in the main experiment and their teachers in related courses.Among the four different score reports,one of them was a textual score report,which was the official score report of the schoolbased CEPT speaking test(in PDF format,see Appendix V);the other three were verbal score reports(including audio and Excel transcripts),which were from the raters’ verbal reports on the examinees’ performance in the main experiment.By comparing the amount of information that different score reports can provide,as well as the impact of different score reports on teachers’teaching and on students’ learning,we can find out the impact of the score meaning and interpretation generated by the CSE-referenced speaking rating scale on teaching and learning feedback.Findings on Research Question 1 are as follows:The first-level and the second-level parameters of the old CEPT speaking rating scale and CSE-SS are the same,and the third-level parameters are somewhat different.First of all,in terms of "oral communication activities",the old CEPT speaking rating scale has fewer three-level parameters than CSE-SS.Secondly,in terms of "oral communication strategies",most of the third-level parameters of CSE-SS are missing in the old CEPT speaking rating scale.Thirdly,in terms of "oral language knowledge",the second-level parameters of the CSE-SS and the old CEPT speaking rating scale are the same,including "linguistic knowledge","discourse knowledge",and "pragmatic knowledge".The threelevel parameters of the two scales are similar,including "accuracy","fluency","coherence","flexibility",and "appropriacy".Findings on Research Question 2 are as follows:(1)Evaluation inference.The three types of rating scales correlated with and differed from each other to various degrees.In the task-based holistic scoring,the internal correlation between the scores of each dimension was the lowest,which indicated that task-based holistic scoring showed the weakest halo effect.The scores obtained from the task type-based analytic rating scale were significantly different from those obtained by the task-based holistic rating scale and by the task-based analytic rating scale.Task type-based analytic scoring displayed the widest spread of examinee ability measures,followed by task-based analytic scoring and then task-based holistic scoring,which indicated that task type-based analytic scoring could best discriminate examinees.The levels,dimensions,and subskill settings of the three rating scales were generally in line with expectations.Although there were several unexpected uses of scale steps,yet,the use of the scale steps of the three rating scales by raters was generally in line with expectations.The bias analysis showed that the raters’ rating bias had an impact on the score changes.Specifically,when using the three types of rating scales,raters were more consistent in severity when evaluating examinees’ ability than when using rating criteria.In terms of evaluating examinees’ ability,in task type-based analytic scoring,the rater showed the most obvious and systematic rating bias towards the examinees.As the ability estimate of the examinee increased,the rating bias becomes more severe or lenient.In terms of using rating criteria,rater severity varied significantly and the rating bias was obvious.In particular,when using the rating criteria of the task-based holistic rating scale,raters’ severity differed the most.As far as scoring confidence was concerned,raters were most confident when using task-based analytic rating scale,followed by task-based holistic,and finally task type-based rating scale.However,the raters’ scoring confidence did not always reflect their actual scoring accuracy.(2)Generalization inference.When the raters used the task-based holistic rating scale,they had the highest scoring reliability and consistency;when using the task type-based analytic rating scale,the raters had the largest discrepancy in severity.In addition to examinee factor,the factors that had a greater impact on the score variability for the three rating scales included"raters-examinee-criterion interaction and other undifferentiated errors","raterexaminee interaction".As far as score generalization was concerned,in the criterion-referenced College English Placement Test,no matter it was single,double,or triple rating mode,the dependability was the highest when using the task-based holistic scoring,and the dependability was the lowest when using task type-based analytic scoring.The task-based holistic scoring enjoyed very high dependability in single rating mode or double rating mode.Task-based analytic scoring enjoyed comparatively high dependability in double rating mode,and task type-based analytic scoring enjoyed comparatively high in triple rating mode.(3)Explanation inference.When using the three rating scales,all raters paid attention to the five core criteria related to phonology,grammar,ideation,discourse,and pragmatics.The rating criteria at the phonological and pragmatic levels were most relevant to the communicative context(task).This proved that it was reasonable to use context-specific rating scale.When scoring by using a context-specific rating scale,raters would focus on the various dimensions or subskills of the rating scale underlying the language ability model during the scoring process.When raters used different rating scales to score,their scoring focuses were also different.Findings on Research Question 3 are as follows:The speaking rating scales developed in this study emphasizes the constructs in the CSE,and they reinforced and included more parameters proposed by CSE speaking standards.Results showed that(1)these newlydeveloped speaking scales provided more information than the old speaking scales.Task-based analytic rating scale provided more information,followed by task-based holistic rating scale,then task type-based analytic rating scale,and finally the old CEPT speaking rating scale.(2)The scores and interpretations produced by the speaking scales had a positive impact on teachers’ teaching and students’ learning.Teachers and students reckoned that they would conduct targeted teaching or study with reference to the score report.The significance of this research is that(1)The newly-developed rating scales in this study proved the adaptability of CSE in the school-based CEPT,which makes the CEPT score report provide more meaningful interpretation of the placement test and more clear guidance on placement teaching.(2)This study compared the parameters of CSE and CLB,established the CSE parameters that needed reinforcing or including into the newly-developed CEPT speaking rating scales,and identified them in the new rating scales.This is the first study that attempts to apply CSE to school-based placement tests in this way.(3)This is the first empirical study to combine the CSE with the Interactionalist Construct Theory to develop the speaking rating scales.By adjusting the learner factors and the contextual factors of different rating scales,three different types of rating scales were developed.This research expands the contextual validity of CSE and opens up new ideas for the application of CSE.(4)This study explores the theory,method and route for validity validation,provides a theoretical and methodological reference for improving the validity validation of subjective tests,and makes the speaking evaluation method more scientific.(5)This research comprehensively used the three theories of language testing,namely Classical Test Theory(CTT),Generalizability Theory(GT),Item Response Theory(IRT),and research methods of grounded theory of social science,and comprehensively uses a variety of software,for example,SPSS,GENOVA,FACETS,NVivo,and CiteSpace,to comprehensively analyze the data required for validity validation.In the comprehensive application of quantitative and qualitative research related to the humanities and social sciences,arduous explorations had been carried out.(6)This study provides a reference basis for modifying and optimizing the independently-developed school-based CEPT based on the CSE parameters.In view of the three major inferences of "extrapolation","decision" and"consequence" in the argument-based validation framework,which has been less concerned at present,more extensive,in-depth,continuous and solid empirical research can be carried out in the future.In the future direction of placement test system improvement,it can be considered to add pair work or even group work tasks to the last question of the test,which will have a more comprehensive measurement of the examinees’ speaking ability. |