Research | Aligning the GSE Young Learne...- 21英语教师网

Executive Summary

Since its publication in 2001, the Common European Framework of Reference for Languages (CFER) has spread beyond the borders of Europe to inform language teaching and assessment around the world.

Pearson expanded the CFER scale in terms of breadth and depth, creating the Global Scale of English (GSE) Learning Objectives, designed for teachers and learners to provide more granular progressive indications of language proficiency from below A1 levels to language mastery.

In China, there has been a significant drive over the last five years to develop a contextual Chinese Scale of English proficiency (CSE) in order to streamline the teaching, learning and policies of English as a second Language across educational primary, secondary and tertiary sectors. The intention being to create a unified national scale, rather than the continuation of differing regional approaches.

This paper reports a research project to link the GSE and CSE Learning Objective Standards using Comparative Judgement (CJ) methodology. CJ is increasingly being used to set, maintain and compare standards, and avoids many of the pitfalls of traditional alignment methods that are often beset with heuristic issues as a result of requiring absolute judgements by raters.

An alignment of GSE Young learners Learning Objectives to CSE levels 1-4 is demonstrated through this research.

Background to GSE Young Learners

The Global Scale of English (GSE) is a standardized, granular English language proficiency scale that runs from 10 to 90 and is psychometrically aligned to the Common European Framework of Reference for Languages (CEFR). The GSE serves as a standard against which English language courses and assessments worldwide can be benchmarked, offering a truly global and shared understanding of proficiency levels.

The GSE Learning Objectives are mapped to the Global Scale of English and describe what a learner can do at different levels of proficiency on the scale. Using the Global Scale of English, teachers can match a student to the right course materials to suit their exact level and learning goals, and track their progression granularly.

The project to develop GSE Learning Objectives builds upon the research carried out by Brian North (North, 2000) and the Council of Europe in creating the Common European Framework of Reference for Languages (Council of Europe, 2001). This research targeted adult and young adult learners and provides a solid framework for extending the set of learning objectives to include additional learning objectives (Can Do statements) specific to particular adult audiences. As part of the GSE project, Pearson has developed additional GSE Learning Objectives for both Academic and Professional English.

The CEFR, however, was never created with the youngest learners in mind, although many have tried to adapt it with varying degrees of success. This is why Pearson English have carried out research, following the model of the CEFR, to create a similar proficiency framework that specifically targets learners aged 6 to 14.

The Global Scale of English itself has been aligned to the CEFR following the psychometric principles and procedures used in developing the CEFR – and all new GSE Learning Objectives for Young Learners are given a GSE value on this same scale. In this way, learners can chart their proficiency and progress across ages and stages of development – from primary school to higher education and learning in the workplace.

In developing the GSE Learning Objectives for Young Learners, Pearson has created learning objectives that support a granular definition of language proficiency – enabling teachers to establish clear learning goals for their students, parents to understand more clearly what their children are learning, and perhaps most importantly, ensuring that learners are aware of the small increases being made in their proficiency. All students – and especially young learners – are much more motivated when they can see that progress is being made.

Background to China’s Standards of English Language Ability: the CSE

In China, there has been a significant drive over the last five years to develop a contextual Chinese Scale of English proficiency (CSE) in order to streamline the teaching, learning and policies of English as a second Language across educational primary, secondary and tertiary sectors. The intention was to create a unified national scale, rather than the continuation of differing regional approaches.

In 2014, the State Council of China issued a document entitled Deepening the Reforms on the Educational Exams and the Enrolment Systems. This paper described the need to develop a foreign language assessment framework to improve the quality of language tests, enhance the communication between teaching, learning and assessing, and raise the overall effectiveness and efficiency of foreign language education in China. To enact these aims, the National Education Examinations Authority (NEEA), endorsed by the Ministry of Education, China, initiated a nationwide project to develop an English language proficiency scale, known as e China’s Standards of English Language Ability (CSE). The aims were to:

defineand describe the English proficiency of English learners in China;
provide references and guidelines for English learning, teaching and assessment;
enrich the existing body of language proficiency scales for alignments on a globalbasis (Liu, 2015).

The overarching theoretical framework and the CSE standards were published in their translated forms in December 2018.

The theoretical framework sets out an ambitious model of proficiency, based on a use-oriented approach, as proposed by the Communicative Language Ability (CLA) model, developed by Bachman and Palmer (1996). A key principle for this is to reflect the use of English in the context of China, which is, in the main, found in educational settings. The theoretical grounding of the CLA model therefore differs in some regards to the action-orientated model of the CEFR, which was developed in a more fluid European environment of multiple language usage. The GSE is also underpinned by an action-orientated approach to language, and so there are some differences between the overarching constructs of the CSE and the GSE. That said, there are far more similarities between the GSE and the CSE, compared to the CEFR. The GSE and CSE are both purposed to provide teachers with a balanced and pragmatic guide to the development of language learning from primary stages of education to post degree professional settings.

This paper is focused on the alignment of the CSE and GSE Learning standards for young learners, and therefore concentrates on CSE levels 1-4.

Pearson’s research on the alignment of the CSE and adult GSE Learning Objectives, (CSE levels 4-9) will be reported on separately.

Purpose of alignment

There are various forms of alignment that can be carried out in the area of language learning and assessment: In the main, these include;

Alignment of different content standards
Alignment of content standards to the performance standards of tests
Alignment of the performance standards across tests

In the context of the CSE and the GSE, the only common features at this time are the content standards. The GSE has a number of associated assessments linked to the GSE (and CEFR) scales, however the CSE has not yet been realized in terms of corresponding assessments.

Therefore, the alignment study of this reported research focuses on the underpinning learning progression through the two frameworks of the CSE and the GSE. As discussed, there are some differences between the two frameworks, in particular the inclusion of interpretation and translation strands in the CSE. However, the remaining strands demonstrate very similar language constructs. Therefore, this alignment study is important as it develops our understanding of the relationship between the CLA based model of the CSE to the expansive and comprehensive user-orientated model of the GSE. The alignment will be of use for teachers, schools, and policy makers who are interested in the relationship between Pearson’s Young Learners GSE scale and China’s Standards of English Language Ability.

Overview of the Research

Traditional alignment studies require judges to make absolute judgements. A number of such methodologies for alignment studies are suggested in the Council of Europe Manual (2009). While this document is extremely useful in providing a number of methods of alignment between content and performance standards, there continues to be two main issues with using a variety of methods for objective measurement alignment studies:

The very variety of alignment methods, processes and procedures results in a lack of equivalence in terms of alignment. In other words, depending on the method used, alignment outcomes may be different.
Nearly all alignment methods involve human judgments to make absolute judgements when rating content standards or test items. Such judgments are characterized by uncertainty and are known to be subjected to bias and heuristics (e.g., Eckes, 2012; Kahneman, Slovic, & Tversky, 1982)

It is well documented that while absolute judgments may show a reasonable degree of reliability and agreement, they may also conceal discrepancies in how the underlying criteria are applied (e.g., Smith, 2000; or Harsch & Martin, 2013).

To avoid the issues above, we chose to use a Comparative Judgement methodology. Comparative Judgement (CJ), is not a particularly new process, however its application in an online environment has transformed its efficiency and efficacy. CJ is increasingly being used as an alternative to using rubrics for marking, and in particular is used in standard setting and comparative analyses of different content standards or tests. CJ has been used to a limited extent in the realm of English Language Assessment (Council of Europe, 2015) and framework alignment (Jones, 2009).

Comparative Judgement method

Overview of the method

Comparative Judgement (CJ) methodology is based on the idea that people are better suited to making relative judgements than we are to making absolute judgements (Thurstone, 1927).

Traditional rating exercises rely on absolute judgements. That is, judges assign each learning objective descriptor a rating on a scale such as the GSE or CSE. It is very possible for this process to be successful, but it takes a great deal of training to ensure that all judges interpret and apply the scale in the same way. It also takes time for each judge to consider each descriptor and be certain of its exact place on the rating scale.

In contrast, CJ exercises are based on relative judgements. Judges are presented with a pair of descriptors and asked to identify which one describes a more difficult skill. These judgements are quick and intuitive. Because judges base their decisions on their experience as language education experts rather than on any specific framework, descriptors from different frameworks can be compared directly.

The judgements are analysed statistically using the Bradley-Terry model (Bradley & Terry, 1952) which is mathematically equivalent to the Rasch model (Rasch, 1960/1980), to establish a scale of descriptor difficulty. This scale describes the difficulty of all descriptors in the study, including descriptors from different frameworks. One of the primary benefits of the common scale produced through CJ is that neither framework is “centered” during the study. Judges are not asked to consider one framework through the lens of the other. Rather, judges produce an independent scale that describes both frameworks together and can be used to express the relationship between them.

Design of the study

In this study, we compared the difficulty of descriptors from China’s Standards of English Language Ability (CSE) and the Global Scale of English for Young Learners (GSE).

The 23 language experts selected to act as judges in this study were based in China, had familiarity with the CSE, and had between 5 and 15 years of experience teaching English in the young learner context. No standardization activity was required, as judges were not intended to reference any specific framework when making comparisons. Judges were provided with a detailed set of instructions explaining the methodology and an interactive demonstration to familiarize them with the web-based platform used to run the exercise, No More Marking (Wheadon, 2019).

Judges were given this simple set of instructions to guide their judgements:

‘Two descriptors will appear on your screen and you will decide which one describes a more difficult skill.

Take enough time to read the descriptors and absorb their meaning, and then make a choice based on your expert opinion and experience. There is no need to reference any documents or external materials for this task. Some of the descriptors come from alternative international standards. You may find the wording and grammar different in style - please judge them according to the essence of what is being described.’

The sample included 1,554 descriptors in total: 665 CSE and 889 GSE. The sample was drawn from CSE levels 1-4 and GSE Young Learners 10-66 and was balanced across reading, writing, listening, and speaking skills. No translations were used. Descriptors were presented in their original language. Figure 1 shows an example of a comparison presented to judges in the web-based platform.

Figure 1 – Example comparison

The descriptors were divided into four groups so that each CJ activity contained descriptors related to the same skill. Table 1 summarizes the number of descriptors and comparisons for each of the four activities.

Table 1 – Number of descriptors and comparisons for each activity

Each descriptor was judged on average approximately 91 times, which far exceeds the minimum requirement of 10 judgements per descriptor (Verhavert, Bouwer, Donche, & De Maeyer, 2019). We were able to recruit more judges than expected, effectively doubling our already robust judging plan and enabling us to achieve high levels of reliability.

For each activity, the judgement data was analyzed to produce a scale of difficulty for each skill. As a result, each descriptor has two important pieces of data:

The intended difficulty of the descriptor in its original framework
A difficulty estimate calculated from the CJ activities

The relationship between these two pieces of data enables us to establish an alignment. Using this relationship, linear transformation functions were produced to predict CSE levels for each point on the GSE Young Learners scale of 10 to 66.

Results of this study

Success of Comparative Judgement alignment

Analysis of the CJ data was carried out in R Studio using the extended Bradley-Terry model in the Supplementary Item Response Theory Models (sirt) package (R Core Team, 2018; Robitzsch, 2019).

Fit

Fit statistics can be calculated both for the judges and descriptors used in the exercise. These statistics give an indication of levels of consensus within the CJ study. For individual judges, they indicate how consistent the judge was with the consensus of other judges.

In this study, we excluded judges with infit greater than two standard deviations above the mean infit, as this indicates that they may have judged inconsistently or did not align with the consensus of the other judges. Ninety-two percent of judges had acceptable infit statistics across the four exercises. Some judges were flagged as misfitting in specific exercises: Judge 8 in Listening, Judge 3 in Reading, and Judges 4 and 15 in Speaking and Writing. Figure 2 shows the infit values for all judges, with the excluded judges in red.

Figure 2 – Judge Infit Reliability

Reliability

Scale separation reliability (SSR) is a measure of the spread of descriptors along the scale of difficulty in relation to the estimated error in the descriptor difficulty values. The greater the separation between descriptors and the smaller the estimated error, the closer reliability is to the ideal value of 1. This gives an indication of how confident we can be that the rank order of descriptors produced in the exercise is accurate and reproducible.

Table 2 shows the reliability estimates for each of the four exercises after removing the misfitting judges. The values range from 0.945 to 0.950 and are evidence of high reliability. This indicates that the judges were able to construct a reliable scale of difficulty for CSE and GSE descriptors.

Table 2 - Scale Separation Reliability for Each Activity

Descriptor Difficulty Estimates

Analysis of the CJ data produces a logarithmic scale of difficulty for the descriptors in each of the four exercises. Each descriptor used in the CJ is given a difficulty estimate along this scale. These CJ difficulty estimates can then be related to the descriptors’ intended difficulty, as defined by the GSE and CSE frameworks. Table 3 shows the correlations between the descriptors’ intended difficulty and the difficulty values produced in the CJ exercises. There is a strong relationship between the scale produced in the CJ and the two frameworks.

Table 3 - Correlations between Intended Difficulty and CJ Difficulty CJ Activity

This means that we can be confident the judges in the CJ produced a common scale that describes both GSE and CSE, and replicates the hierarchy of difficulty seen in both of the original frameworks. This enables us to align the scales by using the linear relationship between the intended difficulty of the descriptors and the CJ difficulty values.

Figure 3 shows the linear relationships between each descriptor’s CJ difficulty and its intended difficulty on either the GSE or CSE scale for each of the four skills. Each descriptor is represented by a dot. The line running through each plot is a linear function that describes the relationship between intended difficulty and CJ difficulty.

Figure 3 – Comparing Intended Descriptor Difficulty with CJ Difficulty

By combining these linear functions, we can say, for example, that CSE level 3 Listening descriptors are generally about 1 on the CJ difficulty scale. GSE Listening descriptors that are rated the same level of difficulty on the CJ scale are generally about 43 on the GSE scale. Therefore, CSE Level 3 and GSE 43 describe similar levels of difficulty for Listening skills.

GSE Young Learners values for CSE levels

For each skill, a set of linear transformation functions was used to identify the GSE values that are equivalent to the thresholds for CSE levels. These values were averaged across all four skills to produce Table 4, the official alignment between the CSE and GSE Young Learners learning objectives.

Table 4 – Alignment between CSE and GSE Young Learners

Discussion

In terms of alignment of the GSE and CSE Learning Objectives, this paper describes the process and results of the alignment of Young Learner GSE Learning Objectives to those of the CSE levels 1-4.

The reliability of the judgements using CJ was strikingly high (0.945 to 0.950) and the process was reported to be straightforward and intuitive for the China based raters.

The next stage of this research will be for Pearson to triangulate the alignment from this study using other alignment methods. It is clear however, that CJ is a powerful and efficient method for linking and alignment purposes.

By Dr. Rose Clesham & Sarah Hughes from Pearson

联系培生：
网址：www.pearson.com.cn
邮箱：LS.marketing@pearson.com

北京
地址：北京市东城区北三环东路36号环球贸易中心D座1208室
邮编：100013   电话：010-5735 5000

上海
地址：上海市徐汇区漕宝路33号徐汇日月光中心C座502、505室
邮编：200233   电话：021-5178 2666

广州
地址：中国广州天河区花城大道85号高德置地广场1期A座12楼1219，1220室
邮编：510623   电话：020-2801 7198

关键词：Pearson