13 Jul 2018

How do the SAT Scales Really Work?

Submitted by Karl Hagen
Reddit and Twitter have been in an uproar about the results from the June 2018 SAT. Many students are upset that missing only a handful of questions has dropped their scores precipitously, and have been expressing their outrage with the hastag #rescorejunesat.

I completely get why they're upset. It's crushing to work hard studying for a test only to get a score much lower than you anticipated. But many of their specific complaints are based on assumptions that are simply factually incorrect. Even some of the test-prep outfits who've commented on the situation have made incorrect statements. (I'm looking at you, Princeton Review.) For reasons that I'll explain at the end, rescoring the test is a stupid idea. Assuming that the June test has significant flaws, and there's prima facie evidence that it has, the proper remedy is to cancel the scores and offer a free retest.

Before explaining how score scales are actually calculated, I want to make a few points to correct the most common misconceptions.

Your raw scores don't mean jack

Many who are upset point out that they missed fewer questions on the June test than on a previous one and yet their scores either didn't change or went down. What such students are doing is comparing their raw scores (the number of questions they got correct). But this is totally irrelevant and deeply misleading.

To see why, imagine an extreme case: You take two different math tests, each of which has 120 points. One was written for 6th graders, and the other is a Putnam Competition exam. Would you expect to get the same number of points on both? Unless you're a future Fields Medalist, certainly not.

Of course, you'll object that those two tests are so different that they can't be compared in any fair way, and you'd be right. But the basic point is this: if you take two tests of differing difficulties, the number of questions you get right will naturally be different.

Fine, you say, but shouldn't each SAT be the same difficulty? College Board must be incompetent if they can't manage that. If you've never tried to develop standardized test questions, it's easy to underestimate the challenges involved in created high-quality questions. Before you express too much scorn consider the following:

The only way you can ensure that two versions of a test have exactly the same difficulty is to give exactly the same test, with exactly the same questions, each of which uses exactly the same wording. For a high-stakes test like the SAT, that's a non-starter. Security considerations mean that new questions have to be developed all the time, and these new questions will each have their own statistical profiles, even if they are asking about the same topic.

Even highly experienced content experts have trouble predicting exactly how hard individual questions will turn out to be. Items have to be pre-tested on students that reflect the normal test-taking population. This gives (among other things) estimates of their difficulty, but these estimates always have an inherent margin of error in them for a number of different reasons. For example, these questions are tested on a sample group in an environment where the group's motivations may not be exactly the same as on the operational test.

This means there's no way to know for certain exactly how the questions will behave until the complete test is given operationally. No matter how hard the developers try to hit a difficulty profile, there will always be unpredictable variations. It's because of these inevitable variations of difficulty from form to form of a test that you cannot validly compare your raw scores from one month of the test to another. The scaled score exists so you can make the comparison. For example, a math score of 600 ought to have more or less the same interpretation, within a given margin of error, whatever date the SAT was taken. Without the scaled score, the test results would be useless to colleges because they couldn't compare tests taken on different dates.

Contra the conspiratorial assertions on Reddit, there's no reason to think the folks assembling the June SAT set out to make an unusually easy test. The uncertainties inherent in the test-assembly process mean that sometimes you'll see tests whose raw difficulty is a harder or easier than normal. The drop-off on the math test, which many are complaining about (1 missed question drops you from 800 to 770), although larger than anything we've seen since the test was revised in 2016, was routine on the older test.

There is no curve, at least not the way you think

When people talk about creating a curve for a test, they usually have in mind the process of fitting the scores given to a normal distribution. Anyone who tells you that the SAT "gives out" a fixed number of scores at different score points is assuming this sort of curve, but they're wrong, and even a little bit of thought will show that there's no way such a procedure would work.

Consider that significantly different groups of people take the test in different months. The spring administrations are mostly juniors, with some sophomores. The fall administrations are a mix of students, including seniors who have procrastinated or who weren't happy with their scores the first time around. As a result, there are noticeable differences in the overall ability of students from administration to administration. If scaled scores were created by simply "curving" the test to the mean and standard deviation of the people taking the test, results wouldn't be comparable from test date to test date.

It's true that if you look at the raw-to-scale score mappings on a graph, it looks curvy, and yes, there are polynomial functions involved in the equating process, but that's not what most people mean by a "curve."

Scaling isn't the same as equating

Even people who don't talk about "curving" a test often talk about "scaling" it. What they usually mean, though, is what's known as "equating." This can lead to confusion if you're trying to make sense of the SAT's technical documentation, where these terms mean different things.

Scaling refers to the process of establishing a basic score scale. This is done once when a test is introduced. Although the scale will be reviewed periodically to check for problems like drift that can creep in when you use the same scale over many years, rescaling a test is a rare event because doing so means that the old and the new scores aren't comparable without a concordance table. Before the current overhaul in 2016, the SAT had done this only once in the past, in 1994, after the previous scaling had been used for nearly 50 years.

Equating refers to the process of fitting the results of a new test form to the base scale. This is the process that was used to create the specific score-conversion tables for the June 2018 test that people are complaining about.

So how are scaled scores actually generated?

First, you have to understand that the SAT is a norm-referenced test. A criterion-referenced test, by contrast, measures if you possess a particular skill or set of skills. When you take a norm-reference test, you're not being measured against some absolute benchmark of what skills you have or against those who are taking the test at the same time as you. Instead, your performance is calibrated against a reference group that was established when the score scales were originally calculated.

The new SAT, introduced in 2016, is radically different from earlier editions, and so the reference group was changed along with the test. The current reference group is derived from a sample of high school juniors and seniors who voluntarily participated in a pilot SAT study before the test became operational. (It was probably unavoidable that they took this approach, but there are real hazards to it, which I'll discuss another time.)

During the pilot study, that reference group took a test, which became the base form for the new SAT. Their results were used to calculate the base scales. So what we have is a mapping for this reference group between a particular raw score on the base form and the scaled score. Although raw-to-scaled-score conversion tables are published as integer values, the actual mapping is done with polynomial functions and doesn't assume integer inputs or outputs. These reference scales imply that a particular percentile rank of the reference group is associated with a particular scaled score.

To equate a new form of the test to the base form, the SAT uses a common-item, nonequivalent group design. The common-item part means that some of the test takers will also receive old items that have been calibrated to the base form. For students who take the no-essay version of the SAT, there is a fifth section that has such questions. The nonequivalent-group part means that the groups are assumed to be different, and probably differ in their ability levels. (These differences can't be too great, though, or the equating will break down.) The new group is randomly selected from a subset of test takers, usually with weighting to match the demographic characteristics of the original group.

In this design, the new group gets two raw scores: one for the new items and one for the common items. By comparing the results of the two groups on the common items, we can infer how the reference group would have done on the new form if they had taken it. There are several ways to calculate this equivalence, and College Board generally checks multiple possibilities, but the most common method is chained equipercentile equating.

I won't describe all the details here, but the basic procedure is that we create a function that maps the score on the operational (new) test with the common-item score by matching equivalent percentile ranks. For example, we might find that a reading-test score of 35 on the new test might be the 55th percentile, and the 55th percentile corresponds to a raw score on common items of 23.854 (notice that with the functions involved, we're not normally going to get integer values in the outputs). We also create a function that maps the common-item score for the reference group to the base raw-score. We then chain these two functions together by running the same common-item score through the reference-group function. For example, we might find that a reference-group common item score of 23.854 has a percentile rank of 59.413, which corresponds to a base raw-score of 36.723. In turn we use this value to look up the appropriate scaled score. At the end of the whole process, we produce a table containing only the new raw score and the final scaled score, which will also be rounded.

The result of this complex process is that the new test can be aligned to the reference group even though both the ability level of the new group differs from the reference group and the difficulty of the new test differs from the base test. But crucially, the equating stops working well if either one departs too severely from the reference point.

Score-scale tables are not created ahead of time

Both the Princeton Review Blog and many redditors have been asserting that tests are equated ahead of time, in the sense that the tables which convert raw to scaled scores are pre-calculated. As a result, the complaint goes that CB must have seen the oddity in the scale but chose to administer the test anyway. I can forgive the redditors, most of whom are students who understand nothing about psychometrics, for not understanding how the scoring process works. Princeton Review's mistake is less defensible. I'm going to give them the benefit of the doubt and assume that they've become confused by the terminology surrounding scaling and equating.

Scaling, as I said above, is done not just ahead of time but at the beginning of the testing program. Equating to establish the score tables is done after the test has been administered for the first time. Once those tables are created, though, it's not done another time if the test is reused. For example, international tests are typically repeats of US tests, and simply use the same scale tables originally established. People often criticize College Board for recycling tests this way, because it can create security headaches when the form gets leaked ahead of time. But adequately equating an SAT with an international population, most of whom are ESL, presents severe challenges of its own and is likely a major reason why College Board has persisted in this practice even though they've been burned repeatedly by test pirates revealing questions online.

There is another kind of statistical measurement, which does involve a different form of equating, that is done before the operational test. I mentioned above that all questions are pre-tested for difficulty and other features before they are used for real. Difficulty of questions isn't just measured as a percentage-correct rate. To do so would mean that the difficulty estimate could be biased by the ability of the test group. Instead, question difficulty is measured on an "equated delta" scale. Again, I'm skipping the details about the math involved, but the basic idea is that the difficulty of the individual problem is converted so that it reflects an estimation of how hard the reference should find it. The methods for that equating are conceptually similar to those used to find the score scale, but a bit simpler (linear rather than equipercentile).

The key thing to note, however, is that this pre-test equating establishes the estimated difficulty of individual questions rather than the score tables that are at issue with the June test.

Why is rescoring stupid?

By asking College Board to "rescore" the test, students are in effect asking for new score tables. The question is how should they be generated? The math by which these scores are generated is well understood and validated. Changing the results arbitrarily to make students happier would undermine the whole point of equating and make comparison less valid.

The Princeton Review blog post about this issue is correct that very easy tests are a concern to high-performing students because one careless mistake will have a more pronounced effect on their scores than it would on a harder test. But an even more serious potential problem comes from the fact that equating only works if the new test is within spitting distance of the reference. If two tests are far enough apart, you can still do the math for the equating, but you won't have valid results. To know if that's the case for this test, we need much more technical data on the equating. Releasing the conditional standard errors of measurement (CSEMs) would be a good place to start.

Regardless of whether the score tables are appropriate, this test does have unparalleled problems. Two questions were dropped from the reading test and two from the writing test. Such omissions come when College Board discovers a flaw in a question when they have the full test data in front of them, or when a test-taker notices an ambiguity and complains. Individual questions get dropped occasionally, but all my years tutoring for the SAT, I have never seen a single test have more than one such question. Four on one test is absurd, and it strongly suggests that important quality-control measures were overlooked. College Board was accused in 2016 by a whistleblowing insider of doing just that sort of corner-cutting. Could the same thing be happening here?

Where do I get my information?

Some people have asked me how I learned all this. College Board (and ETS for the old SAT) have always published a lot of technical information on the SAT, which includes statements about the type of equating done. The SAT has used equipercentile equating for decades, and the recent technical manual for the new SAT confirms that they're still doing so. Also see the scaling report, which explains the details of how scaling was performed. Further, the methods for scaling and equating aren't unique to College Board. ACT uses them too, and so do all the significant standardized tests out there. The fact that these are standard practices lets us infer how College Board is doing things even if they are silent on some of the details. If you want a general explanation of equating, the Wikipedia page on the topic is a decent start. For a more technical introduction, see Michael J. Kolen and Robert L. Brennan, Test Equating, Scaling, and Linking: Methods and Practices