15 Jul 2018

How you can tell if the June 2018 test is excessively flawed?

Submitted by Karl Hagen
TLDR: The evidence to tell isn't public at this point. It would be much easier to tell if College Board released specific technical data on the test.

College Board has made it clear they're not going to rescore the June 2018 test. For those who feel victimized by the scale, that's infuriating. But from my perspective, the demand these students are making is the wrong one. By asking College Board to rescore the test, they're essentially saying "raise my score arbitrarily." It would be arbitrary because there's no reason to think that College Board applied the equating procedures incorrectly. (The software they use to do this work is mostly automated.) As a result, fudging the score scales effectively means awarding students a higher score than is justified by the evidence for this test. Note, I'm not saying individual students might not actually deserve a higher score; I'm saying the evidence of this test doesn't tell us one way or another. Previously I explained how score scales really work. Today I want to consider what evidence we need to conclude that a particular test is sufficiently out of whack that we need to take corrective action.

No test is ever perfect. The right question to ask about the June SAT isn't does it have flaws; it clearly does. The fundamental question is whether its flaws are significant enough that College Board needs to take corrective action. My position is that, although there is some suggestive evidence of a problem, the real data that we could use to judge isn't yet public.

One thing that is not valid evidence of a problem is a personal anecdote of how much you studied between retakes, only to see the same or slightly lower score. I believe you when you tell me that you studied hard. What I have no way of knowing is whether you studied effectively. Did you actually improve your skills on the concepts tested, or were you engaged in make-work? You may have reached a plateau.

Assume, for the sake of argument, we have a test that validly measures some skill and that you retake this test without improving this skill. In that scenario, we expect that your scores will be within the test's margin of error. That is, the second test might be a little higher or lower, but it ought to be close. A traditional way to imagine it is that you have a "true score" for a particular domain—that could be mathematics, or reading, or whatever else is tested. This true score is the value we would get, hypothetically, if we could somehow measure your knowledge perfectly. Your observed scores are approximations to that true score, affected by various sources of error that are inherent in the test-taking process, including such things as how you were feeling that day (Were you stressed out? Did you get enough sleep? Did the student in the seat next to you have some irritating habit that distracted you?) as well as variations in the test content (Did you get lucky and see a passage from a text that you've already read? Were you unlucky and encountered that one math problem type that always gives you trouble?) We have no a priori way to tell how one observed score relates to your true score. For all we know, the first time you took the test, your score may have been overestimated and the second time reverts to the mean. It's true that on average, students who take the test a second time see increases in their score. What's hidden in that statistic is just how large a variation there is among students. That is, while many students do increase their scores, a significant number stay the same, or even drop a bit. In the past, I've worked for two different test-prep companies that carefully tracked test-retest results of students on official SATs. What I found at both places was that the standard deviation in their test-retest score change was about twice the mean increase. So yes, more went up than down, but a substantial minority saw little change, or even some decline.

The upshot is that individual experiences of minimal score change are consistent with the hypothesis of a fair test. They're also consistent with the hypothesis that these students actually did improve but the test is unfairly disguising that fact. My point is simply that we cannot tell which is a better explanation from isolated anecdotes.

A better suggestion that there's a problem with the test comes from the unusual nature of the math scale. I haven't seen the complete scale anywhere, but based on student reports, a raw score of 50 on the math test (i.e, 8 incorrect questions) resulted in a scaled score of 650, where for a more typical SAT, the same raw score would get you something like a 730. For comparison, I've pulled the raw-to-scale conversions for the top end of the Math Test that are found in the SAT Technical Manual Appendix (Table A-6.3.2), which are an approximation of the base raw-to-scaled conversion that was established during the original scaling. (You can think of this as the approximate target they're trying to hit with each newly developed form.) I juxtaposed them with the values for the June 2018 test so that you can compare what raw score translates to the same scaled score.

June '18 Raw Base Raw Scale Percentile
50 42 650 84
51 43 660 86
44 670 89
52 45 680 91
53 46 690 92
54 47 700 94
48 710 95
55 49 720 96
50 730 97
51 740 98
56 52 750 99
53 760 99
54 770 99
57 55 770 99
56 780 99+
57 790 99+
58 58 800 99+

Percentile ranks are for the college-bound group, not the nationally representative sample, and come from the same appendix.

Clearly the June test is noticeably easier, and the equating reflects this. But is this really a problem? Gaps in the scaled scores aren't ideal, but they're also not necessarily an indication of a problem. If you compare the scales from the previous version of the SAT (in use between 2005 and 2016), drops from 800 to 770 were extremely common. Indeed, it would sometimes drop from 800 to 760, or even 750. What is extreme, though is the variation of scaled scores with respect to the raw scores. I pulled my data on old score scales for the previous version of the SAT. (It's fair to compare the math scales in this respect because the number of available raw-score points is approximately the same on both old and new SATs). If you look at the range of scaled scores to which one particular raw-score point maps, the biggest difference in all the released tests I have is 50 points. At the high end of the scale, it's 40 points. Comparing a raw score of 50, there's an 80-point difference between the June form and the base form. The released test that is closest in difficulty to the June test still is 40 points higher at this raw-score point, and the most extreme difference is a 90-point gap.

This is certainly an unprecedented difference, but is it an unfair one? Tests that are significantly easier than intended can disadvantage students at the high end of the scale because the ability to distinguish among them is reduced. So it's incumbent on College Board to provide specific, credible evidence that the equating is OK.

Here are the data I'd want from College Board to check if what they're claiming about the equating is true:

1. The conditional standard errors of measurement (CSEM). This provides a test of the amount of random error in equating that creeps in. All tests have some measurement errors, and they always are larger at the scale extremes because fewer students achieve those scores and therefore the measurements are less precise, but if College Board is correct that the test is acceptable, the CSEMs should be similar to those for other tests.

2. First- and second-order equity plots. These plots are statistical ways of showing if a new form is doing its job fairly. The first-order equity plot tells you whether the conditional expected scale scores are similar across different forms of the test. If they're not similar, then a student with the same true score would get significantly different observed scores on different versions of the test. A second-order equity plot tells you if the CSEMs are similar for alternative forms. Once again, large discrepancies between the forms is evidence of a problem.

Since the recommended practice is to run these tests on each equating, it's very likely that College Board has already done this work. They simply have to release it.

In addition to information on the equating, I would also want to know summary statistics about the items. In particular, College Board should release tables showing

1. The item difficulty distribution and average item difficulty (i.e., how hard were the problems)
2. The item discrimination distribution and average item discrimination (i.e., how well did the items distinguish higher-performing students from lower-performing ones)
3. The differential item functioning (DIF) summary tables (i.e., how many items performed significantly differently among different groups of test takers such as male vs. female, white vs. black, etc.)

Finally, there are several general concerns about the quality of the test which need to be addressed.

1. What were the specific, technical reasons that each of the four dropped questions was omitted from the test?
2. Were any operational questions edited significantly (i.e., underwent non-cosmetic changes) after pretesting and used without being retested? (A whistleblower accused College Board of doing exactly this in 2016.)