16 Jul 2018

Thoughts on College Board's Response to #rescoreJuneSAT

Submitted by Karl Hagen
College Board responded yesterday to the #rescoreJuneSAT complaints with a FAQ page. It is a document which has been very carefully worded, presumably by PR professionals. It makes technically accurate statements while dodging the serious substance of the complaints.

The statements about equating are general but accurate, as far as they go. It is a standard practice among all testing programs. It's purpose is to ensure fairness for all students. And it would be unfair not to equate. I've written in detail about how equating works if you want to understand what's really involved in the process. If you'd rather not get into the details, suffice it to say that some very smart statisticians have worked on this problem over the last 60-70 years, and most of the naive objections you're likely to make, and which upset students have been making on social media, are actually addressed in the methodology.

But, and this is the crucial thing that College Board omits from its response, equating has its limits. It can accurately correct for differences between tests only within certain ranges. If an individual test is too much harder or easier than the target, it becomes no longer a matter of indifference to a test taker which test they take.

Another point of silence that may or may not be significant is that there is no statement on this FAQ that the equating for this test was adequate. They tell us what equating is in general but make no specific assertions that the equating was fair for this particular test. Perhaps that's just an oversight, but in effect College Board is asking people to trust them that the equating is appropriate without providing any evidence to support that assertion. There are, however, widely accepted statistical procedures that can detect a lack of equity, and this data can certainly be released if College Board chooses to do so.

The FAQ also addresses the issue of the questions that were omitted from the test. Once again, their factual statements are strictly speaking correct. Questions are occasionally dropped from a test, and there is enough redundancy that you can omit some questions and still have a valid measurement. It's also true that these omissions in and of themselves are not directly related to the equating, as it's perfectly possible to equate two forms of a test that differ in length by one or two questions.

This response, though misses the real concern that these omissions raise:

First, it doesn't mention why questions are omitted. In practice, this only occurs when a significant flaw turns up in the problem. This could be an undetected ambiguity that renders more than one possible answer correct. It could be a statistical anomaly that renders the problem useless. In any case, when College Board uses the anodyne phrase "for any reason" to talk about why problems are removed they are dancing around the fact that the only valid reasons to drop questions involve some sort of College Board screw-up.

Historically, questions have been removed from the SAT only very rarely. Back when ETS made the tests, it happened once every few years. My impression (which I haven't tried to quantify) is that there has been a slight uptick in removed questions since College Board took over direct development of the test, but I have never, in all the years I've been helping students prepare for the SAT, encountered a single test with more than one such question. The June test had four, two in reading and two in writing.

Now having developed a lot of standardized test questions myself, I'm relatively sympathetic to screw ups. Writing bullet-proof questions is very hard, and one reason that commercial test-prep material has a deservedly low reputation for quality is that this material usually doesn't go through a rigorous editorial process. But that process, which includes reviews by committees of content experts, multiple editors, and psychometricians, is just what College Board is supposed to be doing. If done carefully, a flawed question ought to make it through to the operational form only once in a blue moon.

However four such questions got through the screening process. That you can still create a technically adequate equating after dropping those questions cannot disguise the fact that there appears to have been an extremely serious lapse in quality-control measures. It's logical to ask if the same sloppiness extended to other questions on the same test, and it's fair to ask if there were any corners cut in the test development process.

The most obvious way I can see that so many bad questions made it onto an operational form is if some questions were put on the form without adequate pretesting. All questions are supposed to be pretested. This not only helps estimate their difficulty, but it can turn up flaws in the question. This discovery can then lead to a rewrite of the question to try to make it better. But after any edits to a question, it ought to go through the same pretesting procedure again, as even minor changes in wording can affect the difficulty and other qualities of the item in significant ways. Since there have been insider accusations that College Board was been doing exactly this in the recent past, it's reasonable speculation that something similar may be going on here.

If quality control was compromised, in this way or in some other, then the issue of the four omitted questions does overlap with concerns about equating the math test. The validity of the equating is threatened if two forms differ too much in difficulty, and one way you can get into trouble assembling a form is if your estimates for question difficulties are significantly off. One possible scenario is that some math questions were revised but not retested, leading to questions that were much easier than expected.

College Board needs to make clear statements about these issues, and release the technical documentation to support their assertions.