29 May 2018

Is the SAT Essay Fundamentally Broken?

Submitted by Karl Hagen
The essay on the revised SAT marks a dramatic departure from the old essay. The old essay was a 25-minute, holistically scored exercise in persuasive writing. The new essay is a 50-minute, analytically scored rhetorical analysis.

When the format of the new SAT essay was announced, I admit I was cheered at the prospect. The new prompt has strong face validity, in that it seems to ask for the sort of skills we associate much more closely with college writing and it is much less susceptible to coaching techniques like memorizing predigested examples. Students who can successfully perform this task are probably much better positioned to succeed in college classes that demand analytical writing.

That said, over the last few years I've become increasingly skeptical that the scores on the essay provide much meaningful information. And I find it remarkable that as of today, more than two years after the new SAT became operational, College Board still has not published a validity study. Indeed, even its annual report contains significantly less detailed data than the old reports did.

The limited data available on the essay scores suggest two conclusions to me, which I will treat in separate posts. First, the essay task is likely inappropriately difficult for many students. Second, there are asymmetrical patterns in grading that suggest readers are overly reluctant to give the highest score, so much so that getting an 8 on any component of the essay is more likely the result of chance rather than a consensus judgment of actual ability.

Is the Essay Too Difficult?

As I write this, I'm aware of two public sources of statistics on the new SAT essay. The first is the 2017 SAT Suite of Assessments Annual Report, which includes counts of the most recent essay score for students in the class of 2017 who took the new SAT before graduation. The second is the statistical report for Maine's April 2017 school-day administration. I'll look at the Maine report, which is much more granular, in the next post. For now, I want to focus on the score distribution in the national sample. The report just gives counts and percentages of the different scores, but here they are graphed as percentages. I also calculated the first four moments of the distributions.

Percentages of SAT Essay subscores
SAT Essay Subscore Distribution

SAT Essay Subscores

  reading analysis writing
mean 4.94 3.85 4.90
sd 1.27 1.47 1.29
skewness -0.26 0.36 -0.28
excess kurtosis -0.22 -0.72 -0.30

The reading and writing scores approximate the distribution we would expect to see if the score scales are functioning appropriately, although the skewness and excess kurtosis numbers show that none of the three subscores are normally distributed. The distribution of analysis scores, however, sticks out. It is glaringly obvious is that many students don't have a clue how to perform a rhetorical analysis of a passage. The most common analysis score is 2, i.e., both readers gave a 1 to the analysis, which the rubric characterizes as "demonstrates little or no understanding of the analytic task."

In broad strokes, this performance reflects what I see with students I tutor when they first attempt this assignment. Unless they've taken AP English Language, which has a similar assignment, they tend to merely summarize the passage without providing any actual analysis, and so the abundance of 2s for the analysis subscore doesn't surprise me. This sort of analytical task cannot be done without specific training or practice, and clearly a significant number of students do not have that training: 42.7% of students had at least one reader give them a 1 on the analysis. And keep in mind that these scores reflect the most recent score of each student, so those students who took the SAT, bombed the essay, and then retook it after practicing how to write it, do not have their original scores included in this data set.

The analysis score is so strongly skewed that it looks like we're seeing the sort of floor effect that commonly occurs when you give people a test that is too hard for them. Scores that suffer from the floor effect (or the related ceiling effect) are of diminished utility because they lump together large numbers of people whose abilities actually differ in ways that are of interest.

Now you could argue that students should master the skill of writing a rhetorical analysis, and so low scores merely indicate a gap in their education that should be rectified. This line of thought would say, in effect that the benchmarks are fine, we simply need to teach students better. And since the SAT is primarily intended as a tool for admissions officers to help them figure out how well prepared an applicant is to succeed at their school, you could say that the students at the lower end of the score scale simply aren't cut out for the more elite schools that tend to require the essay. Those schools wouldn't be concerned about the floor effect because they're only going to be admitting those with high scores. There are at least three reasons to question this argument.

First, while rhetorical analysis is a useful skill, what's the empirical justification for thinking that this writing task is better predictive of college success than the universe of other writing assignments that could be given? It's worth remembering that the analytical task in this essay is one that has traditionally been taught in college composition classes. Indeed, the specific form of the essay assignment is adapted from one that has long appeared on the AP English Language test. That means this particular assignment is a mode of writing that has traditionally been considered collegiate level, and it is often not taught in ordinary English classes. As a consequence, there is a strong likelihood that students who haven't taken AP Lang and who can't afford paid test preparation are at a severe disadvantage. (The free test prep like Khan Academy is useless for the essay. You need feedback from an experienced writing teacher to really help on the essay.) This outcome is exactly the opposite of the stated intention behind the revision to the SAT and tends to reinforce inequality rather than level it.

Second, the SAT isn't just used for college admissions. Some states or school districts administer the test to all students as part of their overall testing accountability program. For this purpose, any score that shows a severe floor effect is of negligible usefulness. Testing students on a writing mode they haven't been exposed to wastes everyone's time. It's a bit like administering an AP Calculus exam to try to determine their algebra skills.

Finally, it's not even clear from the data that the high scores provide useful information about the people the scores are supposed to matter most for: high achieving students applying to elite institutions.

The rarity of 8 scores is notable (1.38%, 0.65%, and 1.31% for reading, analysis, and writing, respectively), considering that individual graders only have four score points to choose from and that the scoring rubric isn't worded to suggest you need to write like Ta-Nehisi Coates to get a top score. I've already mentioned that the score distributions don't follow a normal distribution, but the negative excess kurtosis indicates that these distributions have "thin tails," i.e., that there are fewer scores at the tails than would be the case if the scores were normally distributed. Is it really the case that only a tiny percentage of high school students can write a superior analytical essay, or is this score distribution the artifact of the grading process.

The national report doesn't give us enough data to decide, but there are some clues in the Maine statistical report which suggest that the latter is the more likely explanation. I will explore those data and some of their implications in my next post.