30 Mar 2013

How much is an SAT essay worth to your score?

Submitted by Karl Hagen
[Note: This article refers to the old SAT essay, which was made obsolete in 2016]

As you probably know, the 200-800 score for the SAT Writing test is a composite score, based on a combination of an essay and multiple-choice questions. Students (and instructors) often ask me exactly how much the essay counts towards the overall score. Finding an answer to this question is rather tricky, particularly since the score that's reported to you is rounded to the nearest 10 points. (Internally, the ETS psychometricians use unrounded scales for their calculations. The scores are rounded before they are reported so people are less likely to place unwarranted significance on small differences in scores.)

One practical consequence of rounding is that, depending on your multiple-choice raw score, a single point difference in your essay score can mean a difference to your scaled score of anywhere from 0 to 30 points. Such an answer is unsatisfying, however, so I set out to derive a more specific answer by inferring the unrounded contribution of particular essay scores based on the score scales released with publicly available tests.

[Update (12/6/2013): I've added more scales to my analysis (nearly doubling the data set) and refined my estimation algorithm so that it gives a more precise result for many scales. The numbers below reflect that revised analysis, which only refines the earlier analysis slightly.]

Here are average amounts, to the nearest tenth, that a particular essay score contributes to your overall composite writing score in comparison to an essay score of 0, which is given for omitted or off-topic essays (the normal range of scores is 2-12). These values were calculated from the score tables for 64 different essays on released tests from 2005-2013.

For reasons that I'll explain below, this breakdown likely is not exactly how ETS calculates things, but thinking of the essay as points added to a base score is conceptually more straightforward.

Essay Score Average Scaled-Score Contribution Min Max Range St. Dev. Nominal Score

The College Board states that the essay is worth about 30% of the writing score, a weighting that implies a 12 essay should translate to 180 scaled-score points [.3(800-200)=180], and the "nominal score" column shows how many scaled points an essay would be worth if it were simply given an equal portion of that total amount. As you can see, though, that's clearly not how the essay value is calculated, although these nominal amounts are within the observed range for most essay scores. The precise amount will be calculated based on the observed distribution of scores for that essay.

Notice that there's more variability in the extreme scores on both ends, but especially at the high end, than there is for the middle scores. The percentages of 11s and 12s seem to vary significantly more from test to test than any other score point.

If we look at the numbers in terms of incremental payoff, i.e., how many extra scaled-score points do you get, on average, for raising your essay score by one point, the biggest jump comes between the 8 and the 9 essay, which is worth on average 23.7 points to your score, followed next by the step between 7 and 8 (20.1 points). The smallest payoff comes in the step from 0 to 2 (8.6 points). Indeed, for some tests, there's essentially no practical difference, after rounding, between a 0 or a 2 on the essay. The next smallest payoff is the step between 4 and 5 (only 11.8 points). The other steps average about 15-16 scaled-score points for a one-point increase in the essay score.

So, how did I get these numbers?

I started by assuming that ETS would follow normal psychometric practice in creating a composite scale. That means that the composite writing score should be calculated in one of two ways: either by adding weighted raw scores to produce a composite raw score, which is then translated to a scaled score, or by assigning separate scales to the multiple-choice portion and the essay portion and adding the two to produce the final composite scale.

Based on the pattern of numbers in the scale tables, I strongly suspected that the second method was the one used, but I checked them both out to be sure. I could not find any set of numbers that could explain the observed score tables under the assumption that weighted raw scores were summed, but I found solutions for every score table I tried under the second method.

Under this method, we can think of a composite scaled score as the sum of a multiple-choice scaled score (NB: not the same as the multiple-choice subscore reported on the test report) and an essay scaled score. In other words, $S_{m,e} = S_m + S_e + 200$, where $S_m$ is the scaled-score contribution for a multiple-choice raw score of $m$, and $S_e$ is the scaled-score contribution for an essay score of $e$. Values of $S_m$ and $S_e$ need not be integers, or positive, but they must be monotonically increasing.

I then wrote a routine in R to search for a set of values for $S_m$ and $S_e$ that produces the observed scores in the table. Typically, a range of values for each score point will work, so the routine was written to converge on a solution at the midpoint of the range of workable values. In other words, the specific numbers derived for a particular test are probably not exactly right, but they should be within a point or two of the true values). I'm not certain that my routine was the best way to do things, and it was modestly sensitive to the initial conditions, but I did get a solution for every scale that I tried, so it seemed to do the job adequately.

The scaled-score point differences between essay scores are not constant within a single test. In other words, there is no linear equation based on $e$ that will give workable values for $S_e$. For that reason, the scaled-score contribution for the essay can't be based directly on a simple linear transformation.

The exact procedure used to derive a specific $S_e$ for a particular test remains obscure to me. It's almost certainly based on the percentile ranks of essay scores for that test, but is the data smoothed? Is it, for example, transformed to a normal distribution, and if so, what are the parameters of the target distribution? Is the 70-30 weighting a nominal or an effective weighting? (If the latter, the actual weights of the two components will vary depending on their variance and covariance.) I can't answer any of these questions from the score tables alone, nor have I found any literature that answers them.

In the table above, I took the zero essay as the baseline, and for students it's probably most natural to conceive of non-zero essay scores as adding points. But it seems more likely that ETS calculates from the mean. In the terms I used above, then, $S_m$ would represent the multiple-choice scale at the mean essay score (which is normally around 7.2), and values of $S_e$ will be negative for essay scores below the mean.

If my supposition is right, it provides a natural explanation for an oddity that I observed occasionally when there are two or three different composite writing tables in a single test booklet. (A few different essays are typically used with the same multiple choice questions. For example, on the Saturday test dates, different essays are used for the eastern and western halves of the United States. A separate composite table goes with each essay.) Occasionally, the columns for essays with a score of 0 will differ in a few cells for the same multiple-choice score. If the $S_e$ is an amount calculated calculated from the essay mean, the zero essay isn't special. It has a point value just like any other essay score, and if the distribution of zero essays differs enough among the different essays, the zero-essay column can vary. The fact that this is a relatively rare event suggests that the numbers of zero essays tends to be fairly stable.

This procedure leads to an apparently perverse result: you can get a different composite writing score based on which essay you don't write. If you think through the situation, however, keeping in mind that the purpose of a scaled score is to allow comparisons among students who took different versions of the test, this outcome can be justified.

Setting $S_{e0} = 0$ makes sense only if everyone who received that score would have been indifferent to the particular topic they saw. For example, perhaps they decided to skip the essay no matter what it was.

For many students who receive a 0 on the essay, though, this outcome will be affected by the prompt itself. If you consider the universe of potential essay topics, it seems likely that some essay topics may be more likely to provoke students not to respond at all, or to write off topic. If you are a student presented with a highly unusual topic, you might be more likely just to give up and skip the essay, or to write on a completely different subject, than if you received a more pedestrian topic. Under those circumstances, the act of omitting one essay rather than another actually could merit a different scaled score.



So, briefly, what is a good sat score eh?

That depends on what sort of school you're looking at.