Machine Scoring of Student Essays: Truth and Consequences

Machine Scoring of Student Essays: Truth and Consequences. Edited by Patricia Freitag Ericsson and Richard Haswell. 2006. Logan, UT: Utah State University Press.

Genre: Teacher resource

233 pages, discounting extensive bibliography and glossary.

Introduction and Summary

This book requires a little background . . .

Back in the day, large-scale writing assessment was fun. I am not making this up. The scoring was done by teachers (some current, some retired), and they returned year after year to be part of the festivities and to share their insight and wisdom with one another. No one ever wanted to miss the scoring sessions, and they phoned months in advance (not making this up either) to ensure that they would be part of the team for the upcoming year. It certainly wasn’t the money—which was pathetically small. It was the camaraderie, and the opportunity to learn about student writing.

During our weeks together, we read many papers aloud and discussed them at great length. We imagined the faces behind those papers, and talked about students we would love to coach or teach. We identified recurring problems and discussed ways of addressing them—and the thinking of those early teacher/raters formed the foundation for the 6 trait workshops that evolved through the years.

We didn’t read student writing every single minute. As anyone who teaches knows, the mind needs diversity to stay sharp. So we took hourly breaks. We also read to one another from the greatest writers we could identify—people like Maya Angelou, Larry McMurtry, Garrison Keillor, Dylan Thomas, Nikki Giovanni, Sandra Cisneros, Ernest Hemingway, and many, many others. This is how we taught ourselves to think more deeply about things like detail, voice, word choice, fluency, and creative use of conventions. When a student paper was especially stunning (and many were—contrary to the rumor, large numbers of students actually write brilliantly), people would ask, “Can we have a copy of that?” They wanted to celebrate the work of these students by sharing it aloud with family and friends. It was a way to keep the writing alive. At the end of each season, scorers hugged one another and headed off to a place called The Mad Greek for outrageously good sandwiches and a goodbye beer. The most frequent comment was this: “I can’t wait to get back into the classroom and use what I have learned.”

Our reading goal was 12 papers per hour (most averaged a page and a half to two pages. We didn’t push people to read faster (despite budgetary pressures) because we wanted them to take enough time to comprehend what each student was saying—enough time to notice the detail, appreciate the word choice, marvel at the originality, feel touched by the voice.

Fast forward . . . Nothing like this happens now. Raters work mostly for professional agencies and publishing houses. They are “trained” in record time through a blitz method that simply asks prospective raters to match sample papers with anchor papers—and if matches are achieved, they are trained. Raters may be teachers, but sadly (and understandably), many teachers want nothing to do with this process. And so, raters are just as likely to be people with some writing or editing experience—or anyone with a four-year degree who
can meet the requirements. They are often urged to read at breakneck speeds of 20 papers per hour, with bonuses awarded to those who can read faster than this (rates as high as 30 papers per hour are not unheard of, and speed is rewarded).

As a sidebar, let me say that while it is quite possible to read a riveting novel at this pace (you’ve probably done it yourself), reading student writing is another matter altogether. When you read a student’s work, you’re not reading for plot. You’re reading to assess, and that means paying very close attention to the central meaning and how it’s developed, the organizational design of the whole, the use of words and phrases, the beginning, the ending, transitions, and the general flow. I submit it is impossible to do this in any fair and consistent way at a speed of 20+ papers per hour. And I mention this because of the current furor over machine scoring. If you’ve been living in a cave, maybe you missed it—but there’s outrage right now over the possibility that student essays are being (or might be) scored by computers programmed to do just that.

Let me be clear. This outrages me, too. I am wholly, one hundred percent, against automated scoring. At the same time, it also occurs to me that human scoring, as currently conducted, is anything but a viable alternative. In fact, it’s about as close to automated scoring as you can get and still have humans involved. In order for human scoring to make any serious difference, and to be significantly different from automated scoring, the following things would need to be true:

  • The raters need to be teachers—the people we trust to assess students’ writing the other 99.9% of the time
  • These teacher/raters need sufficient training to become thoroughly familiar with the criteria used to assess the writing—and they need to agree whole-heartedly with those criteria (In fact, it’s best if they are the ones who developed them)
  • Training needs to occur not just for a short period at the beginning of the assessment scoring period, but throughout the scoring process
  • Teacher/raters need frequent opportunities to talk with one another, to share perceptions and observations, so that everyone’s insight and awareness are sharpened
  • The pace needs to be realistic—ideally, no more than 12 papers per hour—so that readers can truly pay attention to what they are reading and respond appropriately and thoughtfully to something as complex as writing

OK, let’s take off the rose colored glasses. These things happened once—a long, long time ago, before some people made a painful discovery: Assessing student writing in this way, while both engaging and rewarding (it’s some of the best professional development available), is incredibly expensive. It isn’t going to happen again. Readers will be pushed to the limits of their brains, eyeballs, and ability to sit. And whether, under these conditions, they can (at least consistently) provide feedback that is of any more value than machine scoring is highly questionable. What do we do about this?

Before you leap to answer this question, have a look at a fascinating book: Machine Scoring of Student Essays: Truth and Consequences, edited by Patricia Freitag Ericsson and Richard Haswell. The book contains 16 essays, written by a variety of language arts and assessment specialists, and exploring issues like the following:

  • Can machines really understand the meaning of text?
  • What traits do computers focus on in assessing writing?
  • What are their limitations?
  • Is it possible for savvy students to fool the machine—and if so, how?
  • How do students react when they find out their essays are machine scored? Do they mind? Do some actually prefer it?
  • Is computer analysis of writing of any value in instruction?
  • What do we stand to lose as an educational community by allowing machines to score our students’ writing?

This is an important book not only because of the current controversy surrounding writing assessment associated with the Common Core, but also because a number of colleges and community colleges are already making use of automated scoring programs. As of this writing, the Common Core alliances do not plan to use machine scoring for extended pieces of writing. To learn more about this, check out the article Automated Essay Scoring (AES) and the Common Core State Standards by John Wood (May 20, 2013). It is available online. You may also want to read the new NCTE (National Council of Teachers of English) position paper on automated scoring, Machine Scoring Fails the Test, approved April 2013. You can find this and many valuable related links at www.ncte.org/positions/statements/machine_scoring

Because of budgetary considerations, however, we have no guarantees that machine scoring will not become more acceptable in the future, or that it will not increasingly be used to assess college placement essays—or even everyday college classroom writing.

What’s at stake here . . . As you peruse these or related articles, please ask yourself this: What do we value when it comes to writing instruction? What are we measuring here? Whether student essays are read by machines or mind-numbed people imitating machines, we are certainly not looking in depth at idea development, organizational design, creative use of language, or the ability to capture and hold the attention of an audience. Do we not care about these things anymore? For that matter, when we ask students (or anyone) to write for 20 or 30 minutes on a cold topic that is of no interest to them (except for purposes of the test), a topic about which they have no time to think or do research, what are we hoping to discover? How many people can write cogent sentences under pressure? This is not writing. So let’s stop pretending we are measuring writing at all in this ridiculous fashion. Writing involves planning, thinking, reflecting, reading aloud (to oneself or others), and a veritable symphony of drafting and revision. We can write grocery lists and thank you notes in 20 minutes. We cannot write essays. We cannot write anything representative of our capabilities.

Current approaches to writing assessment are a charade. Machine scoring is but one small part of a much bigger problem. To assess writing effectively, we need to—

  • Figure out what we value—what we’d truly like to see from our student writers
  • Set up an assessment that measures what we want students to do, not just what they can do in 20 minutes
  • Come up with a way of scoring results that matches our assessment approach

As long as we put budgets ahead of student learning, “writing” assessment results will be bitterly disappointing—and good writing teachers will have no idea what to do about it because they will recognize that we are measuring something other than writing.

Highlights of the Book

This book raises a number of interesting issues related to automated scoring—and the possibility that computer programs might be used instructionally (heaven forbid) in the future. (If this doesn’t scare you, it should.) Here are a few highlights that caught my attention:

  1. The making of meaning. Patricia Freitag Ericsson’s essay “The Meaning of Meaning” (Chapter 2) explores the nature of meaning itself. How do computers achieve their so-called natural language processing, or making sense of human speech. Is it a thinking process, as we imagine actual thinking to be, or more of a mathematical process in which words are added to other words, rather like Legos, to form some programmed “whole” with a predictable and formulaic meaning? As she tells us, “If composition is about making meaning—for both the writer and the reader—then scoring machines are deadly. Writing for an asocial machine that ‘understands’ a text only as an equation of word + word + word strikes a death blow to the understanding of writing and composing as a meaning-making activity” (p. 37). Within this same essay, Ericsson cautions us about the very real possibility of social, ethnic, or racial bias in a machine that is insensitive to variations in language or dialect. A human can be (often is) motivated by a wish or need to understand another human being, even if that means climbing some linguistic mountains. The machine does not share this wish.
  2. Word meanings.  Putting aside the meaning of the text as a whole, consider individual words for a moment. In English, a given word—even a simple word such as go, hand, or see—can have numerous meanings, depending on context. In “Can’t Touch This” (Chapter 3), Chris M. Anson maintains that countless misunderstandings can arise from the difficulty inherent in programming a computer to “understand” all these possible meanings. Simply programming the computer with variant definitions is not enough because interpretation of text (ask anyone who’s ever taught vocabulary or ESL) “requires knowledge of the word’s surrounding sentential and discursive context” (p. 43). Consider these sentences: The burgers are ready to eat. Grandma is ready to eat. Or think how a simple word like hand is used in the following sentences: Hand me the book, Let me give you a hand, Let’s give that dancer a hand, Give me your hand in marriage, He’s a good hand at cooking, That jacket is a hand-me-down, Deal me another hand. Now imagine that your writing will be assessed by a machine that essentially can’t distinguish among these very different usages.
  3. Length. I once read an article by a man who claimed he could “score” a student’s paper from across the room, and guarantee almost a perfect match with machine scores. Actually, this is not as magical (or even difficult) as it sounds. As you will discover throughout the book, machines favor length (up to a point—they don’t respond well to multi-page complexity). Given two essays, one running half a page, and one a full page, it is almost inevitable that the full-page essay will score higher if automated scoring is used. Now admittedly, some human raters share this bias. (Where else, after all, would the computers get it?) But if three decades of writing assessment experience has taught me anything, it’s that length per se is a poor and unreliable predictor of quality. One of the finest essays we ever received from an Oregon eighth grader ran only four or five sentences—it summed up the memories the student could call up simply by touching the pins he’d fixed to an old baseball cap. Instead of saying, “Ah—I wish you’d written more!” we said to ourselves, “How did you manage to do that in so few words?”
  4. Correctness. Second favorite trait of computer raters? Correctness, of course! Computers are just made for spotting (and counting) errors—and this preoccupation with correctness is also a recurrent theme of the book. Should errors matter as much as meaning? Well, that’s another whole topic . . . for now, let’s focus on how good computers really are at this error hunting business. Unfortunately, even something so apparently foolproof is anything but. First of all, machines do not, reputation aside, catch every error. With respect to spelling, they do fairly well, but they’re still working to master grammar, usage, and punctuation. And suppose you want to get creative?! Or super emphatic!! Expect a lower score if you blend or repeat punctuation marks. Want to do away with capitals or punctuation altogether? You’ll need to imitate e. e. cummings or Faulkner on your own time. Computers are not good at adapting. They’re not out to make friends. They’re out to assess you—which means, in part, to catch you making an error. Edmund Jones describes a research project in which he edited students’ work (after it was first machine scored) to see if the editing would improve the scores (103). It did, a little—but only sometimes. Corrected spelling and capitalization seemed to boost scores far more than changes to such things as wordiness, faulty antecedents, comma splices, or misplaced modifiers (104-105). In other words, like most of us, computers have pet peeves (programmed in, of course). Thus, when writing is assessed by a computer, a high score in conventions doesn’t necessarily mean your conventions are impeccable; it just means you were foxy enough not to make mistakes the computer was programmed to recognize. Clever you.
  5. Big words. As with length and conventions, computers are programmed to favor big words. (Apparently they are immune to the implied advice of Hemingway who famously said he knew all the big words, but chose not to use them.) We must remember that computers are very, very good at counting. By comparison, we humans have no skill at this whatsoever (though I’ve certainly known teachers who could not be stopped when it came to counting errors in student work). It is simple for the computer to tally the number of letters or syllables per word, words per sentence, sentences per paragraph, instances of word or phrase usage, and so on. The moral of this story is (when automated scoring is in your future), never say big when you could say voluminous, amazing when you could say prodigious, or convert when you could say digitize. The prodigious digitization of voluminous data suggests that computers reward big words with high scores. It’s that simple.
  6. Organization. For years, in every workshop or class I have taught, I have stated my belief that organization is the most difficult of the six traits to score, to teach—or to master. That’s because it’s so exquisitely complex. Where to begin? What to say next? How long to continue? How to end? Anyone can come up with a random list of details—but arranging those details so they become a mystery story, documentary script, motivational speech, or argument? That takes thought. Sometimes it takes genius. And following the organizational design of a writer who doesn’t set out obvious sign posts—my first point, my second point—takes incredible concentration and the ability to follow someone else’s thinking. As we’ve seen already, computers are best suited to the simple tasks: identifying specific words or phrases, counting sentences, and so forth. The complexity of organizational design is beyond them.

Edmund Jones tested one computer program’s awareness of organization by scoring a sample of writing, then shuffling the sentences (so their new random order made virtually no sense as a whole) and rescoring the piece (109-111). The scores were essentially the same for organization, but curiously enough, dropped slightly for conventions. Well, once a conventions fanatic, always a conventions fanatic. I should add that through the years, I’ve known many human raters to lower scores in organization simply because, as they put it, “I couldn’t follow this.” Apparently computers cannot follow human thinking either—at least not all the time. But again we must remember, computers’ shortcomings and apparent biases are reflections not only of a machine’s inherent limitations—but of the preferences and capabilities of the programmers.

7. What happened to voice? It’s no surprise—and several essays note this—that computers cannot score voice, nor its first cousin style. They don’t get it. Computers have no emotional response to writing, no appreciation, no feelings of horror, joy, loathing, terror, ecstasy, amusement. No feelings at all. But—does this matter? After all, we could argue that if students present us with good information and organize it well and show enough mastery of conventions to aid readability and preclude confusion, well then surely that’s enough. Is it, though?

Think of the last time you recommended a book to a friend. No—think of the last six times. The last six books you loved. Do you remember what you said as you held that book out to your friend? Did you say, “Flawless conventions. Well done.” Probably not. How about, “The guy’s an organizational genius.” Also unlikely. Maybe you did say, “I learned so much” or “You can’t believe the description!” Well, information and description are components of ideas, and ideas matter. But when it comes to singling out the writing we love most of all, the writing that touches us, the writing we remember forever, nothing trumps voice.

You were responding to voice if you’ve ever said something like this: I couldn’t put it down, I was riveted, I got the chills, I could hardly bear to read it, You won’t believe this story, I’ve never read another book quite like it, I laughed out loud, It made me cry, I felt as if the author were talking right to me, I never thought I’d like this topic, but this author made me care about it, I’m going to read it again, You have to read this. There are two fundamental reasons for writing: to provide essential information, or to touch another human soul. If we discard voice, we’re only left with one—and then good luck motivating young writers.

8. Marching backward. The use of computers to assess (and by extension, teach—yes, some people do advocate this) student writing seems on the face of it so advanced—well, in a kind of space-age sort of way. After all, robots are building automobiles and operating on people. So surely computers can assign scores to short essays. In my favorite essay from the book (“Why Less Is Not More,” Chapter 15, pp. 211ff), author William Condon explores what we might lose by advocating (indeed, even allowing) automated scoring. He sums things up this way: “In other words, instead of a step forward, or even marking time, machine scoring represents a step backward, into an era when writing proficiency was determined by indirect tests” (212). That is a brilliant connection—and defines the divide. Those who favor machine scoring, or justify it because it saves time and effort, also tend to be those who see no problem with assessing writing by asking multiple choice questions about writing: e.g., Which of the following sentences contains a grammatical error? Which of the following would provide the most effective lead for the paragraph that follows? Multiple choice testing is by nature reactive. Writing is generative. A generative process cannot be measured by multiple choice assessment because there is absolutely no way to anticipate the infinite array of possible responses. Only a non-writer could seriously think otherwise.

Valuable assessment—what Condon calls “robust” assessment”—must take us inside the classroom, where the writing is created. We have to look at such things as the student’s understanding of and application of process. Does the writer choose topics? Lay out a play for research or some way of gathering information? Does the writer know how to use feedback judiciously, without ignoring it or feeling overwhelmed? Can the writer integrate drafting and revising effectively, using self-assessment to discover when the writing is working well? These are the kinds of questions we want answered when we design writing assessment that makes a difference.

Condon also points out that machines are usually used to score writing that is created under artificial conditions: Students are assigned prompts thought up by someone who doesn’t even know them, and asked to write under severe time constraints with limited or no access to resources, and usually with no opportunity to share their writing or revise it (except in the most superficial manner). What they produce under these conditions cannot ever be representative of what they can do when true writing process is allowed to take its course. As Condon tells us, “Such a sample, however it is scored, cannot tell us much about a student’s writing ability, because the sample’s validity is so narrow that it cannot test very much of the construct” (213).

Perhaps it’s fair to say then that machine scoring within this very limited context is not the end of the world after all—since the writing the machine is scoring is as artificial as the scoring procedure itself.

Condon is an advocate (as am I) of portfolio-based assessment, but confesses that to assess students’ writing in this humane and intelligent manner, we would have to leave our robotic scoring machines behind: “No automated-scoring program can assess a portfolio: the samples are too long, the topics often differ widely, and student writers have had time to think, to work up original approaches, and to explore source materials that help promote more complex thinking” (219). He adds that human assessment allows for what is arguably the most critical component of any writing assessment: conversation about the writing. Think about that. Think about what it means to give that up.

In the end

To appreciate the implications of automated scoring, we have to imagine ourselves as writers, working long hours, revising, picturing that one good listener—that mysterious person Mem Fox always called The Watcher—waiting to read what we have written. What would you want that person to say to you about your writing? Write it down. I mean it. Take it with you next time you have writing conferences with your students. Share it with them, so they can see the humanness that underlies the writing we all do—no matter how experienced or professional we may be. Now contrast what your heart longs to hear with the “feedback” you’re likely to receive from a computer: “Good work, guest student!” (53) This would be much funnier if it weren’t real.

There’s no getting around the fact, however, that at the classroom level, responding to student writing takes enormous chunks of time. You have to more or less devote your life (or at least a big portion of it) to this endeavor: reading, responding in your mind, writing responses, meeting with students to talk writing. Richard H. Haswell concludes Chapter 4, “Automatons and Automated Scoring,” with these comments: “In all honesty, the art of getting inside the black box of the student essay is hard work. In the reading of student writing, everyone needs to be reengaged and stimulated with the difficult, which is the only path to the good, as that most hieratic of poets José Lezama Lima once said. If we do not embrace difficulty in this part of our job, easy evaluation will drive out good evaluation every time” (78).

5 Things You Can Do

  1. Oppose automated scoring. Speak out. Form a discussion group. Read one chapter from this book a week and discuss it with friends. Find out how writing samples for your state or district will be scored and be sure you approve.
  2. Advocate the scoring of writing samples by teachers—real, live, human teachers. The discussion that comes from this experience is seen by an overwhelming majority of teachers as a true learning opportunity, with lessons that translate directly into classroom practice. This approach can be made affordable through sampling—assessing a selected portion of responses.
  3. Advocate true writing assessment in which students write to self-selected topics, and develop those topics over time, with opportunities for sharing, reflection, and revision. In other words, advocate the assessment of writing, not the meaningless assessment of quick responses. For details on precisely how to set up such an assessment, see Donald Graves’ brilliant book, Testing Is Not Teaching.
  4. Offer professional development to help teachers learn how to respond effectively to students’ writing, and learn ways of teaching students how to be writing coaches, so that they can respond effectively to one another. Encourage teachers in other content areas to share responsibility for writing so that not all writing is done in English, literature, or writing classes.
  5. Share at least one part of one essay in this book with your students (regardless of age). Ask them to write an argument supporting or opposing machine scoring of student writing. Encourage them (especially older students) to do some serious research on this topic, through online articles and interviews with teachers, administrators, parents, and if possible, testing specialists. What advantages and disadvantages do they uncover?

Coming up on Gurus . . .

I (Vicki) will be reviewing a delightful little book called Still Writing: The Perils and Pleasures of a Creative Life by Dani Shapiro. Refreshingly enough, it has nothing to do with standards or automated scoring. It has to do with the real work of becoming a writer. If you enjoyed Anne Lamott’s Bird by Bird, I think you’ll like it. Are you thinking about professional development in writing for the coming school year? We can help. Let us design a seminar or series of classroom demo’s to meet your needs at the classroom, building, or district level. We can incorporate any combination of the following: Common Core Standards for writing, the 6 traits, effective approaches to dealing with genre, and the best in literature for young people (including emphasis on reading to write). Please contact us for details or with questions at any time: 503-579-3034. Thanks for stopping by. Come back—and bring friends. And remember . . . Give every child a voice.