Sunday, October 26, 2014

Why ETS Might Not Want to Let Les Perelman In

Update on October 27, 2014.

In a discussion on WPA-L, Les Perelman responded to this post with two e-mails that shed more light on the nature of the proposal he submitted and his research goals. I've added those below.

At  Is MIT researcher being censored by Educational Testing Service?, Valerie Strauss has a post by Les Perelman, of MIT, detailing what he describes as a censorship condition imposed by Educational Testing Services (ETS) on research Les proposed to them.  ETS would only give Les access to E-Rater, their technology for automatically assessing writing, if he agreed to have his findings reviewed and to make corrections to errors ETS might find. If Les opted not to take their corrections, he could publish his findings but could not mention ETS or its products by name. The post includes a response by ETS asserting that the practice, which they do not dispute, is not censorship.

So my post here is not about whether ETS conditions are censorship, but something else. Why automated testing companies might choose to make it difficult for Les to have access to their products, and how other researchers might opt to test the claims of these products.

Automated Writing Scoring Companies Do Not Trust Les Perelman

In his post, Les cites the model of consumer watch dogs -- "All I want to do is what organizations like Consumers Union and the Underwriters Laboratory do all the time: determine 1) if an advertised product meets its claims and 2) whether or not it is defective."

Les observes that ETS is selective in applying the research policy under discussion, "Over the next few months, I discovered that the provisions ETS had told me were common practice were not consistently applied. Around the same time, another researcher had applied to use Criterion and had no problem gaining access."

In addition to ETS, Les also asked Pearson for access to their automated essay scoring technology, he writes:
Pearson Educational Technologies wouldn't even reply to my request to test their WriteToLearn® software, and Peter Foltz, a Pearson Vice President, was quoted in the 2012 New York Times article as justifying Pearson’s refusal to give me access to their product because “He wants to show why it doesn't work.”
I'm not surprised that ETS and Pearson (LightSide Labs is a welcome exception on this front) will not give Les easy access to their technologies. Les's goal is in fact what Foltz says it is it is: to prove that, quoting from Les's post,  "computer generated nonsense could receive high scores from Automated Essay Scoring (AES) computers," or put another way, that the software can be tricked into giving high scores for bad writing.

Les's goal, in their view, is not to test their products with the kind of prose most writers who use their products will write. So from their point of view, the review he proposes isn't of the kind Consumer Reports does -- using the products under the conditions they're designed for -- but is instead to show that their software can be tricked with a program that students will not be using. And if that is in fact their view, then I'm not surprised access isn't forthcoming.

I am surprised that ETS didn't just say -- if that is their reasoning -- as much in their rebuttal. From the point of view of these companies, my guess is that the see Les as provocateur, and not as open-minded curious researcher. Les's work is invaluable, and I love reading it. But I can see why these companies might not want to give him access at this stage. It may be cowardly on their part, but it also has a logic to it: very few us make things easy for those we think are out to get us.

So Can, Automated Essay Technology Be Reviewed Openly and Fairly?

But back to testing this technology, which does need to be studied and tested. Les evokes the work of the Consumer Union, publisher of Consumer Reports, and Underwriters Laboratory, a product safety project, saying his research is akin to theirs.

But Consumer Reports, doesn't ask for permission or access to products; it goes out and buys the stuff they will test. That specifically frees them from the kind of entanglements Les's proposal instigated.

So a Consumer Reports model that looked at ETS would not be the kind of project Les has proposed -- using software specifically designed to fool the technology, but instead using the technology under the kinds of conditions and writing that ETS claims to have designed Criterion for.

A research project could be done this way: access to Criterion is purchased for students, and the professors who will be using it would go through the required Criterion training.  The teachers would teach with Criterion as part of the course mix, students using it to do the work of the course,, with the writing emerging for submission to the software under course conditions. The study can include a comparison of feedback from Word on the same draft submitted to Criterion, as Les proposed. But the basis of the study would be using the product the way it was designed to be used.

And ETS doesn't have to be told at all that the product is being studied in a class test of this kind.

I know when I did a back of the envelope analysis of grammar checkers using a single student essay -- and I accessed a variation of e-rater/Criterion by paying for a WriteCheck ( account, Word, Grammarly (another platform I checked) and E-Rater as expressed in WriteCheck all got some things wrong. See the following two links for a summary of that:

But even that quick study doesn't get at what a class test would reveal about how the technology affects students and teachers, how they need to adapt or how the classroom feedback ecology and workloads are shifted.

The important question about automated writing assessment technologies is less about how accurate they are compared on one another, and more about how their presence in classrooms and in the hands of novice writers may hurt or help the teaching and learning of writing. And to know that, it's important to study the technology under the conditions where teaching and learning happen.

Responses from Les Perelman

First e-mail

I think you are being too kind to ETS and the other companies., Lightside excepted (we both have considerable respect for Elijah).  My purpose with BABEL and my other experiments is to demonstrate that overall AES does not work.   Various testing companies make absurd claims in volumes like the recent one edited by Burnstein and Shermis (2013).  MT Schmidt in that volume states “IntelliMetric is theoretically grounded in a cognitive model often referred to as a “brain-based” or “mind-based” model of information processing and understanding..”  I an trying to refute that claim and others.

Secondary to my claim that AES does not work are several other claims. First, students could game AES machines like E-rater, that already grades high stakes tests like the GRE simply by memorizing word lists and peppering them throughout their paper with no regard for making meaning.  I have done that already with E-rater GRE online test.  There is a second human reader for machine graded tests like the TOEFL and the GRE, but given the scoring conditions, I would be interested to see if human readers would catch such strategies.  E-rater gives a score that is a continuous variable (e.g. 4.6), while  humans are limited to integers.  The two scores go to another human reader only if the difference is greater than 1.5 point on a 6-point scale.  If the second human reader’s score is between the scores, the three scores are averaged.  Otherwise the outlier score is thrown out.  ETS has a research report on E-rater but does not present the crucial statistic of when there are outliers, what percentage of the time is the E-rate score the outlier.  Given that Pearson wants to use their AES scoring engine as the second reader for the PARCC Common Core tests, these questions are extremely relevant.

Moreover, the specific study I was proposing was going to use student papers from the ASAP study to observe how well e-rater compares to MS Word.  It would have been similar to the excellent study comparing instructors to e-rater in the current issue of Assessing Writing.  In a phone conference with ETS, I even offered to show then the results afterwards before dissemination and discuss it with them.  The only condition I would not accept was censorship.

As for the Consumer Reports model, I mentioned buying Criterion and using it in a class as you suggested during the conference call with ETS, and was told that the Terms-of-Use agreement with ETS for classroom use prohibits published research without ETS’s permission.

I believe that there may be a few legitimate uses for AES (and Elijah may well find them).  However, I also am convinced by argument from people like Noam Chomsky and other cognitive scientists that our knowledge of semantics is way too insufficient for  most uses of computers to evaluate writing.

Les Perelman, Ph.D
Excerpt from a second e-mail from Less, written after I offered to post the first here:
I did not mention the source of the papers in the OpEd, although I did in the proposal, because most people reading the Washington Post would not know what the ASAP study was and it would take too much verbal real estate to explain. 
I did say, however, "I submitted a detailed proposal to compare the accuracy of Criterion to that of the Microsoft® Word Spelling and Grammar tool. I would conduct the study with a colleague from MIT who has a Ph.D. in linguistics from MIT and who worked with Noam Chomsky and Morris Halle, the founders of modern linguistics.” 
In my conference with them I told them I would agree to use Criterion for no other purpose.  That was not good enough for them.

No comments: