A colleague at Macmillan Education arranged a meeting in New York for editors to meet with Art Graesser, a Psychology professor at the University of Memphis who researches and designs in the Memphis Intelligent Tutoring Systems Center (MITSC), which is part of the Advanced Distributed Learning Center for Intelligent Tutoring Systems Research and Development (ADL-CITSRD), a government partnership, which is located in the FedEx Institute of Technology (FIT, and yes, there will be an acronym test at the end of this post; it is 50% of your grade.) The purpose of the meeting was to learn about different approaches to automated writing assessment.
On the way to discussing automated assessment of writing, Art described some other projects from MITSC:
AutoTutor (http://www.memphis.edu/mitsc/capabilities/team-memphis-projects/autotutor/index.php), where, as a student works at a self-paced tutorial, two "agents," software coded to track what the student is doing (or not doing), are triggered by student actions in the tutorial. So a student might make a mistake in identifying a key idea, and the first agent, might trigger a text or audio message asking the student a question. The second agent might comment on the first agent's question or on the student response, creating a kind of learning dialog among the two agents and the student about the item under study. The Turing Test becomes Turing Tutors. Now, if this sounds bizarre, wait: the research from Art and his team shows that students who study in the tutoring software do slightly better on shallow knowledge (recall, definition, summary) of content than students who read only, but do significantly better at deeper knowledge (reasoning, synthesis, and communication) than students who read only. The acts of dialog, of drawing student attention to thinking in new ways, of answer or at least considering questions the agents posed, leads to deeper learning.
That's not surprising on the face of it, but what's powerful is the creation of software that helps a lone learner come to the kind of deeper engagement necessary for deeper learning.
And that's the nub of Art's work -- deeper learning through deeper engagement via dialog and writing.
Art and his colleagues and Memphis are also doing research with the Center for the Study of Adult Literacy (CSAL) at Georgia State University, which, like MITSC, has won grants from the Institute for Education Sciences (IES), a federal initiative that studies learning sciences. I'm linking to both CSAL and IES because they're sites worth visiting, doing a lot of useful research that we can draw on for validation and direction of editorial initiatives.
On writing, one of the first things you'll want to check out is the work MITSC has been doing with cohesion metrics of written text. To quote from the project, it involves the creation of a tool that generates "Automated Cohesion and Coherence Scores to Predict Text Readability and Facilitate Comprehension," affectionately dubbed, and this will be on the test so pay attention, Coh-Metrix. To put this in simpler terms, Coh-Metrix measure readability, only in ways more sophisticated than Lexile, Gunning-Fog, and perhaps best known (because it's built into MS Word) Flesch-Kincaid. You can get a good explanation of what Coh-Metrix measures here -- http://cohmetrix.memphis.edu/cohmetrixpr/cohmetrix3.html -- but what you might want most to do is to go to their Text Easability Assessor (TEA) --http://tea.cohmetrix.com/ -- create an account and have a TEA party* with some prose of your own.
So. On to auto assessing writing. In the discussion, Art described three broad ways to auto assess text:
- Compare the text to an ideal and score it for how close it comes --- We can do this crudely already with one word answers for example. A student writes a word into an answer, and it matches the word (correctly spelled in our limited engine so it fully compares) we've designated as correct, the student gets full points for the question. The software used in automated assessment allows answers more sophisticated than a single, correctly written word and more nuanced scoring than right or wrong.
- Using a cluster of answers and mapping to them. That is, instead of comparing to a single idea, a range of responses -- A, B, C, and D answers might be available and student submissions are compared for features that match somewhere in the cluster. So if the writing has features associated with an A -- vocabulary, length, and other measures -- the writing is scored an A and so on.
- C-Rater level (C-Rater is an ETS tool that we see in use in writing courses as Criterion). Here, the software is trained to prompts, and a corpus of sample student writing in response to those prompts. The prompts are designed so that submissions will fall into the range of samples given (so a bit of 1 and 2 above happens), but in addition to using that corpus as a tool and way to do the analysis, C-Rater also uses Latent Semantic Analysis, a means of analyzing the text submitted in more sophisticated ways.
The psychology team and the biology team have done experiments with MITSC using the Latent Symantic Analysis (LSA) engine they've designed. The process went something like this:
1. A textbook was turned over to MITSC as .txt files.
2. Those files were scanned and described for the latent semantic features using the engine.
3. Wikipedia's entries on psychology were also scanned, to extend the corpus and to provide a richer semantic matrix. This creates a LSA Space, a corpus of writing that student answers to short answer questions are analyzed and scored against.
4. In an experiment, 6 questions were used and student answers were auto scored and compared to human scores. Now this process had some significant steps I won't get into, and the answers were short answer -- 20 - 100 words or so. But the results were interesting in two ways:
A. The scores on the first three test questions, the ones used to train the machine to score, matched human raters.
B. The machine was able to score on the second three questions accurately, without being trained. That is the algorithm designed on the first three test questions carried over and worked on the second set of three.
The potential here is tremendous -- the ability to deploy in all texts -- questions that draw on the text being read for accuracy and a range of answers. Imagine this in a system where questions occur as students read, engaging them as at key points to help them pause. And imagine the data that might result, where a student sees not only how they did, but what class averages are. Imagine a tutoring agent stepping in from time to time as answers come in, helping students learn deeper based on the answers. That is, with a well designed automated assessment tool, one doesn't just have to give a score alone, the score could trigger dialogic tutoring agent, or suggest a study group of fellow students to chat with, or a trigger a message to an on campus tutor.
It's not the score so much as what is done with the score to foster further learning. The key is that score comes from writing, an act that research shows deepens learning more fully than just reading, just highlighting, and just multiple choice (even in LC) answering.
My understanding is that these questions were designed for convergent thinking (see the graphic here -- Faultless Facilitation – Leveraging De Bono’s Six Thinking Hats -- on that) and not divergent.
There's more, but I'm out of time.