Thursday, June 26, 2014

A Look at the Automated Analysis of Constructed Responses, Or Why Jared is Thin

Note: Like the post on Art Graesser's work, this entry comes from notes taken from a meeting with innovative professors doing great research on using automated response to writing, in this case, writing to learn. One of the privileges of being in publishing is meeting professors with great ideas.


So: Jared, the Subway guy, lost a lot of weight. Where did the weight go?

That's the kind of question that requires biology students to NOT simply recall key concepts, but asks them to explain a process, and unlike a multiple choice question, it does not allow the student to look at choices for hints. It asks them to construct a response that answers the question. So, where did Jared's weight go?  I won't tell you the answer, but it was the illustrative question used by to explore "How Can Automated Assessment of Constructed Responses (AACR) Provide Automatic Evaluation of Written Formative Assessment in LaunchPad?"

The presenters were (with descriptions pulled from http://create4stem.msu.edu/project/aacr:
Mark Urban-Lurain, Associate Professor and Associate Director of the Center for Engineering Education in the College of Engineering, Michigan State University directs the technology development and implementation of AACR.

John Merrill, Director of the Biological Sciences Program, Michigan State U, lead the development of the core biology curriculum, and provides disciplinary expertise in the biology portions of the work plan, coordinating with faculty who teach introductory biology courses to implement the materials in their courses.

John began by stating the essential problem: multiple choice questions are not adequate measures of what students know; writing -- in this project, short answer (from one - 40 words or so) reveals better student understanding (or misunderstanding). However, in large lecture courses, sometimes with up to four or five hundred students, even just skimming, let alone reading, sorting, scoring and getting a composite understanding of what students know doesn't work. So John and Mark embarked on a project to have students write, and then use machine analysis to not only score the writing, but sort it by categories and from those categories reveal where students have misconceptions or misunderstandings, giving instructors two things: a better guage of student learning and the means to see broad trends and thus to adjust their teaching to what students need clarification and help with. Currently AACR is an instructor tool: instructors download questions that AACR can score, sort, and report on, instructors deliver the questions to students, extract their answers as a spreadsheet, upload that spreadsheet to AACR, and the AACR team runs the answers and delivers to the instructor a report.  Cumbersome yes, but the team has won a $6 million NSF grant to both improve how AACR works and -- this is key to the mission -- create a means for faculty development, with advice provided to and from faculty communities of users on what kinds of changes to make to their teaching in response to the data AACR provides.

Attending the presentation also were James Morriss, first listed author of Biology: How Life Works (http://www.macmillanhighered.com/Catalog/product/biologyhowlifeworks-firstedition-morris) and Melissa Michael, the lead assessment author on the book. Both James and Melissa concurred on the power of writing to foster learning and to reveal better than multiple choice questions what students know and do not.  James told a story about how he sometimes use both a multiple choice question and a short answer question on the same test, and students will have wildly different takes, getting more right on MC questions but then seeing from written responses that students don't really know the subject matter or at the very least cannot articulate it on their own. An MC question might have language the triggers recall of lecture notes or textbook language, but when asked to write, and thus required to cull that out on their own, things fall apart.

So as to writing. Note the acronym AACR and the phrase "constructed responses."  While we might use the term open ended responses from our use of surveys or assessments we're used to designing, Merrill and Urban-Lurain used "constructed responses" to emphasize two things: first, students have to 'construct' a response, but the questions used are meant to be open ended, as you see in a survey with things like "other" or "tell us more." Instead the questions seek specific responses to key concepts in the course; and second, when fully considered as a concept in learning, a constructed response might not be text only (though AACR is for text only written response) but can include the use of images, artifacts, data, tables and so on.

Earlier I posted on the meeting we had w/ Art Grasser and his use of Latent Semantic Analysis. The AACR project doesn't use that approach. Instead, they automate responses by doing a word analysis -- the presence of key words in students answer, nouns -- of students answers and matching the prevalence of those words to prior student answers. The machine is trained to score student writing and to categorize answers according to core ideas in -- this demo -- biology, so that teachers can see which students are using language that correlates with understanding and which are not, and where students are not, where they are misunderstanding things. The design of the software uses an program called SPSS (http://www.ibm.com/software/analytics/spss/), statistical analytic and predictive software now owned by IBM. They chose it in part because it's designed for non specialists to use but is also robust enough for their purposes. They applied NodeXL, a program that creates associational graphs of Excel data to produce concept clusters and association graphs (So a word used a lot has a bigger ball, and a word it appears with a lot in student answers has a thicker line to that word's ball. Go to http://www.nodexlgraphgallery.org/Pages/Default.aspx to see NodeXL graph images to get a sense). The images you see at NodeXL are more complex than an example from a single question in AACR, and so the data is easier to read from the graph, but John Merill noted that even so, and even for science professors, teaching instructors how to read the graph, understand what it means about student learning, and then coming up with a response in their teaching to address what they see is necessary.

Here's my understanding of how things worked:

SPSS was fed WordNet, an open source dictionary funded by the National Science Foundation (http://wordnet.princeton.edu/). Wordnet links/associate words not just by meaning, like a thesaurus, but concepts and varied meetings, so a richer lexical matrix. Merrill and Urban-Lurain also added to the dictionary terms and meanings particular to biology, so that additional and occasional specialized language or specialized meanings of common language (such as 'mean' in statistics) that students might use in their answers was accommodated.

As student answers were added, SPSS produced a report that pulled and analyzed key works -- nouns -- in student answers and did a first pass association at suggesting categories/concepts for those words. Merrill and his team of biology professors then fine-tuned that, correcting and fine-tuning the categories and which words should be associated them. Categories were then associated with concepts key to the course.  So for example, on where Jared's weight went, 37.3% of answers were associated correctly that the reasons were cellular (metabolic rate, calorie burn, and weight leaves as CO2) and another 37% said they had to do with physiological actions (sweat, urine, feces and other means of departure). Guess which is right, or rather, where the student answers should be clustering? With the results, at a glance and computed quickly and more accurately than a professor could from reading and categorizing and counting answers by category, a teacher sees whether students have the key concepts at stake down, and with greater accuracy even than a easier to auto score multiple choice question. Very powerful stuff information.

Questions were drafted and student responses were uploaded. In the sample question we studied, 374 student responses were gathered and two things happened: one human readers applied a rubric and scored those, and then, the answers, categories and concepts were tweaked so that SPSS could give a predictive score -- note the word predictive -- that says, essentially, based on the vocabulary we see, we the software predicts a human reader, reading the full answer, would give the answer score X. Overtime that prediction matched human scorers in the 83% or higher range for high and low scores (of three levels the humans used) but matched 43% on mid-range scores (where humans also show widest variance.

The labor intensity is in question authoring, adding vocabulary (though both of these would subside if more instructors used the program and added stuff, one of the goals of the NSF grant, a kind of crowd sourcing to get more questions in. The labor comes in establishing predictive outcomes from SPSS that matches the scores or normed human graders (humans trained to apply a rubric consistently, so readers get the same, or nearly the same [depends on the rubric] score on a given sample of student constructed response. It took 374 items scored by humans, for example, to account for the range of responses and lexical variation of student work. That's a lot of norming for one question. Multiple that by just five per chapter to go with a book, and one can see a tremendously labor intensive process.

But that said, the pedagogical benefits and outcomes possible, and the ability to perhaps adapt the machine to not only score and give a report to an instructor, but to also give information directly back to students, adaptive information (so imagine a LearningCurve made of "constructed responses" instead of just multiple choice questions, and you see can see where this might go.

Right now the technology is young, despite ten years of research, and the NSF grant is only 6 months or so (out of five years) in. So there's time to see where this goes and maybe experiments the biology team can try. My own concern in the humanities, where our textbooks, which are the point of sale to pedagogy, are lower, significantly lower, and so the labor to build the questions that the current methodology uses would be something that we couldn't afford. But boy, they got a lot right and it'll be cool to see where this goes and whether biology and other science books in MHE can do experiments.

No comments: