Note: Like the post on Art Graesser's work, this entry comes from notes taken from a meeting with innovative professors doing great research on using automated response to writing, in this case, writing to learn. One of the privileges of being in publishing is meeting professors with great ideas.
So: Jared, the Subway guy, lost a lot of weight. Where did the weight go?
That's
the kind of question that requires biology students to NOT simply
recall key concepts, but asks them to explain a process, and unlike a
multiple choice question, it does not allow the student to look at
choices for hints. It asks them to construct a response that answers the
question. So, where did Jared's weight go? I won't tell you the
answer, but it was the illustrative question used by to explore "How Can
Automated Assessment of Constructed Responses (AACR) Provide Automatic
Evaluation of Written Formative Assessment in LaunchPad?"
The presenters were (with descriptions pulled from
http://create4stem.msu.edu/project/aacr:
Mark
Urban-Lurain, Associate Professor and Associate Director of the Center
for Engineering Education in the College of Engineering, Michigan State
University directs the technology development and implementation of
AACR.
John
Merrill, Director of the Biological Sciences Program, Michigan State U,
lead the development of the core biology curriculum, and provides
disciplinary expertise in the biology portions of the work plan,
coordinating with faculty who teach introductory biology courses to
implement the materials in their courses.
John
began by stating the essential problem: multiple choice questions are
not adequate measures of what students know; writing -- in this project,
short answer (from one - 40 words or so) reveals better student
understanding (or misunderstanding). However, in large lecture courses,
sometimes with up to four or five hundred students, even just skimming,
let alone reading, sorting, scoring and getting a composite
understanding of what students know doesn't work. So John and Mark
embarked on a project to have students write, and then use machine
analysis to not only score the writing, but sort it by categories and
from those categories reveal where students have misconceptions or
misunderstandings, giving instructors two things: a better guage of
student learning and the means to see broad trends and thus to adjust
their teaching to what students need clarification and help with.
Currently AACR is an instructor tool: instructors download questions
that AACR can score, sort, and report on, instructors deliver the
questions to students, extract their answers as a spreadsheet, upload
that spreadsheet to AACR, and the AACR team runs the answers and
delivers to the instructor a report. Cumbersome yes, but the team has
won a $6 million NSF grant to both improve how AACR works and -- this is
key to the mission -- create a means for faculty development, with
advice provided to and from faculty communities of users on what kinds
of changes to make to their teaching in response to the data AACR
provides.
Attending the presentation also were James Morriss, first listed author of Biology: How Life Works (
http://www.macmillanhighered.com/Catalog/product/biologyhowlifeworks-firstedition-morris)
and Melissa Michael, the lead assessment author on the book. Both James
and Melissa concurred on the power of writing to foster learning and to
reveal better than multiple choice questions what students know and do
not. James told a story about how he sometimes use both a multiple
choice question and a short answer question on the same test, and
students will have wildly different takes, getting more right on MC
questions but then seeing from written responses that students don't
really know the subject matter or at the very least cannot articulate it
on their own. An MC question might have language the triggers recall of
lecture notes or textbook language, but when asked to write, and thus
required to cull that out on their own, things fall apart.
So
as to writing. Note the acronym AACR and the phrase "constructed
responses." While we might use the term open ended responses from our
use of surveys or assessments we're used to designing, Merrill and
Urban-Lurain used "constructed responses" to emphasize two things:
first, students have to 'construct' a response, but the questions used
are meant to be open ended, as you see in a survey with things like
"other" or "tell us more." Instead the questions seek specific responses
to key concepts in the course; and second, when fully considered as a
concept in learning, a constructed response might not be text only
(though AACR is for text only written response) but can include the use
of images, artifacts, data, tables and so on.
Earlier
I posted on the meeting we had w/ Art Grasser and his use of Latent Semantic Analysis.
The AACR project doesn't use that approach. Instead, they automate
responses by doing a word analysis -- the presence of key words in
students answer, nouns -- of students answers and matching the
prevalence of those words to prior student answers. The machine is
trained to score student writing and to categorize answers according to
core ideas in -- this demo -- biology, so that teachers can see which
students are using language that correlates with understanding and which
are not, and where students are not, where they are misunderstanding
things. The design of the software uses an program called SPSS (
http://www.ibm.com/software/analytics/spss/),
statistical analytic and predictive software now owned by IBM. They
chose it in part because it's designed for non specialists to use but is
also robust enough for their purposes. They applied NodeXL, a program
that creates associational graphs of Excel data to produce concept
clusters and association graphs (So a word used a lot has a bigger ball,
and a word it appears with a lot in student answers has a thicker line
to that word's ball. Go to
http://www.nodexlgraphgallery.org/Pages/Default.aspx
to see NodeXL graph images to get a sense). The images you see at
NodeXL are more complex than an example from a single question in AACR,
and so the data is easier to read from the graph, but John Merill noted
that even so, and even for science professors, teaching instructors how
to read the graph, understand what it means about student learning, and
then coming up with a response in their teaching to address what they
see is necessary.
Here's my understanding of how things worked:
SPSS was fed WordNet, an open source dictionary funded by the National Science Foundation (
http://wordnet.princeton.edu/).
Wordnet links/associate words not just by meaning, like a thesaurus,
but concepts and varied meetings, so a richer lexical matrix. Merrill
and Urban-Lurain also added to the dictionary terms and meanings
particular to biology, so that additional and occasional specialized
language or specialized meanings of common language (such as 'mean' in
statistics) that students might use in their answers was accommodated.
As
student answers were added, SPSS produced a report that pulled and
analyzed key works -- nouns -- in student answers and did a first pass
association at suggesting categories/concepts for those words. Merrill
and his team of biology professors then fine-tuned that, correcting and
fine-tuning the categories and which words should be associated them.
Categories were then associated with concepts key to the course. So for
example, on where Jared's weight went, 37.3% of answers were associated
correctly that the reasons were cellular (metabolic rate, calorie burn,
and weight leaves as CO2) and another 37% said they had to do with
physiological actions (sweat, urine, feces and other means of
departure). Guess which is right, or rather, where the student answers
should be clustering? With the results, at a glance and computed quickly
and more accurately than a professor could from reading and
categorizing and counting answers by category, a teacher sees whether
students have the key concepts at stake down, and with greater accuracy
even than a easier to auto score multiple choice question. Very powerful
stuff information.
Questions
were drafted and student responses were uploaded. In the sample
question we studied, 374 student responses were gathered and two things
happened: one human readers applied a rubric and scored those, and then,
the answers, categories and concepts were tweaked so that SPSS could
give a predictive score -- note the word predictive -- that says,
essentially, based on the vocabulary we see, we the software predicts a
human reader, reading the full answer, would give the answer score X.
Overtime that prediction matched human scorers in the 83% or higher
range for high and low scores (of three levels the humans used) but
matched 43% on mid-range scores (where humans also show widest variance.
The
labor intensity is in question authoring, adding vocabulary (though
both of these would subside if more instructors used the program and
added stuff, one of the goals of the NSF grant, a kind of crowd sourcing
to get more questions in. The labor comes in establishing predictive
outcomes from SPSS that matches the scores or normed human graders
(humans trained to apply a rubric consistently, so readers get the same,
or nearly the same [depends on the rubric] score on a given sample of
student constructed response. It took 374 items scored by humans, for
example, to account for the range of responses and lexical variation of
student work. That's a lot of norming for one question. Multiple that by
just five per chapter to go with a book, and one can see a tremendously
labor intensive process.
But
that said, the pedagogical benefits and outcomes possible, and the
ability to perhaps adapt the machine to not only score and give a report
to an instructor, but to also give information directly back to
students, adaptive information (so imagine a LearningCurve made of
"constructed responses" instead of just multiple choice questions, and
you see can see where this might go.
Right
now the technology is young, despite ten years of research, and the NSF
grant is only 6 months or so (out of five years) in. So there's time to
see where this goes and maybe experiments the biology team can try. My
own concern in the humanities, where our textbooks, which are the point
of sale to pedagogy, are lower, significantly lower, and so the labor to
build the questions that the current methodology uses would be
something that we couldn't afford. But boy, they got a lot right and
it'll be cool to see where this goes and whether biology and other
science books in MHE can do experiments.
No comments:
Post a Comment