A Word on Using Test Scores to Evaluate Teachers
As you’re probably aware, Tennessee is rolling out [pdf] its new evaluation system this year, along with new rules on tenure based on that evaluation system. I’ve written about the use of value-added scores and data in school decisions (not just tenure and firing decisions either, which is a giant mistake and a misuse of valuable data), and how we should evaluate teachers who lack test data. It’s worth continuing the conversation on this topic, however, and there was a very interesting (and provocative) piece written in Scientific American a while back that I’ve been meaning to get to. Now, don’t get all offended, but the author compares using test scores to evaluate and fire teachers to eugenics.
I know. Eugenics. Uncomfortable.
Still, though, I think there are some incredibly interesting things about the article, even if it is a bit broad. The title of the article is “Deselection of the Bottom 8%: Lessons from Eugenics for Modern School Reform.” Here’s the part I want to talk about:
I do not wish impugn the statistical techniques themselves, or doubt progress in measuring what we aim to measure. However, in each moment, a refinement of the science of testing has been mistaken for readiness to apply to public policyand specific individual cases. A strong general relationship between conveniently measurable variables becomes riddled with errors when applied to individual personnel decisions. As these tools leave the lab (or the economist’s model) and enter policy reality, the uncertainty magnifies the bias and corruption that science is supposed to prevent. Whether using early IQ tests to reject immigrants at Ellis Island, or using Value-Added Measures (VAM) scores to fire or reward teachers, policymakers convinced they are using the latest microscope, are later seen holding a distorted mirror.
There is a very important point here, that cheerleaders of TVAAS and other such value-added data systems overlook: Broad data systems like these are great at identifying state-wide, system-wide, district-wide, and maybe even school-wide trends and concerns, but they’re often crap when it comes to individual teachers/students. Not always, but certainly often enough to give us pause. From one of my earlier posts:
TVAAS isn’t perfect. It’s a good data system, but it has flaws. Studies have shown [pdf] that teachers’ ratings can fluctuate wildly, even teachers that are presumed to be good ones (“Only a few studies have explored the stability of teacher effects (Ballou, 2005, Aaronson, et al (2007) and Koedel and Betts (2007). Such studies find that teacher effects are quite unstable.” (p.9))
That’s why I’m somewhat hesitant to make firing decisions with that data, especially right now, when TVAAS has demonstrable problems (and is controlled under contract by a secretive, private company to boot). From yet another earlier post:
I hesitate, lest I be labeled repetitious, but Rule #1 is: Data should not be used in a punitive manner.
Some of the more market-oriented educational researchers out there would disagree (I think). These are the folks clamoring for more teacher firings, relaxation of the regulations and red tape required to become a teacher, and other market-driven reforms which would ease both the entry into and exit from the profession. Many believe (and I don’t think I’m creating a straw man here) that we can hire and fire our way to success. That is, with enough supply out there, we can spend our time weeding out the bad teachers, and hiring enough new ones so that, eventually, we keep the good teachers and are able to sort out the chaff. This assumes, however, that there’s an infinite teacher pipeline and that room for experience/improvement is limited. Both of those assumptions fly in the face of what I know about the profession: 1) Even if it is easier to become a teacher (and I do support alternative licensure and the reform of traditional teacher preparation), there aren’t enough folks out there, at current salary/benefit levels, to fill the void of all the mediocre-to-bad teachers we would have to fire, and 2) Experience matters — teachers can get better if you give them the support and opportunities that they need.
I still support the idea that student performance needs to be a part of evaluating teachers, that teacher development, planning, and support needs to be strongly informed by student performance, and that pay needs to be linked to, in some way (though not wholly determined by), student performance. It’s simply that we need to be reflective on how we use data, and understand its limitations.
Overconfidence and a lack of self-reflection can kill any good thing. To take one example, look at subprime lending. It originally started as a way to allow more lower and middle income folks to buy homes. Then it got out of control, people stopped being rigorous and examining their behavior and assumptions, and it ended up decimating the entire world economy. Let’s not let the same thing happen with the use of data in our schools.
P.S. Some great resources on subprime lending and the economic crisis, I always turn to NPR’s Planet Money and This American Life (especially these three episodes) and the books All The Devils Are Here (by Bethany McLean and Joe Nocera) and The Big Short (by Michael Lewis, of Moneyball fame), both of which are excellent reads.
Note: The image above comes from this blog, written by David B. Cohen, and is a great read on using value-added data to evaluate teachers.