A frequent theme throughout the workshop was that when test-based indicators are used to make important decisions, especially ones that affect individual teachers, administrators, or students, the results must be held to higher standards of reliability and validity than when the stakes are lower. However, drawing the line between high and low stakes is not always straightforward. Furthermore, in a particular evaluation, stakes are often different for various stakeholders, such as students, teachers, and principals.
Participants generally referred to exploratory research as a low-stakes use and school or teacher accountability as a high-stakes uses. Using value-added results for school or teacher improvement, or program evaluation, fell somewhere in between, depending on the particular circum-. Interestingly, the Florida merit pay program proved very unpopular after it was discovered that teachers in the most affluent schools were the ones benefiting the most.
Most of the participating districts turned down the additional money after its first year of implementation. For example, as Derek Briggs pointed out, using a value-added model for program evaluation could be high stakes if the studies were part of the What Works Clearinghouse, sponsored by the U.
Department of Education. In any case, it is important for designers of an evaluation system to first set out the standards for the properties they desire of the evaluation model and then ask if value-added approaches satisfy them. For example, if one wants transparency to enable personnel actions to be fully defensible, a very complex value-added model may well fail to meet the requirement.
If one wants all schools in a state to be assessed using the same tests and with adjustments for background factors, value-added approaches do meet the requirement. To date, there is little relevant research in education on the incentives created by value-added evaluation systems and the effects on school culture, teacher practice, and student outcomes.
The workshop therefore addressed the issue of the possible consequences of using value-added models for high-stakes purposes by looking at high-quality studies about their use in other contexts. Ashish Jha presented a paper on the use of an adjusted status model see footnote 4, Chapter 1 in New York State for the purpose of improving health care.
The New York Department of Health began to publicly report the performance of both hospitals and individual surgeons. Assessment of the performance of about 31 hospitals and surgeons, as measured by risk-adjusted mortality rates, was freely available to New York citizens. In this application, the statistical model adjusted for patient risk, in a manner similar to the way models in education adjust for student characteristics.
In , prior to the introduction of CSRS, the risk-adjusted in-hospital mortality rate for patients undergoing heart surgery was 4. Empirical evaluations of CSRS, as well as anecdotal evidence, indicate that a number of surgeons with high adjusted mortality rates stopped practicing in New York after public reporting began.
Poor-performing surgeons were four times more likely. However, many simply moved to neighboring states. Several of the hospitals with the worst mortality rates revamped their cardiac surgery programs. This was precisely what was hoped for by the state and, from this point of view, the CSRS program was a success.
However, there were reports of unintended consequences of this intervention. Some studies indicated that surgeons were less likely to operate on sicker patients, although others contradicted this claim. Finally, one study conducted by Jha and colleagues found that the introduction of CSRS had a significant deleterious effect on access to surgery for African American patients. The proportion of African American patients dropped, presumably because surgeons perceived them as high risk and therefore were less willing to perform surgery on them.
It took almost a decade before the racial composition of patients reverted to pre-CSRS proportions. This health care example illustrates that, if value-added models are to be used in an education accountability context, with the intention of changing the behavior of teachers and administrators, one can expect both intended and unintended consequences.
The adjustment process should be clearly explained, and an incentive structure should be put into place that minimizes perverse incentives. The system is likely to be most effective if teachers believe the measure treats them fairly in the sense of holding them accountable for things that are under their control. Workshop participants noted a few ways that test-based accountability systems have had unintended consequences in the education context.
For example, Ladd gave the example of South Carolina, which experimented in the s with a growth model not a value-added model. It was hoped that the growth model would be more appropriate and useful than the status model that had been used previously.
The status model was regarded as faulty because the results largely reflected socioeconomic status SES. It was found, however, that the growth model results still favored schools serving more advantaged students, which were then more likely to be eligible for rewards than schools serving low-income students and minority students.
State and school officials were concerned. In response, they created a school classification system based mainly on the average SES of the students in the schools.
Schools were then compared only with other schools in the same category, with rewards equitably dis-. This was widely regarded as fair. However, one result was that schools at the boundaries had an incentive to try to get into a lower SES classification in order to increase their chances of receiving a reward. Sean Reardon pointed out a similar situation based on the use of a value-added model in San Diego Koedel and Betts, Test scores from fourth grade students along with their matched test scores from third and second grade indicated that teachers were showing the greatest gains among low-performing students.
Possible explanations were that the best teachers were concentrated in the classes with students with the lowest initial skills which was unlikely , or that there was a ceiling effect or some other consequence of test scaling, such that low-performing students were able to show much greater gains than higher-performing students.
It was difficult to determine the exact cause, but had the model been implemented for teacher pay or accountability purposes, the teachers would have had an incentive to move to those schools serving students with low SES, where they could achieve the greatest score gains. If I think I am a really good teacher with this population of students, then the league [tables] make me want to move to a school where I teach that population of students, so that I rank relatively high in that league.
Adam Gamoran suggested that the jury has not reached a verdict on whether a performance-based incentive system that was intended to motivate teachers to improve would be better than the current system, which rewards teachers on the basis of experience and professional qualifications.
However, he noted that the current system also has problematic incentives: it provides incentives for all teachers, regardless of their effectiveness, to stay in teaching, because the longer they stay, the more their salary increases. After several years of teaching, teachers reach the point at which there are huge benefits for persisting and substantial costs to leaving. An alternative is a system that rewards more effective teachers and encourages less effective ones to leave. A value-added model that evaluates teachers has the potential to become part of such a system.
At the moment, such a system is problematic, in part because of the imprecision of value-added teacher estimates.
Gamoran speculated that a pay-for-performance system for teachers based on current value-added models would probably result in short-term improvement for staying, because teachers would work harder for a bonus. He judged that the long-term effects are less clear, however, due to the imprecision of the models under some conditions. The system will lose its incentive power. Why bother to try hard? Why bother to seek out new strategies? Just trust to luck to get the bonus one year if not another.
Several workshop participants made the point that, even without strong, tangible rewards or sanctions for teachers or administrators, an accountability system will still induce incentives. There is the effect of competition: if a principal saw other principals receiving rewards and he or she did not get one, that tended to be enough to change behavior.
The incentives created a dramatic shift in internal norms and cultures in the workplace and achieved the desired result. Value-added models are not necessarily the best choice for all policy purposes; indeed, no single evaluation model is. Another issue is that value-added results are usually normative: Schools or teachers are characterized as performing either above or below average compared with other units in the analysis, such as teachers in the same school, district, or perhaps state.
In other words, estimates of value-added have meaning only in comparison to average estimated effectiveness. This is different from current systems of state accountability that are criterion-referenced, in which performance is described in relation to a standard set by the state such as the proficient level.
Dan McCaffrey explained that if the policy goal is for all students to reach a certain acceptable level of achievement, then it may not be appropriate to reward schools that are adding great value but. Value-added models clearly have many potential uses in education. At the workshop, there was little concern about using them for exploratory research or to identify teachers who might benefit most from professional development.
In fact, one participant argued that these types of low-stakes uses were needed to increase understanding about the strengths and limitations of different value-added approaches and to set the stage for their possible use for higher stakes purposes in the future. There was a great deal of concern expressed, however, about using these models alone for high-stakes decisions—such as whether a school is in need of improvement or whether a teacher deserves a bonus, tenure, or promotion—given the current state of knowledge about the accuracy of value-added estimates.
Most participants acknowledged that they would be uncomfortable basing almost any high-stakes decision on a single measure or indicator, such as a status determination. Of course, there can be disagreement as to whether this is a reasonable or appropriate goal. Value-added methods refer to efforts to estimate the relative contributions of specific teachers, schools, or programs to student test performance.
In recent years, these methods have attracted considerable attention because of their potential applicability for educational accountability, teacher pay-for-performance systems, school and teacher improvement, program evaluation, and research. Value-added methods involve complex statistical models applied to test data of varying quality. Accordingly, there are many technical challenges to ascertaining the degree to which the output of these models provides the desired estimates.
Despite a substantial amount of research over the last decade and a half, overcoming these challenges has proven to be very difficult, and many questions remain unanswered--at a time when there is strong interest in implementing value-added models in a variety of settings. The National Research Council and the National Academy of Education held a workshop, summarized in this volume, to help identify areas of emerging consensus and areas of disagreement regarding appropriate uses of value-added methods, in an effort to provide research-based guidance to policy makers who are facing decisions about whether to proceed in this direction.
Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website. Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book. Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.
To search the entire text of this book, type in your search term here and press Enter. Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available. Do you enjoy reading reports from the Academies online for free? By statistical model, I mean which factors are included prior attainment, gender etc. If the model changes, the scores change. But there are other factors which may not be so clear cut.
Under the old CVA measure the government used to produce there was a perverse incentive to over-identify pupils as having SEN because it lowered benchmark scores. Then we have the question of whether to include school characteristics. So it would make sense to include it in a CVA measure. You can see contextualised value added scores for your school on a range of measures, including Attainment 8, in FFT Aspire.
Want to compare Progress 8 scores for schools with similar pupil intakes? Other school-level measures, such as the percentage of disadvantaged pupils in the school cohort could also be included. Rather than comparing pupils in a school to similar pupils in the rest of England, we would now be comparing similar pupils in similar schools.
This is an important distinction. What if, for instance, grammar schools were able to recruit more effective teachers?
By adjusting for school characteristics we would be cancelling this out. We also have the question of how factors are included. What is the nature of the relationship between each of the factors and the outcome? To get technical for a moment, do we use linear or multilevel regression? To what extent do we allow for interactions between factors e.
The point here is that there are considerable researcher degrees of freedom. I can define my model one way, test its output and think it looks reasonable. Other researchers would say the same about their models.
Differences in approaches by different researchers lead to different school CVA scores, and these differences can sometimes be large given the range of school-level scores. The chart below looks at grammar schools, which tend to do rather well under Progress 8. As you can see, the CVA scores tend to be lower, although for the most part remain positive.
This tends to reduce the scores even more. There is a broadly even split of schools with positive and negative scores. This is because we are pretty much comparing grammar schools to other grammar schools or at least other schools with high attaining intakes. The difference between But between the 30 th and 70 th percentiles between It is equivalent to three grades higher across the 10 slots of Attainment 8. CVA can often lead to philosophical arguments about whether it reinforces low expectations, and technical arguments about the model used.
Perhaps this is the real reason it was abandoned. Now read the other post in this pair of blogposts, which looks at how things would change if KS4 qualifications were rescored. There are numerous methods for testing models.
We could see which works best in terms of explaining variance and minimising bias with respect to the factors. Or we could split the data in two, derive models using the first split and work out which produces the best predictions for the second split. Different researchers would also have different ideas about which tests to use.
It includes mean Key Stage 2 fine grade reading and maths fitted as a quartic, gender, first language, ethnic background, percentage of school career eligible for free school meals, year of first registration at a state school in England, IDACI score of the area where a pupil lives, month of birth, SEN status in Year 6, whether the pupil was admitted at a non-standard time.
It also includes some interactions. His current research interests include linking education and workplace datasets to improve estimates of adult attainment and study the impact of education on employment and benefits outcomes. Each school produces a data point on the chart. Schools above the regression line are getting better than the Cumbria average GCSE outcomes in relation to intake CATs scores and schools below the line are performing more poorly.
0コメント