Caution! Using high stakes testing of student learning in development

This is the first of a two part series on high stakes testing. The second part can be found here.

The Center for Global Development published a Policy Paper in September by Amanda Beatty and Lant Pritchett titled From Schooling Goals to Learning Goals; How Fast Can Student Learning Improve?

This is a welcome contribution to the discussion about what might replace the Millennium Development Goals, post 2015. The Paper is welcome because it foregrounds learning, moving away from schooling goals, such as enrolment and completion, and towards learning goals.

Beatty and Pritchett show that the pace of progress for developing countries in international assessments of student learning is very slow. They demonstrate how it would take ‘forever’ for some developing countries to reach current OECD levels. The data they use in their analysis are from the assessment of fourth and eighth graders in the Trends in International Mathematics and Science StudyTIMSS, the Programme for International Student AssessmentPISA – a study of 15-year-old’s performance on mathematics, science, and reading, and SACMEQ, The Southern and Eastern Africa Consortium for Monitoring Educational Quality for mathematics and reading.

Some might see the use of these tests as the perfect policy mechanism in development because they are thought to ‘effect’ or motivate change in the system, and, on the other, are supposed to ‘detect’ whether changes in the system have actually occurred. Seen in this way, could any way be better for development policy makers’ intent on improving the quality of education? The short answer is ‘yes’. One better way may be to avoid such tests for these purposes!

This answer demands an explanation. Beatty and Pritchett skip lightly over the nature of the indicators of learning they have used in their analysis and ignore the negative consequences of their use in development work. In my view, this is where we must proceed with great caution. To be fair, the authors insist that they are not wedded to the metrics of these tests and they encourage the use of country-specific goals and tests. Nevertheless, they go on to ‘encourage’ countries to participate in these tests in order to facilitate comprehensive international comparisons (p.19). Here, the need for indicators threatens to trump the consequences – that can be very serious!

The authors discuss two of their concerns with using TIMSS, PISA, SACMEQ, or similar tests for international comparisons. First, these tests do not address age-group cohorts of children and are based only on sampling data of enrolled children. They argue that cohort-based learning goals should apply to all children, in or out of school. This seems sensible but will be very difficult to achieve in practice. Second, the tests only assess children in later grades whereas there are many stages from commencement to completion of compulsory education and many goals in these different stages.

These are reasonable technical concerns, but addressing them will have serious human consequences. Both concerns imply more testing of more children more often. Anyone who seriously believes this is a good educational idea should read either or both of: The Paradoxes of High Stakes Testing: How they affect students, their parents, teachers, principals, schools and society, by G. Madaus, M. Russell and J. Higgins, Charlotte, NC, Information Age Publishing, 2009 (available here) and Collateral Damage: How high stakes testing corrupts America’s schools, by S.L Nichols and D.C. Berliner, Cambridge, Harvard Education Press, 2007 (here). These books are not popular polemics but serious, research based texts written by some of the most respected scholars in the field of educational testing.

Madaus and his colleagues conclude in Chapter 8 that high stakes testing produces “chronic unintended negative consequences” – narrowing the curriculum, decreasing attention on non-tested subjects, corrupting test results, cheating, retaining students in grade, increased drop out rates, among others. These authors consider ways in which technology might be applied to advance testing and argue for rigorous, independent monitoring of high stakes testing to ensure that benefits outweigh the harms. Nichols and Berliner go much further, concluding that they have no doubt these tests corrupt educators and harm schools. They express the hope that others will join them in demanding a moratorium on high stakes testing (p.202). Together, these books eviscerate the proposition that schools and systems can be fairly evaluated with current approaches. They provide damning evidence of the negative consequences for children, teachers and schools of current policies and the current state of knowledge that support high stakes testing.

I have two more concerns about these kinds of tests: the comparability of the indicators between countries, and the negative consequences that may flow from their use by developing countries.

The technical research literature on testing is raising doubts about the assumption that test performance is independent of the language of questioning and that linguistic, geographic, and socio-economic and cultural factors have an impact that makes meaningful international comparisons between countries problematic. Further, as the relationship between test performance, home, and social class factors in the USA alone calls into question the use of test scores to judge school quality (here, page 62), then how valid can these kinds of tests be for comparative purposes between diverse countries and cultures to assess development progress?

My second concern is with the consequences for the whole educational enterprise of using these kinds of tests. For countries, these are ‘high stakes’ tests. Doing well on them and maintaining a consistent standard of performance over time is a matter of national prestige that has major political consequence – the Australian government’s policy with NAPLAN (the National Assessment Program Literacy and Numeracy) and the My School web site is just one example.

Donald Campbell, an eminent American social scientist noted for his work in quantitative methodology, proposed in the 1970s what is now known as “Campbell’s law”. This law has two parts, one of which is concerned with the validity of the test indicators we use, and another that is concerned with the effects on organisations and people that work with indicators of high stakes value. Campbell’s law, as discussed here on page 4 states, “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

In other words, the more we use high-stakes tests to assess students, teachers, schools, and systems the corruptions and distortions that inevitably appear compromise the construct validity of the test and make scores uninterpretable. It is not difficult to imagine the flow-on effects of high stakes testing in developing countries already fighting the scourge of corruption in their education systems. The Special Issue of the journal Assessment in Education: Principles, Policy and Practice, Volume 19, Number 1, 2012 contains a review of the consequences of high stakes testing in developing countries as well as in Australia.

Is it too radical to suggest such tests should be done away with altogether? Perhaps. But, in the real world, calls for data and pressures for testing will persist and educationists and psychometricians will have to address these. Equally, those who like the idea of international comparative testing have to address concerns with the cross-cultural validity of indicators, the consequences of the testing they have in mind, and how these consequences will be managed. With our current state of knowledge, it is dangerous to think that the use of these high stakes tests can have a constructive role in the development of education. Just as Beatty and Pritchett point out that setting overambitious learning goals may be counterproductive, the evidence is accumulating that the use of high stakes indicators will be counterproductive.

In a subsequent blog, future directions on these issues will be discussed.

This is the first of a two part series on high stakes testing. The second part can be found here.

Robert Cannon is an Associate of the Development Policy Centre and is presently working as an evaluation specialist with the USAID funded PRIORITAS Project in Indonesian education.

image_pdfDownload PDF

Robert Cannon

Robert Cannon is a research associate with the Development Policy Centre. He has worked in educational development in university, technical and school education, most recently in Indonesia and Palestine.


  • I believe that proposed strategies for development and accounting for development progress is something that should be subject to analysis, debate and evidence when this is available. By the tone of your response, Lant, I sense that you may not agree with this. I think this is unfortunate. It inhibits considered progress towards good development strategies for good outcomes. This is why I noted that your Paper is a welcome contribution because it foregrounds the move away from schooling goals towards learning goals, which has the potential for good outcomes. Nevertheless, there are risks when we begin to consider ways of measuring these outcomes. And this is why I argued for caution.

    You identify three key points in your comment. As I understand your first point, you say that I that “cannot coherently applaud the shift of focus from “schooling” … to “learning” and then claim we can do that without ever measuring learning.” I did not assert that we should never measure learning. I acknowledged, for example, that you encouraged the use of country-specific goals and tests, a strategy that I believe is appropriate, if done well. Rather, I said that we should exercise great caution in the wider use of tests such as TIMSS and PISA for comprehensive international comparisons of learning. Further, we need to be alert to the potentially negative consequences of such types of tests for developing countries. This caution is not based entirely on them becoming high stakes for the systems but also because of concerns about their cross-cultural validity.

    Secondly, you point out that “the debate about ‘high stakes’ in countries like India … is absurd because there are massively high stakes for the student tests at grades 10 and 12 and university entrance examinations already.” If you really mean debating high stakes testing is absurd, I find this troubling; with more reason to consider the stakes and figure out how to manage them. Another of your points is “that [the Indian] system itself (and schools and teachers) avoids any measurement of its performance at all … So introducing learning goals would not increase the stakes for students–that is there already–it would just create some tracking of overall performance on goals the society claims to care about.” I agree with you that tracking is necessary, but this is where the caution I argue for is also necessary so that the stakes from tracking are understood beforehand and carefully managed so that their consequences do not filter down to schools and children with even more negative consequences. I fail to see why exercising caution is problematic, particularly in the kind of environment you describe. To do otherwise would be irresponsible.

    In your third point you denigrate the work of large numbers of dedicated teachers and I wonder why you do this. Yes, I am sure there are some teachers in Africa, and elsewhere in the developed world as well, who are absent too often or only work 29 minutes a day. But there are very many teachers who are, as you sardonically describe, “wonderful, intrinsically motivated teachers and doing their best at the complex task called teaching”. Where is the evidence to support your assertion about the typical Indian teacher’s commitment and that the Indian education “system itself (and schools and teachers) avoids any measurement of its performance at all”?

    I will go on “assuming” (to use your term, Lant) that the risks of high stakes testing will apply in developing countries just as they do in the developed, until proven wrong. I repeat that I believe the move to learning goals you argue for in your Paper is a positive direction to take. But I also repeat that we must be cautious about the risks that can arise when testing attainment of these goals becomes high stakes within developing countries with all the known and potentially negative consequences that will be counterproductive to good development outcomes.

  • There is some question about when to cry wolf. Clearly if there is a wolf. But what if it is dusky and I cannot really tell if it is a wolf or an Alaskan Husky? Probably better to play it safe and cry wolf.

    On the other hand, crying wolf at everything that is canine leads to lots of hub-bub and confusion. I own a Bichon-Frise who weighs about 12 pounds and is white and cuddly and looks alot more like a sheep than a wolf (for reasons I don’t understand she has her own Facebook page at Jaya Dog, go see). Crying wolf at a Bichon-Frise just makes you seem hysterical and a little silly.

    Our paper never proposes high stakes testing, never uses the words “high stakes” and cannot, in my view, be reasonably construed as proposing test for high stakes purposes for students, teachers or schools. Debates about “high stakes” should be reserved for when that is actually on the table and not when it is never mentioned. The authors current attitude seems very Victorian, the current paper isn’t proposing sex but it does say things that could lead people to think about other things that might eventually lead them to think about sex and so the paper is prurient.

    On a less facetious note, three points.

    First, in developing countries there is “high stakes” measurement–it is just that measurement is about enrollments and inputs. So the debate is not about “high stakes” or not, it is about what the “high stakes” measures that drive organizations should be. I don’t think you can coherently applaud the shift of focus from “schooling” (where there are scads of measures that drive policy making) to “learning” and then claim we can do that without ever measuring learning. This is like saying I am going to drive from New York to Kansas but never look at where I am. Odds of doing that seem pretty slim.

    Second, the debate about “high stakes” is countries like India (where I now live) is absurd because there are massively high stakes for the student tests at grades 10 and 12 and university entrance examinations already. So right now there are examinations that are crushingly high stakes for students but the system itself (and schools and teachers) avoids any measurement of its performance at all and so at times produces tragically awful schooling–especially for the poorest–with consequences at all. So introducing learning goals would not increase the stakes for students–that is there already–it would just create some tracking of overall performance on goals the society claims to care about.

    Third, I do not believe–for many of the reasons the blog author cites–in “high stakes” or “thin accountability” metrics. That said, many of the arguments from rich country denizens about “high stakes” is that it detracts teachers from doing more worthwhile things. But in poorly performing countries the issue is not that these wonderful, intrinsically motivated teachers and doing their best at the complex task called teaching and evil green-shade guys would sully their cherished occupation with mind numbing, task-narrowing, soul-shrinking, numbers. In India it is well documented that a typical teacher is in the classroom engaged in instructional activity less than half the hours they are paid to do so. I just heard today that a recent survey in Africa (rural Uganda I believe) found teacher instructional time was 29 minutes a day. So instructional activity of any type would be a gain. In countries at high levels of performance, like Australia or the USA or Finland I can see being very worried about high stakes testing because the system is already functioning pretty well and some combination of intrinsic motivation and thick accountability (internal and external) is working reasonably well. But this is not the problem in many countries and about them we should not assume that the same risks are present.

Leave a Comment