The shocking truth about randomised control trials exposed!

A chlorine dispenser installed as part of a randomised control trial in Kenya and Uganda (Jonathan Kalan/DIVatUSAID/Flickr/CC BY-NC-ND 2.0)
A chlorine dispenser installed as part of a randomised control trial in Kenya and Uganda (Jonathan Kalan/DIVatUSAID/Flickr/CC BY-NC-ND 2.0)

Development debates are frequently fierce and rarely resolved. Often this makes sense, many disputes are ideologically charged, evidence is unclear, and peoples’ lives are at stake.

In other instances, the source of the sound and fury is hard to fathom. Randomised control trials (RCTs) are a case in point. Some eminent development thinkers proclaim their virtues, insisting they are the final word in evidence, others decry them in treatises.

I’m here to tell you both sides of this fight are wrong. Like much else in development, RCTs are remarkable, but also flawed. Here’s what you need to know.

What’s an RCT?

To simplify, RCTs involve taking a treatment and randomly giving it to some people but not others, subsequently comparing treated and non-treated groups. RCTs are best known as a way to test medicines, but in development all range of treatments are possible, from the effects of free bed nets to the impact of civic education on voters.

Crucially, treatments are randomly allocated. If samples are large enough, random allocation means the group of people who get the treatment will be very similar in other aspects of their lives to those who don’t. Because of this, any subsequent differences between the two groups will very likely have been caused by the treatment alone. Randomisation does not have to occur between individual people. Treatments can, for example, be randomly allocated to entire villages.

The practice of running RCTs can be more complicated, but these basics are all you need for the sake of this blog.

In aid work, RCTs can be used for two reasons: to evaluate a single aid project (an evaluation); and/or to contribute to more generalised learning (research). In the rest of this post I’ll make it clear when a particular strength or weakness matters for evaluation but not research and vice versa. If I don’t do this, assume that it’s relevant to both.

What’s so good about RCTs?

One benefit of RCTs is so prosaic it’s often forgotten. RCTs require good data. This entails making an effort to gather data well, usually beginning before the start of the actual aid project. In many evaluations evidence is an afterthought, only worried about once it’s too late. You don’t need to run an RCT to get better data but, if nothing else, running an RCT will almost certainly make you think about evidence while you can still do something about it.

Another strength of RCTs is that their findings are usually unambiguous. Compared to other approaches, particularly subjective approaches such as interviewing people who may have a vested interested in a certain outcome, the findings of RCTs are helpfully clear cut. This matters in aid, where people are often reluctant to abandon a favoured project unless the evidence is overwhelming, or where projects get cut, not because they’ve failed, but because there’s no good evidence they’ve succeeded. Clarity helps.

Findings from RCTs also eliminate other problems. If, for example, you simply compare outcomes before and after an aid project, and find an improvement, how can you be sure that the improvement was actually a result of aid, not the continuation of some pre-existing trend? RCTs address this.

What wrong with RCTs?

Despite these strengths RCTs are not universally popular.

At times the complaints are spurious. The worst of the arguments against RCTs is the claim that RCTs are unethical. If you have very good evidential grounds for believing a certain treatment will work and if you enough money to give it to everyone, denying the treatment to 50% of the relevant population just so you have a control group would be unethical. However, this almost never occurs in aid work. Usually, we don’t know what works, and often we don’t have money to treat everyone. In this case, RCTs are actually more ethical than the alternative: non-randomly allocating treatment and/or not gathering data, and not learning, thereby losing the opportunity to improve.

RCTs do have real limitations though.

Findings from individual RCTs may not be transferrable. (Technically, their external validity is not guaranteed.) This is not usually an issue for evaluations, but it is an issue for research. Context matters for a lot in development. Just because something works in Northern India doesn’t mean it will work in Papua New Guinea. It’s possible to combine the results of many RCTs in a way that increases the transferability of findings. But, even then it still isn’t guaranteed that something that works in many countries will work in Papua New Guinea.

Also, many development projects cannot be evaluated with RCTs. RCTs can’t usually be run on attempts to build the capacity of a country’s ministry of education, or on a national policy change, or on a large infrastructure project.

Moreover, on their own, RCTs only reveal how much impact a treatment had, they don’t reveal how it worked, or why it failed to work.

Finally, RCTs are expensive.

These are real objections. But they aren’t fatal. The first objection is merely a reminder that you need to be wary of context. The second and third objections show that RCTs aren’t the be all and end all: other research and evaluation methods are essential, sometimes as complements for RCTs, sometimes as substitutes.

And cost is really a question of priorities: is an RCT appropriate? What’s the current uncertainty? What’s the value of learning? Often the price of an RCT will be well worth paying. For NGOs, the binding constraint of cost could also be eased by government donors providing contestable funds for RCTs.

RCTs won’t answer all the questions that matter in development work, but they can answer some better than other approaches. Use them carefully when they will help. Use something else when they won’t.

Many development debates play out over decades. Some deserve to. Others wouldn’t be debates at all if we could just tolerate some nuance. The debate around RCTs falls into the latter category. The shocking truth about RCTs is that they are useful, when used appropriately.

Terence Wood

Terence Wood is a Fellow at the Development Policy Centre. His research focuses on political governance in Western Melanesia, and Australian and New Zealand aid.

17 Comments

  • Hi Terrence. I came to this article via google, and so a little late to the party.

    Something I notice about the RCT debate and the issue of non-transferability is that non-transferability is almost always taken to mean geographic / cultural transferability. But there is also temporal transferability: societies change over time, and the findings of an RCT from last year may not be applicable in the same place this year. An example of this is the current Samoan measles crisis, due in part to a medical error in vaccination made by two nurses (later jailed for negligence) radically altered public attitudes towards vaccination.

    It seems almost certain, to me, that the applicability of any RCT conducted on incentivisation vaccination conducted before the medical error, would be highly dubious after the error. The change in population perception between before and after might in fact be greater than that any geographic / cultural difference between Samoa and, say, Afghanistan.

    • Hi David,
      Thanks for your comment. I agree.
      It’s worth noting, however, that this fact is true of any evaluation method, and — for that matter — any research method in social sciences.
      Thanks again for your comment.
      Terence

  • Thanks for bringing this to our attention. I think another thing to consider is that some RCTs do fail.

    One reason is insufficient buy-in from stakeholders. Given that RCTs in social sciences, generally (although not always), take at least one to two years to complete, it is important to ensure buy-in from all stakeholders. One RCT we attempted failed because a new team leader didn’t believe in RCTs while another failed because we weren’t able to sufficiently incentive people in the control and treatment groups to participate in follow-up interviews.

    A lack of buy-in at implementation level has also led to contamination. In one RCT we were undertaking, a certain government official believed anyone who wanted the intervention should be offered it. While we wanted to know if it worked, the government official believed it did work. Afterwards, when I asked why he went against our protocol, he stated that he would be doing a disservice to his people by not providing the intervention. This clearly relates to the ethical argument against RCTs and I certainly understood his perspective.

    There are many other reasons for failure – eg: insufficient statistical power (possibly due to small sample sizes), insufficient resources, changing priorities – to name a few.
    That all being said, I am certainly not anti-RCT. I would say I’m pro-RCT, just as I’m pro-Quasi-Experiment or pro-Participatory Evaluation … it all depends on what is most appropriate under the circumstances (which can also be subjective!).

    Thanks again for your post.

    • Thanks Dinuk,

      That’s a really good comment. It’s important to be aware that RCTs — like everything in development, and all research methods — can fail. Thanks for highlighting this. If RCTs are to be done they should be done well, and used only when appropriate. As you say, other methods may be more appropriate in particular circumstances.

      On point I’ll make in favour of RCTs is that their problems are somewhat more transparent than those of some other methods. For example, it’s easy to see if an RCT is under-powered. And some other analytical flaws are easy enough to spot. For me at least, the ways that a complex regression using panel data can be tweaked are basically opaque. Similarly, it may be impossible to tell who’s voices are excluded from a qualitative piece of work and why. Or what community dynamics might skew a participatory evaluation.

      I’m don’t want to make this a binary. I’m not claiming RCTs are perfectly transparent and all other methods complete opaque. Obviously, this isn’t the case. Much can still be insufficiently illuminated in an RCT’s methods section. But as I thought about your comment over the weekend, I decided it was still fair to say that RCTs have something of a methodological visibility advantage over at least some of the other methods in the tool kit.

      None of this changes my agreement with your main point though: they can still fail.

      Terence

  • Thanks Terence, there is certainly some heat in this debate!

    I suggest that one of the issues underlying it is that of the unity of science – that scientific method applies equally to the study of the material and social world. I am persuaded by the debate that says we can only ever argue about this because neither side can position itself outside of discourse and proclaim ‘truth’ from a position of ‘nowhere’. We are all shaped by history, society, and culture, which in turn influences the questions we chose to ask as well as the answers we hear.

    That does not make me anti-science but rather concerned that methodology is selected according to the questions we are trying and answer. Testing new drugs may demand RCTs but will never provide what Clifford Geertz calls the ‘thick description’ critical to understanding our social and cultural worlds.

    Whatever the methodology selected, ethical research or evaluation demands informed consent. If anyone needs reminding about the importance of informed consent go back to the Cartwright Report. The gains from that were hard won and need to be protected and extended.

    As the NYT article shows, random selection is critical to RCTs and the wellness programme was fortunate in getting 500 volunteers from which participants and a control group could be selected. What’s the equivalent process in an international development context, in which donors have the cash and communities want and need support?

    Let’s take the bed-net evaluation as an example – it is possible to do an RCT but the question is why would you go to the trouble and expense? It is well known that mosquitos spread malaria so what questions would you need an RCT to answer? How would the participants be selected – would it be open to all community members in the first instance? Would the control group be offered treatment should they contract malaria during the trial? They are after all part of an experiment.

    An alternative is to identify participants using the ‘snow-ball’ method and, having gained informed consent, interview until you’ve reached ‘saturation’ point about the factors that limit access and use and those that facilitate it? A well-designed study couples this with other data collection methods. The approach does not make claims about cause and effect but rather seeks a deeper understanding of what people say they do and why (users and non-users) and makes recommendations on that understanding.

    It seems to me that participation and partnership demands approaches that involve participants in research and evaluation and that the decision to use RCTs, given how resource intensive they are, should be reserved to answer those questions that can best be answered by experiment and not because of an ideological position about the unity of science.

    • Hi Suzanne,

      Thanks for your comment. Good to hear from you again.

      For what it’s worth, I think the scientific method applies both to the material and human world, for the simple reason that we’re part of the material world. Having said that, the scientific method becomes more and more complex the further we move away from physics, simply because the number of variables and their potential interactions become much more numerous.

      And, as you say, our culture and beliefs constrain and shape the views we hold. This doesn’t mean that all views are equally right though, it just makes the task of learning harder still. (I should add even physicists suffer some of these problems too.)

      Moreover, these problems don’t mean that RCTs won’t work. Indeed, one of their strengths is the robustness of their method and the clear results it provides.

      Having said that, I don’t think RCTs can answer all the important questions. And I agree that there is an important role for qualitative research.

      In the specific case of the bed nets, which I should stress was a hypothetical example. (RCTs may have been used in their study but I don’t know).

      We know for obvious reasons that bed nets stop mosquitoes. Less clear, at least less clear a decade ago, were questions such as:
      1. Will people use them as required (answer: yes (to the best of my knowledge))
      2. Did charging a small price for bed nets reduce their use (answer: yes)
      3. Are insecticide treated bed nets more effective (answer: yes)

      To the best of my limited knowledge many of these questions were answered without RCTs, but they were answered with empirical social science.

      Thanks for your comments.

      Terence

  • Nicely balanced blog Terence. I just wanted to clarify a couple of the ‘limitations’ you mentioned of RCTs.

    Cost of RCTs – RCTs can be done for free. This is can occur when data is already going to be collected. For example, governments regularly collect data about things like who has paid their taxes. Here is example from Latin America of an RCT that uses government administrative data: https://www.povertyactionlab.org/evaluation/role-vat-tax-enforcement

    Programs RCTs can be used on – RCTs can be used to evaluate both micro and macro (‘universal’) level policy changes (even large scale infrastructure). This can be achieved using a nudge to encourage people to take up a new service or product. Here is an example from Africa of an RCT that evaluates the impact of a large infrastructure project: https://www.povertyactionlab.org/evaluation/household-water-connections-tangier-morocco

    How programs work – RCTs can be used to determine why/how a program works. This can be achieved by having multiple treatment groups whereby some people are provided with a sub-component of the program and other people receive the whole program. If there is no difference between the impact of the sub-component and the entire program this indicates that it is just the sub-component that is making a difference. Here is an example of an RCT in South Asia that used this approach: https://www.povertyactionlab.org/evaluation/improving-immunization-rates-through-regular-camps-and-incentives-india

    • Thanks Chris,

      That’s a great comment. I agree with all of your points, but would note:

      At times you will have to collect your own data. This is unavoidable, particularly if you want quality data. That said, I think the cost is a small price to pay for learning.

      I agree, RCTs can help with learning how a project works too. I’d just add that there may be limits and that a great alternative would be to combine RCTs with process tracing as Oxfam Great Britain does. Quant and qual methods can be friends 🙂

      Thanks again.

      Terence

  • Hi Terence, I’m in agreement for the most part – RCTs have an important place.
    I’m also of the view that RCTs can be highly unethical – Elieen has given a good example, but there are many others from rural development that could be used. Data collection in RCTs also commonly involves taking the time of control groups that have received nothing from the development activity.
    Leaving that aside, RCTs generally do not control for at least one important factor – the attention and stimulation to recipients that goes with a development activity. This factor may be as important or more important than the technology/support on offer. You mentioned that RCTs can’t always pinpoint what caused the impact and this is an example of that.
    You also mentioned the need for good data. Socio-economic surveys are commonly used in RCTs – they are generally very crude instruments because they rely on complex, often confidential, information from recipients that very few of us could provide accurately.
    So still worth discussing maybe…

    • Hi Ben,

      Thanks for your comment. I think the wasting the control group’s time is about as good a ethical issue as can possibly be raised. (Absent some non-RCT-specific ethical lapse).

      The situation is still no worse than the normal evaluation of a project that did not work, though.

      What’s more, people fill out census forms, HIES and DHS surveys for no benefit. Not quite as intrusive, of course, but it’s not as if RCTs are new in wasting people’s time for data.

      Moreover, if the treatment is effective, the compensation could be expanding it to the control group.

      Re your attention stimulation point. I agree. It will be a problem with all good evaluations though. Moreover if the attention factor proves to be critical then perhaps we’ve learnt something useful that should be built into aid more generally.

      “Socio-economic surveys are commonly used in RCTs – they are generally very crude instruments because they rely on complex, often confidential, information from recipients that very few of us could provide accurately.”

      It would be interesting to know what % of RCTs do actually rely on poor data (I don’t think surveys inevitably generate such data, but know this is a real possibility). Once again, this isn’t specific to RCTs, though. No evaluation will ever be better than the quality of its data.

      Great points. Thank you for engaging.

      Terence

  • I have two comments to make about your post.
    1. Your argument that RCTs are not unethical denies the reality of projects like the one reported on in this article. Deliberately providing “high quality” education to young children and denying others that same education purely for experimental purposes is unethical.
    “Children aged three to four years were assigned randomly to high-quality preschools that were created for the experiment or to existing petites écoles (that is, low-quality preschools)”. https://www.sciencedirect.com/science/article/abs/pii/S0272775717302637
    2. I was at an aid and development seminar a couple of years ago where many people undertaking evaluations mistakenly believed they were doing RCTs when they were obviously not. It’s a term that is being used inappropriately because bureaucrats believe it sounds good. Doing a systematic evaluation and collecting rigorous data does not require use of this method. This is especially true in education when it is almost impossible to distinguish between the many variables that can impact on student learning and performance.

    • Hi Eileen,

      You make it sound like the experiment in question denied preschoolers in question access to ‘high quality’ preschools. As best I can tell from skimming the methods what actually happened is that 50% of the students in the sample were given ‘high quality’ preschooling that they would have otherwise failed to receive.

      Then, from the abstract, the main finding was: “The findings show that quality of preschool education had no significant effect on children’s overall educational attainment.” (Although there may be heterogeneous effects.)

      No one was denied high quality preschooling in this experiment. What’s more the intervention was discovered not to help. (Plus or minus a few caveats).

      I fail to see how this was unethical.

      “I was at an aid and development seminar a couple of years ago where many people undertaking evaluations mistakenly believed they were doing RCTs when they were obviously not. It’s a term that is being used inappropriately because bureaucrats believe it sounds good.”

      This is disappointing to hear, I appreciate you raising the point, and I share your concern.

      “Doing a systematic evaluation and collecting rigorous data does not require use of this method.”

      I agree. I thought I said as much in the blog post, sorry if I was unclear.

      “This is especially true in education when it is almost impossible to distinguish between the many variables that can impact on student learning and performance.”

      But, within constraints, that is what RCTs do: they control effectively for the influence of other variables. I agree there are issues (listed in my post), but it’s worth giving RCTs credit for what they can do well.

      Thank you for your comment.

      Terence

      • Thanks Terence. Imagine if the money spent on creating ‘high quality preschools for the experiment’ was spent on making the low quality schools maybe even average quality? So maybe instead of 50% children getting high quality, then maybe 80% get medium quality?
        Regarding the ‘control of variables’ – unfortunately any teacher in the world can tell you that controlling variables (in education that means children, parents, teachers) is extremely difficult. This is what non-education researchers just don’t get. The closest I saw was a researcher from a science background I was working with who suggested the only theory to explain what went on in an early years classroom was chaos theory. RCTs are designed for use in laboratories. Schools are not laboratories.

        • Hi Eileen,

          Thank you for your reply. With respect to the cost. The treatment group only involved 84 students (more on that in a second). My guess is that, compared to the cost of the education system in Mauritius, the cost of upgrading approximately three pre-schools is a small price to pay for learning that investing a lot of extra money in high quality pre-schools (at least of this type) is not likely to deliver the benefits they anticipated. Certainly, the cost of improving 3 pre-schools wouldn’t be enough to improve the overall quality of pre-school education in Mauritius.

          As for control variables, if the sample is large enough, the treatment and control groups will be effectively the same, regardless of the complexity of human life. This is one of the big advantages of RCTs over non-experimental data. (A sample of 174 may not be large enough, I’m not here to defend the study in its entirety, although the authors do show the treatment and control groups are balanced across an impressive array of variables in Table 1. A sample of 174 may also be under-powered more generally, but the fact that some people run poor quality RCTs doesn’t mean the method itself is flawed.)

          As for schools not being laboratories, I think this critique is perhaps more persuasive with respect to the external validity of the study. (See the blog, for my views on the issue of external validity.)

          Thanks for your comment.

          Terence

Leave a Comment