The case against randomized controlled trials
Everyone knows RCTs are the gold standard of evidence...right?
This article originally appeared in Issue 14: Risk. Subscribe to the print magazine by this Friday to receive our next issue, Work.
By Lennart Finke
In 1710, Scottish doctor John Arbuthnot presented a new proof for the existence of God.1 He had observed that for 82 years in a row, London counted more christenings of baby boys than girls. Assuming that the probability of birthing a girl is equal to that of birthing a boy, and that it varies independently over years, the odds of this outcome occurring by chance are 0.5^82. It follows that the ratio must be governed not by random chance but by a divine unifying principle.
Pierre-Simon Laplace revisited the data in an analysis published in 1781, and concluded more soberly that the probability of birthing a boy is simply a bit higher than one in two. Further interested in the difference in male-to-female birth proportions in various European cities, Laplace found that this proportion was 0.38% higher in London than in Paris, which he found significant.2 The same comparison between Paris and Naples yielded a probability of 1/100, which Laplace didn’t consider “sufficiently extreme for an irrevocable pronouncement.”
These early tests of quantitative hypotheses illustrate both the risks and merits of observational evidence. To be sure, as Arbuthnot showed, it’s easy enough to find that the data proves something you wanted to be true all along. Yet it is remarkable that both Arbuthnot and Laplace could obtain local records and start doing science right from their desks. In doing so, both correctly documented important phenomena without the need to invest much labor or capital to get the data.
The efficiency, simplicity, and beauty of this method of gaining knowledge has been underappreciated, especially in medicine and public health. Although it is considered ideal to obtain data in the form of a randomized control trial for an intervention like a drug, that form of data collection is not always possible in either field. Granted, the Paris versus London hypothesis Laplace was testing is relatively simple, but his sample size was over 1.93 million, more than the vast majority of interventional trials in the history of medicine. It is a rare randomized trial (studying a particular intervention with a sufficiently large and well-distributed sample population) that can detect an 0.3% difference in a binary random variable — but Laplace could, more than 200 years ago.
The advantages of the RCT have cemented it as the gold standard for interventional trials in medicine, and it remains what many laypeople think of as the one true way to do science. Yet once we understand where these advantages come from, how they interact with the economics of collecting samples, and the merits of the alternative, observational evidence emerges as the winner more often than one might think.
The slow invention of the RCT
The first written mention of a controlled trial can be found as early as the 7th century BC in the Book of Daniel 1:11-16. The eponymous prophet asks the steward of Nebuchadnezzar, the king of Babylon, for permission to eat a vegetarian diet instead of the king’s rich and possibly non-kosher food. The king’s steward worries that Daniel and his companions will waste away, but agrees to let them test the diet for 10 days, after which adherents of both diets will be evaluated by how “fair and fat” they looked. The vegetarians win, and the steward is convinced.
While other pre-modern controlled trials are thin on the ground, the 11th century Persian philosopher Ibn Sina’s Canon of Medicine does set out rules for designing experiments, including the prescription to test patients with a “single, not a composite condition.” This is echoed in modern RCTs, in which patients who have unusual comorbidities or other unusual circumstances are excluded from trials.
A subject as touchy nearly a millenium ago as it is today is sample size. One of the earliest mentions of a proposed large sample comparison in medicine is found not in a description of a real trial, but from an unrealised proposal to conduct one. In one of his letters from the 14th century Italian poet Petrarch declared the following as part of a polemic against physicians:
I solemnly affirm and believe, if a hundred or a thousand men of the same age, same temperament and habits, together with the same surroundings, were attacked at the same time by the same disease, that if one half followed the prescriptions of the doctors of the variety of those practicing at the present day, and that the other half took no medicine but relied on Nature’s instincts, I have no doubt as to which half would escape.3
Flemish doctor Jan Baptist van Helmont was somewhat more optimistic about the practice of medicine when he wrote a provocative letter in 1648, casually proposing mentioning what could be the first explicit randomization with an external source of entropy:
Let us take from the itinerants’ hospitals, from the camps or from elsewhere 200 or 500 poor people with fevers, pleurisy etc. and divide them in two: let us cast lots so that one half of them fall to me and the other half to you. I shall cure them without blood-letting or perceptible purging, you will do so according to your knowledge (nor do I even hold you to your boast of abstaining from phlebotomy or purging) and we shall see how many funerals each of us will have: the outcome of the contest shall be the reward of 300 florins deposited by each of us.4
The sample size of what may be one of the first earliest recorded attempts at a clinical controlled trial was a bit smaller: 12. In 1747, the British naval doctor James Lind wanted to test the most promising cures for scurvy. He selected 12 patients with similar symptoms, all fed the same diet and all housed together in the same part of the ship. The only difference was in their treatment: cider, vinegar, “elixir vitriol” (a medical extract of alcohol and sulfuric acid), sea water, citrus fruit, or a medicinal paste recommended by the ship’s surgeon. This wasn’t quite a modern controlled trial — for one thing, there was no control group — but at the time, it hardly mattered. The citrus group were unambiguously the first to recover.
Better design trials remained rare until the 20th century.A positive exception is an 1898 diphtheria trial conducted by Danish doctor Johannes Fibiger. With 484 participants, this trial was larger than Lind’s. Unlike most other doctors at the time, Fibiger intentionally sought a large sample size and ensured that his participants actually had the disease they sought to treat. Most importantly, he produced comprehensive documentation of what happened in the trial.5 Fibiger’s trial also featured a true control group and the first attempt at explicit randomization in a clinical trial. Fibiger decided whether a patient would get their experimental serum based on the day they were admitted to the hospital, instead of preferentially treating sicker patients or anything else which might bias the results.
One major source of bias in clinical trials comes from the researchers themselves: Even careful and well-intentioned researchers like Fibiger have an incentive to find that their own treatments work and seek all kinds of avenues, intentional or not, to nudge results in their favor. Another source of bias is the placebo effect, where dummy treatments that a patient believes are real can sometimes have clinical effects. Today, this is addressed by double-blind trials, where neither the doctors conducting the study nor the patients themselves know who is getting a placebo or the real treatment.
In 1946, the United Kingdom’s Medical Research Council started planning our last candidate for the first-ever RCT: a trial of streptomycin for treating tuberculosis. This experiment brought together many of the elements of the RCT as we find it today: attempted randomization and concealment in treatment assignments, systematic enrollment criteria, randomization, and quantitative hypothesis testing.
Taking all the improvements in methodology together, we arrive at the commonly invoked advantage of the RCT: Treatment estimates from RCTs are unbiased, at least under reasonable assumptions.
A lesser-appreciated advantage is that the RCT is a tool for communication. By reducing the vast possibilities of data collection and analysis, it presents a canonical way to perform an experiment and write down the results. After one semester, medical students have a good grasp of how to interpret the outcome of an RCT. You can be relatively sure that a p-value in one clinical trial conveys the same information as it does in another clinical trial, at least as far as the mathematical model is concerned. Depending on the field you work in, you might even be able to skip over the methods section due to similarities across trials. Like a social media app, trial designs benefit from a network effect.
This also makes it much easier to share results with the public and regulatory bodies. Regulators were quick to recognize the power of RCTs, but initially lacked the power to require them. That changed in the U.S. after the drug thalidomide caused thousands of birth defects in Germany and Britain.6 The FDA had already barred the drug from entering the U.S. market, which gave them the political leverage to push for legislation requiring manufacturers to demonstrate the efficacy and safety of their drugs using ”well-controlled investigations.”789
The resulting Kefauver-Harris Amendment, passed in 1962, made the RCT the chief way to communicate to the government that a drug is effective. Because RCTs are relatively easy to understand, it is also harder for the organizer to fake results to get a drug approved. RCTs are not fraud-proof, but their structure makes some forms of manipulation easier to detect, whereas post-hoc data analysis of observational data is easy to game.
When RCTs fail
To see how this works out in practice, let’s design an RCT answering a question from my own field of research. I work in public health and have been trying to understand ways to reduce under-five mortality in Ethiopia. One cost-effective intervention is a basic vaccine like DPT, which protects against diphtheria, pertussis and tetanus. I (and likely you) got three doses in our first six months alive, but the numbers are much lower for babies in rural Ethiopia.
One idea for bringing that number up is to increase the number of mothers who have a skilled health professional present at birth, known as “skilled birth attendance” in the literature. Plausibly, the health professional would tell the new mother about reasonable next steps for her newborn, like when to come in for immunization. So a good question might be: How does the likelihood of a newborn getting vaccinated change given skilled birth attendance?
The RCT protocol to answer this is simple enough: We find a few hundred women in labor, and tell half of them that, sorry, you’re in the control group, so good luck giving birth on your own because we’re taking the doctor with us. Then we come back six months later and record whether the child was vaccinated or not.
Needless to say, this would be a horrible thing to do, and luckily, nobody is running this trial.
Even if there were a way to solve the glaring ethical problem, this hypothetical experiment would also interrupt normal operations of the clinic, take up a lot of practitioner, researcher, and patient time, and require non-trivial labor costs for data collection.
One example with similar levels of operational intensity would be the WHO antenatal care trial of 1996, which compared the impact of different levels of care for pregnant women on various health outcomes. It collected data for 53 clinics (27 in the intervention group and 26 in the control group), and observed the outcomes of 24,000 women, crediting 256 contributors. The cost and effort required by these types of studies mean that large-scale RCTs are very rare in the field of public health. The interventional trials that do happen are typically small and have to compromise to approximate the hypothesis they really want to test.
Take the WHO trial again. It compared different protocols, but rightfully did not include an arm of women who didn’t receive any care, which would be the differential we’re trying to estimate for a straightforward cost-benefit analysis. Further, though the study covers four countries (Argentina, Cuba, Saudi Arabia, and Thailand), it’s fair to have suspicions about whether the regions would be comparable to my country of interest, Ethiopia.
We might also ask whether a typical facility implementing the tested protocol would be as diligent when the WHO is not looking over their shoulder. These kinds of experiments are very hard to blind. In evaluations of a drug candidate for a serious disease, patients naturally try to find out whether they’re in a control group to potentially seek effective care elsewhere. This is a special case of the Hawthorne effect, or when study participants change their behavior if they know they’re participating in a study. This is a problem for all kinds of interventional trials.
Unconfounding
By now, our simple question about the impact of skilled birth attendance seems near-impossible to answer. But the problem is not the inherent complexity of the question, but the limits of the RCT lens.
After all, lots of women in Ethiopia already have births attended or not attended by a skilled health worker and do or don’t vaccinate their children for DPT. We just need to ask them. Or even better, we don’t, because other researchers have already done that as an implementation of a global project called Performance Monitoring for Action. With this data, we can compare vaccination rates for babies whose birth was attended vs. not attended.
The trouble with doing that is, of course, confounding. There are so many ways a simple comparison could be misleading. For example, an educated woman is more likely to decide to give birth in a facility (the treatment), and also to have her baby vaccinated (the outcome). In other words, better outcomes might not be due to skilled birth attendance, but because of a correlated variable like education.
Confounding is often perceived as so devastating that it precludes the use of observational studies as a way of gaining knowledge. At a talk by an editor of a prestigious journal in public health, I was surprised to find that they usually discard papers based on observational evidence wholesale. I suspect the same is largely true in institutions, such as policymaking, that have to make even quicker decisions with even less scientific context.
I contend that confounding is not as threatening as commonly believed. From 1962 to 1970, when the standards of evidence for RCTs in medicine crystallized, championing RCTs made sense. Since then, however, a few factors have changed in favor of observational evidence. We now have better methods, more data, and faster computers to tackle the problem.
One extremely simple conceptual tool developed in 2016 is target trial emulation.10 As with a controlled trial, this starts with a researcher formulating a hypothesis, but instead of running an experiment, they design a hypothetical experiment that would have produced some data they have access to. As in a real experiment, specifying a protocol before looking at the “results” (even if the results already exist) makes it harder for a researcher to manipulate their data after the fact to support a particular result.
For our question about the impact of skilled birth attendance, we can take our very unethical trial design, with a few adjustments, and treat the PMA survey data as if it came from that design. Luckily, the survey data includes multiple time points following women from pregnancy to six weeks after birth to one year after birth. This lets us cleanly frame different points in time as study enrollment, assignment of, and observation of outcome. If we lack an individual’s data at the third time point, this is analogous to loss to follow-up in an RCT.
Now we need to think about how the survey data differs from the trial we are emulating. The important part is that the treatment is not randomly assigned; as mentioned before, some women were more likely to have skilled birth attendance than others. If we are able to take into account those differences in receiving the treatment, we can imitate an RCT.
While a few statisticians use the term “control” for what we are about to do,we prefer “adjust,” because we didn’t actually exert control over what happened.11 Crucially, in many typical approaches, the process of adjusting is mathematically identical to what we would do in an RCT where participants were assigned to two treatments with a probability other than 50%. RCTs with non-balanced group assignment happen frequently — for example, study authors may assign more people to a treatment that they are more optimistic about, to get higher power for its effect estimate.
More computation, more insight
In order to actually adjust for all these confounders, we draw on the patients’ data. Commonly used variables in public health would be demographic information (for instance, age, education, wealth, and so on), information about the patient’s environment (access to health facilities, family), or health history. Using these, we model the patient’s likelihood of receiving treatment. Then, we check whether the model’s predicted treatment status matches up with the treatment status we actually observe.
In most cases, the more covariates we have in the data, the higher the chances we catch all confounders. (Overfitting can be a risk in prediction problems, but in public health there are usually too few rather than too many covariates in the model.) This is why, in a world with a growing wealth of data, observational evidence treated with causal methods ends up winning.
Once we have the likelihood of treatment, we can use it to emulate an RCT by incorporating it,weighting each sample by the inverse of the treatment likelihood. Take the example of whether drinking wine prevents cancer. If people with higher incomes are more likely to drink wine, this messes with our data: Wine drinking was not randomly assigned. If we happen to see a person with low income who drinks wine in our data, we get excited, because this person’s experience gives us more information: It helps separate the income effect generally from the wine effect specifically. We count their cancer status more heavily. By up-weighting these surprising cases — people who were unlikely to get the treatment, but got it anyway — we can create an RCT-like sample after the fact.
While these inverse probability methods have been around since the 80s,1213 now is a better time than ever to use them.
Doing the necessary calculations is trivial today, but recall that when RCTs were cemented as a gold standard, it wasn’t easy for medical researchers to run even a simple logistic regression. SPSS — a commonly used statistical software suite — came out in 1968, and its competitor SAS in 1972.1415 By the time that inverse probability weighting methods were developed in the early ‘80s, computation and RAM limitations had significantly improved for medical researchers. However, getting access to a mainframe, plus digitizing and loading data, was a hurdle. For some results in many RCTs, a simple average is sufficient, and that does not require the computational power of a logistic regression.
Today, we can trade more computational power for better solutions, like using nonlinear methods to predict propensity, gaining a better prediction of the causal structure from the data or simply bootstrapped confidence intervals. Using machine learning to estimate both the treatment and the outcome is called double machine learning, a term that comes from econometrics. DML combines orthogonalized (debiased) score functions with cross-fitting, which helps control bias from flexible first-stage machine learning estimates and reduces the risk of overfitting-related bias.
The causal methods we discussed need the assumption that we observed all confounders. This is certainly annoying, because it is in a sense untestable, but we’re not helpless either.
Barbra Dickerman, one of the researchers behind the target trial emulation framework, sat down with me and explained how target trial emulations can help correct errors in the observational literature, and how to think about unobserved confounders.
In a 2019 paper, she and colleagues demonstrated how a trial emulation with 733,804 subjects showed that statins — cholesterol-lowering drugs — do not protect against cancer, whereas other observational studies seemed to show that they did.16 The main error in those earlier analyses was classifying people based on treatment duration observed over follow-up, which creates immortal time bias: To count as a long-term statin user, a person had to remain cancer-free long enough to accumulate that exposure. Statin users couldn’t be selected for the study if they died of cancer first. By making such assumptions explicit, target trial emulation aligns the incidence estimate with the interventional literature.
The method can also help address questions that would be very hard to answer otherwise. For example, in Ioannou et al. (2022),17 researchers used Veterans Affairs healthcare data to emulate a comparison trial with N = 902,235 per arm between Moderna and BioNTech-Pfizer Covid vaccines. (They found that the Moderna vaccine is more effective). While an RCT would have taken many months to produce, this analysis was done more quickly and at a fraction of the cost. And the companies weren’t going to do it themselves. Drug manufacturers generally avoid comparing their products to the closest competitor. It is often hard to tell which medicine on the market is the safest or most effective for a given condition because we lack these comparisons. Target trial emulation can help.
Alongside target trial emulation, Dickerman points to other relatively new conceptual and mathematical tools for working with observational evidence — for example, the negative control,18 a longstanding experimental practice in other fields, but somewhat novel as a conceptual way to benchmark causal methods. A drug may improve a patient’s odds of recovering from cancer, but it’s unlikely to affect their chances of being hit by a truck. If a researcher runs their whole analysis again, swapping in vehicular impacts for remission rates, and finds a positive effect, this is a good sign that something has gone wrong. Conversely, you could test whether your analysis finds that aspirin, or some other drug commonly prescribed to patients in your sample, has miraculous effects on cancer prognosis. In either case, the negative control serves as a sanity check.
Observing what we cannot control
You can think of these causal methods as trading sample size for quality. This means that if you have a large enough quantity of samples and can make a case that all confounders were observed, you can surpass an RCT with a smaller sample size in causal-effect estimation. Since interventional samples are much more expensive, observational evidence is sometimes the better solution.
The overvaluing of interventional evidence disproportionately harms public health, especially in low- and middle-income countries, where an unconscious choice is made between considering observational data or no data at all. Fields with low financial resources gain outsized benefits when getting observational data right. At the same time, the recent defunding of the global health sector has halted many projects, shifting the balance in favor of cost-effective research. Given these conditions, I am sure that on the margins, institutions will start relying more on observational evidence.
Funders should consider allocating more resources to the collection of observational data. It may be much more cost-effective than the alternative, but the total cost is still not zero. Investment in any given dataset is amortized over many papers by authors with no connection to each other, and as usual with coordination problems like this, large organizations are good at solving them.
One of the most interesting datasets in medicine comes from the U.S. Department of Veterans Affairs’ Million Veterans Program, which includes N = one million (as of 2023, and growing) lifestyle survey items, gene samples, and health outcome records. Spurred by the success of genome-wide association studies as kicked off by Ozaki et al. 2002,19 generalizing beyond genetics and public health, I’m tempted to suggest building large datasets as an efficient way to make progress in other fields studying complex systems that I don’t know much about. The increasing availability of large datasets should make this an especially good time to reconsider observational evidence in many fields.
What I do know for sure is that in my work in Ethiopia, it would be immensely helpful to have any kind of electronic health records at the individual or household level, such as simple panel surveys with ideally >= 3 time points a few months apart, or even a more recent census following the last one in 2007. We only have estimates of how many people live in Ethiopia, much less where exactly they reside. Quantifying the treatment coverage, the disease burden, and the effect of one on the other is not a solved problem. The right tool to tackle it is national-scale observational data, but the investment needed to collect it only happens if we make it clear how valuable it is, and that an RCT is not always the best way to test a hypothesis.
Sometimes nature is kinder than we assume. We need not always coerce it to make it answer our questions. Instead, it can be enough to just record what happens to further science. In time, we will understand all those things we cannot control, only observe.
Lennart Finke is an AI safety researcher and statistician studying at the ETH Zurich and the Harvard School of Public Health. He writes at fi-le.net.
This article originally appeared in Issue 14: Risk. Subscribe to the print magazine by this Friday to receive our next issue, Work.
John Arbuthnot, An Argument for Divine Providence, taken from the Constant Regularity observed in the Births of both Sexes (1710).
Pierre Simon Laplace, “Mémoire sur les probabilités” in Mémoires de l’Académie Royale des Sciences de Paris (1778), 40.
Julian C. Jamison, “The Entry of Randomized Assignment into the Social Sciences,” World Bank Group: Development Policy Department (2017).
Jan Baptist van Helmont, Ortus Medicinae (1648). Accessed via James Lind Library, https://www.jameslindlibrary.org/van-helmont-jb-1648/.
Asbjørn Hróbjartsson, Peter C. Gotzsche, and Christian Gludd. “The controlled clinical trial turns 100 years: Fibiger’s trial of serum treatment of diphtheria,” British Medical Journal 317 (1998).
M. L. Meldrum, “A Brief History of the Randomized Control Trial,” Hematology/Oncology Clinics of North America 14 no. 4 (2000): 745-760, https://doi.org/10.1016/s0889-8588(05)70309-9.
“Milestones in U.S. Food and Drug Law,” FDA History, Food and Drug Administration, last modified January 30, 2023, https://www.fda.gov/about-fda/fda-history/milestones-us-food-and-drug-law.
Drug Amendments of 1962, Pub. L. No. 87-781, 76 Stat. 780 (1962).
Upjohn Company v. Finch, 303 F. Supp. 241 (W.D. Mich. 1969).
Miguel A. Hernán and James M. Robins, “Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available,” American Journal of Epidemiology 183 no. 8 (2015): 758-764, doi.org/10.1093/aje/kwv254.
Miguel A. Hernán and James M. Robins, Causal Inference: What If (CRC Press, 2010), 23.
James Robins, “A new approach to causal inference in mortality studies with a sustained exposure period,” Mathematical Modelling, vol. 7, iss. 9-12 (1986): 1393-1512, https://doi.org/10.1016/0270-0255(86)90088-6.
Paul R. Rosenbaum and Donald B. Rubin, “The central role of the propensity score in observational studies for causal effects,” Biometrika 70, iss. 1 (1993): 41-45. https://doi.org/10.1093/biomet/70.1.41.
“SAS (software),” Wikipedia, last modified February 20, 2026, https://en.wikipedia.org/w/index.php?title=SAS_(software)&action=history.
Douglas Stauber, “SPSS: 50 Years of Innovation,” IBM Community, April 5, 2018, https://community.ibm.com/community/user/blogs/douglas-stauber/2018/04/05/spss-50-years-of-innovation.
Barbara Dickerman, Xabier García-Albéniz, Roger W. Logan, Spiros Denaxas, and Miguel A Hernán, “Avoidable flaws in observational analyses: An application to statins and cancer,” Nature Medicine 25 (2019): 1601–1606, https://doi.org/10.1038/s41591-019-0597-x.
George N. Ioannou, Emily R. Locke, Pamela K. Green, and Kristin Berry, “Comparison of Moderna versus Pfizer-BioNTech COVID-19 vaccine outcomes: A target trial emulation study in the U.S. Veterans Affairs healthcare system,” The Lancet 45 (2022). https://doi.org/10.1016/j.eclinm.2022.101326
Marc Lipsitch, Eric Tchetgen, and Ted Cohen, “Negative Controls: A Tool for Detecting Confounding and Bias in Observational Studies,” Epidemiology 21 no. 3 (2010), https://doi.org/10.1097/EDE.0b013e3181d61eeb.
Kouichi Ozaki, Yozo Ohnishi, Aritoshia Iida, et. al., “Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction,” Nature Genetics 32 (2002): 650-654, https://doi.org/10.1038/ng1047.



I think the general point of this piece (there can be a tradeoff between small sample sizes in RCTs and large sample size observational data) is interesting and important, but I have serious concerns with the proposed solution (control for as much as possible). Controlling for covariates doesn't always make an estimate closer to the true causal effect -- it can often introduce new biases.
As a simple example, suppose I want to know whether smoking increases the chance of mortality. I might be worried that smokers differ systematically from non-smokers in some way, and control whatever variables I can find to account for this. If I control for whether or not a person has lung cancer, however, my estimate of the treatment effect will likely get *worse*, because I'll control away a main pathway by which smoking leads to death! This kind of covariate (one influenced by treatment) is an example of a "collider", which shouldn't be controlled for in a statistical analysis. With large datasets I think it becomes easier to control for lots of things, but harder to figure out which ones are appropriate to control for and which might be colliders or otherwise not ok to control for.
A great thing about critical thinking is it allows us to have meaningful conversations about how different types of evidence may be collected to answer different questions, and that the final decision about the type of evidence which is collected will be determined by a range of considerations, including ethical and practical considerations. The Evidence-Based Medicine movement, which has been adopted more widely as Evidence-Based Practice, acknowledges the value and suitability of observational research. I've taught an undergraduate course for several years where we explicitly teach students that the "hierarchy of evidence" is not a rule, but an over-generalised heuristic. I still find it useful for introducing students to concepts like confounding and bias. I found this article to be weirdly antagonistic about the value of RCTs, and other triallist methodologies, in a way which is neither charitable nor particularly novel position to take! Which is a shame, because I think it would be more productive for us to realise the value in technologies which allow for the collection of evidence to address research questions which have been (historically) difficult to address. That is to say, rather than trying usurp RCTs at the top of the study design pyramid, wouldn't it be better to work together and see what novel methodologies may emerge?