By John P.A. Ioannidis, MD, DSc, C.F. Rehnborg Chair in Disease Prevention, Professor of Medicine, of Health Research and Policy, of Biomedical Data Science, and of Statistics, and Co-Director, Meta-Research Innovation Center at Stanford (METRICS), Stanford University
Generating reproducible research results is not an easy task. As discussions about a reproducibility crisis become more common and occasionally heated, investigators may feel intimidated or even threatened, caught in the middle of the reproducibility wars. Some feel that the mounting pressure to deliver (both quantity and quality) may be threatening the joy of doing science and even the momentum to explore bold ideas. However, this is a gross misunderstanding. The effort to understand the shortcomings of reproducibility in our work and to find ways to improve our research standards is not some sort of externally imposed police auditing. It is a grassroots movement that stems from scientists themselves who want to improve their work, including its validity, relevance, and utility.
As it has been clarified before, reproducibility of results is just one of many aspects of reproducibility. It is difficult to deal with it in isolation, without also considering reproducibility of methods and reproducibility of inferences. Reproducibility of methods is usually impossible to assess, because unfortunately the triplet of software, script/code, and complete raw data is hardly ever available in a complete functional form. Lack of reproducibility of inferences leads to debates, even when the evidence seems strong and well-rounded. Reproducibility of results, when considered in the context of these other two reproducibility components, is unevenly pursued across disciplines. Some fields like genetic epidemiology have long understood the importance of routinely incorporating replication as a sine qua non in their efforts. Others still consider replication as second-class, “me too” research. Nevertheless, it can be shown (see Ioannidis, Behavioral and Brain Sciences, in press), that in most circumstances replication has at least the same value—and often more value—than original discovery. However, this leads to the question: how do we reward and incentivize investigators to follow a reproducible research path?
It has been argued multiple times that the current prototype for successful investigators is malfunctional, ineffective, and error-prone. There is substantial variability in this regard across different disciplines, but most fields continue to place emphasis on the solo, siloed investigator. Resources are spread thin and thus each solo, siloed investigator can typically only run small studies. Employees do a lot of data dredging and exploration to cherry-pick the best-looking results, and use very lenient and often inappropriate statistical inference tools, e.g., P<0.05, to claim eureka moments and build a broader narrative of success. There are currently no strong incentives to support and reward investigators who want to register studies and promote data sharing and replication in their work.
Given these circumstances, it is not surprising that whenever systematic reproducibility checks are performed, their results are not very encouraging and may even be called frustratingly dismal. One can endlessly debate post hoc whether a lack of replication means that the original research, the replication, both, or neither were wrong; whether reproducibility across science is rapidly becoming a greater problem; or if we have simply invested more attention in a serious long-standing problem than we had previously. We need more reproducibility efforts to be able to properly map where we stand, so that we can decide where we want to go. We may be surprised at the low (or high) reproducibility rates of specific disciplines, and we can use this information to try to understand what is causing irreproducibility and what the next steps should be.
The insights gained from this effort are also likely to help us to develop and apply solutions that would improve the reproducibility of results. Some solutions have already worked in specific fields and may need to be considered in other fields as well. Other solutions are more speculative and can even be harmful. Many research practices are difficult to study in experimental studies, but such studies would be useful to do, whenever possible. Insights can also be gleaned from other designs and from modeling approaches. It is likely that in the field of research of research (aka meta-research) there are still some low-hanging fruit—large effects—given that the field is relatively new. But the devil can be in the details when it comes to interventions that can affect and can be affected by human behavior. Implementation details can make a difference.
As previously summarized, there are at least twelve families of solutions to improving research:
- Large-scale collaborative research
- Adoption of replication culture
- Registration (of studies, protocols, analysis codes, datasets, raw data, and results)
- Sharing (of data, protocols, materials, software, and other tools)
- Reproducibility practices
- Containment of conflicted sponsors and authors
- More appropriate statistical methods
- Standardization of definitions and analyses
- More stringent thresholds for claiming discoveries or ‘‘successes’’
- Improvement of study design standards
- Improvements in peer review, reporting, and dissemination of research
- Better training of scientific workforce in methods and statistical literacy
How exactly these solutions are pursued can make a difference. Multiple options may exist and be useful, but it may also depend on when they are applied, in what sequence, and by which stakeholders. For example, it is widely agreed that our methods of statistical inferences used across science are often not fit for purpose and are misused and misinterpreted, but there is a plethora of different options to move forward. We may need a multi-step process, for example, a first step of some temporizing measures such as requiring more stringent thresholds for claiming statistical significance. We could then move to more ambitious, long-term solutions that require re-training or/and future training of the scientific workforce at large to improve statistical literacy/numeracy.
Finally, none of these solutions are likely to work well unless they are integrally linked with the way that we reward scientists, in hiring, promoting and funding them. Here, a concerted effort from a wide ecosystem of scientists, institutions, and funding agencies will be required to derive the optimal outcome.