12. Experiments

12.6. Experiments: Our Best Way of Inferring Causality, but Far from Foolproof

Victor Tan Chen; Gabriela León-Pérez; Julie Honnold; and Volkan Aytar

Two girls in scout uniforms eating cookies from boxes each are holding.

An influential lab experiment that plied participants with chocolate chip cookies and radishes developed the psychological theory of “ego depletion,” which holds that forcing people to resist temptation or make hard decisions exhausts the brain and makes it harder to exercise willpower or choose wisely later in the day (Baumeister et al. 1998). While hundreds of psychological studies built on this theory over the next two decades, two massive studies drawing from the cooperation of dozens of labs across the world stunned the research community by finding no evidence of an ego-depletion effect. cottonbro studio, via Pexels

Learning Objectives

  1. Discuss the “replication crisis” and its implications for social scientific research.
  2. Understand how publication bias tends to discourage replication efforts in social science.

We’ve gone over the many ways that experimental designs can fall short of the classical experimental ideal. We’ve also talked about the inveterate problems of external validity that plague experiments, even if field experiments can offer a middle ground of sorts in producing generalizable results. It is worth emphasizing again that even when we are able to do an experiment with randomized control groups, pretests and posttests, and a precisely manipulated treatment, we can still run into problems with internal validity.

In fact, social scientists are coming to recognize how even well-executed experiments can fall short, particular given the growing evidence that many experimental results published in peer-reviewed journals cannot be replicated. As part of an initiative called the Reproducibility Project, researchers from around the world replicated 100 studies from three top psychology journals (Resnick 2018). More than half of these empirical studies—many of which had used the most rigorous experimental designs—failed to pass muster: the project’s researchers could not reproduce statistically significant results in line with the original studies’ findings. For whatever reasons, the accountability systems built into peer-reviewed science had not succeeded in weeding out these studies earlier—perhaps because those studies had not been conducted properly, but also perhaps because some of the more striking findings had been dumb luck, and no one had bothered to ensure that they held up.

One of the most startling casualties of recent replication efforts has been the psychological theory of “ego depletion.” A research team led by Roy Baumeister and Dianne Tice seated participants before a table with a stack of just-baked chocolate chip cookies and a bowl of plain red and white radishes (Baumeister et al. 1998). The researchers found that an experimental group forced to resist the cookies and eat the radishes instead gave up on an impossible-to-solve geometric puzzle much faster than a control group given free range to chow down on the sugary treats. This study led to a vibrant line of research around theories of “ego depletion” and “decision fatigue,” which view willpower as a muscle, with repeated instances of resisting temptations and making tough decisions sapping the brain’s limited capacity for self-control and decision-making. Hundreds of peer-reviewed studies built on this foundation—measuring the consequences of mental exhaustion for everything from altruistic behaviors to athletic performance—and the ideas trickled into public consciousness, with even Barack Obama reportedly avoiding the morning ritual of personally picking his suits to avoid decision fatigue (Resnick 2016). Yet a large-scale, systematic replication involving 2,141 participants in multiple continents debunked the theory of ego depletion when measured using the exact same test across those sites (Hagger et al. 2016). Another multilab replication involving 36 sites and 3,531 participants also failed to produce evidence of any ego-depletion effect (Vohs et al. 2021). “I’m in a dark place,” one of the first replication’s contributors, a psychologist who specializes in ego depletion, wrote on his blog (Inzlicht 2016). “I feel like the ground is moving from underneath me and I no longer know what is real and what is not.”

In response to this so-called replication crisis, some scientists have raised concerns that today’s conventions regarding reviewing scholarly work and publishing it may be contributing to a larger problem of reproducibility, whereby the checks and balances built into the scientific system—peer review and replication—aren’t working as well as they should be. In a 2016 poll by the prestigious scientific journal Nature of 1,500 scientists from various fields, 70 percent said they had failed to replicate at least one other scientist’s work—and 50 percent admitted they had failed to reproduce their own work. Only a minority of these scientists, however, had ever sought to publish a replication in a journal. In general, academic journals seek to publish new, trailblazing work rather than paint-by-numbers replications, so academics have few incentives to rehash what is seen as settled science. Indeed, in their feedback, several respondents in the Nature poll brought up instances when they had succeeded in publishing failed replications in the face of pushback from journal editors and reviewers, who had asked the authors to downplay how their research contradicted the original studies.

Note that publication bias—the tendency of journals to publish work drawing provocative connections between phenomena while passing on papers that merely test earlier findings—is an issue that plagues scientific work of all kinds, not just experimental studies. In fact, one of the key advantages of experiments as a research method is that they are relatively easy to replicate. Individual experiments, particularly laboratory experiments limited in scope, often require relatively little investment in time and other resources. That’s not the case for immersive ethnographic studies or wide-ranging surveys. And any one study—even a well-designed one—can produce results that happen to be a fluke, so replication is essential to keep science honest. (By the way, this is another reason that you shouldn’t pay too much attention to every new headline about a study showing major health benefits or dangers of eating one food or another—wait to see those findings replicated before you overhaul your way of life.) It’s likely that a Reproducibility Project focused on vetting sociological studies would find even more grim news, given that sociology does not rely as heavily on experimental designs as psychology does, and even less of this sort of exacting replication occurs in our field.

We should be deeply concerned about the possibility that much social scientific research cannot be reproduced, given that replicability is one of the foundations of what makes a science a science, as we discussed in Chapter 1: Introduction. That said, it does no good to simply throw up our hands and reject science, the best method we have to observe and vet the truths of our reality. And its flaws notwithstanding, a robust experimental design remains the best scientific tool for inferring causality. Indeed, in the next two chapters, we’ll discuss the ways that sociologists use statistical methods to essentially approximate the randomized treatment and control groups in experiments. As we noted in this chapter, quasi-experiments are themselves approximations of the key internal validity-preserving features of randomized controlled trials. They try their best to infer causality, but they often fall short of the classical experimental ideal.

We’ll conclude this chapter by breaking down just how much more an experiment like the “motherhood penalty” study (Correll et al. 2007) can tell us about what is truly driving the outcomes we observe than just relying on observational data from surveys. As you probably know, women get paid less than men. A 2020 Pew Research Center report put this ratio at 84 cents for women for every dollar a man makes (calculated based on the median hourly earnings of both full- and part-time workers). Clearly, the observational data Pew is using here—and similar surveys of payroll data—show a stark gap in terms of how much men and women are paid, even accounting for the number of hours they work. But what factor is decisive in creating this gap? Is this 84 percent figure a good measure of how much employers discriminate against women? How much does it reflect differences in education and skill, or years of experience, or choices to pursue occupations that pay low wages or prioritize family over career? In short, what are we really measuring here? A similar issue complicates our analysis of the pay gap between mothers and nonmothers. Mothers clearly make less money over their lifetime, according to payroll data, but we can tell more than one causal story about why that is so. For one thing, is it that mothers truly are less committed to their jobs than nonmothers—hurting their performance and resulting in fewer promotions and less pay—or is it because employers perceive mothers to be less committed, and discriminate against them based on stereotypes even when they perform just as well?

If we just look at the patterns in cross-sectional observational data like this, we can’t say all that much. We can’t manipulate the independent variable. Our data is from one point in time. To deal with these issues, as we will explain in the next two chapters, social scientists will often use statistical controls when analyzing observational data. They use math and statistical theory to screen out explanations other than the independent variable of interest, essentially creating experimental and control groups within their samples. The problem is—as we alluded to in our discussion of matching, which follows a similar logic—a researcher following this approach must control for all confounders, and measure those confounders well. If they fail to do so, they can easily fall into the trap of selection bias and other threats to internal validity.

In the case of the gender pay gap and the motherhood penalty, it was exceedingly hard for social scientists to measure each of the possible factors at play and convincingly argue that they had isolated the effect of employer discrimination. That is why the study by Correll and her collaborators was so ingenious: they were able to create conditions in their lab and field experimental settings that ensured the only factor being tested was employee discrimination, and not hard-to-measure differences in the skill levels of the job candidates. They used their control over the study conditions to keep everything constant (“all other things being equal”) except for the independent variable they were manipulating. The result was potent evidence that, yes, employee discrimination against mothers is real.

Although not as ideal from an internal validity standpoint, quantitative analysis of observational data from surveys and other sources can be illuminating in its own right, as we discuss in the next two chapters. What is lost in internal validity may be gained in external validity, since surveys typically elicit people’s opinions about real issues and situations rather than contrived scenarios. Rigorous statistical techniques can compensate for the lack of an experimental design. And ethical and pragmatic considerations often make true experiments hard to conduct, meaning that we sometimes have to make do with less-than-ideal methodological approaches. Indeed, some of the most interesting research of recent decades has been natural experiments and quasi-experiments that creatively examine phenomena that are otherwise difficult to study in a traditional experimental fashion—the idea here being, as with field experiments, that we might trade a degree of internal validity for feasibility or generalizability.

Personally, we believe that each of these methodological approaches has its merits. While controlled experiments remain the gold standard, we social scientists should use all the tools at our disposal. By triangulating with a variety of methods, we as a scientific community get steadily closer to being able to say with confidence that changes in one variable truly do, or do not, cause changes in another.

Key Takeaways

  1. Researchers have failed to replicate findings of many influential psychological experiments, raising questions about social science’s ability to infer causality.
  2. Publication bias favors the publication of work drawing provocative connections between phenomena over papers that test earlier findings through replication, which may be contributing to the so-called replication crisis.
definition

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

12.6. Experiments: Our Best Way of Inferring Causality, but Far from Foolproof Copyright © by Victor Tan Chen; Gabriela León-Pérez; Julie Honnold; and Volkan Aytar is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book