External Validity, or “It worked for them, why not for me?”

When assessing program impact, researchers often focus on evaluating the causal relationship between two variables and the accuracy of those results. This concept is called “internal validity,” which asks the question: what’s the likelihood that this relationship actually exists? With strong internal validity, researchers feel more confident that their findings are less likely to be attributed to chance.

Another question equally important to policymakers is whether the observed results can be generalized to other settings, which is referred to as external validity. In other words, if we were to change characteristics such as time, setting, or population, would the causal relationship still exist?

There are a number of scenarios in which the external validity question is relevant. Take, for example, a county that conducted a rigorous five-year randomized control trial (RCT) to determine the impact of a family housing program. The evaluation found that the program improves adult and child well-being. Based on this finding, the government wants to:

  • Continue to fund the program for the next five years
  • Scale the program to new neighborhoods
  • Select additional service providers to deliver the same program model
  • Expand the program to serve homeless veterans in addition to families

How will the original finding translate to these new settings? Given funding constraints, it’s unrealistic to think that we can conduct a full randomized control trial (RCT) evaluation for each program in every community. In order to tackle this external validity question, there are a few key steps that the county can take that doesn’t involve a RCT:

  1. Isolate what remains constant between evaluations: Isolate and clearly define the characteristics that will change between a prior evaluation and future implementation. Some will be unknown, but others can be predicted or approximated. A qualitative or process evaluation of the first implementation can identify what characteristics are most important to continue.
  2. Compare results with similar studies: Look for evaluations of similar programs to assess whether findings have been consistent across multiple settings, and especially those settings that are most analogous to the new implementation environment. Evaluation clearinghouses are a great resource to see the replicability of results or whether certain characteristics lead to better replication of results.
  3. Maintain ongoing evaluation presence: Continue to collect data on population baselines and program impact, as both will likely change in a new setting. Even if an RCT is not feasible, quasi-experimental methods can bridge some of the gaps in understanding these differences. Using Bayesian modeling to interpret the new evaluation as a continuation of the old is a way to maximize power within the new evaluation framework.

As the social sector continues to identify and scale promising program models, we must think critically about how to best generalize and apply learnings. Exclusive use of RCTs is not scalable as programs expand. Further, RCTs often ignore prior experimental results and put the burden of proof of impact on the intervention, despite having prior evidence of its effectiveness. Continual monitoring, feedback, and learning builds on the prior experimentation and can enhance external validity as a program scales.

Ultimately, there will be no “one-size-fits-all” solution or “gold standard” when addressing a social challenge. Instead, an outcomes-oriented social sector requires continuous conversation, data sharing, and feedback to assess progress and impact.