Was the Program Successful?

Published: May 2025

By Amy Humke, Ph.D.
Founder, Critical Influence

Program Evaluation Image

"Was the program successful?"
It's a deceptively simple question I've had to answer more times than I can count. But real-world evaluation isn't tidy. Sometimes, the same program fails and then succeeds. Sometimes, a win quietly becomes a loss. This isn't a story about perfect results with a happy ending. It's a story about what it takes to make sense of messy, moving targets and how to evaluate programs when the ground keeps shifting.

Example 1: The Course That Didn't Work—Until It Did

During a course evaluation project, I was asked to evaluate a revised course format designed to improve student outcomes. It was thoughtfully developed, well-intentioned, and backed by faculty enthusiasm. But when we evaluated it using propensity score matching—matching students enrolled in the new format vs. not by demographics, prior GPA, and test scores—the results were disappointing. No statistical difference between the groups.

If we'd stopped there, the program would've been marked a failure.

Instead, we dug in to get more information. I ran a focus group to get qualitative feedback on the course, and we conducted surveys. They reworked the delivery model, added better scaffolding, and refined how support was offered. The format was rerun on a new group of students, and the second analysis showed the difference: this time, students in the new format outperformed their peers.

When the new version launched, we ran the same evaluation design.
We used nearest-neighbor propensity score matching using pre-treatment variables like GPA, prior course completions, and demographics. The effect size in the second iteration exceeded Cohen's d = 0.5, suggesting a moderate program impact.

Same evaluation. Better implementation. That's the power of iterative program design and evaluation methods that make fair comparisons possible, even in non-random rollouts.

Lesson: Program evaluation doesn't end with a single analysis. Sometimes, the most important thing you can do is stay open to iteration and evaluate again once the dust settles. Also, method matters: propensity score matching gave us a fair comparison in a non-random rollout.

Example 2: The A/B Test That Flipped

Another project I was involved in asked whether early contact after the application would increase conversion. A/B testing was ideal because we could randomly assign applicants to receive contact or not and then track conversion.

At first, it worked; the group that got contact converted at a higher rate.

Then, something unexpected happened, and the effect reversed after a couple of months. The treatment group receiving the contact converted to a lower level than the control group.

That reversal raised more questions than answers. Was it seasonality? A change in messaging? A shift in who was applying? The data pointed to a change but not the reason behind it.

We monitored the shift using control charts and tested group differences with chi-square analysis, but the story remains unfinished.

Lesson: A snapshot is never the whole story. Even in randomized trials, not everything is under your control. Evaluation must be ongoing and flexible enough to recognize when early wins don't hold.

4 Ways I've Seen Program Results Flip and What To Do Next

Initial Failure, Later Success
Example: The course redesign. The intervention didn't work in its first form. However, after improvements in delivery and support, it achieved its goal.
What to do: Don't walk away too early. Use the data to refine the intervention and reevaluate after changes.
Early Success, Later Decline
Example: The early contact A/B test. Results looked great at first but eventually eroded.
What to do: Build in follow-up checks. Automate longitudinal monitoring where possible.
Overall Flat, Subgroup Wins
Example scenario: A pilot support program shows no aggregate improvement, but students with below-average GPAs benefit significantly.
What to do: Break down results by meaningful subgroups. Sometimes, a program's value lies in how well it works for a specific population.
Strong Results, Low ROI
Example scenario: A training intervention improves test scores by 5% but costs 3x more per student than comparable programs.
What to do: Always pair statistical results with a business case. A strong effect size still needs a reasonable payoff.

Those two experiences taught me something: the story isn't always finished, even when methods are sound and results are clear. In the first case, the disappointing result wasn't a sign to abandon the course redesign but a signal to adjust the intervention. In the second, the early success was short-lived, reminding us that even strong results shouldn't mean we stop monitoring impact.

In both cases, the evaluations reflected the programs as they were implemented at that moment, not their potential. The lesson was that the answer to a poor result isn't always to scrap the program. Similarly, the response to a good result isn't always to scale up and move on.

The real takeaway? Programs are living systems, and evaluation should be, too.

Choosing the Right Design Isn't Just About Math

The structure of your evaluation and how you assign, observe, and compare will determine what kinds of conclusions you can reasonably draw.

A/B Testing

What it does:
Compares outcomes between two groups, one receiving a treatment and one not, where participants have been randomly assigned. This helps isolate the effect of the treatment by holding other variables constant through randomization.

How to analyze: - Use a t-test to compare averages (e.g., average time on site). - Use a chi-square test for binary outcomes (e.g., conversion: yes/no). - Use logistic regression if control for covariates is needed post-randomization.

Difference-in-Differences (DiD)

What it does:
Estimates treatment effects when you can't randomize by comparing before-and-after changes between a treatment group and a comparison group. Pre-trend checks are essential to ensure the groups were on similar paths before the intervention.

How to analyze: - Use a t-test to compare before-after differences in outcomes between groups. - Use a chi-square test for categorical outcomes pre- and post-treatment. - Use fixed-effects regression with a treatment × time interaction term to control for group and time effects and adjust for covariates.

Propensity Score Matching (PSM)

What it does:
Mimics a randomized experiment by pairing individuals who received a treatment with similar individuals who did not, based on their propensity score.

How to do it: - Estimate propensity scores using logistic regression or other classifiers. - Match using nearest-neighbor, caliper, full matching, or kernel matching. - Check covariate balance after matching using standardized mean differences (aim for SMDs < 0.1).

How to analyze: - Use a t-test to compare continuous outcomes between matched groups. - Use a chi-square test for categorical outcomes. - Use regression to estimate treatment effects while adjusting for any remaining imbalance.

Regression Discontinuity (RD)

What it does:
Estimates the treatment effect when assignment is based on a threshold (e.g., test score ≥ 80 = program eligibility).

How to analyze: - Use a t-test to compare outcomes just above and below the cutoff. - Use a chi-square test for binary outcomes. - Use linear regression with a treatment indicator and assignment variable.

Best Practice: - Visually inspect outcomes near the cutoff using scatterplots. - Use a narrow bandwidth around the cutoff. - Check for smoothness in pre-treatment covariates across the threshold.

Instrumental Variables (IV)

What it does:
Estimates the treatment effect when there is unmeasured confounding by using an instrument variable that influences treatment but does not directly affect the outcome.

How to analyze: - Use two-stage least squares (2SLS) regression. - Stage 1: Predict treatment assignment using the instrument. - Stage 2: Use predicted treatment values to estimate the causal effect. - Evaluate instrument strength using the F-statistic (F > 10).

Best Practice: - Choose a strong instrument related to treatment but unrelated to the outcome except through treatment.

Beyond Yes or No: Who Benefits, and By How Much?

As programs grow and scale, the questions we ask also evolve. Sometimes, the goal shifts from "Did it work?" to "Who did it work for, and by how much?"

That's where machine learning models can help. These tools don't replace causal inference, but they extend it, especially when programs need to be targeted or personalized:

Decision trees and random forests help surface which subgroups benefit most.
K-nearest neighbors (KNN) can identify individuals most similar to past successes.
Causal forests estimate individual-level effects (CATE).
Uplift models predict how much difference a treatment makes for each person.

These methods don't produce p-values, but they sharpen decision-making when the goal isn't just hypothesis testing but improving outcomes at scale.

Final Thought: Truth in Motion

In a perfect world, you run a test, get a result, and make a decision. But in the real world, programs evolve. Populations shift. Implementation matters.

So don't ask: Did it work?

Ask: How well did it work, for whom, under what conditions, and is that enough to act on?

That's what real evaluation looks like.