When the Model's Right and the Business Still Suffers, Look Here

Published: July 2025
By Amy Humke, Ph.D.
Founder, Critical Influence

Indirect Pipeline

When discussing machine learning pipelines, it often sounds like an assembly line: data goes in, predictions come out, and a model sits neatly in the middle.

But that's not how it feels when you're the data scientist working inside the pipeline.

I don't own our MLOps architecture, which is handled by a dedicated team, but I build within it and lead projects it houses daily. I use the tools they provide. I run into the constraints they enforce. I run into the things I've failed to control in the process steps. And I've learned, often the hard way, where things break down if you're not careful. Feature engineering, drift monitoring, and retraining schedules are not abstract principles. They're real tradeoffs that show up in production.

This article reflects on those lessons, not from an architect's perspective, but from someone embedded in the system, navigating the layers, troubleshooting the edge cases, and learning how to work with the infrastructure rather than against it.

Because a pipeline isn't just code and containers; it's a layered ecosystem. And if you want your model to last, you'd better understand how those layers interact.


1. Feature Engineering Isn't a Pre-Step; It's a Versioned System

If you treat feature engineering like a one-time Jupyter activity, you're not building a model but a prototype. In production, features aren't just data transformations; they're assets that must be versioned, reproducible, and consistent across environments.

As a data scientist working in Databricks, I don't manage the entire MLOps stack, but I've learned what it takes to make features reliable within one. That means:

In my environment, the Databricks Feature Store helps enforce this discipline by acting as a central registry of feature definitions. It's more than just storage; it ensures features are computed the same way during training and serving, with clear lineage and version history.

However, feature logic is only part of the story. I've learned the hard way that reproducibility lives and dies by environmental hygiene. That includes:

Bottom line: Feature engineering isn't just a step—it's a contract. Between training and serving. Between now and six months from now. And between the model and the humans who need to trust it.

If your model broke after a schema shift or started drifting after a minor platform update, there's a good chance it wasn't the model's fault—it was the pipeline's memory loss.


2. Calibration: The Last Mile of a Trustworthy Pipeline

A pipeline isn't just about delivering predictions; it's about delivering decisions people can trust. And that means calibration matters.

You can ship a model with 92% accuracy and still burn your team. Why? Because when your model says, "There's an 80% chance this user will churn," someone acts on it. That 80% isn't just a number—it's a promise.

That's why I treat calibration as a production concern, not a research nice-to-have:

In domains like admissions, healthcare, or churn mitigation, probability isn't just informative—it's operational. A well-calibrated model lets humans and machines collaborate with aligned expectations.

I'd rather ship a slightly less "accurate" model that tells the truth about uncertainty than a precise but overconfident one.

Calibration isn't a metrics tweak; it's the last mile of model integrity.


3. Retraining Isn't a Schedule. It's a Triggered System.

Scheduled retrainings (e.g., monthly) sound great until you realize:

Here's my current hierarchy:

In Databricks, I use MLflow to track cohort performance and trigger jobs based on those metrics. It's not just "automate retrain every Friday." It's "retrain when precision drops below target for two cohorts in a row."


4. Monitoring: The Eyes and Ears of a Fragile System

Most people consider monitoring a "did it run?" checkbox. But in ML, the danger isn't a job that fails—it's one that runs and silently breaks logic midstream.

That's why I monitor more than latency and error rates. I monitor behavior. And I do it in two loops:

Some leaks only show up in production.

You can test all you want in dev and staging, but you often don't see the cracks until a model interacts with live data and real business consequences. That data you believed came in during the model analysis period is delayed by a week? In production, it's all nulls. And performance tanks before anyone sounds the alarm.

That's why, before a model goes live, I now ask:

What production checks will catch a silent failure within 24 hours, before stakeholders do?

Those checks might include:

Because monitoring isn't just about detecting failure, it's about catching model harm before it spreads. And in ML, silence is rarely a sign of health.


5. Planning for Failure Is Part of the Pipeline

The first time I realized the pipeline hadn't run in five days, I was staring at an empty dashboard and an inbox full of questions.

That's when it hit me: good ML pipelines aren't just built to run—they're built to fail safely.

Failures in production don't always come with stack traces. Sometimes they show up as:

That's why I've learned to build with failure in mind. Not just model failure—system fragility across data, logic, and infrastructure.

Can You Rebuild History?

When a scheduled job fails for several days, you don't just need a fix; you need a backfill plan. That means:

Can You Remove Just the Bad Data?

Sometimes the pipeline runs, but processes invalid inputs due to schema changes, upstream typos, or failed joins. To recover without blowing away good history, good tools to have include:

Will You Know It Broke Before Someone Else Tells You?

Silent failures are often worse than loud ones. The model runs, the outputs update, but the predictions are broken, and no one knows until a stakeholder calls it out.

So now, I layer in production checks designed to catch these cases early:

What Happens When the Model, or the Environment, Breaks?

Deployment should be predictable. But the truth is, it often isn't.

Sometimes a model performs well in staging and fails in production. Sometimes an environment is upgraded and can't be rolled back. Sometimes, dev stops working entirely until you reconcile it with a prod version you can't test against.

In a perfect world, I'd have access to:

That isn't always possible in practice—especially if the underlying platform doesn't support rollback. So I compensate by versioning everything:

Because if I can't roll back the system, I need to rebuild at least the state that produced the last known-good results.

Bottom line: Most ML systems don't fail catastrophically; they fail partially and silently. And by the time you notice, the damage has already been done. Planning for failure isn't an edge case; it's a core requirement of any pipeline meant to last.


6. Scaling Smartly (And Why Bigger Isn't Always Better)

When models get large—especially in e-commerce contexts—things get messy. Here's how I scale only when needed:

Challenge Strategy
Slow single-machine training? Use Spark for feature engineering, but offload modeling to MLflow experiments with parallelization.
Hugging Face + PyTorch model? Use TorchDistributor in Databricks or DeepSpeed when memory is tight.
OOM errors? Use gradient checkpointing and mixed precision; tune batch size dynamically.
Wide tables (>100 cols)? Use Photon engine in Databricks (bypasses WholeStageCodegen limits).
Memory bottlenecks? Don’t increase threads blindly—test thread counts and watch for cache thrashing.

Caution: More parallelism doesn't always mean faster. I've seen multithreaded models slow down when threads compete for cache or exceed memory bandwidth.

Always test:

Don't optimize in a vacuum. Optimize for your actual hardware, data volume, and latency budget.


7. One More Thing: ROI Is the Real North Star

A great ML pipeline doesn't just run well—it pays rent.

That's why I push for:

If you can't draw a straight line from your model to a business decision, you might be modeling for fun, not impact. At the very least, when the outcome metric isn't directly measurable, be prepared to show clear, traceable use.


Final Thought

Everyone loves to show off their models. But don't ignore the pipelines—there's magic there.

Whether you're in Databricks or another stack, the principle is the same:
Build ML pipelines like systems, not scripts. Automate the boring stuff. Watch the scary stuff. And always design for what breaks when no one's looking.

← Back to Articles