Feature Store: The Sprawl

When a tool quietly becomes a platform

Dec 22, 2025

Every feature store starts as “cron + Redis” and ends as a choice: either you standardize time semantics and ownership, or you build a platform team by accident.

I’ve spent over six years watching this pattern repeat—building feature stores, operating them, acquiring vendors, leading consolidation efforts. The failure modes are consistent. This post lays out the patterns so you can spot them early. In the next post, I’ll share an actionable questionnaire and concrete solutions.

Who this is for: ML platform engineers, infrastructure leads, and engineering managers evaluating build vs. buy decisions. If you’re responsible for feature infrastructure at scale, this will help you see around corners.

Key Takeaways

If you take nothing else from this post:

The MVP will work. That’s the trap. Success at small scale tells you nothing about what’s coming.
The real cost is opportunity cost. Not infrastructure spend, but what your engineers aren’t building while they develop and maintain the platform.
Know what you’re signing up for before you commit. Ask yourself tough questions and relentlessly pursue the answers.

The Seductive Simplicity of the MVP

Every feature store starts the same way.

“We just need to serve features for our model. Cron job to compute them, Redis to serve them. Simple.”

And it works. Beautifully.

You set up some basic sharding. The latency is fine. The data scientists are happy. You’ve increased velocity—the process of getting models to production is straightforward. You get it running in two weeks and everyone pats themselves on the back.

“Multi-region deployment would be nice. We need to be closer to our users. We can add it later.”

And you do. And it also works.

The MVP works precisely because it’s scoped so tightly. One team. A few use cases. Batch compute. Single region, or multiple regions with a clear partitioning strategy. Clear ownership. Of course it’s simple—you’ve avoided everything that makes feature stores hard.

The confidence you gain from the MVP is inversely proportional to its usefulness for predicting what comes next.

When Things Start to Spiral

The problems don’t arrive all at once. They creep in, each one reasonable on its own.

“We need to use these features for training, not just serving.”

You solved the online problem. Now comes the offline problem. And this is where feature stores get genuinely hard—not because of the technology, but because your engineers need to start thinking about timestamps.

Not one timestamp. Many:

When did the impression, click, or conversion happen?
When was the request made?
When was the feature computed?
When was it materialized to storage/published to kafka?
When did it become available for serving?
When did it become available for training?
What do we do in case of outages or delays?

You’re trying to answer one question: what features did the model see (or would have seen) at prediction time?

Get this wrong and you have data leakage. Your offline metrics look great. Your business metrics, not so much.

This is point-in-time correctness. It sounds simple until you try to implement it correctly across serving, batch and streaming pipelines with late-arriving data and backfills—while ensuring you don’t have offline-online drift.

“We have N implementations of the same feature.”

Without a declarative framework, the same feature logic exists in at least four places:

The real-time service (Java, Go, Rust, C++—whatever your serving stack is)
The streaming pipeline (Flink, Spark Streaming—whatever computes fresh features)
The data warehouse (SQL or PySpark for batch training data)
Notebooks (Python, because engineers can just do things)

They will drift. They always drift. And you won’t know until someone notices the model behaving strangely in production.

“We need lower latency.” Or: “We need larger batches.”

Now you’re evaluating multiple databases. Maybe Redis or Valkey. Maybe embedded databases like RocksDB. Maybe something custom.

The closer you move storage to compute, the lower your latency—but at the cost of higher operational burden and weaker consistency guarantees.

When you train on features from the warehouse and serve from an embedded store, what guarantees do you have that they match? What are you doing to ensure they do? Have you even thought about this problem?

”We need fresher features.”

Now you’re adding Flink and exploring the wonderful world of streaming data—late events, watermarks, state management. And you’re still doing this as part of the feature store. The complexity compounds.

”Actually, we have multiple feature stores.”

If you’re at a large company, this one hits hard. It’s not one feature store—it’s eight1. Which is seven more than you should have. Each team built their own. Each team religiously believes theirs is the best approach for their use case.

Now someone wants to share features. Now you need governance. Now you need internal billing so teams can charge each other. Now you have ownership wars, deprecation policies, and migration strategies. And you probably don’t want to manage your databases anymore. Congratulations—now you have to manage purchases from external providers.

Suddenly you have a platform team

Somewhere between “cron + Redis” and now, you’ve acquired a dozen engineers whose full-time job is keeping this thing running.

With increasing usage everyone discovers a bunch of caveats called operational knowledge.

Kafka rebalancing stops all consumption, not just the affected partition.
Hot entities bypass your sharding. One viral user - one partition on fire.
Backfills compete with production for the same resources.
The failure mode isn’t “down”—it’s “silently wrong for weeks”. 2

Failure Modes at a Glance

Offline-online drift — Training data doesn’t match serving. Model behavior diverges.
Point-in-time leakage — Features include future data. Offline metrics look great, production fails.
Implementation drift — Same feature, N codebases, >1 behaviors.
Freshness vs. complexity — Streaming adds operational burden that compounds.
Org sprawl — Multiple feature stores, unclear ownership, unpaid governance debt (you didn’t build ownership rules, deprecation policies, or access controls upfront—now you’re paying interest).
Simple and uniform is expensive - different use-cases require different infra setups (pushing everyone to use RAM is far from being optimal )

Each of these is survivable in isolation. The problem is they compound.

The Hidden Work You Won’t Budget For

I’ve never seen a feature store roadmap that covers every single one of these:

On-call. Someone needs to wake up at 3am when the pipeline fails. Who?
Backfills. Something went wrong. You need to recompute three months of features. How long does that take? How do you validate correctness? Do you have a separate compute pool for them?
Debugging training-serving skew. The model’s predictions look different in production than they did offline. Is it the features? The model? Both? Good luck.
Ownership negotiations. Who owns the shared feature that three teams depend on? Who approves changes? Who’s on the hook when it breaks?
Billing and cost attribution. If you’re sharing infrastructure across teams, someone will eventually ask “why is my team paying for this?” Chargeback models add complexity and politics.
Handling timestamps and late data. When upstream data lands 6 hours late, what happens? Do you recompute? Do you version? Do you ignore? Have you even instrumented ingestion lag? How do you migrate all existing use-cases to the right approach?
Testing offline-online consistency. How do you prove that what you trained on matches what you serve? Not in theory—in practice, with monitoring and alerts.
Choosing and operating storage. The right database depends on your latency budget, data size, and ops capacity. Redis? RocksDB? DynamoDB? Each has tradeoffs you’ll discover in production.
Getting everyone onboard - Do you have a list of all ML use-cases in your company? Are you sure you need, say, RAM for all of them? Are you sure people need the low-latency they claim it is a “non-negotiable” requirement?

The Two-Pizza Rule for Feature Platforms

If you cannot dedicate a two-pizza team (5-12 engineers) to own, maintain, and on-call the feature store permanently—do not build it.

Feature stores aren’t “set and forget” infrastructure. They’re living platforms that require constant attention as data volumes grow and compliance requirements shift.

A story I’m not proud of

One day I was pinged on Slack and within a few minutes I discovered that our Feast-based feature store had been silently failing for months.

The problem: Feast’s materialization loaded everything into memory before writing to the warehouse. We hit a BigQuery issue with large batch ingestion3. The job kept failing. Nobody noticed because the platform was “temporary”—we were supposed to migrate to another team’s solution, so monitoring was never prioritized.

Features for one critical model became stale. Not for days. For months.

The fix required rewriting the pipeline for partial ingestion, setting up push-based metrics (Kubernetes jobs can’t be scraped after they complete), and building the alerting that should have existed from day one.

None of this was technically hard. It was just work. Work nobody had planned for. I wish I could simply say it was incompetence on my behalf but no, it was a deliberate decision not to invest time because the migration was just around the corner.

If you’re a big company where ML is core to your revenue, you will run into all of this. The work is no longer hidden—it’s explicit and planned for. You have the resources, and the ROI is there.

But what if you’re not at that scale? You might be building a platform you cannot afford to maintain. That’s not a failure of ambition—it’s a mismatch between investment and business need.

The Buy Trap

I’ve been hard on building. Let me be equally brutal about buying.

Buying a feature store doesn’t fix your data culture. If your warehouse is a swamp of undocumented tables and broken lineage, a vendor will let you pipe that sewage into your models faster.

Vendors also bring the Integration Tax
Your infra team will hate managing a black-box they can’t debug
You’ll spend months writing glue code between the vendor’s API and your 2017 event bus
Pricing rarely aligns with your usage patterns—expect a frantic meeting about $10k/month sudden cost spike

Buying trades code debt for integration debt. The advantage is that integration debt is usually cheaper to pay off than architectural mistakes & migrations.

The Opportunity Cost

When teams argue about build vs. buy, they usually frame it as:

“How many FTEs to build and maintain this vs. what would we pay a vendor?”

This is the wrong framing.

The real question is:

What revenue-driving work are your engineers NOT doing because they’re maintaining and developing the feature store?

Every engineer debugging a pipeline failure is an engineer not shipping the feature that moves your core metrics. Every hour spent on internal billing negotiations, vendor contract fights, migrations, or being on-call is an hour not spent improving model performance.

The cost isn’t just what you’re paying. It’s also what you’re not earning.

Ask yourself:

How core is this to how we make money?
If we buy instead of build, what would those engineers work on?
If we choose to build, how many engineers would we actually need?
What’s the actual ROI of owning this ourselves?

Know what you’re getting yourself into. In the next post I’m going to explore more which questions you need to ask yourself and what solutions are available.

Acknowledgments

Big thanks to Kamaleshwar, Mantas and Andrey for reviewing the draft and providing valuable feedback.

A real case in my career

I cannot possibly recommend enough this paper

https://github.com/feast-dev/feast/issues/3982 - specifically this error message

Stratum

Discussion about this post

Ready for more?