Ever wonder why your beautifully trained machine learning model works perfectly in your Jupyter notebook but completely falls apart at 3 AM when it’s actually serving production traffic? You’re not alone. Most ML teams discover the hard way that the actual model code is only about 5% of building a real ML system. The other 95% is infrastructure, data pipelines, monitoring, and a thousand things that can break in spectacularly creative ways.
In this episode, we’re diving deep into what it actually takes to build a machine learning platform that doesn’t crumble under pressure. We’re not talking high-level fluff here. This is a technical walkthrough of how companies like Netflix, Uber, and Airbnb designed their ML infrastructure to handle billions of predictions without falling over.
We’ll break down the three critical pipelines every ML platform needs: data management, model training, and production deployment. You’ll learn why training-serving skew is one of the most insidious bugs in ML systems and how Google Play boosted their app install rate by 2% just by fixing it. We’ll explore why experiment tracking isn’t optional if you want any hope of reproducing your results, and how platforms like MLflow became the version control system for machine learning.
But here’s where it gets interesting. For every component we discuss, we’re going to look at four approaches: the naive “bad” approach that everyone tries first, the “medium” approach that’s getting warmer, the “good” approach where things start working properly, and the “very good” approach that’s what you aim for when you need bulletproof systems.
We’ll cover the infrastructure nobody talks about until it breaks: how to orchestrate distributed training across GPU clusters, how hyperparameter tuning platforms like Kubeflow’s Katib can try hundreds of model configurations in parallel using Bayesian optimization, and why model registries are the bridge between your experimentation chaos and production reliability.
You’ll learn about canary deployments and how to roll out new models to 10% of traffic before betting the farm. We’ll talk about monitoring for data drift, because the world changes and yesterday’s perfect model becomes today’s garbage predictor. And we’ll discuss the fault tolerance patterns that let Netflix process trillions of events daily without the whole system collapsing when individual components fail.
This isn’t for people looking for a gentle introduction to machine learning. This is for engineers in the trenches who need to understand how to build ML infrastructure that scales, how to debug models that mysteriously underperform in production, and how to set up systems that won’t require you to manually babysit every training run at 2 AM.
Whether you’re building your first ML platform from scratch or trying to figure out why your current system keeps catching fire, this episode will give you the architectural patterns and war stories you need to build something that actually works.
Let’s get into it.
References
[1] Sculley, D., Holt, G., Golovin, D., et al. (2015). “Hidden Technical Debt in Machine Learning Systems.” *Proceedings of NIPS 2015*. https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
[2] “MLOps as the Remedy to Tech Debt in Machine Learning.” Alectio Blog. https://alectio.com/2023/03/26/mlops-as-the-remedy-to-tech-debt-in-machine-learning/
[3] “MLOps-Reducing the technical debt of Machine Learning.” MLOps Community. https://medium.com/mlops-community/mlops-reducing-the-technical-debt-of-machine-learning-dac528ef39de
[4] “MLOps: Continuous delivery and automation pipelines in machine learning.” Google Cloud Architecture Center. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
[5] “Top End to End MLOps Platforms and Tools in 2024.” JFrog ML. https://www.qwak.com/post/top-mlops-end-to-end
[6] Rustamy, F. “Machine Learning Platforms Using Kubeflow.” Medium. https://medium.com/@faheemrustamy/machine-learning-platforms-using-kubeflow-a0a9be98f57f
[7] “Architecture | Kubeflow.” Kubeflow Documentation. https://www.kubeflow.org/docs/started/architecture/
[8] “Automating Machine Learning Pipelines on Kubernetes with Kubeflow.” IOD Blog. https://iamondemand.com/blog/automating-machine-learning-pipelines-on-kubernetes-with-kubeflow/
[9] “MLflow: A Unified Platform for Experiment Tracking and Model Management.” Medium. https://medium.com/@pi_45757/mlflow-a-unified-platform-for-experiment-tracking-and-model-management-13dd8b8356db
[10] “MLflow Tracking.” MLflow Documentation. https://mlflow.org/docs/latest/ml/tracking/
[11] “How to Build an End-To-End ML Pipeline.” Neptune.ai Blog. https://neptune.ai/blog/building-end-to-end-ml-pipeline
[12] “MLOps Architecture Guide.” Neptune.ai Blog. https://neptune.ai/blog/mlops-architecture-guide
[13] “The Evolution of the Machine Learning Platform.” Scribd Technology Blog. https://tech.scribd.com/blog/2024/evolution-of-mlplatform.html
[14] “Challenges of building high performance data pipelines for big data analytics.” Eyer.ai Blog. https://www.eyer.ai/blog/challenges-of-building-high-performance-data-pipelines-for-big-data-analytics/
[15] “Industry Spotlight - Engineering the AI Factory: Inside Netflix’s AI Infrastructure (Part 3).” Vamsi Talks Tech. https://www.vamsitalkstech.com/ai/industry-spotlight-engineering-the-ai-factory-inside-netflixs-ai-infrastructure-part-3/
[16] “Machine Learning Infrastructure.” LinkedIn Engineering. https://engineering.linkedin.com/teams/data/data-infrastructure/machine-learning-infrastructure
[17] “Model Deployment Strategies: Discover How to Boost your ML Deployment Success.” Medium. https://medium.com/@juanc.olamendy/model-deployment-strategies-discover-how-to-boost-your-ml-deployment-success-d82b320ac118
[18] “They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.” Monte Carlo Data Blog. https://www.montecarlodata.com/blog-data-engineering-architecture/
[19] “What Is a Feature Store?” Tecton Blog. https://www.tecton.ai/blog/what-is-a-feature-store/
[20] “Top 3 Feature Stores To Ease Feature Management in Machine Learning.” Censius Blog. https://censius.ai/blogs/top-3-feature-stores-to-ease-feature-management-in-machine-learning
[21] “What is training-serving skew in Machine Learning?” JFrog ML Blog. https://www.qwak.com/post/training-serving-skew-in-machine-learning
[22] “Monitor models for training-serving skew with Vertex AI.” Google Cloud Blog. https://cloud.google.com/blog/topics/developers-practitioners/monitor-models-training-serving-skew-vertex-ai
[23] “Meet Michelangelo: Uber’s Machine Learning Platform.” Uber Engineering Blog. https://www.uber.com/blog/michelangelo-machine-learning-platform/
[24] “Open sourcing Feathr – LinkedIn’s feature store for productive machine learning.” LinkedIn Engineering Blog. https://engineering.linkedin.com/blog/2022/open-sourcing-feathr---linkedin-s-feature-store-for-productive-m
[25] “Getting started with Kubeflow Pipelines.” Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/getting-started-kubeflow-pipelines
[26] “Experiment Tracking with MLflow in 10 Minutes.” Towards Data Science. https://towardsdatascience.com/experiment-tracking-with-mlflow-in-10-minutes-f7c2128b8f2c/
[27] “Demystifying MLflow: A Hands-on Guide to Experiment Tracking and Model Registry.” Medium. https://dspatil.medium.com/demystifying-mlflow-a-hands-on-guide-to-experiment-tracking-and-model-registry-d99b6bfd1bda
[28] “Machine Learning (ML) Orchestration on Kubernetes using Kubeflow.” InfraCloud Blog. https://www.infracloud.io/blogs/machine-learning-orchestration-kubernetes-kubeflow/
[29] “Kubeflow: Architecture, Tutorial, and Best Practices.” Komodor Learn. https://komodor.com/learn/kubeflow-architecture-tutorial-and-best-practices/
[30] “Overview | Kubeflow.” Kubeflow Training Documentation. https://www.kubeflow.org/docs/components/training/overview/
[31] “GitHub - kubeflow/trainer: Distributed ML Training and Fine-Tuning on Kubernetes.” GitHub. https://github.com/kubeflow/trainer
[32] “An overview for Katib.” Kubeflow Documentation. https://www.kubeflow.org/docs/components/katib/overview/
[33] “Kubeflow Part 4: AutoML Experimentation in Kubeflow Using Katib.” Invisibl Blog. https://invisibl.io/blog/kubeflow-automl-experimentation-katib-kubernetes-mlops/
[34] “Hyperparameter optimization - Wikipedia.” Wikipedia. https://en.wikipedia.org/wiki/Hyperparameter_optimization
[35] “Kubeflow 1.9: New Tools for Model Management and Training Optimization.” Kubeflow Blog. https://blog.kubeflow.org/kubeflow-1.9-release/
[36] “MLflow Model Registry | MLflow.” MLflow Documentation. https://mlflow.org/docs/latest/ml/model-registry/
[37] “KServe | MLServer.” MLServer Documentation. https://docs.seldon.ai/mlserver/user-guide/deployment/kserve
[38] “Machine Learning Model Serving Tools Comparison - KServe, Seldon Core, BentoML.” Xebia Blog. https://xebia.com/blog/machine-learning-model-serving-tools-comparison-kserve-seldon-core-bentoml/
[39] “Best Tools For ML Model Serving.” Neptune.ai Blog. https://neptune.ai/blog/ml-model-serving-best-tools
[40] “Machine Learning Model Serving Overview (Seldon Core, KFServing, BentoML, MLFlow).” Medium. https://medium.com/israeli-tech-radar/machine-learning-model-serving-overview-c01a6aa3e823
[41] “Building A Declarative Real-Time Feature Engineering Framework.” DoorDash Engineering Blog. https://careersatdoordash.com/blog/building-a-declarative-real-time-feature-engineering-framework/
[42] “How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions.” KDnuggets. https://www.kdnuggets.com/2019/08/linkedin-uber-lyft-airbnb-netflix-solving-data-management-discovery-machine-learning-solutions.html
[43] “TensorFlow Extended (TFX) for data validation in practice.” Sarus Blog. https://medium.com/sarus/tensorflow-extended-tfx-for-data-validation-in-practice-2e6f061753c0
[44] “Validating Data in a Production Pipeline: The TFX Way.” Towards Data Science. https://towardsdatascience.com/validating-data-in-a-production-pipeline-the-tfx-way-9770311eb7ce/
[45] “TensorFlow Extended: Data Validation and Transform.” O’Reilly Live Events. https://www.oreilly.com/live-events/tensorflow-extended-data-validation-and-transform/0636920251866/0636920251859/
[46] “MLflow Tracking | MLflow.” MLflow Documentation. https://mlflow.org/docs/latest/ml/tracking/
[47] “MLOps Part 2: Advanced Experiment Tracking and Model Management in MLflow.” Medium. https://drlee.io/mlops-part-2-advanced-experiment-tracking-and-model-management-in-mlflow-1ca25dc2c1a7
[48] “Introduction to MLflow: Tracking, Models, and Projects.” Medium. https://medium.com/@laoluoyefolu/introduction-to-mlflow-tracking-models-and-projects-a84c4cac2335
[49] “DISTRIBUTED TRAINING IN MLOPS: Accelerate MLOps with Distributed Computing for Scalable Machine Learning.” MLOps Community. https://mlops.community/distributed-training-in-mlops-accelerate-mlops-with-distributed-computing-for-scalable-machine-learning/
[50] “What is Kubeflow?” Red Hat Topics. https://www.redhat.com/en/topics/cloud-computing/what-is-kubeflow
[51] “A Comprehensive Comparison Between Kubeflow and Airflow.” Valohai Blog. https://valohai.com/blog/kubeflow-vs-airflow/
[52] “Kubeflow vs Airflow - Which is Better For Your Business?” Hevo Learn. https://hevodata.com/learn/kubeflow-vs-airflow/
[53] “Orchestrator for ML Pipelines — Vertex AI Pipelines (Kubeflow) vs. Apache Airflow.” Medium. https://medium.com/@saeedhajebi/orchestrator-for-ml-pipelines-vertex-ai-pipelines-kubeflow-vs-apache-airflow-b4af94671c74
[54] “Why We Switched Our Data Orchestration Service.” Spotify Engineering Blog. https://engineering.atspotify.com/2022/03/why-we-switched-our-data-orchestration-service
[55] “The Winding Road to Better Machine Learning Infrastructure Through Tensorflow Extended and Kubeflow.” Spotify Engineering Blog. https://engineering.atspotify.com/2019/12/the-winding-road-to-better-machine-learning-infrastructure-through-tensorflow-extended-and-kubeflow
[56] “Building Robust ML Systems: A Guide to Fault-Tolerant Machine Learning.” Medium. https://medium.com/@hybrid.minds/building-robust-ml-systems-a-guide-to-fault-tolerant-machine-learning-f4765d23a51d
[57] “kfp.dsl package — Kubeflow Pipelines documentation.” Kubeflow Pipelines Docs. https://kubeflow-pipelines.readthedocs.io/en/1.8.16/source/kfp.dsl.html
[58] “AutoML | Hyperparameter Optimization.” AutoML.org. https://www.automl.org/hpo-overview/
[59] “Katib Architecture | Kubeflow.” Kubeflow Documentation. https://www.kubeflow.org/docs/components/katib/reference/architecture/
[60] “Bayesian Optimization - Hyperparameter tuning for TensorFlow using Katib and Kubeflow.” TFWorld Katib Tutorial. https://tfworldkatib.github.io/tutorial/katib/bayesian.html
[61] “DoorDash’s ML Platform - The Beginning.” DoorDash Engineering Blog. https://doordash.engineering/2020/04/23/doordash-ml-platform-the-beginning/
[62] “Day 60/100: Canary Deployments and A/B Testing – Safer, Smarter Model Rollouts.” Medium. https://medium.com/@sebuzdugan/day-60-100-canary-deployments-and-a-b-testing-safer-smarter-model-rollouts-d9245042baf9
[63] “KServe vs Seldon Core Comparison.” Superwise AI Blog. https://superwise.ai/blog/kserve-vs-seldon-core/
[64] “Machine learning model monitoring: Best practices.” Datadog Blog. https://www.datadoghq.com/blog/ml-model-monitoring-in-production-best-practices/
[65] “What is data drift in ML, and how to detect and handle it.” Evidently AI Blog. https://www.evidentlyai.com/ml-in-production/data-drift
[66] “Fault Tolerance in a High Volume, Distributed System.” Netflix Tech Blog. http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
[67] “A/B Testing, Canary and Shadow deployments for ML models.” LinkedIn. https://www.linkedin.com/pulse/ab-testing-canary-shadow-deployments-ml-models-qwak-com
[68] “Building Robust ML Systems: A Guide to Fault-Tolerant Machine Learning.” Medium. https://medium.com/@hybrid.minds/building-robust-ml-systems-a-guide-to-fault-tolerant-machine-learning-f4765d23a51d
[69] “Challenges of building high performance data pipelines for big data analytics.” Eyer.ai Blog. https://www.eyer.ai/blog/challenges-of-building-high-performance-data-pipelines-for-big-data-analytics/
[70] “Production ML systems: Monitoring pipelines.” Google Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/production-ml-systems/monitoring
[71] “Fault-tolerant Distributed Training with torchrun — PyTorch Tutorials.” PyTorch Documentation. https://docs.pytorch.org/tutorials/beginner/ddp_series_fault_tolerance.html
[72] “A Study of Checkpointing in Large Scale Training of Deep Neural Networks.” arXiv. https://arxiv.org/pdf/2012.00825
[73] “Distributed Checkpoint: Efficient checkpointing in large-scale jobs.” PyTorch Blog. https://pytorch.org/blog/distributed-checkpoint-efficient-checkpointing-in-large-scale-jobs/
[74] “GitHub - intelligent-machine-learning/dlrover: DLRover: An Automatic Distributed Deep Learning System.” GitHub. https://github.com/intelligent-machine-learning/dlrover
[75] “MLREL-11: Use an appropriate deployment and testing strategy.” AWS Machine Learning Lens. https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/mlrel-11.html
[76] “Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow.” Morioh. https://morioh.com/p/874199991459
[77] “How Netflix Uses Fault Injection To Truly Understand Their Resilience.” Coralogix Blog. https://coralogix.com/blog/how-netflix-uses-fault-injection-to-truly-understand-their-resilience/
[78] “MLflow Model Registry: Workflows, Benefits & Challenges.” lakeFS Blog. https://lakefs.io/blog/mlflow-model-registry/
[79] “Challenges of building high performance data pipelines for big data analytics.” Eyer.ai Blog. https://www.eyer.ai/blog/challenges-of-building-high-performance-data-pipelines-for-big-data-analytics/
[80] “Model Drift & Machine Learning: Concept Drift, Feature Drift, Etc.” Arize AI. https://arize.com/model-drift/
[81] “Identifying drift in ML models: Best practices for generating consistent, reliable responses.” Microsoft Tech Community. https://techcommunity.microsoft.com/blog/fasttrackforazureblog/identifying-drift-in-ml-models-best-practices-for-generating-consistent-reliable/4040531
[82] “Netflix Hystrix - Latency and Fault Tolerance for Complex Distributed Systems.” InfoQ. https://www.infoq.com/news/2012/12/netflix-hystrix-fault-tolerance/
[83] “How to build an ML platform? Lessons from 10 tech companies.” Evidently AI Blog. https://www.evidentlyai.com/blog/how-to-build-ml-platform
[84] “Architecture for MLOps using TensorFlow Extended, Vertex AI Pipelines, and Cloud Build.” Google Cloud Architecture Center. https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build












