Machine Learning Model Deployment: From Jupyter to Production

Q: How do you deploy a machine learning model to production?

The standard production ML deployment process is: (1) Serialise the trained model (pickle, joblib, or ONNX format). (2) Wrap it in a REST API using FastAPI or Flask that accepts input features, preprocesses them, runs inference, and returns predictions. (3) Containerise the API with Docker for reproducibility. (4) Deploy to cloud infrastructure (AWS ECS, Kubernetes, or a managed service like SageMaker). (5) Add monitoring for prediction latency, error rates, and model drift.

Q: What is model drift and how do you detect it?

Model drift occurs when the statistical properties of the production data diverge from the training data, causing prediction accuracy to degrade over time. Data drift (input feature distribution changes) can be detected by monitoring feature statistics (mean, std, distribution) against the training baseline. Concept drift (the relationship between features and the target changes) requires ground truth labels to detect — monitor prediction accuracy metrics as labels come in. Tools like Evidently AI, WhyLogs, and Arize ML automate drift detection.

Last updated: 2026-05-23

A model that only lives in a Jupyter notebook creates no business value. Here's the complete process for taking an ML model from experiment to production API — with real code examples.

By SpiderHunts Technologies · 23 May 2026 · 13 min read

TL;DR

ML model deployment has 5 stages: serialise → wrap in API → containerise → deploy → monitor
FastAPI + Docker is the standard production stack for Python ML models
Feature preprocessing must be serialised alongside the model — not just the model weights
Monitoring for model drift is as important as uptime monitoring — models degrade silently
Batch inference (overnight jobs) is simpler than real-time inference — start there if latency doesn't matter

The Deployment Pipeline Overview

Stage	What Happens	Tools
1. Model serialisation	Save trained model + preprocessors to disk	joblib, pickle, ONNX, MLflow
2. API wrapper	Build REST endpoint to serve predictions	FastAPI, Flask, BentoML
3. Containerisation	Package API + model into portable Docker image	Docker, Docker Compose
4. Cloud deployment	Run container in scalable cloud infrastructure	AWS ECS, Kubernetes, SageMaker, Railway
5. Monitoring	Track accuracy, latency, data drift, feature drift	Evidently AI, WhyLogs, Prometheus, Grafana

Stage 1: Model Serialisation

The most common mistake in ML deployment is serialising only the model weights. This forgets that the preprocessing pipeline (scalers, encoders, imputers) must also be saved and loaded identically. If your training data was scaled with StandardScaler, production data must be scaled with the same fitted scaler, not a new one.

Save model + preprocessor together

import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

# Build pipeline — preprocessing + model in one object
pipeline = Pipeline([
 ('scaler', StandardScaler()),
 ('classifier', GradientBoostingClassifier(n_estimators=200))
])

pipeline.fit(X_train, y_train)

# Save the entire pipeline (scaler params are saved with it)
joblib.dump(pipeline, 'model_pipeline.joblib')

# Load in production — scaler params are restored
pipeline = joblib.load('model_pipeline.joblib')
prediction = pipeline.predict(new_data)

Stage 2: REST API with FastAPI

FastAPI is the standard choice for Python ML model APIs — async-native, auto-validated inputs via Pydantic, and auto-generated OpenAPI documentation.

FastAPI prediction endpoint

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()
pipeline = joblib.load('model_pipeline.joblib')

class ChurnInput(BaseModel):
 tenure_months: int
 monthly_spend: float
 support_tickets_90d: int
 last_login_days_ago: int

class ChurnPrediction(BaseModel):
 churn_probability: float
 churn_predicted: bool

@app.post("/predict/churn", response_model=ChurnPrediction)
async def predict_churn(data: ChurnInput):
 features = np.array([[
 data.tenure_months,
 data.monthly_spend,
 data.support_tickets_90d,
 data.last_login_days_ago
 ]])
 prob = pipeline.predict_proba(features)[0][1]
 return ChurnPrediction(
 churn_probability=round(prob, 4),
 churn_predicted=prob > 0.5
 )

Stage 3: Docker Containerisation

Docker ensures your model runs identically on any machine — dev, staging, and production. The Dockerfile captures Python version, library versions, and the model file.

Dockerfile for ML API

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt.
RUN pip install --no-cache-dir -r requirements.txt

COPY model_pipeline.joblib.
COPY main.py.

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Stage 4: Cloud Deployment Options

Option	Monthly Cost	Complexity	Best For
Railway / Render	£5–£50	Low	Internal tools, prototypes, low traffic
AWS ECS (Fargate)	£30–£200	Medium	Production APIs, auto-scaling, AWS ecosystem
AWS SageMaker Endpoints	£80–£500+	Medium	Managed ML serving, A/B testing, model registry
Kubernetes (EKS/GKE)	£200–£2,000+	High	Multiple models, enterprise scale, custom infra
AWS Lambda (serverless)	Pay per call (~£0)	Medium	Low-volume, sporadic inference; small models only

Stage 5: Monitoring and Model Drift

ML models degrade silently. Unlike a crashed API (immediately visible), a model that starts returning wrong predictions can go unnoticed for weeks. You need two layers of monitoring:

Infrastructure Monitoring

API latency (P50, P95, P99)
Error rates (5xx responses)
Request throughput
Memory / CPU utilisation

Model Quality Monitoring

Prediction distribution drift
Feature distribution drift
Accuracy (when labels available)
Confidence score distribution

Log predictions for drift monitoring

import datetime
import json

@app.post("/predict/churn", response_model=ChurnPrediction)
async def predict_churn(data: ChurnInput):
 features = np.array([[...]])
 prob = pipeline.predict_proba(features)[0][1]

 # Log every prediction for drift monitoring
 log_entry = {
 "timestamp": datetime.utcnow().isoformat(),
 "input_features": data.dict(),
 "prediction": float(prob),
 "model_version": "v1.2.0"
 }
 # Write to your monitoring store (S3, BigQuery, etc.)
 prediction_logger.log(log_entry)

 return ChurnPrediction(churn_probability=round(prob, 4), churn_predicted=prob > 0.5)

Batch vs Real-Time Inference

Factor	Batch Inference	Real-Time Inference
When predictions run	Scheduled (nightly, weekly)	On-demand (milliseconds)
Infrastructure needed	Scheduled script or Airflow job	Always-on REST API server
Complexity	Lower	Higher
Cost	Cheap (runs briefly, then stops)	Higher (always running)
Example use cases	Overnight churn scores, weekly demand forecasts	Fraud detection, real-time recommendations
When to start	Default choice if latency > 1 hour is acceptable	Only when instant predictions are required

Frequently Asked Questions

How do you deploy a machine learning model to production?

The standard process is: (1) Serialise the trained model with joblib/pickle. (2) Wrap it in a REST API using FastAPI. (3) Containerise with Docker. (4) Deploy to cloud infrastructure (AWS ECS, Kubernetes, or SageMaker). (5) Add monitoring for prediction latency, error rates, and model drift.

What is model drift and how do you detect it?

Model drift occurs when production data diverges from training data, causing accuracy to degrade silently. Data drift (feature distribution changes) can be detected by monitoring feature statistics against the training baseline. Concept drift (the relationship between features and target changes) requires monitoring prediction accuracy as ground truth labels come in. Tools like Evidently AI and WhyLogs automate drift detection.

What is the difference between batch and real-time ML inference?

Batch inference runs the model on a large dataset at scheduled intervals — suitable for predictions that don't need to be instant (churn scores overnight, demand forecasts weekly). Real-time inference runs the model on demand as each request arrives — required for fraud detection at point of transaction or recommendations during a web session. Start with batch inference unless instant predictions are a genuine requirement.

Need to Deploy Your ML Model to Production?

We handle the full MLOps stack — from model serialisation to REST API, Docker, cloud deployment, and monitoring. Book a free consultation to discuss your model's production requirements.

Book a Free MLOps Consultation

Machine Learning Machine Learning vs AI: What's the Difference and Why It Machine Learning Supervised vs Unsupervised vs Reinforcement Learning Machine Learning How to Build a Custom Machine Learning Model for Your Business

🤖 More in AI & Machine Learning