Machine Learning Model Deployment: From Jupyter to Production

A model that only lives in a Jupyter notebook creates no business value. Here's the complete process for taking an ML model from experiment to production API — with real code examples.

By SpiderHunts Technologies  ·  23 May 2026  ·  13 min read

TL;DR

  • ML model deployment has 5 stages: serialise → wrap in API → containerise → deploy → monitor
  • FastAPI + Docker is the standard production stack for Python ML models
  • Feature preprocessing must be serialised alongside the model — not just the model weights
  • Monitoring for model drift is as important as uptime monitoring — models degrade silently
  • Batch inference (overnight jobs) is simpler than real-time inference — start there if latency doesn't matter

The Deployment Pipeline Overview

Stage What Happens Tools
1. Model serialisation Save trained model + preprocessors to disk joblib, pickle, ONNX, MLflow
2. API wrapper Build REST endpoint to serve predictions FastAPI, Flask, BentoML
3. Containerisation Package API + model into portable Docker image Docker, Docker Compose
4. Cloud deployment Run container in scalable cloud infrastructure AWS ECS, Kubernetes, SageMaker, Railway
5. Monitoring Track accuracy, latency, data drift, feature drift Evidently AI, WhyLogs, Prometheus, Grafana

Stage 1: Model Serialisation

The most common mistake in ML deployment is serialising only the model weights — forgetting that the preprocessing pipeline (scalers, encoders, imputers) must also be saved and loaded identically. If your training data was scaled with StandardScaler, production data must be scaled with the same fitted scaler, not a new one.

Save model + preprocessor together

import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

# Build pipeline — preprocessing + model in one object
pipeline = Pipeline([
 ('scaler', StandardScaler()),
 ('classifier', GradientBoostingClassifier(n_estimators=200))
])

pipeline.fit(X_train, y_train)

# Save the entire pipeline (scaler params are saved with it)
joblib.dump(pipeline, 'model_pipeline.joblib')

# Load in production — scaler params are restored
pipeline = joblib.load('model_pipeline.joblib')
prediction = pipeline.predict(new_data)

Stage 2: REST API with FastAPI

FastAPI is the standard choice for Python ML model APIs — async-native, auto-validated inputs via Pydantic, and auto-generated OpenAPI documentation.

FastAPI prediction endpoint

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()
pipeline = joblib.load('model_pipeline.joblib')

class ChurnInput(BaseModel):
 tenure_months: int
 monthly_spend: float
 support_tickets_90d: int
 last_login_days_ago: int

class ChurnPrediction(BaseModel):
 churn_probability: float
 churn_predicted: bool

@app.post("/predict/churn", response_model=ChurnPrediction)
async def predict_churn(data: ChurnInput):
 features = np.array([[
 data.tenure_months,
 data.monthly_spend,
 data.support_tickets_90d,
 data.last_login_days_ago
 ]])
 prob = pipeline.predict_proba(features)[0][1]
 return ChurnPrediction(
 churn_probability=round(prob, 4),
 churn_predicted=prob > 0.5
 )

Stage 3: Docker Containerisation

Docker ensures your model runs identically on any machine — dev, staging, and production. The Dockerfile captures Python version, library versions, and the model file.

Dockerfile for ML API

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt.
RUN pip install --no-cache-dir -r requirements.txt

COPY model_pipeline.joblib.
COPY main.py.

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Stage 4: Cloud Deployment Options

Option Monthly Cost Complexity Best For
Railway / Render £5–£50 Low Internal tools, prototypes, low traffic
AWS ECS (Fargate) £30–£200 Medium Production APIs, auto-scaling, AWS ecosystem
AWS SageMaker Endpoints £80–£500+ Medium Managed ML serving, A/B testing, model registry
Kubernetes (EKS/GKE) £200–£2,000+ High Multiple models, enterprise scale, custom infra
AWS Lambda (serverless) Pay per call (~£0) Medium Low-volume, sporadic inference; small models only

Stage 5: Monitoring and Model Drift

ML models degrade silently. Unlike a crashed API (immediately visible), a model that starts returning wrong predictions can go unnoticed for weeks. You need two layers of monitoring:

Infrastructure Monitoring

  • API latency (P50, P95, P99)
  • Error rates (5xx responses)
  • Request throughput
  • Memory / CPU utilisation

Model Quality Monitoring

  • Prediction distribution drift
  • Feature distribution drift
  • Accuracy (when labels available)
  • Confidence score distribution

Log predictions for drift monitoring

import datetime
import json

@app.post("/predict/churn", response_model=ChurnPrediction)
async def predict_churn(data: ChurnInput):
 features = np.array([[...]])
 prob = pipeline.predict_proba(features)[0][1]

 # Log every prediction for drift monitoring
 log_entry = {
 "timestamp": datetime.utcnow().isoformat(),
 "input_features": data.dict(),
 "prediction": float(prob),
 "model_version": "v1.2.0"
 }
 # Write to your monitoring store (S3, BigQuery, etc.)
 prediction_logger.log(log_entry)

 return ChurnPrediction(churn_probability=round(prob, 4), churn_predicted=prob > 0.5)

Batch vs Real-Time Inference

Factor Batch Inference Real-Time Inference
When predictions run Scheduled (nightly, weekly) On-demand (milliseconds)
Infrastructure needed Scheduled script or Airflow job Always-on REST API server
Complexity Lower Higher
Cost Cheap (runs briefly, then stops) Higher (always running)
Example use cases Overnight churn scores, weekly demand forecasts Fraud detection, real-time recommendations
When to start Default choice if latency > 1 hour is acceptable Only when instant predictions are required

Frequently Asked Questions

How do you deploy a machine learning model to production?

The standard process is: (1) Serialise the trained model with joblib/pickle. (2) Wrap it in a REST API using FastAPI. (3) Containerise with Docker. (4) Deploy to cloud infrastructure (AWS ECS, Kubernetes, or SageMaker). (5) Add monitoring for prediction latency, error rates, and model drift.

What is model drift and how do you detect it?

Model drift occurs when production data diverges from training data, causing accuracy to degrade silently. Data drift (feature distribution changes) can be detected by monitoring feature statistics against the training baseline. Concept drift (the relationship between features and target changes) requires monitoring prediction accuracy as ground truth labels come in. Tools like Evidently AI and WhyLogs automate drift detection.

What is the difference between batch and real-time ML inference?

Batch inference runs the model on a large dataset at scheduled intervals — suitable for predictions that don't need to be instant (churn scores overnight, demand forecasts weekly). Real-time inference runs the model on demand as each request arrives — required for fraud detection at point of transaction or recommendations during a web session. Start with batch inference unless instant predictions are a genuine requirement.

Need to Deploy Your ML Model to Production?

We handle the full MLOps stack — from model serialisation to REST API, Docker, cloud deployment, and monitoring. Book a free consultation to discuss your model's production requirements.

Book a Free MLOps Consultation