Machine Learning Model Deployment: From Jupyter to Production
A model that only lives in a Jupyter notebook creates no business value. Here's the complete process for taking an ML model from experiment to production API — with real code examples.
TL;DR
- ML model deployment has 5 stages: serialise → wrap in API → containerise → deploy → monitor
- FastAPI + Docker is the standard production stack for Python ML models
- Feature preprocessing must be serialised alongside the model — not just the model weights
- Monitoring for model drift is as important as uptime monitoring — models degrade silently
- Batch inference (overnight jobs) is simpler than real-time inference — start there if latency doesn't matter
The Deployment Pipeline Overview
| Stage | What Happens | Tools |
|---|---|---|
| 1. Model serialisation | Save trained model + preprocessors to disk | joblib, pickle, ONNX, MLflow |
| 2. API wrapper | Build REST endpoint to serve predictions | FastAPI, Flask, BentoML |
| 3. Containerisation | Package API + model into portable Docker image | Docker, Docker Compose |
| 4. Cloud deployment | Run container in scalable cloud infrastructure | AWS ECS, Kubernetes, SageMaker, Railway |
| 5. Monitoring | Track accuracy, latency, data drift, feature drift | Evidently AI, WhyLogs, Prometheus, Grafana |
Stage 1: Model Serialisation
The most common mistake in ML deployment is serialising only the model weights — forgetting that the preprocessing pipeline (scalers, encoders, imputers) must also be saved and loaded identically. If your training data was scaled with StandardScaler, production data must be scaled with the same fitted scaler, not a new one.
Save model + preprocessor together
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
# Build pipeline — preprocessing + model in one object
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', GradientBoostingClassifier(n_estimators=200))
])
pipeline.fit(X_train, y_train)
# Save the entire pipeline (scaler params are saved with it)
joblib.dump(pipeline, 'model_pipeline.joblib')
# Load in production — scaler params are restored
pipeline = joblib.load('model_pipeline.joblib')
prediction = pipeline.predict(new_data)
Stage 2: REST API with FastAPI
FastAPI is the standard choice for Python ML model APIs — async-native, auto-validated inputs via Pydantic, and auto-generated OpenAPI documentation.
FastAPI prediction endpoint
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
pipeline = joblib.load('model_pipeline.joblib')
class ChurnInput(BaseModel):
tenure_months: int
monthly_spend: float
support_tickets_90d: int
last_login_days_ago: int
class ChurnPrediction(BaseModel):
churn_probability: float
churn_predicted: bool
@app.post("/predict/churn", response_model=ChurnPrediction)
async def predict_churn(data: ChurnInput):
features = np.array([[
data.tenure_months,
data.monthly_spend,
data.support_tickets_90d,
data.last_login_days_ago
]])
prob = pipeline.predict_proba(features)[0][1]
return ChurnPrediction(
churn_probability=round(prob, 4),
churn_predicted=prob > 0.5
)
Stage 3: Docker Containerisation
Docker ensures your model runs identically on any machine — dev, staging, and production. The Dockerfile captures Python version, library versions, and the model file.
Dockerfile for ML API
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt.
RUN pip install --no-cache-dir -r requirements.txt
COPY model_pipeline.joblib.
COPY main.py.
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Stage 4: Cloud Deployment Options
| Option | Monthly Cost | Complexity | Best For |
|---|---|---|---|
| Railway / Render | £5–£50 | Low | Internal tools, prototypes, low traffic |
| AWS ECS (Fargate) | £30–£200 | Medium | Production APIs, auto-scaling, AWS ecosystem |
| AWS SageMaker Endpoints | £80–£500+ | Medium | Managed ML serving, A/B testing, model registry |
| Kubernetes (EKS/GKE) | £200–£2,000+ | High | Multiple models, enterprise scale, custom infra |
| AWS Lambda (serverless) | Pay per call (~£0) | Medium | Low-volume, sporadic inference; small models only |
Stage 5: Monitoring and Model Drift
ML models degrade silently. Unlike a crashed API (immediately visible), a model that starts returning wrong predictions can go unnoticed for weeks. You need two layers of monitoring:
Infrastructure Monitoring
- API latency (P50, P95, P99)
- Error rates (5xx responses)
- Request throughput
- Memory / CPU utilisation
Model Quality Monitoring
- Prediction distribution drift
- Feature distribution drift
- Accuracy (when labels available)
- Confidence score distribution
Log predictions for drift monitoring
import datetime
import json
@app.post("/predict/churn", response_model=ChurnPrediction)
async def predict_churn(data: ChurnInput):
features = np.array([[...]])
prob = pipeline.predict_proba(features)[0][1]
# Log every prediction for drift monitoring
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"input_features": data.dict(),
"prediction": float(prob),
"model_version": "v1.2.0"
}
# Write to your monitoring store (S3, BigQuery, etc.)
prediction_logger.log(log_entry)
return ChurnPrediction(churn_probability=round(prob, 4), churn_predicted=prob > 0.5)
Batch vs Real-Time Inference
| Factor | Batch Inference | Real-Time Inference |
|---|---|---|
| When predictions run | Scheduled (nightly, weekly) | On-demand (milliseconds) |
| Infrastructure needed | Scheduled script or Airflow job | Always-on REST API server |
| Complexity | Lower | Higher |
| Cost | Cheap (runs briefly, then stops) | Higher (always running) |
| Example use cases | Overnight churn scores, weekly demand forecasts | Fraud detection, real-time recommendations |
| When to start | Default choice if latency > 1 hour is acceptable | Only when instant predictions are required |
Frequently Asked Questions
How do you deploy a machine learning model to production?
The standard process is: (1) Serialise the trained model with joblib/pickle. (2) Wrap it in a REST API using FastAPI. (3) Containerise with Docker. (4) Deploy to cloud infrastructure (AWS ECS, Kubernetes, or SageMaker). (5) Add monitoring for prediction latency, error rates, and model drift.
What is model drift and how do you detect it?
Model drift occurs when production data diverges from training data, causing accuracy to degrade silently. Data drift (feature distribution changes) can be detected by monitoring feature statistics against the training baseline. Concept drift (the relationship between features and target changes) requires monitoring prediction accuracy as ground truth labels come in. Tools like Evidently AI and WhyLogs automate drift detection.
What is the difference between batch and real-time ML inference?
Batch inference runs the model on a large dataset at scheduled intervals — suitable for predictions that don't need to be instant (churn scores overnight, demand forecasts weekly). Real-time inference runs the model on demand as each request arrives — required for fraud detection at point of transaction or recommendations during a web session. Start with batch inference unless instant predictions are a genuine requirement.
Need to Deploy Your ML Model to Production?
We handle the full MLOps stack — from model serialisation to REST API, Docker, cloud deployment, and monitoring. Book a free consultation to discuss your model's production requirements.
Book a Free MLOps Consultation