Why Getting Infrastructure Right From Day One Matters
We frequently work with SaaS startups that have landed their first major customer, need to pass a security audit, and discover that their infrastructure was clicked together in the AWS console with no automation, no environment separation, database credentials hardcoded in environment variables, no backups tested, and no monitoring. The "quick fix" that felt pragmatic at the start is now a month-long refactoring project happening in parallel with serving paying customers.
The time investment to do infrastructure properly from day one is two to three weeks for an experienced cloud engineer. The time to fix it properly later, under production pressure, is three to eight weeks. The choice is straightforward if you know what "properly" means — which this guide will explain.
Choose Managed Services Over Self-Managed
The most impactful decision for a SaaS startup's infrastructure is choosing managed services at every layer. Self-hosting a component means you own: initial setup, OS patching, monitoring, backup, failover, version upgrades, and incident response. For a small team, each self-managed component is an operational burden that competes with shipping product.
| Layer | Self-Managed | Managed Service | Recommendation |
|---|---|---|---|
| PostgreSQL | EC2 + self-managed Postgres | AWS RDS / Aurora Serverless | Managed (RDS) |
| Redis | EC2 + self-managed Redis | AWS ElastiCache for Redis | Managed (ElastiCache) |
| Application servers | EC2 instances | ECS Fargate / EKS | ECS Fargate (for most startups) |
| Load balancer | Nginx on EC2 | AWS ALB | Managed (ALB) |
| SSL/TLS certificates | Let's Encrypt + cron renewal | AWS ACM (auto-renewal) | Managed (ACM) |
| Message queue | RabbitMQ on EC2 | AWS SQS | Managed (SQS) |
| File storage | EBS / EFS on EC2 | AWS S3 | Managed (S3) |
| DNS | Self-managed BIND / CoreDNS | Route 53 | Managed (Route 53) |
| Secrets | .env files, SSM Parameter Store | AWS Secrets Manager | Managed (Secrets Manager) |
| Container registry | Docker Hub | AWS ECR | Managed (ECR) |
Infrastructure-as-Code From Day One
Never create AWS resources by clicking in the console — or if you do for exploration, immediately codify them in Terraform. Infrastructure-as-code (IaC) gives you: a version-controlled audit log of every infrastructure change, reproducible environments (staging is identical to production, just smaller), peer review for infrastructure changes, and the ability to tear down and recreate environments in minutes.
Use Terraform modules to avoid repeating yourself. A good module structure: a vpc module (VPC, subnets, routing, NAT Gateway), an ecs-service module (ECS task definition, service, ALB target group, security groups, autoscaling), a rds module (RDS instance, subnet group, parameter group, security group), and a secrets module (Secrets Manager secrets with rotation).
Terraform: Complete 3-Tier SaaS Infrastructure
# main.tf — Complete SaaS startup infrastructure (simplified)
# Assumes AWS provider configured with eu-west-2 (London)
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
backend "s3" {
bucket = "myapp-terraform-state"
key = "prod/terraform.tfstate"
region = "eu-west-2"
}
}
provider "aws" { region = "eu-west-2" }
# ─── VPC ───────────────────────────────────────────────────────
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "myapp-prod"
cidr = "10.0.0.0/16"
azs = ["eu-west-2a", "eu-west-2b", "eu-west-2c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # one per AZ for HA
enable_dns_hostnames = true
enable_dns_support = true
create_database_subnet_group = true
tags = {
Project = "myapp"
Environment = "prod"
ManagedBy = "terraform"
}
}
# ─── ALB (Application Load Balancer) ──────────────────────────
resource "aws_lb" "main" {
name = "myapp-prod-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = module.vpc.public_subnets
enable_deletion_protection = true
tags = { Name = "myapp-prod-alb" }
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = aws_acm_certificate_validation.main.certificate_arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}
resource "aws_lb_listener" "http_redirect" {
load_balancer_arn = aws_lb.main.arn
port = "80"
protocol = "HTTP"
default_action {
type = "redirect"
redirect {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
}
# ─── RDS (PostgreSQL) ─────────────────────────────────────────
resource "aws_db_instance" "main" {
identifier = "myapp-prod-postgres"
engine = "postgres"
engine_version = "16.2"
instance_class = "db.t4g.medium"
allocated_storage = 100
storage_type = "gp3"
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
db_name = "myapp"
username = "myapp_admin"
password = random_password.db_password.result
db_subnet_group_name = module.vpc.database_subnet_group
vpc_security_group_ids = [aws_security_group.rds.id]
multi_az = true # HA in production
backup_retention_period = 30 # 30 days of automated backups
backup_window = "02:00-04:00"
maintenance_window = "Mon:04:00-Mon:06:00"
deletion_protection = true
skip_final_snapshot = false
final_snapshot_identifier = "myapp-prod-final-snapshot"
performance_insights_enabled = true
tags = { Name = "myapp-prod-postgres" }
}
# ─── ElastiCache (Redis) ─────────────────────────────────────
resource "aws_elasticache_replication_group" "redis" {
replication_group_id = "myapp-prod-redis"
description = "Redis for session cache and Celery broker"
node_type = "cache.t4g.small"
num_cache_clusters = 2 # primary + one replica
port = 6379
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = random_password.redis_token.result
tags = { Name = "myapp-prod-redis" }
}
# ─── ECS Fargate (Application) ───────────────────────────────
resource "aws_ecs_cluster" "main" {
name = "myapp-prod"
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_ecs_task_definition" "api" {
family = "myapp-api"
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = 512
memory = 1024
execution_role_arn = aws_iam_role.ecs_execution.arn
task_role_arn = aws_iam_role.ecs_task.arn
container_definitions = jsonencode([{
name = "api"
image = "${aws_ecr_repository.api.repository_url}:latest"
essential = true
portMappings = [{ containerPort = 8000, protocol = "tcp" }]
environment = [
{ name = "ENVIRONMENT", value = "production" },
{ name = "REDIS_URL", value = "rediss://:${random_password.redis_token.result}@${aws_elasticache_replication_group.redis.primary_endpoint_address}:6379/0" }
]
secrets = [
{ name = "DATABASE_URL", valueFrom = aws_secretsmanager_secret.db_url.arn },
{ name = "SECRET_KEY", valueFrom = aws_secretsmanager_secret.app_secret.arn }
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.api.name
"awslogs-region" = "eu-west-2"
"awslogs-stream-prefix" = "api"
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}])
}
resource "aws_ecs_service" "api" {
name = "api"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.api.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "api"
container_port = 8000
}
deployment_circuit_breaker {
enable = true
rollback = true
}
lifecycle { ignore_changes = [task_definition] }
}
Environment Setup: Dev / Staging / Production
Use AWS Organizations with separate accounts for each environment. Production gets its own account — isolated billing, separate IAM, separate network. Developers have access to the development account, senior engineers to staging, and only the CI/CD pipeline (and on-call engineers) can deploy to production. This prevents the classic "I was debugging in prod" incident.
Apply consistent resource tagging across all environments: Project, Environment, Team, ManagedBy. Tags enable cost allocation reports that show exactly how much each environment and component costs — essential for FinOps as you scale.
CI/CD Pipeline: GitHub Actions + ECR + ECS
A minimal but production-ready CI/CD pipeline for a containerised SaaS application: on pull request, run tests and build the Docker image. On merge to main, push the image to ECR with a git SHA tag, run database migrations in a one-off ECS task, then update the ECS service with the new image tag. ECS's deployment circuit breaker automatically rolls back if the new tasks fail their health checks.
Use GitHub Actions OIDC to authenticate to AWS without storing long-lived access keys in GitHub secrets. This uses IAM Identity Provider configuration to allow GitHub Actions to assume an IAM role using short-lived tokens — significantly more secure than access key rotation.
Secrets Management
The cardinal rule: never store credentials in environment variables,.env files committed to git, or EC2 user data scripts. These all appear in logs, are visible to anyone with access to the repository, and cannot be rotated without a deployment.
Use AWS Secrets Manager from day one. Store your database URL, API keys, third-party service credentials, and application secret keys here. Reference them in ECS task definitions as secret environment variables — ECS retrieves and injects them at runtime, and they never appear in your code or infrastructure configuration. Enable automatic rotation for your database password — AWS Secrets Manager handles this natively for RDS.
Monitoring From Day One
Structured logging: configure your application to output JSON-formatted logs (not plain text). JSON logs are structured, searchable, and filterable in CloudWatch Logs Insights. Add fields like request_id, user_id, duration_ms, status_code, and error_message to every log line. This turns your logs from a wall of text into a queryable database of application events.
Uptime monitoring: set up an external uptime monitor (Pingdom, Better Uptime, or AWS Route 53 health checks) that alerts immediately if your production URL becomes unreachable. Internal monitoring can't tell you the service is down if the monitoring system itself is affected by the same outage.
Error tracking: integrate Sentry into your application from day one. Sentry captures unhandled exceptions with full stack traces, breadcrumbs, user context, and release tracking. The free tier covers up to 5,000 errors per month — more than enough for a startup. Alert to Slack immediately on new issues.
CloudWatch alarms: configure alarms for: ECS CPU above 80% (sustained 5 min), ECS memory above 85%, RDS CPU above 80%, RDS storage below 20% free, ALB 5xx error rate above 1%, and SQS dead-letter queue depth above 0 (any DLQ message means a failed job).
Cost Control for Startups
Budget alerts: set up AWS Budgets with alerts at 80% and 100% of your expected monthly spend. Receive alerts via email and SNS. This is a five-minute setup that prevents bill shock.
Auto-stop dev environments: development databases and ECS services don't need to run 24/7. Use EventBridge Scheduler to stop RDS instances and scale ECS desired count to 0 at 7pm on weekdays and restart them at 8am. For a typical dev environment costing £200/month on-demand, this saves approximately £130/month (running only 55 hours per week instead of 168).
Right-size from the start: resist the urge to over-provision "just in case". Start with the smallest instance that meets your needs based on load testing, not gut feel. It is easy to scale up; it is psychologically harder to scale down a resource you already provisioned (because you worry about what might break).
Building a SaaS? Get Your Infrastructure Right From Day One.
SpiderHunts Technologies sets up complete, production-ready cloud infrastructure for SaaS startups — Terraform, CI/CD, managed databases, monitoring, secrets management, and security. Done in 2 to 3 weeks. Ready to scale with you.
Talk to a Cloud Engineer