Cloud Engineering for SaaS Startups: Infrastructure From Day One

Getting cloud infrastructure right from the start saves months of painful refactoring. Here is the complete playbook for SaaS startups — from AWS account setup to CI/CD to monitoring.

TL;DR: Start with managed services (RDS not self-hosted Postgres), infrastructure-as-code from day one (Terraform), environment separation (dev/staging/prod accounts), secrets in Secrets Manager (never in.env files), structured logging, uptime monitoring, and budget alerts. A well-built startup cloud stack costs £300 to £600/month to start and scales gracefully. A poorly built one costs the same but turns into a painful rebuild six months later.

Why Getting Infrastructure Right From Day One Matters

We frequently work with SaaS startups that have landed their first major customer, need to pass a security audit, and discover that their infrastructure was clicked together in the AWS console with no automation, no environment separation, database credentials hardcoded in environment variables, no backups tested, and no monitoring. The "quick fix" that felt pragmatic at the start is now a month-long refactoring project happening in parallel with serving paying customers.

The time investment to do infrastructure properly from day one is two to three weeks for an experienced cloud engineer. The time to fix it properly later, under production pressure, is three to eight weeks. The choice is straightforward if you know what "properly" means — which this guide will explain.

Choose Managed Services Over Self-Managed

The most impactful decision for a SaaS startup's infrastructure is choosing managed services at every layer. Self-hosting a component means you own: initial setup, OS patching, monitoring, backup, failover, version upgrades, and incident response. For a small team, each self-managed component is an operational burden that competes with shipping product.

Layer Self-Managed Managed Service Recommendation
PostgreSQL EC2 + self-managed Postgres AWS RDS / Aurora Serverless Managed (RDS)
Redis EC2 + self-managed Redis AWS ElastiCache for Redis Managed (ElastiCache)
Application servers EC2 instances ECS Fargate / EKS ECS Fargate (for most startups)
Load balancer Nginx on EC2 AWS ALB Managed (ALB)
SSL/TLS certificates Let's Encrypt + cron renewal AWS ACM (auto-renewal) Managed (ACM)
Message queue RabbitMQ on EC2 AWS SQS Managed (SQS)
File storage EBS / EFS on EC2 AWS S3 Managed (S3)
DNS Self-managed BIND / CoreDNS Route 53 Managed (Route 53)
Secrets .env files, SSM Parameter Store AWS Secrets Manager Managed (Secrets Manager)
Container registry Docker Hub AWS ECR Managed (ECR)

Infrastructure-as-Code From Day One

Never create AWS resources by clicking in the console — or if you do for exploration, immediately codify them in Terraform. Infrastructure-as-code (IaC) gives you: a version-controlled audit log of every infrastructure change, reproducible environments (staging is identical to production, just smaller), peer review for infrastructure changes, and the ability to tear down and recreate environments in minutes.

Use Terraform modules to avoid repeating yourself. A good module structure: a vpc module (VPC, subnets, routing, NAT Gateway), an ecs-service module (ECS task definition, service, ALB target group, security groups, autoscaling), a rds module (RDS instance, subnet group, parameter group, security group), and a secrets module (Secrets Manager secrets with rotation).

Terraform: Complete 3-Tier SaaS Infrastructure

# main.tf — Complete SaaS startup infrastructure (simplified)
# Assumes AWS provider configured with eu-west-2 (London)

terraform {
 required_providers {
 aws = { source = "hashicorp/aws", version = "~> 5.0" }
 }
 backend "s3" {
 bucket = "myapp-terraform-state"
 key = "prod/terraform.tfstate"
 region = "eu-west-2"
 }
}

provider "aws" { region = "eu-west-2" }

# ─── VPC ───────────────────────────────────────────────────────
module "vpc" {
 source = "terraform-aws-modules/vpc/aws"
 version = "~> 5.0"

 name = "myapp-prod"
 cidr = "10.0.0.0/16"

 azs = ["eu-west-2a", "eu-west-2b", "eu-west-2c"]
 private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
 public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
 database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]

 enable_nat_gateway = true
 single_nat_gateway = false # one per AZ for HA
 enable_dns_hostnames = true
 enable_dns_support = true

 create_database_subnet_group = true

 tags = {
 Project = "myapp"
 Environment = "prod"
 ManagedBy = "terraform"
 }
}

# ─── ALB (Application Load Balancer) ──────────────────────────
resource "aws_lb" "main" {
 name = "myapp-prod-alb"
 internal = false
 load_balancer_type = "application"
 security_groups = [aws_security_group.alb.id]
 subnets = module.vpc.public_subnets

 enable_deletion_protection = true

 tags = { Name = "myapp-prod-alb" }
}

resource "aws_lb_listener" "https" {
 load_balancer_arn = aws_lb.main.arn
 port = "443"
 protocol = "HTTPS"
 ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
 certificate_arn = aws_acm_certificate_validation.main.certificate_arn

 default_action {
 type = "forward"
 target_group_arn = aws_lb_target_group.app.arn
 }
}

resource "aws_lb_listener" "http_redirect" {
 load_balancer_arn = aws_lb.main.arn
 port = "80"
 protocol = "HTTP"

 default_action {
 type = "redirect"
 redirect {
 port = "443"
 protocol = "HTTPS"
 status_code = "HTTP_301"
 }
 }
}

# ─── RDS (PostgreSQL) ─────────────────────────────────────────
resource "aws_db_instance" "main" {
 identifier = "myapp-prod-postgres"
 engine = "postgres"
 engine_version = "16.2"
 instance_class = "db.t4g.medium"
 allocated_storage = 100
 storage_type = "gp3"
 storage_encrypted = true
 kms_key_id = aws_kms_key.rds.arn

 db_name = "myapp"
 username = "myapp_admin"
 password = random_password.db_password.result

 db_subnet_group_name = module.vpc.database_subnet_group
 vpc_security_group_ids = [aws_security_group.rds.id]

 multi_az = true # HA in production
 backup_retention_period = 30 # 30 days of automated backups
 backup_window = "02:00-04:00"
 maintenance_window = "Mon:04:00-Mon:06:00"

 deletion_protection = true
 skip_final_snapshot = false
 final_snapshot_identifier = "myapp-prod-final-snapshot"

 performance_insights_enabled = true

 tags = { Name = "myapp-prod-postgres" }
}

# ─── ElastiCache (Redis) ─────────────────────────────────────
resource "aws_elasticache_replication_group" "redis" {
 replication_group_id = "myapp-prod-redis"
 description = "Redis for session cache and Celery broker"

 node_type = "cache.t4g.small"
 num_cache_clusters = 2 # primary + one replica
 port = 6379

 subnet_group_name = aws_elasticache_subnet_group.main.name
 security_group_ids = [aws_security_group.redis.id]

 at_rest_encryption_enabled = true
 transit_encryption_enabled = true
 auth_token = random_password.redis_token.result

 tags = { Name = "myapp-prod-redis" }
}

# ─── ECS Fargate (Application) ───────────────────────────────
resource "aws_ecs_cluster" "main" {
 name = "myapp-prod"

 setting {
 name = "containerInsights"
 value = "enabled"
 }
}

resource "aws_ecs_task_definition" "api" {
 family = "myapp-api"
 requires_compatibilities = ["FARGATE"]
 network_mode = "awsvpc"
 cpu = 512
 memory = 1024
 execution_role_arn = aws_iam_role.ecs_execution.arn
 task_role_arn = aws_iam_role.ecs_task.arn

 container_definitions = jsonencode([{
 name = "api"
 image = "${aws_ecr_repository.api.repository_url}:latest"
 essential = true

 portMappings = [{ containerPort = 8000, protocol = "tcp" }]

 environment = [
 { name = "ENVIRONMENT", value = "production" },
 { name = "REDIS_URL", value = "rediss://:${random_password.redis_token.result}@${aws_elasticache_replication_group.redis.primary_endpoint_address}:6379/0" }
 ]

 secrets = [
 { name = "DATABASE_URL", valueFrom = aws_secretsmanager_secret.db_url.arn },
 { name = "SECRET_KEY", valueFrom = aws_secretsmanager_secret.app_secret.arn }
 ]

 logConfiguration = {
 logDriver = "awslogs"
 options = {
 "awslogs-group" = aws_cloudwatch_log_group.api.name
 "awslogs-region" = "eu-west-2"
 "awslogs-stream-prefix" = "api"
 }
 }

 healthCheck = {
 command = ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
 interval = 30
 timeout = 5
 retries = 3
 startPeriod = 60
 }
 }])
}

resource "aws_ecs_service" "api" {
 name = "api"
 cluster = aws_ecs_cluster.main.id
 task_definition = aws_ecs_task_definition.api.arn
 desired_count = 2
 launch_type = "FARGATE"

 network_configuration {
 subnets = module.vpc.private_subnets
 security_groups = [aws_security_group.app.id]
 assign_public_ip = false
 }

 load_balancer {
 target_group_arn = aws_lb_target_group.app.arn
 container_name = "api"
 container_port = 8000
 }

 deployment_circuit_breaker {
 enable = true
 rollback = true
 }

 lifecycle { ignore_changes = [task_definition] }
}

Environment Setup: Dev / Staging / Production

Use AWS Organizations with separate accounts for each environment. Production gets its own account — isolated billing, separate IAM, separate network. Developers have access to the development account, senior engineers to staging, and only the CI/CD pipeline (and on-call engineers) can deploy to production. This prevents the classic "I was debugging in prod" incident.

Apply consistent resource tagging across all environments: Project, Environment, Team, ManagedBy. Tags enable cost allocation reports that show exactly how much each environment and component costs — essential for FinOps as you scale.

CI/CD Pipeline: GitHub Actions + ECR + ECS

A minimal but production-ready CI/CD pipeline for a containerised SaaS application: on pull request, run tests and build the Docker image. On merge to main, push the image to ECR with a git SHA tag, run database migrations in a one-off ECS task, then update the ECS service with the new image tag. ECS's deployment circuit breaker automatically rolls back if the new tasks fail their health checks.

Use GitHub Actions OIDC to authenticate to AWS without storing long-lived access keys in GitHub secrets. This uses IAM Identity Provider configuration to allow GitHub Actions to assume an IAM role using short-lived tokens — significantly more secure than access key rotation.

Secrets Management

The cardinal rule: never store credentials in environment variables,.env files committed to git, or EC2 user data scripts. These all appear in logs, are visible to anyone with access to the repository, and cannot be rotated without a deployment.

Use AWS Secrets Manager from day one. Store your database URL, API keys, third-party service credentials, and application secret keys here. Reference them in ECS task definitions as secret environment variables — ECS retrieves and injects them at runtime, and they never appear in your code or infrastructure configuration. Enable automatic rotation for your database password — AWS Secrets Manager handles this natively for RDS.

Monitoring From Day One

Structured logging: configure your application to output JSON-formatted logs (not plain text). JSON logs are structured, searchable, and filterable in CloudWatch Logs Insights. Add fields like request_id, user_id, duration_ms, status_code, and error_message to every log line. This turns your logs from a wall of text into a queryable database of application events.

Uptime monitoring: set up an external uptime monitor (Pingdom, Better Uptime, or AWS Route 53 health checks) that alerts immediately if your production URL becomes unreachable. Internal monitoring can't tell you the service is down if the monitoring system itself is affected by the same outage.

Error tracking: integrate Sentry into your application from day one. Sentry captures unhandled exceptions with full stack traces, breadcrumbs, user context, and release tracking. The free tier covers up to 5,000 errors per month — more than enough for a startup. Alert to Slack immediately on new issues.

CloudWatch alarms: configure alarms for: ECS CPU above 80% (sustained 5 min), ECS memory above 85%, RDS CPU above 80%, RDS storage below 20% free, ALB 5xx error rate above 1%, and SQS dead-letter queue depth above 0 (any DLQ message means a failed job).

Cost Control for Startups

Budget alerts: set up AWS Budgets with alerts at 80% and 100% of your expected monthly spend. Receive alerts via email and SNS. This is a five-minute setup that prevents bill shock.

Auto-stop dev environments: development databases and ECS services don't need to run 24/7. Use EventBridge Scheduler to stop RDS instances and scale ECS desired count to 0 at 7pm on weekdays and restart them at 8am. For a typical dev environment costing £200/month on-demand, this saves approximately £130/month (running only 55 hours per week instead of 168).

Right-size from the start: resist the urge to over-provision "just in case". Start with the smallest instance that meets your needs based on load testing, not gut feel. It is easy to scale up; it is psychologically harder to scale down a resource you already provisioned (because you worry about what might break).

Building a SaaS? Get Your Infrastructure Right From Day One.

SpiderHunts Technologies sets up complete, production-ready cloud infrastructure for SaaS startups — Terraform, CI/CD, managed databases, monitoring, secrets management, and security. Done in 2 to 3 weeks. Ready to scale with you.

Talk to a Cloud Engineer