Back to Journal
DevOps

Infrastructure as Code Best Practices for High Scale Teams

Battle-tested best practices for Infrastructure as Code tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 12 min read

High-scale infrastructure as code requires patterns that go beyond standard enterprise practices. When you're managing 10,000+ resources across multiple clouds and regions, the challenges shift from correctness to performance, state management at scale, and organizational coordination across dozens of teams.

Scalability Challenges

Standard Terraform workflows break down at high scale:

  • State files exceeding 100MB cause slow plan/apply cycles
  • Provider API rate limits throttle parallel resource creation
  • Module dependency graphs become complex enough to cause circular references
  • CI/CD pipelines take 30+ minutes for plan operations

Best Practices

1. Hierarchical State Architecture

Decompose infrastructure into layers with explicit dependency ordering:

1Layer 0: Foundation AWS Organizations, account structure, DNS zones
2Layer 1: Networking VPCs, transit gateways, peering connections
3Layer 2: Platform EKS clusters, RDS instances, ElastiCache
4Layer 3: Application Deployments, load balancers, autoscaling
5Layer 4: Observability CloudWatch, Datadog integration, alerts
6 

Each layer reads outputs from lower layers via terraform_remote_state or data sources. This prevents circular dependencies and limits blast radius to a single layer.

hcl
1data "terraform_remote_state" "networking" {
2 backend = "s3"
3 config = {
4 bucket = "terraform-state"
5 key = "layer-1/networking/terraform.tfstate"
6 region = "us-east-1"
7 }
8}
9 
10resource "aws_eks_cluster" "main" {
11 name = "production"
12 role_arn = aws_iam_role.cluster.arn
13
14 vpc_config {
15 subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
16 }
17}
18 

2. Parallel Execution with Resource Targeting

When managing 1000+ resources, use targeted applies for faster iteration:

bash
1# Only apply changes to the specific module that changed
2terraform plan -target=module.service_a -out=plan.out
3terraform apply plan.out
4 
5# Parallel execution across independent modules
6parallel -j4 'cd {} && terraform apply -auto-approve' ::: \
7 modules/service-a \
8 modules/service-b \
9 modules/service-c \
10 modules/service-d
11 

3. Custom Provider Configurations for Rate Limiting

hcl
1provider "aws" {
2 region = "us-east-1"
3
4 default_tags {
5 tags = {
6 ManagedBy = "terraform"
7 Environment = var.environment
8 Team = var.team
9 }
10 }
11}
12 
13# Separate provider for high-volume API calls with retry configuration
14provider "aws" {
15 alias = "high_throughput"
16 region = "us-east-1"
17
18 retry_mode = "adaptive"
19 max_retries = 10
20}
21 

4. State File Optimization

At scale, state files grow large. Optimize with:

hcl
1# Remove state entries for destroyed resources
2terraform state rm 'module.old_service'
3 
4# Move resources between state files during refactoring
5terraform state mv \
6 'module.monolith.aws_ecs_service.app' \
7 'module.service_a.aws_ecs_service.app'
8 

Automated state cleanup scripts prevent state bloat:

python
1import json
2import subprocess
3 
4def find_orphaned_resources(state_file: str) -> list[str]:
5 result = subprocess.run(
6 ["terraform", "state", "list"],
7 capture_output=True, text=True
8 )
9 state_resources = set(result.stdout.strip().split("\n"))
10
11 # Compare with actual cloud resources
12 # Return resources in state but not in cloud
13 return list(state_resources - cloud_resources)
14 

5. Multi-Account Strategy with Terragrunt

hcl
1# terragrunt.hcl at root
2remote_state {
3 backend = "s3"
4 config = {
5 bucket = "terraform-state-${get_aws_account_id()}"
6 key = "${path_relative_to_include()}/terraform.tfstate"
7 region = "us-east-1"
8 dynamodb_table = "terraform-locks"
9 encrypt = true
10 }
11}
12 
13inputs = {
14 environment = basename(get_terragrunt_dir())
15 account_id = get_aws_account_id()
16 region = "us-east-1"
17}
18 

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Anti-Patterns to Avoid

  1. Single state file for entire infrastructure — at high scale, this causes 30+ minute plan times and massive blast radius.
  2. Manual resource imports — use import blocks (Terraform 1.5+) for declarative imports that survive code review.
  3. Over-abstraction in modules — deeply nested module hierarchies (4+ levels) create debugging nightmares. Keep module depth to 2 levels maximum.
  4. Ignoring provider API limits — parallel resource creation can hit rate limits, causing intermittent failures that waste CI time.
  5. Shared workspaces across teams — each team must own their state. Cross-team dependencies flow through data sources and outputs.

Checklist

  • Infrastructure decomposed into dependency layers (0-4)
  • No single state file manages > 500 resources
  • CI plan times < 10 minutes for any single workspace
  • Provider retry and rate limiting configured for high-volume operations
  • State cleanup automation runs weekly
  • Multi-account strategy with account-level isolation
  • Terragrunt or similar wrapper manages cross-workspace dependencies
  • Cost estimation integrated into plan review (Infracost)
  • Automated rollback procedure documented and tested quarterly
  • Cross-region disaster recovery for state files

Conclusion

High-scale IaC is an exercise in decomposition and parallelism. The practices that work for 100 resources break at 10,000. Layered architecture prevents circular dependencies and limits blast radius. Targeted applies and parallel execution keep CI times manageable. Provider-level rate limiting prevents intermittent failures. And aggressive state file hygiene prevents the slow degradation that makes Terraform workflows unusable over time.

The organizational challenge is equally important: at high scale, IaC must be a platform that teams consume through modules and workflows, not a monolithic configuration that a central team maintains. Self-service infrastructure provisioning through a private module registry, automated testing, and PR-based workflows lets individual teams move fast within the guardrails.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026