When should a startup move from a single cluster to multi-cluster?

Most startups never need multi-cluster. Consider it only when you need to serve users from multiple geographic regions with sub-50ms latency requirements, or when your cluster exceeds 200 nodes and blast radius becomes a concern. For the majority of startups, a single cluster in a single region handles traffic well into the millions of monthly active users.

Is Helm or Kustomize better for managing manifests?

Helm is more practical for startups because it manages third-party chart installations (ingress-nginx, cert-manager, prometheus). For your own application manifests, either works. Kustomize is simpler but less flexible. Many teams use Helm for third-party charts and Kustomize for application manifests — this is a reasonable hybrid approach.

How do you handle secrets in a startup Kubernetes setup?

Start with Kubernetes Secrets encrypted at rest (enabled by default on managed Kubernetes) and SealedSecrets for GitOps compatibility. As you grow, migrate to External Secrets Operator with AWS Secrets Manager or HashiCorp Vault. The key principle is that plaintext secrets should never appear in your Git repository.

What is the minimum viable monitoring setup?

kube-prometheus-stack provides Prometheus, Grafana, and Alertmanager in a single Helm install. Configure four alerts: pod restart loops, node resource exhaustion (>80% CPU or memory), certificate expiration (14 days), and persistent volume usage (>85%). These four alerts catch the majority of Kubernetes operational issues.

Kubernetes Production Setup Best Practices for Startup Teams

Startups running Kubernetes face a different set of constraints than established companies. Budget is tight, the team is small (often one or two people managing infrastructure), and speed of iteration matters more than theoretical perfection. These practices focus on getting a production-ready Kubernetes setup without over-engineering — maximizing reliability per dollar spent.

Start with Managed Kubernetes

Self-hosting the Kubernetes control plane is never the right call for a startup. EKS, GKE, or AKS eliminate the need to manage etcd, the API server, and controller managers. The $75/month cost for an EKS control plane is trivially cheap compared to the engineering hours of debugging a self-managed etcd cluster at 3 AM.

bash

1# EKS cluster with eksctl — production-ready in 15 minutes

2eksctl create cluster \

3 --name production \

4 --region us-east-1 \

5 --version 1.29 \

6 --nodegroup-name general \

7 --node-type t3.large \

8 --nodes 3 \

9 --nodes-min 2 \

10 --nodes-max 10 \

11 --managed \

12 --asg-access

GKE Autopilot is worth considering for very early-stage startups. It removes node management entirely, charging per pod resource request. You lose some flexibility but eliminate an entire category of operational concerns.

Resource Requests: Always Set Them

The single most impactful practice for cluster stability is setting resource requests on every container. Without them, the scheduler cannot make informed placement decisions, and pods compete for resources unpredictably.

yaml

1apiVersion: apps/v1

2kind: Deployment

3metadata:

4 name: api

5spec:

6 replicas: 2

7 selector:

8 matchLabels:

9 app: api

10 template:

11 metadata:

12 labels:

13 app: api

14 spec:

15 containers:

16 - name: api

17 image: api:v1.2.3

18 resources:

19 requests:

20 cpu: 250m

21 memory: 512Mi

22 limits:

23 memory: 1Gi

24 readinessProbe:

25 httpGet:

26 path: /health

27 port: 8080

28 initialDelaySeconds: 5

29 periodSeconds: 10

30 livenessProbe:

31 httpGet:

32 path: /health

33 port: 8080

34 initialDelaySeconds: 15

35 periodSeconds: 20

A common startup mistake is setting CPU limits. CPU is a compressible resource — when a pod hits its CPU limit, it gets throttled rather than killed. This throttling creates latency spikes that are difficult to diagnose. Set CPU requests for scheduling but omit CPU limits unless you have a specific reason.

Memory limits are different. Memory is incompressible — a pod exceeding its memory limit gets OOM-killed. Always set memory limits.

Namespace Strategy

Keep it simple. Three namespaces cover most startup needs:

yaml

1apiVersion: v1

2kind: Namespace

3metadata:

4 name: production

5 labels:

6 environment: production

7---

8apiVersion: v1

9kind: Namespace

10metadata:

11 name: staging

12 labels:

13 environment: staging

14---

15apiVersion: v1

16kind: Namespace

17metadata:

18 name: monitoring

19 labels:

20 environment: monitoring

Add ResourceQuotas to prevent a single namespace from consuming the entire cluster:

yaml

1apiVersion: v1

2kind: ResourceQuota

3metadata:

4 name: production-quota

5 namespace: production

6spec:

7 hard:

8 requests.cpu: "8"

9 requests.memory: 16Gi

10 limits.memory: 32Gi

11 pods: "50"

Ingress with Cert-Manager

Every startup needs TLS termination and routing. The nginx ingress controller plus cert-manager is the standard, well-tested combination:

bash

1# Install nginx ingress

2helm install ingress-nginx ingress-nginx/ingress-nginx \

3 --namespace ingress-nginx \

4 --create-namespace \

5 --set controller.replicaCount=2

7# Install cert-manager

8helm install cert-manager jetstack/cert-manager \

9 --namespace cert-manager \

10 --create-namespace \

11 --set installCRDs=true

yaml

1apiVersion: cert-manager.io/v1

2kind: ClusterIssuer

3metadata:

4 name: letsencrypt-prod

5spec:

6 acme:

7 server: https://acme-v02.api.letsencrypt.org/directory

8 email: [email protected]

9 privateKeySecretRef:

10 name: letsencrypt-prod

11 solvers:

12 - http01:

13 ingress:

14 class: nginx

15---

16apiVersion: networking.k8s.io/v1

17kind: Ingress

18metadata:

19 name: api-ingress

20 annotations:

21 cert-manager.io/cluster-issuer: "letsencrypt-prod"

22spec:

23 ingressClassName: nginx

24 tls:

25 - hosts:

26 - api.yourcompany.com

27 secretName: api-tls

28 rules:

29 - host: api.yourcompany.com

30 http:

31 paths:

32 - path: /

33 pathType: Prefix

34 backend:

35 service:

36 name: api

37 port:

38 number: 8080

GitOps from Day One

ArgoCD or Flux should be your deployment mechanism from the first day. Manual kubectl apply doesn't scale, and more importantly, it doesn't provide an audit trail or easy rollbacks.

yaml

1# argocd-app.yaml

2apiVersion: argoproj.io/v1alpha1

3kind: Application

4metadata:

5 name: api

6 namespace: argocd

7spec:

8 project: default

9 source:

10 repoURL: https://github.com/yourcompany/k8s-manifests

11 targetRevision: main

12 path: production/api

13 destination:

14 server: https://kubernetes.default.svc

15 namespace: production

16 syncPolicy:

17 automated:

18 prune: true

19 selfHeal: true

20 syncOptions:

21 - CreateNamespace=true

The overhead of setting up ArgoCD is about 2 hours. The time saved on the first rollback pays that back immediately.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Cost Optimization

Spot Instances for Non-Critical Workloads

yaml

1apiVersion: apps/v1

2kind: Deployment

3metadata:

4 name: background-worker

5spec:

6 replicas: 3

7 template:

8 spec:

9 nodeSelector:

10 node.kubernetes.io/lifecycle: spot

11 tolerations:

12 - key: "spot"

13 operator: "Equal"

14 value: "true"

15 effect: "NoSchedule"

16 terminationGracePeriodSeconds: 60

17 containers:

18 - name: worker

19 image: worker:v1.0.0

20 lifecycle:

21 preStop:

22 exec:

23 command: ["/bin/sh", "-c", "sleep 10 && kill -SIGTERM 1"]

Spot instances save 60-70% on compute costs. Use them for background workers, batch jobs, and any workload that handles interruption gracefully. Keep your API servers on on-demand instances.

Karpenter for Smarter Scaling

yaml

1apiVersion: karpenter.sh/v1beta1

2kind: NodePool

3metadata:

4 name: default

5spec:

6 template:

7 spec:

8 requirements:

9 - key: karpenter.sh/capacity-type

10 operator: In

11 values: ["on-demand", "spot"]

12 - key: node.kubernetes.io/instance-type

13 operator: In

14 values: ["t3.large", "t3.xlarge", "m5.large", "m5.xlarge"]

15 limits:

16 cpu: 100

17 memory: 200Gi

18 disruption:

19 consolidationPolicy: WhenUnderutilized

20 expireAfter: 720h

Karpenter provisions nodes faster than the Kubernetes Cluster Autoscaler (typically 60 seconds vs 3-5 minutes) and makes better instance selection decisions. For a startup where every pod minute costs money, this matters.

Monitoring on a Budget

You don't need a full observability platform on day one. Start with the basics:

yaml

1# kube-prometheus-stack via Helm

2helm install monitoring prometheus-community/kube-prometheus-stack \

3 --namespace monitoring \

4 --create-namespace \

5 --set prometheus.prometheusSpec.retention=7d \

6 --set prometheus.prometheusSpec.resources.requests.memory=1Gi \

7 --set prometheus.prometheusSpec.resources.limits.memory=2Gi \

8 --set grafana.adminPassword=changeme

Seven days of retention is sufficient for a startup. If you need longer-term metrics, add Thanos or Grafana Mimir later. The key dashboards to set up immediately are node resource utilization, pod restart counts, and request latency by service.

Anti-Patterns to Avoid

Over-engineering with service mesh. Istio adds 200Mi+ memory per sidecar and significant operational complexity. Unless you have specific mTLS, traffic management, or observability needs that can't be met by simpler tools, skip it until you have the team to manage it.

Running databases in Kubernetes. Managed databases (RDS, Cloud SQL, Atlas) are almost always the right choice for startups. StatefulSets work, but the operational overhead of managing persistent volumes, backup schedules, and failover in Kubernetes is substantial for a small team.

Creating too many environments. Production and staging are sufficient. Each additional environment costs money and maintenance time. Use feature flags instead of long-lived preview environments.

Ignoring security basics. Network policies, RBAC, and pod security standards take an afternoon to set up and prevent entire categories of incidents. The startup that skips security "to move faster" moves slower after their first breach.

Not setting up PDBs. Even with 2-3 replicas, PDBs prevent cluster upgrades from taking down your service. A single line of YAML prevents hours of downtime.

Production Checklist

Conclusion

A startup Kubernetes setup should optimize for reliability and operational simplicity, not theoretical completeness. Managed Kubernetes, GitOps, basic monitoring, and proper resource configuration cover 90% of what a small team needs. Every additional layer of complexity — service mesh, custom operators, multi-cluster federation — should be deferred until the team and traffic justify it.

The practices here represent approximately one week of setup work for a single engineer. After that initial investment, the ongoing operational burden is minimal: dependency updates, certificate rotations (automated via cert-manager), and responding to alerts. This foundation scales comfortably to serve thousands of requests per second across dozens of services.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

kubernetes k8s container-orchestration devops startup best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Kubernetes Production Setup Best Practices for Startup Teams

Start with Managed Kubernetes

Resource Requests: Always Set Them

Namespace Strategy

Ingress with Cert-Manager

GitOps from Day One

Cost Optimization

Spot Instances for Non-Critical Workloads

Karpenter for Smarter Scaling

Monitoring on a Budget

Anti-Patterns to Avoid

Production Checklist

Conclusion

FAQ

Building with CI/CD pipelines?

Kubernetes Production Setup Best Practices for High Scale Teams

Kubernetes Production Setup Best Practices for Enterprise Teams

Kubernetes Production Setup at Scale: Lessons from Production

Kubernetes Production Setup Best Practices for Enterprise Teams

Complete Guide to Kubernetes Production Setup with Java

Start a
Conversation.

Start with Managed Kubernetes

Resource Requests: Always Set Them

Namespace Strategy

Ingress with Cert-Manager

GitOps from Day One

Cost Optimization

Spot Instances for Non-Critical Workloads

Karpenter for Smarter Scaling

Monitoring on a Budget

Anti-Patterns to Avoid

Production Checklist

Conclusion

FAQ

Building with CI/CD pipelines?

Kubernetes Production Setup Best Practices for High Scale Teams

Kubernetes Production Setup Best Practices for Enterprise Teams

Kubernetes Production Setup at Scale: Lessons from Production

Kubernetes Production Setup Best Practices for Enterprise Teams

Complete Guide to Kubernetes Production Setup with Java

Start aConversation.

Start a
Conversation.