Back to Journal
DevOps

Kubernetes Production Setup Best Practices for Startup Teams

Battle-tested best practices for Kubernetes Production Setup tailored to Startup teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 18 min read

Startups running Kubernetes face a different set of constraints than established companies. Budget is tight, the team is small (often one or two people managing infrastructure), and speed of iteration matters more than theoretical perfection. These practices focus on getting a production-ready Kubernetes setup without over-engineering — maximizing reliability per dollar spent.

Start with Managed Kubernetes

Self-hosting the Kubernetes control plane is never the right call for a startup. EKS, GKE, or AKS eliminate the need to manage etcd, the API server, and controller managers. The $75/month cost for an EKS control plane is trivially cheap compared to the engineering hours of debugging a self-managed etcd cluster at 3 AM.

bash
1# EKS cluster with eksctl — production-ready in 15 minutes
2eksctl create cluster \
3 --name production \
4 --region us-east-1 \
5 --version 1.29 \
6 --nodegroup-name general \
7 --node-type t3.large \
8 --nodes 3 \
9 --nodes-min 2 \
10 --nodes-max 10 \
11 --managed \
12 --asg-access
13 

GKE Autopilot is worth considering for very early-stage startups. It removes node management entirely, charging per pod resource request. You lose some flexibility but eliminate an entire category of operational concerns.

Resource Requests: Always Set Them

The single most impactful practice for cluster stability is setting resource requests on every container. Without them, the scheduler cannot make informed placement decisions, and pods compete for resources unpredictably.

yaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: api
5spec:
6 replicas: 2
7 selector:
8 matchLabels:
9 app: api
10 template:
11 metadata:
12 labels:
13 app: api
14 spec:
15 containers:
16 - name: api
17 image: api:v1.2.3
18 resources:
19 requests:
20 cpu: 250m
21 memory: 512Mi
22 limits:
23 memory: 1Gi
24 readinessProbe:
25 httpGet:
26 path: /health
27 port: 8080
28 initialDelaySeconds: 5
29 periodSeconds: 10
30 livenessProbe:
31 httpGet:
32 path: /health
33 port: 8080
34 initialDelaySeconds: 15
35 periodSeconds: 20
36 

A common startup mistake is setting CPU limits. CPU is a compressible resource — when a pod hits its CPU limit, it gets throttled rather than killed. This throttling creates latency spikes that are difficult to diagnose. Set CPU requests for scheduling but omit CPU limits unless you have a specific reason.

Memory limits are different. Memory is incompressible — a pod exceeding its memory limit gets OOM-killed. Always set memory limits.

Namespace Strategy

Keep it simple. Three namespaces cover most startup needs:

yaml
1apiVersion: v1
2kind: Namespace
3metadata:
4 name: production
5 labels:
6 environment: production
7---
8apiVersion: v1
9kind: Namespace
10metadata:
11 name: staging
12 labels:
13 environment: staging
14---
15apiVersion: v1
16kind: Namespace
17metadata:
18 name: monitoring
19 labels:
20 environment: monitoring
21 

Add ResourceQuotas to prevent a single namespace from consuming the entire cluster:

yaml
1apiVersion: v1
2kind: ResourceQuota
3metadata:
4 name: production-quota
5 namespace: production
6spec:
7 hard:
8 requests.cpu: "8"
9 requests.memory: 16Gi
10 limits.memory: 32Gi
11 pods: "50"
12 

Ingress with Cert-Manager

Every startup needs TLS termination and routing. The nginx ingress controller plus cert-manager is the standard, well-tested combination:

bash
1# Install nginx ingress
2helm install ingress-nginx ingress-nginx/ingress-nginx \
3 --namespace ingress-nginx \
4 --create-namespace \
5 --set controller.replicaCount=2
6 
7# Install cert-manager
8helm install cert-manager jetstack/cert-manager \
9 --namespace cert-manager \
10 --create-namespace \
11 --set installCRDs=true
12 
yaml
1apiVersion: cert-manager.io/v1
2kind: ClusterIssuer
3metadata:
4 name: letsencrypt-prod
5spec:
6 acme:
7 server: https://acme-v02.api.letsencrypt.org/directory
9 privateKeySecretRef:
10 name: letsencrypt-prod
11 solvers:
12 - http01:
13 ingress:
14 class: nginx
15---
16apiVersion: networking.k8s.io/v1
17kind: Ingress
18metadata:
19 name: api-ingress
20 annotations:
21 cert-manager.io/cluster-issuer: "letsencrypt-prod"
22spec:
23 ingressClassName: nginx
24 tls:
25 - hosts:
26 - api.yourcompany.com
27 secretName: api-tls
28 rules:
29 - host: api.yourcompany.com
30 http:
31 paths:
32 - path: /
33 pathType: Prefix
34 backend:
35 service:
36 name: api
37 port:
38 number: 8080
39 

GitOps from Day One

ArgoCD or Flux should be your deployment mechanism from the first day. Manual kubectl apply doesn't scale, and more importantly, it doesn't provide an audit trail or easy rollbacks.

yaml
1# argocd-app.yaml
2apiVersion: argoproj.io/v1alpha1
3kind: Application
4metadata:
5 name: api
6 namespace: argocd
7spec:
8 project: default
9 source:
10 repoURL: https://github.com/yourcompany/k8s-manifests
11 targetRevision: main
12 path: production/api
13 destination:
14 server: https://kubernetes.default.svc
15 namespace: production
16 syncPolicy:
17 automated:
18 prune: true
19 selfHeal: true
20 syncOptions:
21 - CreateNamespace=true
22 

The overhead of setting up ArgoCD is about 2 hours. The time saved on the first rollback pays that back immediately.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Cost Optimization

Spot Instances for Non-Critical Workloads

yaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: background-worker
5spec:
6 replicas: 3
7 template:
8 spec:
9 nodeSelector:
10 node.kubernetes.io/lifecycle: spot
11 tolerations:
12 - key: "spot"
13 operator: "Equal"
14 value: "true"
15 effect: "NoSchedule"
16 terminationGracePeriodSeconds: 60
17 containers:
18 - name: worker
19 image: worker:v1.0.0
20 lifecycle:
21 preStop:
22 exec:
23 command: ["/bin/sh", "-c", "sleep 10 && kill -SIGTERM 1"]
24 

Spot instances save 60-70% on compute costs. Use them for background workers, batch jobs, and any workload that handles interruption gracefully. Keep your API servers on on-demand instances.

Karpenter for Smarter Scaling

yaml
1apiVersion: karpenter.sh/v1beta1
2kind: NodePool
3metadata:
4 name: default
5spec:
6 template:
7 spec:
8 requirements:
9 - key: karpenter.sh/capacity-type
10 operator: In
11 values: ["on-demand", "spot"]
12 - key: node.kubernetes.io/instance-type
13 operator: In
14 values: ["t3.large", "t3.xlarge", "m5.large", "m5.xlarge"]
15 limits:
16 cpu: 100
17 memory: 200Gi
18 disruption:
19 consolidationPolicy: WhenUnderutilized
20 expireAfter: 720h
21 

Karpenter provisions nodes faster than the Kubernetes Cluster Autoscaler (typically 60 seconds vs 3-5 minutes) and makes better instance selection decisions. For a startup where every pod minute costs money, this matters.

Monitoring on a Budget

You don't need a full observability platform on day one. Start with the basics:

yaml
1# kube-prometheus-stack via Helm
2helm install monitoring prometheus-community/kube-prometheus-stack \
3 --namespace monitoring \
4 --create-namespace \
5 --set prometheus.prometheusSpec.retention=7d \
6 --set prometheus.prometheusSpec.resources.requests.memory=1Gi \
7 --set prometheus.prometheusSpec.resources.limits.memory=2Gi \
8 --set grafana.adminPassword=changeme
9 

Seven days of retention is sufficient for a startup. If you need longer-term metrics, add Thanos or Grafana Mimir later. The key dashboards to set up immediately are node resource utilization, pod restart counts, and request latency by service.

Anti-Patterns to Avoid

Over-engineering with service mesh. Istio adds 200Mi+ memory per sidecar and significant operational complexity. Unless you have specific mTLS, traffic management, or observability needs that can't be met by simpler tools, skip it until you have the team to manage it.

Running databases in Kubernetes. Managed databases (RDS, Cloud SQL, Atlas) are almost always the right choice for startups. StatefulSets work, but the operational overhead of managing persistent volumes, backup schedules, and failover in Kubernetes is substantial for a small team.

Creating too many environments. Production and staging are sufficient. Each additional environment costs money and maintenance time. Use feature flags instead of long-lived preview environments.

Ignoring security basics. Network policies, RBAC, and pod security standards take an afternoon to set up and prevent entire categories of incidents. The startup that skips security "to move faster" moves slower after their first breach.

Not setting up PDBs. Even with 2-3 replicas, PDBs prevent cluster upgrades from taking down your service. A single line of YAML prevents hours of downtime.

Production Checklist

  • Managed Kubernetes (EKS/GKE/AKS) — never self-hosted
  • Resource requests on every container
  • Memory limits on every container (skip CPU limits)
  • Health checks (readiness and liveness) on every container
  • Cert-manager with Let's Encrypt for TLS
  • GitOps deployment via ArgoCD or Flux
  • Spot instances for non-critical workloads
  • Karpenter or Cluster Autoscaler configured
  • kube-prometheus-stack for monitoring
  • PDBs on services with 2+ replicas
  • ResourceQuotas per namespace
  • RBAC with least-privilege access
  • Default-deny network policies
  • Pod Security Standards enforced
  • Automated backups for any stateful components

Conclusion

A startup Kubernetes setup should optimize for reliability and operational simplicity, not theoretical completeness. Managed Kubernetes, GitOps, basic monitoring, and proper resource configuration cover 90% of what a small team needs. Every additional layer of complexity — service mesh, custom operators, multi-cluster federation — should be deferred until the team and traffic justify it.

The practices here represent approximately one week of setup work for a single engineer. After that initial investment, the ongoing operational burden is minimal: dependency updates, certificate rotations (automated via cert-manager), and responding to alerts. This foundation scales comfortably to serve thousands of requests per second across dozens of services.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026