Back to Journal
SaaS Engineering

Multi-Tenant Architecture at Scale: Lessons from Production

Real-world lessons from implementing Multi-Tenant Architecture in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 14 min read

In 2024, we rebuilt the multi-tenant architecture of a B2B project management SaaS serving 2,400 tenants. The original architecture — shared database with application-level tenant filtering — had produced three cross-tenant data leakage incidents in six months. This is the story of migrating to a robust multi-tenant architecture under production load.

Starting Point

The application was a Rails-based project management tool on AWS. All 2,400 tenants shared a single PostgreSQL RDS instance with no row-level security. Tenant isolation relied entirely on application code: every ActiveRecord scope included where(tenant_id: current_tenant.id). Three incidents in six months proved this approach insufficient:

  1. Incident 1: A developer forgot the tenant scope on a new API endpoint. 12 tenants' project data was visible to any authenticated user for 3 hours before detection.
  2. Incident 2: A background job processing queue lost tenant context when retrying failed jobs, causing file attachments to be associated with the wrong tenant.
  3. Incident 3: A search feature indexed all tenants' data without the tenant filter, exposing project names and descriptions across tenants for 2 days.

After the third incident, three enterprise customers threatened to cancel ($180,000 ARR at risk). The board approved a 3-month engineering investment to fix the architecture.

Architecture Decisions

Why PostgreSQL RLS + Schema-per-Tenant Hybrid

We evaluated three options:

  1. Fix application code (add better testing, code review): Rejected. The root cause was that isolation depended on developer discipline. More discipline wouldn't eliminate the risk.
  2. Row-Level Security on the shared schema: Implemented as phase 1. Provides database-enforced isolation without data migration.
  3. Schema-per-tenant for enterprise customers: Implemented as phase 2. Provides physical isolation for the 15 enterprise customers that demanded it.

Migration Strategy

sql
1-- Phase 1: Add RLS to all existing tables
2ALTER TABLE projects ENABLE ROW LEVEL SECURITY;
3ALTER TABLE projects FORCE ROW LEVEL SECURITY;
4 
5CREATE POLICY projects_tenant_isolation ON projects
6 USING (tenant_id = current_setting('app.current_tenant')::uuid);
7 
8-- Repeat for all 34 tables with tenant_id
9 

The RLS migration was non-destructive — it added policies without changing data. We deployed it table-by-table over two weeks, monitoring for query failures.

Measurable Results

MetricBeforeAfterChange
Cross-tenant data incidents3 in 6 months0 in 12 months-100%
Enterprise customer churn risk$180K ARR$0Eliminated
Query latency (p50)12ms13ms+8%
Query latency (p99)145ms152ms+5%
Monthly infrastructure cost$4,200$5,800+38%

The 38% cost increase came from the schema-per-tenant infrastructure for 15 enterprise customers. The 5-8% latency increase from RLS policy evaluation was negligible.

Need a second opinion on your saas engineering architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

What Went Wrong

RLS broke admin panels. Our internal admin tools queried across tenants for support and analytics. RLS correctly blocked these queries. We had to create a separate database role for admin operations that bypassed RLS, with audit logging on every cross-tenant query.

Background jobs lost tenant context. Sidekiq workers inherited the web request's tenant context through thread-local variables, but on retry, this context was lost. We had to serialize tenant_id into every job's payload and set the PostgreSQL session variable at the start of each job execution.

Schema migrations for 15 tenants. Applying migrations to 15 separate schemas increased deployment time from 30 seconds to 4 minutes. We parallelized schema migrations to bring this back to under 1 minute.

Honest Retrospective

The biggest win was eliminating cross-tenant incidents entirely. In 12 months since the migration, zero data leakage events. This alone justified the investment.

What we'd do differently:

  1. Start with RLS from day one. The retrofit cost 3 engineer-months; implementing RLS at project start would have been 2 days.
  2. Design background jobs with explicit tenant context from the beginning. Implicit context through thread-local variables is fragile.
  3. Implement cross-tenant access testing in CI before the first incident, not after three.

Conclusion

Multi-tenant data isolation cannot depend on application code alone. PostgreSQL RLS provides database-enforced isolation that prevents cross-tenant access regardless of application bugs, forgotten WHERE clauses, or lost context in background jobs. The performance overhead is minimal (5-8% latency increase), and the operational overhead is manageable. For any SaaS handling sensitive customer data, RLS should be enabled from day one — retrofitting it is 10x more expensive than building it in.

FAQ

Need expert help?

Building with saas engineering?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026