Back to Journal
SaaS Engineering

SaaS API Design Best Practices for High Scale Teams

Battle-tested best practices for SaaS API Design tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 14 min read

When your SaaS platform handles millions of API requests per day, the difference between a well-designed API and a hastily built one becomes the difference between scaling smoothly and firefighting constantly. High-scale teams face unique challenges: thundering herds during peak traffic, cascade failures across microservices, and the ever-present tension between backward compatibility and forward progress.

This guide distills battle-tested API design practices specifically for teams operating at scale. Whether you're serving 10,000 or 10 million requests per minute, these patterns will help you build APIs that remain performant, maintainable, and developer-friendly.

Design for Backward Compatibility from Day One

At scale, breaking changes are extraordinarily expensive. You cannot coordinate simultaneous updates across thousands of API consumers. Every API endpoint must be designed with evolution in mind.

Versioning Strategy

Use URI-based versioning for major breaking changes combined with additive evolution for minor updates:

typescript
1// Router setup with explicit versioning
2const router = new Router();
3 
4// v1 - original endpoint
5router.get('/api/v1/users/:id', UserControllerV1.getUser);
6 
7// v2 - breaking change (different response shape)
8router.get('/api/v2/users/:id', UserControllerV2.getUser);
9 
10// Both versions coexist, served from different controllers
11// v1 maps internally to v2 logic with response transformation
12class UserControllerV1 {
13 async getUser(req: Request, res: Response) {
14 const user = await userService.getUser(req.params.id);
15 // Transform v2 internal model to v1 response shape
16 res.json(transformToV1Response(user));
17 }
18}
19 

Additive Change Policy

Adopt an additive-only change policy for minor versions. New fields can be added to responses, but existing fields must never be removed or have their types changed:

typescript
1// Version 1.0 response
2interface UserResponseV1 {
3 id: string;
4 name: string;
5 email: string;
6}
7 
8// Version 1.1 response - additive only
9interface UserResponseV1_1 extends UserResponseV1 {
10 avatar_url: string | null; // New field, nullable
11 team_id: string | null; // New field, nullable
12 // 'name' and 'email' remain unchanged
13}
14 

Implement Robust Rate Limiting

At high scale, rate limiting isn't optional—it's infrastructure. Without it, a single misbehaving client can degrade the experience for every other tenant.

Token Bucket with Redis

The token bucket algorithm provides the best balance between burst tolerance and sustained rate enforcement:

python
1import redis
2import time
3from dataclasses import dataclass
4 
5@dataclass
6class RateLimitResult:
7 allowed: bool
8 remaining: int
9 retry_after: float | None
10 limit: int
11 
12class TokenBucketRateLimiter:
13 def __init__(self, redis_client: redis.Redis):
14 self.redis = redis_client
15 self.script = self.redis.register_script("""
16 local key = KEYS[1]
17 local capacity = tonumber(ARGV[1])
18 local refill_rate = tonumber(ARGV[2])
19 local now = tonumber(ARGV[3])
20 local requested = tonumber(ARGV[4])
21 
22 local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
23 local tokens = tonumber(bucket[1])
24 local last_refill = tonumber(bucket[2])
25 
26 if tokens == nil then
27 tokens = capacity
28 last_refill = now
29 end
30 
31 local elapsed = now - last_refill
32 local new_tokens = math.min(capacity, tokens + (elapsed * refill_rate))
33 
34 if new_tokens >= requested then
35 new_tokens = new_tokens - requested
36 redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
37 redis.call('expire', key, math.ceil(capacity / refill_rate) + 1)
38 return {1, math.floor(new_tokens), 0}
39 else
40 redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
41 local retry_after = (requested - new_tokens) / refill_rate
42 return {0, math.floor(new_tokens), math.ceil(retry_after * 1000)}
43 end
44 """)
45 
46 def check(
47 self, key: str, capacity: int, refill_rate: float
48 ) -> RateLimitResult:
49 now = time.time()
50 result = self.script(
51 keys=[f"ratelimit:{key}"],
52 args=[capacity, refill_rate, now, 1]
53 )
54 allowed, remaining, retry_after_ms = result
55 return RateLimitResult(
56 allowed=bool(allowed),
57 remaining=int(remaining),
58 retry_after=retry_after_ms / 1000 if retry_after_ms else None,
59 limit=capacity,
60 )
61 

Per-Tenant and Per-Endpoint Limits

High-scale systems need tiered rate limits—global, per-tenant, and per-endpoint:

typescript
1interface RateLimitTier {
2 global: { rpm: number; burst: number };
3 perEndpoint: Record<string, { rpm: number; burst: number }>;
4}
5 
6const RATE_LIMIT_TIERS: Record<string, RateLimitTier> = {
7 free: {
8 global: { rpm: 60, burst: 10 },
9 perEndpoint: {
10 'POST /api/v1/documents': { rpm: 10, burst: 2 },
11 'GET /api/v1/search': { rpm: 30, burst: 5 },
12 },
13 },
14 enterprise: {
15 global: { rpm: 10000, burst: 500 },
16 perEndpoint: {
17 'POST /api/v1/documents': { rpm: 1000, burst: 100 },
18 'GET /api/v1/search': { rpm: 5000, burst: 200 },
19 },
20 },
21};
22 

Build Idempotent Endpoints

At scale, network failures and retries are routine. Every mutating endpoint must handle duplicate requests gracefully.

Idempotency Key Pattern

go
1package middleware
2 
3import (
4 "context"
5 "crypto/sha256"
6 "encoding/hex"
7 "net/http"
8 "time"
9 
10 "github.com/redis/go-redis/v9"
11)
12 
13type IdempotencyMiddleware struct {
14 redis *redis.Client
15 ttl time.Duration
16}
17 
18func NewIdempotencyMiddleware(rdb *redis.Client) *IdempotencyMiddleware {
19 return &IdempotencyMiddleware{redis: rdb, ttl: 24 * time.Hour}
20}
21 
22func (m *IdempotencyMiddleware) Handle(next http.Handler) http.Handler {
23 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
24 key := r.Header.Get("Idempotency-Key")
25 if key == "" {
26 next.ServeHTTP(w, r)
27 return
28 }
29 
30 cacheKey := buildCacheKey(r, key)
31 
32 // Check for existing result
33 cached, err := m.redis.Get(context.Background(), cacheKey).Bytes()
34 if err == nil {
35 w.Header().Set("X-Idempotent-Replayed", "true")
36 w.Write(cached)
37 return
38 }
39 
40 // Acquire lock to prevent concurrent execution
41 lockKey := cacheKey + ":lock"
42 acquired, _ := m.redis.SetNX(
43 context.Background(), lockKey, "1", 30*time.Second,
44 ).Result()
45 
46 if !acquired {
47 w.Header().Set("Retry-After", "1")
48 http.Error(w, "Concurrent request in progress", http.StatusConflict)
49 return
50 }
51 defer m.redis.Del(context.Background(), lockKey)
52 
53 rec := &responseRecorder{ResponseWriter: w}
54 next.ServeHTTP(rec, r)
55 
56 // Cache successful responses
57 if rec.statusCode >= 200 && rec.statusCode < 300 {
58 m.redis.Set(
59 context.Background(), cacheKey, rec.body.Bytes(), m.ttl,
60 )
61 }
62 })
63}
64 
65func buildCacheKey(r *http.Request, idempotencyKey string) string {
66 h := sha256.New()
67 h.Write([]byte(r.Method + r.URL.Path + idempotencyKey))
68 return "idempotency:" + hex.EncodeToString(h.Sum(nil))
69}
70 

Implement Cursor-Based Pagination

Offset pagination breaks at scale. When your tables have millions of rows, OFFSET 100000 forces the database to scan and discard 100,000 rows. Cursor-based pagination maintains consistent performance regardless of page depth.

python
1from dataclasses import dataclass
2from datetime import datetime
3from base64 import b64encode, b64decode
4import json
5 
6@dataclass
7class CursorPage:
8 items: list
9 next_cursor: str | None
10 has_more: bool
11 
12class CursorPaginator:
13 def __init__(self, db_session, default_limit: int = 50):
14 self.db = db_session
15 self.default_limit = default_limit
16 self.max_limit = 200
17 
18 def paginate(
19 self,
20 query,
21 cursor: str | None = None,
22 limit: int | None = None,
23 order_by: str = "created_at",
24 ) -> CursorPage:
25 limit = min(limit or self.default_limit, self.max_limit)
26 
27 if cursor:
28 decoded = json.loads(b64decode(cursor))
29 cursor_value = decoded["v"]
30 cursor_id = decoded["id"]
31 query = query.filter(
32 db.or_(
33 getattr(Model, order_by) < cursor_value,
34 db.and_(
35 getattr(Model, order_by) == cursor_value,
36 Model.id < cursor_id,
37 ),
38 )
39 )
40 
41 items = query.order_by(
42 getattr(Model, order_by).desc(), Model.id.desc()
43 ).limit(limit + 1).all()
44 
45 has_more = len(items) > limit
46 items = items[:limit]
47 
48 next_cursor = None
49 if has_more and items:
50 last = items[-1]
51 next_cursor = b64encode(
52 json.dumps({
53 "v": str(getattr(last, order_by)),
54 "id": str(last.id),
55 }).encode()
56 ).decode()
57 
58 return CursorPage(
59 items=items,
60 next_cursor=next_cursor,
61 has_more=has_more,
62 )
63 

Standardize Error Responses with RFC 7807

Consistent error formats reduce debugging time for API consumers. RFC 7807 Problem Details provides a standard structure:

typescript
1interface ProblemDetail {
2 type: string;
3 title: string;
4 status: number;
5 detail: string;
6 instance: string;
7 errors?: ValidationError[];
8 trace_id?: string;
9}
10 
11function createProblemDetail(
12 status: number,
13 title: string,
14 detail: string,
15 req: Request,
16 extras?: Record<string, unknown>
17): ProblemDetail {
18 return {
19 type: `https://api.example.com/errors/${title.toLowerCase().replace(/\s+/g, '-')}`,
20 title,
21 status,
22 detail,
23 instance: req.url,
24 trace_id: req.headers['x-trace-id'] as string,
25 ...extras,
26 };
27}
28 
29// Usage in error handler
30app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
31 if (err instanceof ValidationError) {
32 return res.status(422).json(
33 createProblemDetail(422, 'Validation Error', err.message, req, {
34 errors: err.fieldErrors,
35 })
36 );
37 }
38 
39 if (err instanceof NotFoundError) {
40 return res.status(404).json(
41 createProblemDetail(404, 'Not Found', err.message, req)
42 );
43 }
44 
45 // Unexpected errors - don't leak internals
46 console.error(`Unhandled error [${req.headers['x-trace-id']}]:`, err);
47 return res.status(500).json(
48 createProblemDetail(
49 500,
50 'Internal Server Error',
51 'An unexpected error occurred. Please try again later.',
52 req
53 )
54 );
55});
56 

Need a second opinion on your saas engineering architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Design Webhook Delivery for Reliability

At high scale, webhooks must be treated as a separate delivery system with its own guarantees. Failed deliveries must be retried with exponential backoff, and consumers must handle duplicates.

Webhook Delivery Engine

python
1import hashlib
2import hmac
3import httpx
4import asyncio
5from datetime import datetime, timedelta
6 
7class WebhookDeliveryEngine:
8 MAX_RETRIES = 8
9 BASE_DELAY = 1 # seconds
10 TIMEOUT = 30
11 
12 def __init__(self, db, signing_secret: str):
13 self.db = db
14 self.signing_secret = signing_secret
15 
16 def sign_payload(self, payload: bytes, timestamp: str) -> str:
17 message = f"{timestamp}.{payload.decode()}"
18 return hmac.new(
19 self.signing_secret.encode(),
20 message.encode(),
21 hashlib.sha256,
22 ).hexdigest()
23 
24 async def deliver(self, event: WebhookEvent) -> bool:
25 timestamp = str(int(datetime.utcnow().timestamp()))
26 signature = self.sign_payload(event.payload, timestamp)
27 
28 headers = {
29 "Content-Type": "application/json",
30 "X-Webhook-Signature": f"sha256={signature}",
31 "X-Webhook-Timestamp": timestamp,
32 "X-Webhook-ID": event.id,
33 }
34 
35 for attempt in range(self.MAX_RETRIES):
36 try:
37 async with httpx.AsyncClient() as client:
38 response = await client.post(
39 event.target_url,
40 content=event.payload,
41 headers=headers,
42 timeout=self.TIMEOUT,
43 )
44 if 200 <= response.status_code < 300:
45 await self.record_success(event, attempt)
46 return True
47 
48 if response.status_code < 500:
49 await self.record_failure(
50 event, attempt, f"HTTP {response.status_code}"
51 )
52 return False # Client error, don't retry
53 
54 except (httpx.TimeoutException, httpx.ConnectError) as e:
55 await self.record_attempt(event, attempt, str(e))
56 
57 delay = self.BASE_DELAY * (2 ** attempt)
58 await asyncio.sleep(delay)
59 
60 await self.record_failure(event, self.MAX_RETRIES, "Max retries exceeded")
61 return False
62 

Implement Request Coalescing for Hot Paths

When thousands of requests hit the same resource simultaneously, request coalescing prevents redundant database queries:

go
1package cache
2 
3import (
4 "context"
5 "sync"
6 "time"
7)
8 
9type call struct {
10 wg sync.WaitGroup
11 val interface{}
12 err error
13}
14 
15type SingleFlight struct {
16 mu sync.Mutex
17 calls map[string]*call
18}
19 
20func NewSingleFlight() *SingleFlight {
21 return &SingleFlight{calls: make(map[string]*call)}
22}
23 
24func (sf *SingleFlight) Do(
25 key string,
26 fn func() (interface{}, error),
27) (interface{}, error) {
28 sf.mu.Lock()
29 if c, ok := sf.calls[key]; ok {
30 sf.mu.Unlock()
31 c.wg.Wait()
32 return c.val, c.err
33 }
34 
35 c := &call{}
36 c.wg.Add(1)
37 sf.calls[key] = c
38 sf.mu.Unlock()
39 
40 c.val, c.err = fn()
41 c.wg.Done()
42 
43 sf.mu.Lock()
44 delete(sf.calls, key)
45 sf.mu.Unlock()
46 
47 return c.val, c.err
48}
49 
50// Usage in API handler
51func (h *UserHandler) GetUser(w http.ResponseWriter, r *http.Request) {
52 userID := chi.URLParam(r, "id")
53 
54 result, err := h.singleFlight.Do("user:"+userID, func() (interface{}, error) {
55 return h.userRepo.FindByID(r.Context(), userID)
56 })
57 
58 if err != nil {
59 writeError(w, err)
60 return
61 }
62 
63 writeJSON(w, http.StatusOK, result)
64}
65 

API Design Anti-Patterns at Scale

Avoid these common mistakes that cause failures under high traffic:

Unbounded list endpoints. Every list endpoint must have a maximum page size enforced server-side, regardless of what the client requests. A single GET /users?limit=1000000 can bring down your database.

Synchronous heavy operations. Any operation taking more than 500ms should be asynchronous. Return a 202 Accepted with a status URL instead of blocking the HTTP connection.

N+1 query patterns in APIs. Design your API resources to support field selection and nested includes to prevent clients from making dozens of sequential requests:

typescript
1// Bad: Forces N+1 from the client
2GET /api/v1/orders
3GET /api/v1/orders/1/items
4GET /api/v1/orders/2/items
5 
6// Good: Support includes
7GET /api/v1/orders?include=items,customer&fields=id,total,status
8 

Missing circuit breakers. When your API calls downstream services, always wrap calls in circuit breakers. Without them, a single slow dependency can exhaust all your connection pools.

High-Scale API Checklist

Use this checklist before deploying any new API endpoint:

  • Endpoint has explicit rate limits per tier
  • All mutating endpoints accept idempotency keys
  • Pagination uses cursor-based approach
  • Response follows RFC 7807 for errors
  • API version is explicit in the URI
  • Request/response schemas are validated
  • Timeouts configured for all downstream calls
  • Circuit breakers wrap external service calls
  • Metrics emit latency, error rate, and throughput
  • Load test passes at 3x expected peak traffic
  • Webhook deliveries retry with exponential backoff
  • Long operations return 202 with status endpoint
  • Response headers include rate limit metadata
  • CORS and authentication validated at gateway level

Conclusion

Building APIs for high-scale SaaS demands a fundamentally different mindset than building for a few hundred users. Every design decision must account for concurrent access, partial failures, and the reality that you cannot coordinate upgrades across your entire consumer base simultaneously.

The practices outlined here—versioning, rate limiting, idempotency, cursor pagination, standardized errors, reliable webhooks, and request coalescing—form the foundation of APIs that scale gracefully. They are not premature optimization; they are the baseline for any team operating at meaningful scale.

Start by auditing your existing endpoints against the checklist. Prioritize idempotency and rate limiting first, as these prevent the most common scale-related incidents. Then systematically address pagination and error standardization. The investment pays dividends every time you avoid a 3 AM page caused by a missing rate limit or a broken client retry loop.

FAQ

Need expert help?

Building with saas engineering?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026