How do you handle API versioning without maintaining duplicate codebases?

Use a shared service layer with version-specific request/response transformers. Your business logic lives in a single codebase, while thin controller layers handle mapping between version-specific DTOs and the internal domain model. This keeps the maintenance burden manageable even with multiple active API versions.

What's the right rate limit for a new SaaS API endpoint?

Start conservative—100 requests per minute for free tiers, 1,000 for paid—and increase based on observed usage patterns. Monitor the 95th percentile of actual request rates per tenant and set limits at roughly 3x that number. Always expose rate limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) so consumers can self-regulate.

Should every endpoint be idempotent or just write operations?

All GET, HEAD, and OPTIONS requests are inherently idempotent. Focus idempotency key support on POST endpoints and any operation that triggers side effects like sending emails or processing payments. PUT and DELETE should be naturally idempotent by design—PUT replaces the full resource, DELETE succeeds even if the resource is already gone.

How do you test APIs at scale without affecting production?

Use shadow traffic testing: mirror a percentage of production requests to your staging environment and compare response correctness and latency. Combine this with load testing using realistic traffic patterns—not just uniform request rates but bursty patterns that simulate real-world usage spikes, batch processing, and timezone-driven peaks.

SaaS API Design Best Practices for High Scale Teams

When your SaaS platform handles millions of API requests per day, the difference between a well-designed API and a hastily built one becomes the difference between scaling smoothly and firefighting constantly. High-scale teams face unique challenges: thundering herds during peak traffic, cascade failures across microservices, and the ever-present tension between backward compatibility and forward progress.

This guide distills battle-tested API design practices specifically for teams operating at scale. Whether you're serving 10,000 or 10 million requests per minute, these patterns will help you build APIs that remain performant, maintainable, and developer-friendly.

Design for Backward Compatibility from Day One

At scale, breaking changes are extraordinarily expensive. You cannot coordinate simultaneous updates across thousands of API consumers. Every API endpoint must be designed with evolution in mind.

Versioning Strategy

Use URI-based versioning for major breaking changes combined with additive evolution for minor updates:

typescript

1// Router setup with explicit versioning

2const router = new Router();

4// v1 - original endpoint

5router.get('/api/v1/users/:id', UserControllerV1.getUser);

7// v2 - breaking change (different response shape)

8router.get('/api/v2/users/:id', UserControllerV2.getUser);

10// Both versions coexist, served from different controllers

11// v1 maps internally to v2 logic with response transformation

12class UserControllerV1 {

13 async getUser(req: Request, res: Response) {

14 const user = await userService.getUser(req.params.id);

15 // Transform v2 internal model to v1 response shape

16 res.json(transformToV1Response(user));

17 }

18}

Additive Change Policy

Adopt an additive-only change policy for minor versions. New fields can be added to responses, but existing fields must never be removed or have their types changed:

typescript

1// Version 1.0 response

2interface UserResponseV1 {

3 id: string;

4 name: string;

5 email: string;

8// Version 1.1 response - additive only

9interface UserResponseV1_1 extends UserResponseV1 {

10 avatar_url: string | null; // New field, nullable

11 team_id: string | null; // New field, nullable

12 // 'name' and 'email' remain unchanged

13}

Implement Robust Rate Limiting

At high scale, rate limiting isn't optional—it's infrastructure. Without it, a single misbehaving client can degrade the experience for every other tenant.

Token Bucket with Redis

The token bucket algorithm provides the best balance between burst tolerance and sustained rate enforcement:

python

1import redis

2import time

3from dataclasses import dataclass

5@dataclass

6class RateLimitResult:

7 allowed: bool

8 remaining: int

9 retry_after: float | None

10 limit: int

12class TokenBucketRateLimiter:

13 def __init__(self, redis_client: redis.Redis):

14 self.redis = redis_client

15 self.script = self.redis.register_script("""

16 local key = KEYS[1]

17 local capacity = tonumber(ARGV[1])

18 local refill_rate = tonumber(ARGV[2])

19 local now = tonumber(ARGV[3])

20 local requested = tonumber(ARGV[4])

22 local bucket = redis.call('hmget', key, 'tokens', 'last_refill')

23 local tokens = tonumber(bucket[1])

24 local last_refill = tonumber(bucket[2])

26 if tokens == nil then

27 tokens = capacity

28 last_refill = now

29 end

31 local elapsed = now - last_refill

32 local new_tokens = math.min(capacity, tokens + (elapsed * refill_rate))

34 if new_tokens >= requested then

35 new_tokens = new_tokens - requested

36 redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)

37 redis.call('expire', key, math.ceil(capacity / refill_rate) + 1)

38 return {1, math.floor(new_tokens), 0}

39 else

40 redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)

41 local retry_after = (requested - new_tokens) / refill_rate

42 return {0, math.floor(new_tokens), math.ceil(retry_after * 1000)}

43 end

44 """)

46 def check(

47 self, key: str, capacity: int, refill_rate: float

48 ) -> RateLimitResult:

49 now = time.time()

50 result = self.script(

51 keys=[f"ratelimit:{key}"],

52 args=[capacity, refill_rate, now, 1]

53 )

54 allowed, remaining, retry_after_ms = result

55 return RateLimitResult(

56 allowed=bool(allowed),

57 remaining=int(remaining),

58 retry_after=retry_after_ms / 1000 if retry_after_ms else None,

59 limit=capacity,

60 )

Per-Tenant and Per-Endpoint Limits

High-scale systems need tiered rate limits—global, per-tenant, and per-endpoint:

typescript

1interface RateLimitTier {

2 global: { rpm: number; burst: number };

3 perEndpoint: Record<string, { rpm: number; burst: number }>;

6const RATE_LIMIT_TIERS: Record<string, RateLimitTier> = {

7 free: {

8 global: { rpm: 60, burst: 10 },

9 perEndpoint: {

10 'POST /api/v1/documents': { rpm: 10, burst: 2 },

11 'GET /api/v1/search': { rpm: 30, burst: 5 },

12 },

13 },

14 enterprise: {

15 global: { rpm: 10000, burst: 500 },

16 perEndpoint: {

17 'POST /api/v1/documents': { rpm: 1000, burst: 100 },

18 'GET /api/v1/search': { rpm: 5000, burst: 200 },

19 },

20 },

21};

Build Idempotent Endpoints

At scale, network failures and retries are routine. Every mutating endpoint must handle duplicate requests gracefully.

Idempotency Key Pattern

1package middleware

3import (

4 "context"

5 "crypto/sha256"

6 "encoding/hex"

7 "net/http"

8 "time"

10 "github.com/redis/go-redis/v9"

11)

13type IdempotencyMiddleware struct {

14 redis *redis.Client

15 ttl time.Duration

16}

18func NewIdempotencyMiddleware(rdb *redis.Client) *IdempotencyMiddleware {

19 return &IdempotencyMiddleware{redis: rdb, ttl: 24 * time.Hour}

20}

22func (m *IdempotencyMiddleware) Handle(next http.Handler) http.Handler {

23 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {

24 key := r.Header.Get("Idempotency-Key")

25 if key == "" {

26 next.ServeHTTP(w, r)

27 return

28 }

30 cacheKey := buildCacheKey(r, key)

32 // Check for existing result

33 cached, err := m.redis.Get(context.Background(), cacheKey).Bytes()

34 if err == nil {

35 w.Header().Set("X-Idempotent-Replayed", "true")

36 w.Write(cached)

37 return

38 }

40 // Acquire lock to prevent concurrent execution

41 lockKey := cacheKey + ":lock"

42 acquired, _ := m.redis.SetNX(

43 context.Background(), lockKey, "1", 30*time.Second,

44 ).Result()

46 if !acquired {

47 w.Header().Set("Retry-After", "1")

48 http.Error(w, "Concurrent request in progress", http.StatusConflict)

49 return

50 }

51 defer m.redis.Del(context.Background(), lockKey)

53 rec := &responseRecorder{ResponseWriter: w}

54 next.ServeHTTP(rec, r)

56 // Cache successful responses

57 if rec.statusCode >= 200 && rec.statusCode < 300 {

58 m.redis.Set(

59 context.Background(), cacheKey, rec.body.Bytes(), m.ttl,

60 )

61 }

62 })

63}

65func buildCacheKey(r *http.Request, idempotencyKey string) string {

66 h := sha256.New()

67 h.Write([]byte(r.Method + r.URL.Path + idempotencyKey))

68 return "idempotency:" + hex.EncodeToString(h.Sum(nil))

69}

Implement Cursor-Based Pagination

Offset pagination breaks at scale. When your tables have millions of rows, OFFSET 100000 forces the database to scan and discard 100,000 rows. Cursor-based pagination maintains consistent performance regardless of page depth.

python

1from dataclasses import dataclass

2from datetime import datetime

3from base64 import b64encode, b64decode

4import json

6@dataclass

7class CursorPage:

8 items: list

9 next_cursor: str | None

10 has_more: bool

12class CursorPaginator:

13 def __init__(self, db_session, default_limit: int = 50):

14 self.db = db_session

15 self.default_limit = default_limit

16 self.max_limit = 200

18 def paginate(

19 self,

20 query,

21 cursor: str | None = None,

22 limit: int | None = None,

23 order_by: str = "created_at",

24 ) -> CursorPage:

25 limit = min(limit or self.default_limit, self.max_limit)

27 if cursor:

28 decoded = json.loads(b64decode(cursor))

29 cursor_value = decoded["v"]

30 cursor_id = decoded["id"]

31 query = query.filter(

32 db.or_(

33 getattr(Model, order_by) < cursor_value,

34 db.and_(

35 getattr(Model, order_by) == cursor_value,

36 Model.id < cursor_id,

37 ),

38 )

39 )

41 items = query.order_by(

42 getattr(Model, order_by).desc(), Model.id.desc()

43 ).limit(limit + 1).all()

45 has_more = len(items) > limit

46 items = items[:limit]

48 next_cursor = None

49 if has_more and items:

50 last = items[-1]

51 next_cursor = b64encode(

52 json.dumps({

53 "v": str(getattr(last, order_by)),

54 "id": str(last.id),

55 }).encode()

56 ).decode()

58 return CursorPage(

59 items=items,

60 next_cursor=next_cursor,

61 has_more=has_more,

62 )

Standardize Error Responses with RFC 7807

Consistent error formats reduce debugging time for API consumers. RFC 7807 Problem Details provides a standard structure:

typescript

1interface ProblemDetail {

2 type: string;

3 title: string;

4 status: number;

5 detail: string;

6 instance: string;

7 errors?: ValidationError[];

8 trace_id?: string;

11function createProblemDetail(

12 status: number,

13 title: string,

14 detail: string,

15 req: Request,

16 extras?: Record<string, unknown>

17): ProblemDetail {

18 return {

19 type: `https://api.example.com/errors/${title.toLowerCase().replace(/\s+/g, '-')}`,

20 title,

21 status,

22 detail,

23 instance: req.url,

24 trace_id: req.headers['x-trace-id'] as string,

25 ...extras,

26 };

27}

29// Usage in error handler

30app.use((err: Error, req: Request, res: Response, next: NextFunction) => {

31 if (err instanceof ValidationError) {

32 return res.status(422).json(

33 createProblemDetail(422, 'Validation Error', err.message, req, {

34 errors: err.fieldErrors,

35 })

36 );

37 }

39 if (err instanceof NotFoundError) {

40 return res.status(404).json(

41 createProblemDetail(404, 'Not Found', err.message, req)

42 );

43 }

45 // Unexpected errors - don't leak internals

46 console.error(`Unhandled error [${req.headers['x-trace-id']}]:`, err);

47 return res.status(500).json(

48 createProblemDetail(

49 500,

50 'Internal Server Error',

51 'An unexpected error occurred. Please try again later.',

52 req

53 )

54 );

55});

Need a second opinion on your saas engineering architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Design Webhook Delivery for Reliability

At high scale, webhooks must be treated as a separate delivery system with its own guarantees. Failed deliveries must be retried with exponential backoff, and consumers must handle duplicates.

Webhook Delivery Engine

python

1import hashlib

2import hmac

3import httpx

4import asyncio

5from datetime import datetime, timedelta

7class WebhookDeliveryEngine:

8 MAX_RETRIES = 8

9 BASE_DELAY = 1 # seconds

10 TIMEOUT = 30

12 def __init__(self, db, signing_secret: str):

13 self.db = db

14 self.signing_secret = signing_secret

16 def sign_payload(self, payload: bytes, timestamp: str) -> str:

17 message = f"{timestamp}.{payload.decode()}"

18 return hmac.new(

19 self.signing_secret.encode(),

20 message.encode(),

21 hashlib.sha256,

22 ).hexdigest()

24 async def deliver(self, event: WebhookEvent) -> bool:

25 timestamp = str(int(datetime.utcnow().timestamp()))

26 signature = self.sign_payload(event.payload, timestamp)

28 headers = {

29 "Content-Type": "application/json",

30 "X-Webhook-Signature": f"sha256={signature}",

31 "X-Webhook-Timestamp": timestamp,

32 "X-Webhook-ID": event.id,

33 }

35 for attempt in range(self.MAX_RETRIES):

36 try:

37 async with httpx.AsyncClient() as client:

38 response = await client.post(

39 event.target_url,

40 content=event.payload,

41 headers=headers,

42 timeout=self.TIMEOUT,

43 )

44 if 200 <= response.status_code < 300:

45 await self.record_success(event, attempt)

46 return True

48 if response.status_code < 500:

49 await self.record_failure(

50 event, attempt, f"HTTP {response.status_code}"

51 )

52 return False # Client error, don't retry

54 except (httpx.TimeoutException, httpx.ConnectError) as e:

55 await self.record_attempt(event, attempt, str(e))

57 delay = self.BASE_DELAY * (2 ** attempt)

58 await asyncio.sleep(delay)

60 await self.record_failure(event, self.MAX_RETRIES, "Max retries exceeded")

61 return False

Implement Request Coalescing for Hot Paths

When thousands of requests hit the same resource simultaneously, request coalescing prevents redundant database queries:

1package cache

3import (

4 "context"

5 "sync"

6 "time"

9type call struct {

10 wg sync.WaitGroup

11 val interface{}

12 err error

13}

15type SingleFlight struct {

16 mu sync.Mutex

17 calls map[string]*call

18}

20func NewSingleFlight() *SingleFlight {

21 return &SingleFlight{calls: make(map[string]*call)}

22}

24func (sf *SingleFlight) Do(

25 key string,

26 fn func() (interface{}, error),

27) (interface{}, error) {

28 sf.mu.Lock()

29 if c, ok := sf.calls[key]; ok {

30 sf.mu.Unlock()

31 c.wg.Wait()

32 return c.val, c.err

33 }

35 c := &call{}

36 c.wg.Add(1)

37 sf.calls[key] = c

38 sf.mu.Unlock()

40 c.val, c.err = fn()

41 c.wg.Done()

43 sf.mu.Lock()

44 delete(sf.calls, key)

45 sf.mu.Unlock()

47 return c.val, c.err

48}

50// Usage in API handler

51func (h *UserHandler) GetUser(w http.ResponseWriter, r *http.Request) {

52 userID := chi.URLParam(r, "id")

54 result, err := h.singleFlight.Do("user:"+userID, func() (interface{}, error) {

55 return h.userRepo.FindByID(r.Context(), userID)

56 })

58 if err != nil {

59 writeError(w, err)

60 return

61 }

63 writeJSON(w, http.StatusOK, result)

64}

API Design Anti-Patterns at Scale

Avoid these common mistakes that cause failures under high traffic:

Unbounded list endpoints. Every list endpoint must have a maximum page size enforced server-side, regardless of what the client requests. A single GET /users?limit=1000000 can bring down your database.

Synchronous heavy operations. Any operation taking more than 500ms should be asynchronous. Return a 202 Accepted with a status URL instead of blocking the HTTP connection.

N+1 query patterns in APIs. Design your API resources to support field selection and nested includes to prevent clients from making dozens of sequential requests:

typescript

1// Bad: Forces N+1 from the client

2GET /api/v1/orders

3GET /api/v1/orders/1/items

4GET /api/v1/orders/2/items

6// Good: Support includes

7GET /api/v1/orders?include=items,customer&fields=id,total,status

Missing circuit breakers. When your API calls downstream services, always wrap calls in circuit breakers. Without them, a single slow dependency can exhaust all your connection pools.

High-Scale API Checklist

Use this checklist before deploying any new API endpoint:

Conclusion

Building APIs for high-scale SaaS demands a fundamentally different mindset than building for a few hundred users. Every design decision must account for concurrent access, partial failures, and the reality that you cannot coordinate upgrades across your entire consumer base simultaneously.

The practices outlined here—versioning, rate limiting, idempotency, cursor pagination, standardized errors, reliable webhooks, and request coalescing—form the foundation of APIs that scale gracefully. They are not premature optimization; they are the baseline for any team operating at meaningful scale.

Start by auditing your existing endpoints against the checklist. Prioritize idempotency and rate limiting first, as these prevent the most common scale-related incidents. Then systematically address pagination and error standardization. The investment pays dividends every time you avoid a 3 AM page caused by a missing rate limit or a broken client retry loop.

FAQ

Need expert help?

Building with saas engineering?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

api-design rest graphql saas high-scale best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

SaaS API Design Best Practices for High Scale Teams

Design for Backward Compatibility from Day One

Versioning Strategy

Additive Change Policy

Implement Robust Rate Limiting

Token Bucket with Redis

Per-Tenant and Per-Endpoint Limits

Build Idempotent Endpoints

Idempotency Key Pattern

Standardize Error Responses with RFC 7807

Design Webhook Delivery for Reliability

Webhook Delivery Engine

Implement Request Coalescing for Hot Paths

API Design Anti-Patterns at Scale

High-Scale API Checklist

Conclusion

FAQ

Building with saas engineering?

SaaS API Design Best Practices for Enterprise Teams

SaaS API Design Best Practices for Startup Teams

How to Build SaaS API Design Using Spring Boot

How to Build SaaS API Design Using Nestjs

SaaS API Design Best Practices for Enterprise Teams

Start a
Conversation.

Design for Backward Compatibility from Day One

Versioning Strategy

Additive Change Policy

Implement Robust Rate Limiting

Token Bucket with Redis

Per-Tenant and Per-Endpoint Limits

Build Idempotent Endpoints

Idempotency Key Pattern

Implement Cursor-Based Pagination

Standardize Error Responses with RFC 7807

Design Webhook Delivery for Reliability

Webhook Delivery Engine

Implement Request Coalescing for Hot Paths

API Design Anti-Patterns at Scale

High-Scale API Checklist

Conclusion

FAQ

Building with saas engineering?

SaaS API Design Best Practices for Enterprise Teams

SaaS API Design Best Practices for Startup Teams

How to Build SaaS API Design Using Spring Boot

How to Build SaaS API Design Using Nestjs

SaaS API Design Best Practices for Enterprise Teams

Start aConversation.

Start a
Conversation.