Software Engineering · Scalability|13 min read|

Scalable Backend Architecture: Patterns for Growing Without Breaking

Most scaling problems don't appear at launch — they appear when the product succeeds. The system that worked perfectly with 100 daily users starts creaking with 10,000. Architecture decisions that seemed pragmatic at first (unindexed queries, in-memory sessions, logic in the request-response) become concrete bottlenecks. This guide documents the architecture patterns that allow backends to scale sustainably.

Diagnose first: where the bottleneck actually is

Scaling without diagnosis is throwing money away. Before adding instances or redesigning architecture, identify the real bottleneck. The four most common locations in order of probability:

  • Database (70% of cases): queries without indexes, N+1 queries, lock contention, exhausted connection pool. The database scales worse than the application and is the bottleneck in most systems.
  • Network I/O and third-party latency (15%): synchronous calls to external APIs in the main request-response path, without caching or circuit breaker.
  • Application code (10%): O(n²) loops, inefficient serialization, in-memory object generation at scale.
  • Infrastructure (5%): container CPU throttling, network limits, oversaturated nodes.

Connection Pooling: the first problem nobody configures correctly

An application without a connection pool creates a new database connection on every request. At low load, the overhead is manageable. At high load, the time to establish TCP + TLS + PostgreSQL authentication connections can be 30-40% of total request latency. With PgBouncer as a pooling proxy, the application reuses existing connections.

ini
; pgbouncer.ini — configuration for a production API
[databases]
production = host=postgres-primary port=5432 dbname=production

[pgbouncer]
pool_mode = transaction     ; pool per transaction (more efficient than session)
max_client_conn = 1000      ; max application connections to proxy
default_pool_size = 25      ; actual connections to PostgreSQL
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
log_connections = 0         ; disable in production (generates I/O)
log_disconnections = 0

Layered caching: the correct architecture

Caching isn't a single tool — it's a layered strategy, each with different tradeoffs between cost, complexity, and hit rate.

typescript
// Layered caching: in-process → Redis → database
import { LRUCache } from 'lru-cache';
import Redis from 'ioredis';

const localCache = new LRUCache<string, Product>({
  max: 1000,
  ttl: 30_000,    // 30 seconds local TTL
});
const redis = new Redis(process.env.REDIS_URL!);

async function getProduct(id: string): Promise<Product> {
  // Layer 1: in-memory local cache (sub-millisecond)
  const local = localCache.get(id);
  if (local) return local;

  // Layer 2: Redis (1-2ms)
  const cached = await redis.get(`product:${id}`);
  if (cached) {
    const product = JSON.parse(cached);
    localCache.set(id, product);
    return product;
  }

  // Layer 3: database
  const product = await db.product.findUnique({ where: { id } });
  if (!product) throw new NotFoundError(`Product ${id} not found`);

  await redis.setex(`product:${id}`, 300, JSON.stringify(product));
  localCache.set(id, product);
  return product;
}

CQRS: separating reads from writes to scale each side

Command Query Responsibility Segregation separates the data model for writes (Commands) from the model for reads (Queries). In many production systems, reads outnumber writes at a 10:1 ratio or more. With CQRS, you can scale the read model independently (read-only replicas, denormalized projections) without affecting the write model.

The local in-process cache (LRU) is fastest but has stale data risk in multi-instance deployments. For frequently changing data (prices, inventory), use only Redis with short TTL. For infrequently changing data (catalog, account configuration), local in-memory cache with 30-60 second TTL eliminates tens of milliseconds of latency.

Database scaling and hot spots

  • Read replicas: scaling reads with read-only replicas is simple and resolves most database scaling problems without sharding complexity.
  • Table partitioning: partition large tables (logs, events, invoices) by date range. PostgreSQL native partitioning reduces query cost and simplifies historical data archiving.
  • Optimize queries before scaling infrastructure: a missing index can generate a full table scan taking 5 seconds on a 10-million-row table. With the correct index, the same query takes 2 milliseconds.

Rate limiting: protecting the backend from itself

typescript
// Rate limiting with Redis using sliding window algorithm
async function checkRateLimit(
  userId: string,
  limitPerMinute: number
): Promise<{ allowed: boolean; remaining: number }> {
  const key = `rate_limit:${userId}:${Math.floor(Date.now() / 60000)}`;
  const current = await redis.incr(key);
  if (current === 1) await redis.expire(key, 60);
  return {
    allowed: current <= limitPerMinute,
    remaining: Math.max(0, limitPerMinute - current),
  };
}

Frequently Asked Questions

When is it time to move from a single database instance to read replicas?
When database metrics show sustained CPU > 60%, or when read query response times start affecting API high latency percentiles (p95, p99). The clearest signal: analytical queries or reports are affecting transactional write latency on the same instance.
Redis or Memcached for enterprise caching?
Redis in practically all cases. Redis supports advanced data structures (sorted sets, hashes, streams), has optional persistence, supports Pub/Sub, and has native clustering. Memcached is marginally faster for pure cache use cases (key-value strings), but Redis's versatility justifies the difference. In practice, most systems that start with caching eventually need Redis's additional features.
What is N+1 query and how do I avoid it?
The N+1 problem occurs when you load N entities and then make 1 additional query per entity to get related data. Example: loading 100 orders and then making 100 additional queries to get the customer name for each. The solution: eager loading (JOIN in the initial query), DataLoader (query batching in GraphQL), or a denormalized read index that includes the needed related data.
How do I horizontally scale a stateful API (with sessions)?
The session is the state that prevents naïve horizontal scaling. The solution: move session state out of the process to a shared external store (Redis). With Redis sessions, any API instance can handle any request — the load balancer can distribute without sticky sessions. JWT with minimal state in the token (not a full session) is the alternative if you don't want to manage a session store.
When to implement CQRS and when is it over-engineering?
CQRS is over-engineering for most simple CRUD applications. It's valuable when: read and write models have fundamentally different shapes, reads scale 10x more than writes, or you need different consistency levels for reads and writes. The criterion: if the same data model serves reads and writes equally well, CQRS adds complexity without real benefit.

Is your backend starting to show scaling problems? We can conduct a technical diagnosis and identify real bottlenecks before proposing architecture changes.

Talk to our team

Related articles

IQS

Engineering Team — IQS

Software, cloud, and DevOps engineers with enterprise project experience.