Diagnose first: where the bottleneck actually is
Scaling without diagnosis is throwing money away. Before adding instances or redesigning architecture, identify the real bottleneck. The four most common locations in order of probability:
- Database (70% of cases): queries without indexes, N+1 queries, lock contention, exhausted connection pool. The database scales worse than the application and is the bottleneck in most systems.
- Network I/O and third-party latency (15%): synchronous calls to external APIs in the main request-response path, without caching or circuit breaker.
- Application code (10%): O(n²) loops, inefficient serialization, in-memory object generation at scale.
- Infrastructure (5%): container CPU throttling, network limits, oversaturated nodes.
Connection Pooling: the first problem nobody configures correctly
An application without a connection pool creates a new database connection on every request. At low load, the overhead is manageable. At high load, the time to establish TCP + TLS + PostgreSQL authentication connections can be 30-40% of total request latency. With PgBouncer as a pooling proxy, the application reuses existing connections.
; pgbouncer.ini — configuration for a production API
[databases]
production = host=postgres-primary port=5432 dbname=production
[pgbouncer]
pool_mode = transaction ; pool per transaction (more efficient than session)
max_client_conn = 1000 ; max application connections to proxy
default_pool_size = 25 ; actual connections to PostgreSQL
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
log_connections = 0 ; disable in production (generates I/O)
log_disconnections = 0Layered caching: the correct architecture
Caching isn't a single tool — it's a layered strategy, each with different tradeoffs between cost, complexity, and hit rate.
// Layered caching: in-process → Redis → database
import { LRUCache } from 'lru-cache';
import Redis from 'ioredis';
const localCache = new LRUCache<string, Product>({
max: 1000,
ttl: 30_000, // 30 seconds local TTL
});
const redis = new Redis(process.env.REDIS_URL!);
async function getProduct(id: string): Promise<Product> {
// Layer 1: in-memory local cache (sub-millisecond)
const local = localCache.get(id);
if (local) return local;
// Layer 2: Redis (1-2ms)
const cached = await redis.get(`product:${id}`);
if (cached) {
const product = JSON.parse(cached);
localCache.set(id, product);
return product;
}
// Layer 3: database
const product = await db.product.findUnique({ where: { id } });
if (!product) throw new NotFoundError(`Product ${id} not found`);
await redis.setex(`product:${id}`, 300, JSON.stringify(product));
localCache.set(id, product);
return product;
}CQRS: separating reads from writes to scale each side
Command Query Responsibility Segregation separates the data model for writes (Commands) from the model for reads (Queries). In many production systems, reads outnumber writes at a 10:1 ratio or more. With CQRS, you can scale the read model independently (read-only replicas, denormalized projections) without affecting the write model.
Database scaling and hot spots
- Read replicas: scaling reads with read-only replicas is simple and resolves most database scaling problems without sharding complexity.
- Table partitioning: partition large tables (logs, events, invoices) by date range. PostgreSQL native partitioning reduces query cost and simplifies historical data archiving.
- Optimize queries before scaling infrastructure: a missing index can generate a full table scan taking 5 seconds on a 10-million-row table. With the correct index, the same query takes 2 milliseconds.
Rate limiting: protecting the backend from itself
// Rate limiting with Redis using sliding window algorithm
async function checkRateLimit(
userId: string,
limitPerMinute: number
): Promise<{ allowed: boolean; remaining: number }> {
const key = `rate_limit:${userId}:${Math.floor(Date.now() / 60000)}`;
const current = await redis.incr(key);
if (current === 1) await redis.expire(key, 60);
return {
allowed: current <= limitPerMinute,
remaining: Math.max(0, limitPerMinute - current),
};
}Frequently Asked Questions
When is it time to move from a single database instance to read replicas?
Redis or Memcached for enterprise caching?
What is N+1 query and how do I avoid it?
How do I horizontally scale a stateful API (with sessions)?
When to implement CQRS and when is it over-engineering?
Is your backend starting to show scaling problems? We can conduct a technical diagnosis and identify real bottlenecks before proposing architecture changes.
Talk to our team