Cloud · Observability|12 min read|

Enterprise Observability: Beyond Traditional Monitoring

Monitoring and observability are not synonyms. Monitoring tells you something is wrong; observability tells you why. In modern distributed systems — where a transaction might pass through 8 microservices before completing — the difference between having CPU dashboards and having real observability can mean the difference between resolving an incident in 5 minutes versus 5 hours. This guide documents the implementation of real observability in enterprise production environments.

The three pillars: metrics, logs, and distributed traces

The three observability pillars are complementary, not redundant. Metrics tell you there's a problem (p99 latency rose to 2 seconds). Logs tell you what happened (500 error on the /checkout endpoint). Distributed traces tell you why (the inventory database query took 1.8 seconds because there's no index on the product_sku column).

Most companies have logs. Fewer have application metrics (beyond CPU/memory). Very few have correlated distributed traces. Real observability requires all three, correlated by trace ID.

OpenTelemetry: the standard that unifies all three pillars

OpenTelemetry (OTel) is the open-source project that standardizes application instrumentation to emit metrics, logs, and traces in a vendor-neutral format. The key advantage: instrument the application once and send data to any backend (Grafana, Jaeger, Tempo, Datadog, New Relic) without changing the code.

typescript
// Automatic Node.js instrumentation with OpenTelemetry SDK
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricReader: new PrometheusExporter({ port: 9090 }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Prometheus: production-grade metrics architecture

Prometheus uses a pull model — it scrapes /metrics endpoints every N seconds. For enterprise production, the correct stack is Prometheus + Thanos or Prometheus + VictoriaMetrics for long-term retention and high availability.

yaml
# PrometheusRule — SLO-based alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slos
  namespace: monitoring
spec:
  groups:
  - name: api.slos
    rules:
    # Error rate > 1% for 5 minutes
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m])) > 0.01
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "API error rate above SLO ({{ $value | humanizePercentage }})"

    # p99 latency > 500ms for 10 minutes
    - alert: HighP99Latency
      expr: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[10m])) by (le)
        ) > 0.5
      for: 10m
      labels:
        severity: warning
SLO-based alerts fire on symptoms that matter to users (error rate, latency), not causes (high CPU, pod restarting). This reduces alert noise — instead of 10 simultaneous infrastructure alerts during an incident, you get 1-2 SLO alerts confirming user impact.

Grafana Loki: logs without Elasticsearch costs

Loki is Grafana's log system designed to integrate natively with Prometheus and Tempo. Unlike Elasticsearch, Loki doesn't index log content — only labels (namespace, pod, app, env). This reduces storage cost 10x but requires queries to include label filters upfront.

Grafana Tempo: distributed traces without Jaeger costs

Tempo is Grafana's distributed tracing backend, designed for cheap storage (object storage only: S3, GCS) with trace ID search. The Loki integration is the most valuable feature: from a log with a trace_id, you can jump directly to the complete distributed trace of that request in a single click.

The full stack on Kubernetes with kube-prometheus-stack

bash
helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.enabled=true \
  --set grafana.adminPassword=${GRAFANA_PASSWORD} \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

Frequently Asked Questions

When does it make sense to pay for Datadog or New Relic vs. the open-source stack?
Datadog/New Relic make sense when: the team is small and can't operate an observability stack (Datadog is a managed service), the company already has contracts with them, or you need automatic APM/infra/log correlation without configuration. The open-source stack (Prometheus + Grafana + Loki + Tempo) makes sense when the team can operate it, Datadog cost exceeds the operational cost of owning the stack, or you need full control over retention and data.
How much does implementing full observability cost?
The open-source stack on Kubernetes (kube-prometheus-stack + Loki + Tempo) has infrastructure costs of $200-500/month for a medium production cluster. Datadog for the same cluster can cost $3,000-10,000/month depending on host count and ingested metrics. The cost difference is real — but the open-source stack requires engineering time to operate.
What is an SLO and how do I define one for my service?
A Service Level Objective (SLO) is the reliability target the team commits to maintaining. Defined as: '99.5% of requests to /api/checkout must complete with 2xx status in under 500ms, measured over a 30-day window.' SLOs are derived from user expectations and business SLAs. Practical rule: the internal SLO should be 0.5-1% stricter than the SLA committed to customers.
How do I correlate a user error with a distributed trace?
Correlation requires two things: the frontend sends a trace_id in HTTP headers (or the API gateway generates one), and that trace_id is propagated to all downstream services via W3C Trace Context headers. With OpenTelemetry configured correctly, the trace_id appears in logs, error metrics, and the complete trace. From a user error report, you can see exactly which internal calls their request made and which one failed.
What Kubernetes dashboards are most important to have from the start?
In priority order: (1) USE Method per node (Utilization, Saturation, Errors for CPU/memory/disk/network), (2) RED Method per service (Rate of requests, Errors, Duration/latency), (3) Pod resource usage vs. requests/limits to detect over/under-provisioning, (4) PVC usage to catch full disks before they become incidents.

Does your team operate production services without real visibility into internal behavior? We implement complete observability with the stack best suited to your infrastructure.

Talk to our team

Related articles

IQS

Engineering Team — IQS

Software, cloud, and DevOps engineers with enterprise project experience.