The three pillars: metrics, logs, and distributed traces
The three observability pillars are complementary, not redundant. Metrics tell you there's a problem (p99 latency rose to 2 seconds). Logs tell you what happened (500 error on the /checkout endpoint). Distributed traces tell you why (the inventory database query took 1.8 seconds because there's no index on the product_sku column).
Most companies have logs. Fewer have application metrics (beyond CPU/memory). Very few have correlated distributed traces. Real observability requires all three, correlated by trace ID.
OpenTelemetry: the standard that unifies all three pillars
OpenTelemetry (OTel) is the open-source project that standardizes application instrumentation to emit metrics, logs, and traces in a vendor-neutral format. The key advantage: instrument the application once and send data to any backend (Grafana, Jaeger, Tempo, Datadog, New Relic) without changing the code.
// Automatic Node.js instrumentation with OpenTelemetry SDK
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
metricReader: new PrometheusExporter({ port: 9090 }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();Prometheus: production-grade metrics architecture
Prometheus uses a pull model — it scrapes /metrics endpoints every N seconds. For enterprise production, the correct stack is Prometheus + Thanos or Prometheus + VictoriaMetrics for long-term retention and high availability.
# PrometheusRule — SLO-based alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-slos
namespace: monitoring
spec:
groups:
- name: api.slos
rules:
# Error rate > 1% for 5 minutes
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "API error rate above SLO ({{ $value | humanizePercentage }})"
# p99 latency > 500ms for 10 minutes
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[10m])) by (le)
) > 0.5
for: 10m
labels:
severity: warningGrafana Loki: logs without Elasticsearch costs
Loki is Grafana's log system designed to integrate natively with Prometheus and Tempo. Unlike Elasticsearch, Loki doesn't index log content — only labels (namespace, pod, app, env). This reduces storage cost 10x but requires queries to include label filters upfront.
Grafana Tempo: distributed traces without Jaeger costs
Tempo is Grafana's distributed tracing backend, designed for cheap storage (object storage only: S3, GCS) with trace ID search. The Loki integration is the most valuable feature: from a log with a trace_id, you can jump directly to the complete distributed trace of that request in a single click.
The full stack on Kubernetes with kube-prometheus-stack
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set grafana.enabled=true \
--set grafana.adminPassword=${GRAFANA_PASSWORD} \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50GiFrequently Asked Questions
When does it make sense to pay for Datadog or New Relic vs. the open-source stack?
How much does implementing full observability cost?
What is an SLO and how do I define one for my service?
How do I correlate a user error with a distributed trace?
What Kubernetes dashboards are most important to have from the start?
Does your team operate production services without real visibility into internal behavior? We implement complete observability with the stack best suited to your infrastructure.
Talk to our team