Skip to content
Go back

OpenTelemetry for Self-Hosters: Traces, Metrics, Logs Without the Datadog Bill

By SumGuy 11 min read
OpenTelemetry for Self-Hosters: Traces, Metrics, Logs Without the Datadog Bill

Datadog Wants $50K/Year. Your Raspberry Pi Doesn’t Care.

You’ve seen the Datadog pricing page. You’ve had the moment where you squint at the per-host fee, do some mental math, and quietly close the tab. For a home lab or a small self-hosted stack, paying enterprise APM prices is like hiring a private chef to make instant ramen. Technically achievable. Deeply unnecessary.

Here’s the thing: the open-source observability ecosystem has caught up. OpenTelemetry (OTel) is now the vendor-neutral standard for traces, metrics, and logs — and the tools around it (the Grafana LGTM stack, SigNoz, Coroot) give you the full picture without a monthly invoice that would make your CFO cry.

This post walks through the OTel mental model, gets a Collector running via Docker Compose, routes signals to Grafana Tempo, Loki, and Mimir, and shows you how to instrument a Python app with zero code changes. By the end you’ll have distributed tracing, log correlation, and dashboards — for the cost of electricity and mild configuration frustration.


The Mental Model: Signals, SDKs, and the Collector

OpenTelemetry gives you three signal types:

The OTel architecture has two moving parts:

SDK / Auto-instrumentation — the code-side piece. For Python, Node, Java, Go, etc., OTel provides language SDKs that produce telemetry data. The auto-instrumentation agents go further: they patch common libraries (Flask, Django, FastAPI, Express, HTTP clients, database drivers) automatically — you don’t modify your application code at all. You just run it with a wrapper.

OTel Collector — a standalone process that receives telemetry from your apps over OTLP (the OTel wire protocol), processes it (batching, filtering, enrichment), and exports to your backends. Think of it as a router that decouples your apps from your storage backends. Swap Tempo for Jaeger tomorrow? Change one line in the Collector config, not every app.

The Collector pipeline is: receivers → processors → exporters. Simple, powerful, occasionally maddening to debug the first time.


The Stack We’re Building

App (auto-instrumented)
│ OTLP/gRPC :4317
OTel Collector
├── traces → Grafana Tempo :3200
├── metrics → Grafana Mimir :9009
└── logs → Grafana Loki :3100
Grafana UI :3000

Everything runs on a single Docker Compose stack. For a home lab this is plenty. For something larger, Collector and backends split into separate machines — same config, different targets.


The Collector Config

This is the most important file. Get this right and the rest is plumbing.

otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
resource:
attributes:
- key: deployment.environment
value: homelab
action: upsert
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
default_labels_enabled:
exporter: true
job: true
instance: true
level: true
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
tls:
insecure: true
debug:
verbosity: basic
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo, debug]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [prometheusremotewrite, debug]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki, debug]

A few things worth calling out:

The debug exporter in every pipeline is your best friend when things don’t show up where you expect. verbosity: basic prints a one-liner per batch. Bump it to detailed when you’re troubleshooting a specific signal. Remove it (or set verbosity: normal) once everything’s working.

The memory_limiter processor goes first — always. If your Collector OOMs under a spike, you want it to start dropping data before the process dies. Dropping some traces is fine. A crashed Collector drops everything.

resource processor stamps every metric with deployment.environment=homelab. This is optional but useful when you eventually have staging and prod pointing at the same backends.


Docker Compose: The Full Stack

docker-compose.yml
version: "3.8"
networks:
observability:
driver: bridge
volumes:
tempo-data:
loki-data:
mimir-data:
grafana-data:
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.100.0
command: ["--config=/etc/otelcol/config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otelcol/config.yaml:ro
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
networks:
- observability
restart: unless-stopped
tempo:
image: grafana/tempo:2.4.1
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo-config.yaml:/etc/tempo.yaml:ro
- tempo-data:/var/tempo
ports:
- "3200:3200"
networks:
- observability
restart: unless-stopped
loki:
image: grafana/loki:3.0.0
command: ["-config.file=/etc/loki/local-config.yaml"]
volumes:
- loki-data:/loki
ports:
- "3100:3100"
networks:
- observability
restart: unless-stopped
mimir:
image: grafana/mimir:2.11.0
command: ["--config.file=/etc/mimir.yaml"]
volumes:
- ./mimir-config.yaml:/etc/mimir.yaml:ro
- mimir-data:/data
ports:
- "9009:9009"
networks:
- observability
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.2
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_DISABLE_LOGIN_FORM=true
volumes:
- grafana-data:/var/lib/grafana
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml:ro
ports:
- "3000:3000"
networks:
- observability
depends_on:
- tempo
- loki
- mimir
restart: unless-stopped

You’ll notice anonymous auth is enabled on Grafana. For a home lab behind a firewall or Tailscale, that’s fine. For anything facing the internet, drop GF_AUTH_ANONYMOUS_ENABLED and set up proper credentials.

The otel/opentelemetry-collector-contrib image is important — the -contrib variant includes the Loki exporter and a bunch of other community components. The base otel/opentelemetry-collector image is leaner but won’t have the Loki exporter. Use contrib.

For Tempo, Loki, and Mimir minimal configs (the *-config.yaml files), their respective GitHub repos have local-config.yaml examples that use filesystem storage — copy those and you’re running. The Compose file above assumes you’ve dropped those in the same directory.

Grafana datasources provisioning file wires everything together automatically:

grafana-datasources.yaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
url: http://tempo:3200
isDefault: false
jsonData:
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: "-1m"
spanEndTimeShift: "1m"
filterByTraceID: true
filterBySpanID: false
- name: Loki
type: loki
uid: loki
url: http://loki:3100
isDefault: false
- name: Mimir
type: prometheus
url: http://mimir:9009/prometheus
isDefault: true

The tracesToLogsV2 block in Tempo’s config is the good stuff. It tells Grafana: when you’re looking at a trace in Tempo, here’s how to jump to correlated logs in Loki. Set filterByTraceID: true and your trace IDs need to be in your log lines — which the OTel SDK handles automatically when you set it up correctly.


Instrumenting a Python App (Zero Code Changes)

Here’s a minimal Flask app. It does nothing interesting — but after instrumentation, you’ll see every HTTP request as a trace, with timing breakdowns and status codes.

app.py
from flask import Flask, jsonify
import time
import random
app = Flask(__name__)
@app.route("/")
def index():
return jsonify({"status": "ok", "service": "demo-api"})
@app.route("/slow")
def slow():
time.sleep(random.uniform(0.1, 0.8))
return jsonify({"result": "eventually"})
@app.route("/fail")
def fail():
if random.random() > 0.5:
return jsonify({"error": "coin flip lost"}), 500
return jsonify({"result": "lucky"})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)

No OTel imports. No manual span creation. Now instrument it:

Terminal window
# Install the auto-instrumentation packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Run with auto-instrumentation, pointing at the Collector
OTEL_SERVICE_NAME=demo-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
OTEL_EXPORTER_OTLP_PROTOCOL=grpc \
OTEL_TRACES_EXPORTER=otlp \
OTEL_METRICS_EXPORTER=otlp \
OTEL_LOGS_EXPORTER=otlp \
OTEL_PYTHON_LOG_CORRELATION=true \
opentelemetry-instrument python app.py

opentelemetry-bootstrap -a install detects installed packages and installs the matching OTel instrumentation libraries automatically. Flask gets opentelemetry-instrumentation-flask, requests gets opentelemetry-instrumentation-requests, and so on.

OTEL_PYTHON_LOG_CORRELATION=true injects trace_id and span_id into your log records. That’s what enables the Tempo→Loki drill-down in Grafana.

Hit the /slow and /fail endpoints a few times:

Terminal window
for i in $(seq 1 20); do
curl -s http://localhost:8080/slow > /dev/null
curl -s http://localhost:8080/fail > /dev/null
done

Open Grafana at http://localhost:3000, go to Explore → Tempo, and you’ll see your traces. Click one, see the spans. Notice the p95 latency on /slow being terrible. Click the logs icon to jump to correlated log lines in Loki. That’s distributed tracing working in your home lab.


What About Node.js?

Same idea, different wrapper. For a Node/Express app:

Terminal window
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-metrics-otlp-grpc
# Run with the SDK auto-initialization
OTEL_SERVICE_NAME=demo-node \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
node --require @opentelemetry/auto-instrumentations-node/register app.js

The --require flag loads the auto-instrumentation before your application code starts. HTTP, Express, pg, redis, gRPC — all get patched automatically.


Semantic Conventions: Don’t Reinvent Attribute Names

OTel defines standard attribute names (semantic conventions) for common concepts. http.method, http.status_code, db.system, service.name — these are standardized so dashboards and alert rules can be written generically across any service.

The auto-instrumentation libraries follow these conventions. If you add manual instrumentation later (and you will, once you get hooked on traces), use the semantic convention constants from opentelemetry-semantic-conventions rather than making up attribute names. Your future self querying Tempo at 2 AM will appreciate it.


The One-Box Alternatives: SigNoz and Coroot

If the LGTM stack feels like assembling furniture from a bag of unmarked parts, two projects deserve your attention:

SigNoz — an open-source Datadog alternative with a single Docker Compose that includes its own ingestion pipeline, ClickHouse storage, and a polished UI. Traces, metrics, and logs in one interface with correlation built in. OTel-native from day one. If you want the experience of “open Datadog, see everything,” SigNoz is the fastest path there.

Terminal window
git clone https://github.com/SigNoz/signoz.git
cd signoz/deploy/
docker compose up -d

That’s it. Point your apps at http://your-host:4317 and you’re sending telemetry.

Coroot — takes a different angle. It uses eBPF agents to auto-discover your services and their dependencies with zero instrumentation required. Generates service maps, latency analysis, and RED metrics (Rate, Errors, Duration) automatically. The free tier is genuinely useful; the paid tier adds more. For a home lab, Coroot is the least friction path to “what is my system actually doing.”

The honest comparison: Grafana LGTM gives you the most flexibility and the largest ecosystem. SigNoz gives you a polished out-of-box experience with less config overhead. Coroot gets you value before you’ve instrumented anything. They’re not mutually exclusive — run Coroot for auto-discovery while you’re building out OTel instrumentation for explicit traces.


A Few Gotchas Worth Knowing Now

Collector version pinning matters. The -contrib image moves fast. Pin to a specific version in your Compose file (the example above uses 0.100.0) rather than latest. Breaking changes in exporter configs happen more often than you’d like.

Tail-based sampling. The Collector’s batch processor is head-based by default — it makes sampling decisions at the start of a trace. For error-rate sampling (keep 100% of error traces, sample 10% of successful ones), you need the tailsampling processor and multiple Collector instances. Overkill for a home lab; relevant when your trace volume gets real.

Cardinality. This bites Mimir/Prometheus users. If you label metrics with high-cardinality values (user IDs, request URLs with parameters), your storage grows fast and queries slow down. Use metricstransform or filter processors in the Collector to drop high-cardinality attributes before they hit Mimir.

Storage retention. Tempo, Loki, and Mimir all default to storing data indefinitely until you configure retention. Set retention_period in each backend config early — 7 days for traces and logs, 30 days for metrics is a reasonable home lab default.


The Decision Tree

Start here based on your situation:

The LGTM stack is the right choice if you’re already invested in Grafana and want dashboards that talk to each other. SigNoz is right if you want to stop configuring things and start watching traces. Coroot is right if you want value before you’ve instrumented anything.

Whatever you pick: don’t pay Datadog prices for a home lab. The tools exist. The Compose files are above. Your server is right there.


Resources


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
Kdenlive Basics: Effects You Actually Use
Next Post
ZFS Send/Receive Over WireGuard for Off-Site Replication

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts