Prometheus Is Not Always the Right Tool
You have Prometheus. It scrapes your exporters every 15 seconds, you have Grafana dashboards, and your alerting rules fire on sustained conditions. Life is good.
Then something weird happens: ten containers OOM-kill in 60 seconds across the same host — individually, each one looks like a blip. No single container breached a threshold long enough to trigger an alert. Prometheus never fired. You found out at 2 AM because your disk ran out of inodes, not because anything paged you.
This is not a Prometheus failure. It is a category mismatch. Prometheus is a pull-based, time-series database. Its mental model is: “give me the value of this metric at this point in time.” That model is excellent for CPU utilization, memory pressure, request rates. It is the wrong model for questions like “did ten bad things happen in the same 60-second window on the same host?” — because that is an event-stream problem, not a time-series problem.
Riemann is the tool someone built to answer exactly that kind of question. It is also the tool most people have never heard of.
What Riemann Actually Is
Riemann was written by Kyle Kingsbury — yes, the Jepsen guy — starting around 2012. It is a JVM-based event stream processor with a configuration language written in Clojure. Events flow in via TCP, UDP, or WebSocket. You write streams in Clojure DSL that filter, aggregate, transform, and route those events to outputs: PagerDuty, Slack, InfluxDB, Graphite, email, or whatever you wire up.
The mental model is a pipeline:
[event sources] → [Riemann streams] → [outputs]An “event” in Riemann is a map with fields: host, service, metric, state, time, ttl, tags, and arbitrary custom fields. Your app sends an event when something happens. Riemann receives it, runs it through your stream functions, and takes action.
That is the key difference: Riemann reacts to what you push, not to what it polls. If you push an event every time a container exits with OOM, Riemann can count ten of those within a 60-second rolling window and fire an alert. Prometheus cannot do that without stitching together Pushgateway, recording rules, and a lot of patience.
Running Riemann in 2026
Riemann’s Docker image is still maintained (community-driven at this point), and it runs fine on a home lab server:
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y \ openjdk-17-jre-headless \ curl \ && rm -rf /var/lib/apt/lists/*
RUN curl -Lo /opt/riemann.tar.bz2 \ https://github.com/riemann/riemann/releases/download/0.3.10/riemann-0.3.10.tar.bz2 \ && tar -xjf /opt/riemann.tar.bz2 -C /opt \ && ln -s /opt/riemann-0.3.10 /opt/riemann \ && rm /opt/riemann.tar.bz2
COPY riemann.config /etc/riemann/riemann.configEXPOSE 5555 5556 5557CMD ["/opt/riemann/bin/riemann", "/etc/riemann/riemann.config"]Or, if you prefer Compose:
services: riemann: image: riemannio/riemann:0.3.10 ports: - "5555:5555" # TCP events - "5555:5555/udp" # UDP events - "5556:5556" # WebSocket - "5557:5557" # HTTP API volumes: - ./riemann.config:/etc/riemann/riemann.config:ro restart: unless-stoppedThe Config DSL (and Why Clojure Is Both the Feature and the Bug)
Riemann’s config is a Clojure program. This is simultaneously its best and worst feature.
Best: you get a real programming language. Conditionals, functions, let bindings, map/filter, custom logic. PromQL is a query language bolted onto a time-series DB; Riemann config is code.
Worst: if you have never written a Clojure parenthesis in your life, the config will look like someone’s cat walked across the keyboard.
Here is a minimal config that accepts events and sends high-severity ones to Slack:
; riemann.config — minimal working example
(logging/init {:file "/var/log/riemann/riemann.log"})
(let [host "0.0.0.0"] (tcp-server {:host host}) (udp-server {:host host}) (ws-server {:host host}))
(periodically-expire 5)
; Slack webhook output (riemann-slack plugin or HTTP call)(def slack-notify (fn [event] (let [msg (str "[" (:host event) "] " (:service event) " — " (:state event) ": " (:description event))] (riemann.common/post-body "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" {:text msg}))))
; Main stream(streams ; Drop anything with no host or service (where (and host service) ; Route critical events to Slack (where (= state "critical") slack-notify)
; Log everything else (fn [event] (info "event" event))))Not beautiful, but it works. Now for the interesting part.
Real Example: OOM Storm Detection
This is the kind of alert Riemann was built for. You want to fire a single “OOM storm” alert when 10 or more container OOM events hit the same host within 60 seconds — not 10 separate “container died” pages.
First, your containers need to send events. You can do this from a script that watches Docker events:
import subprocessimport jsonimport socketimport structimport time
RIEMANN_HOST = "riemann"RIEMANN_PORT = 5555
def send_riemann_event(host, service, metric, state, description, tags=None): """Send a raw Riemann event over TCP (simplified, no proto3 encoding).""" # In production, use the riemann-client Python library import riemann_client.client as rc import riemann_client.transport as rt
with rc.Client(rt.TCPTransport(RIEMANN_HOST, RIEMANN_PORT)) as client: client.event( host=host, service=service, metric=metric, state=state, description=description, tags=tags or [], ttl=120, )
def watch_docker_events(): proc = subprocess.Popen( ["docker", "events", "--format", "{{json .}}", "--filter", "event=oom"], stdout=subprocess.PIPE, text=True, ) import socket as s hostname = s.gethostname()
for line in proc.stdout: try: ev = json.loads(line.strip()) container = ev.get("Actor", {}).get("Attributes", {}).get("name", "unknown") send_riemann_event( host=hostname, service="docker.container.oom", metric=1, state="warning", description=f"Container OOM: {container}", tags=["docker", "oom", container], ) except Exception as e: print(f"Error processing event: {e}")
if __name__ == "__main__": watch_docker_events()Install the client:
pip install riemann-clientNow the Riemann side — the stream that detects the storm:
(streams ; Only process OOM events (where (= service "docker.container.oom")
; Rolling window: count OOM events per host over 60 seconds (by [:host] (moving-time-window 60 (fn [events] (let [oom-count (count events)] (when (>= oom-count 10) ; Build a synthetic "storm" event (let [storm-event {:host (:host (first events)) :service "docker.oom.storm" :metric oom-count :state "critical" :description (str oom-count " container OOM kills in 60s") :tags ["storm" "oom" "docker"]}] ; Deduplicate: only fire once per storm, not once per event (throttle 1 300 slack-notify) (info "OOM storm detected:" storm-event))))))))The by [:host] partitions the stream per host, so a noisy VM does not mask a quieter one. moving-time-window 60 keeps a rolling 60-second buffer of events. throttle 1 300 ensures you get at most one alert per 5 minutes per host — your phone will thank you.
This is genuinely hard to replicate in pure Prometheus. You could approximate it with:
- A Pushgateway receiving OOM events
- A recording rule summing them over 60s
- An alert rule firing when the sum >= 10
But you still get the cardinality problem (one series per container), stale metric expiry issues, and you cannot deduplicate the alert cleanly without Alertmanager silences. Riemann does it in 10 lines of Clojure.
Wiring Outputs
Riemann has built-in outputs and a plugin ecosystem. Common ones:
InfluxDB (time-series storage for dashboards):
(def influx (influxdb {:host "influxdb" :port 8086 :db "riemann" :username "riemann" :password "secret"}))
(streams (where metric influx))Graphite:
(def graphite-out (graphite {:host "graphite" :port 2003}))PagerDuty (via riemann-pagerduty plugin):
(def pd (pagerduty "your-integration-key"))
(streams (where (= state "critical") pd))Email:
(def mailer (mailer {:host "smtp.example.com" :from "riemann@example.com"}))
(streams (where (= state "critical") (email "oncall@example.com")))Honest Talk About the Ecosystem
Riemann peaked around 2015-2017. The Clojure ecosystem has not collapsed, but it has not grown the way Go or Rust tooling has. You need to be aware of a few things before you commit:
- riemann-dash (the built-in dashboard) is dated. You will want to send metrics to InfluxDB/Graphite and use Grafana instead.
- The plugin ecosystem has some unmaintained gems. Check GitHub last-commit dates before depending on any plugin.
- The Docker image is community-maintained, not from a commercial entity. The 0.3.10 release is from 2022.
- JVM cold start is 3-5 seconds. On a Pi 4 with 4 GB RAM this is fine; on a Pi Zero it is not.
- The Clojure barrier is real. If nobody on your team has touched a Lisp, budget an hour of confusion before anything makes sense.
Alternatives in the Same Niche
If Riemann’s vibe is not for you, here is where the same problem space lives in 2026:
| Tool | Approach | Clojure Required | Event Windows |
|---|---|---|---|
| Riemann | JVM stream processor | Yes | Native |
| Vector + VRL | Rust pipeline + transform language | No | Limited |
| Logstash + Watcher | ELK-native event routing | No | Good (complex config) |
| Prometheus + Pushgateway | Pull-based with push bridge | No | Approximate |
| OpenObserve | Modern stream + alerting | No | Good |
| Benthos / Redpanda Connect | Go-based stream processor | No | Good |
Vector (vector.dev) is the closest modern equivalent in spirit — events, transformations, routing, outputs. The VRL (Vector Remap Language) is more approachable than Clojure for most people and Rust performance is excellent. If you are starting fresh in 2026 and need event-stream processing without the JVM overhead, Vector is probably the move.
Riemann wins when you need complex stateful aggregation (rolling windows, per-host partitioning, deduplication) expressed in a real programming language. Vector’s alerting story is improving but not there yet for arbitrarily complex stream logic.
When NOT to Use Riemann
Save yourself the JVM overhead if:
- You have one host. Prometheus + Alertmanager is fine. Riemann is most useful when you are correlating events across multiple hosts.
- Your team is Clojure-skeptic. The config is not optional — you will be maintaining Clojure. If that word makes your team’s eyes glaze over, use Vector or just accept Prometheus’s limitations.
- You need long-term storage. Riemann is a processor, not a database. You still need InfluxDB/Prometheus/Graphite downstream.
- Your events are actually metrics. If you are thinking “I want to alert when CPU > 80% for 5 minutes,” that is a time-series problem. Prometheus wins.
The Bottom Line
Riemann is 12 years old and looks it. The dashboard is dated, the ecosystem is quiet, and the Clojure config will earn you some side-eyes in a PR review. None of that makes it wrong.
For a specific class of problem — correlating discrete events across hosts in fast rolling windows, with expressive logic that PromQL cannot cleanly express — Riemann is still one of the most direct tools available. The OOM storm detection example above is 10 lines of config. The equivalent Prometheus setup is a multi-component Rube Goldberg machine with three potential failure points.
Worth knowing it exists. Worth spinning it up in a container for a weekend to see if it fits your specific pain. Not worth ripping out Prometheus for — they solve different problems and run perfectly well side by side.
Your 2 AM self will appreciate knowing there is more than one hammer in the box.