Warren's Distributed Monitoring
Principles & Practices...


Primary Principles

"Until you have reliable monitoring, you don’t have a service. What you have is a prolonged period of optimism."
    -- apologies to Laurie Voss.


  • Alerts must be actionable

  • Minimize Toil

    • Human intervention should (almost) never be needed; DevOps FTW!

  • Export all metrics

    • 'Tis better to expose and not need, than not have.

    • Disk is cheap, humans are expensive, downtime doubly so.

  • "Events"/Anomalies/logs should only be exported if the event is an anomaly.

    • It's the nodes job to know this.

  • Schemas are important.

  • Don't reinvent the wheel, use existing tools for well trodden roads, like TSD B, graphing, alerting.

    • Your monitoring needs to be super-stable. Troubleshooting your monitoring tool is a fail.

  • Metrics must be at the appropriate level

    • "Number of HTTP Responses" is good, "Number of HTTP 200 Responses" is much better.

    • Anything that smells even slightly like PII is not a metric.

      • "Bob logged in" is not a metric!

        • Srsly, does this need to be mentioned?!

Metrics vs Events

  • Metrics must be useful in the aggregate

  • Events are for human debugging, and must be useful as a singular

  • Collectors pull metrics

  • Nodes push events.


  • All exported events must be counted/metric'ed

  • Be really careful that PII doesn't sneak in

    • Sanitize all errors on the node before exporting

  • Events must have all the information needed to troubleshoot the cause.

    • Joins are not your friend

  • A big blob of text is not an event.

    • A blob can be included as well if unable to parse the error


  • Metric names are important, even more so than variable names.

    • They must be self-describing and clear

    • They should follow best-practices and conventions.

      • If there isn't a convention, reuse the Prometheus one

      • If you violate the convention, encode this is the name

        • E.g. Time is in seconds. If you need picoseconds, the metric name must indicate this

  • A string is not a metric.

    • If you think it is you either:

    • The primary exception to this things like exporter version

  • Understand your metric types

    • E.g. for Prometheus: Counters, Gauges, Histograms, Summaries

    • Samples are generally float64 with millisecond-precision timestamps

  • Help text is really important, and should communicate the precision of the metric.


Quis custodiet ipsos custodes?

This is important. If your probers are not really reliable, accurate, fast, and (correctly) detailed, they are literally worse than useless.

  • Nodes must be standalone.

    • You don't want to be troubleshooting your troubleshooting.

  • Nodes should be minimally complex.

  • Nodes must not rely on a "central source of truth".

    • They may have a config file for collector location, port number, frequency, etc., but this is only read at startup, not dynamically tunable.

      • It can be really tempting to violate this principle, but dynamic sources of truth, dynamic prober thresholds, etc will bite you. These sorts of things belong in the collector/alerter. Trying to put them in the prober is a complexity failure.

      • Don't violate this, no matter how attractive the idea seems. It will bite you...

  • Nodes must be trivially deployed (collectors/reporting should be trivially deployed)

  • Scalability is critical

    • Even in small systems, if this is not followed, the monitor perturbs the system

  • Timeliness is critical

    • Knowing that a problem happened 5 minutes ago is not useful, you need to be able to see what's happening now.

    • Nodes must expose a health check/proof of life

    • A counter of tests run is acceptable for this, as long as it is on the order of minutes.

  • Exporters must expose internal state, like RAM, heap, etc. - Quis custodiet ipsos custodes?


⚠️ Don't be stingy...

This is a really common mistake that people make. For all practical purposes, disk is free for TSDB use. As an example, Prometheus stores an average of only 1-2 bytes per sample. If you collect 1000 metrics per second, that means you can store 15 years of metrics on a 1TB disk. If you earn just $100K, you are earning about 1TB disk/hour. This means that if you waste more than 3 minutes (60/15) per year because you don't have a metric handy, it's false economy. Obviously, your loaded costs are higher, the cost of downtime way exceeds your salary, etc - but that just makes the argument stronger.

  • Don't be stingy with retention, number of metrics, etc.

  • You are monitoring for a bunch of reasons - uptime, performance, security, capacity planning, etc

    • These clearly much more valuable than the cost of some disk/RAM/CPU/bits.

  • Metrics should be shared. Transparency good, silos bad.

    • The only exception to this is security metrics

  • Sharding/clustering the collector is important

    • You can look really dumb if you don't have metrics showing the outage because your link to the collector is implicated in the outage.


⚠️ Read the Site Reliability Engineering book - it's <$30 on Amazon and well worth the money. Srsly.

This is just a very high level overview of the ethos, along with my focus on the monitoring bits.

  • Minimize Toil

    • Logging into a machine is a fail

  • Thresholds must be visible and tunable.

    • They must progress from Inform -> Alert -> Page

    • Unless there is a serious incident, any time a threshold causes a page, someone messed up...


  • An alert must be actionable and require action

  • Unless it is a sudden incident (e.g backhoe fade), an alert must fire before a page.

    • Alerts can escalate over time


  • A page must be immediately actionable, and critical (if not, it isn't a page, it is an alert or inform)

    • It must be clear and self-contained.

    • Pretend you got the page at 3:00AM while in a cab after a 15 hour flight

      • Is the reason for the page clear?

      • Is all of the supporting info included?

      • Could you talk your computer illiterate uncle Frank though resolving it?

      • If the answer to any of these is "No", then the page is inadequate.

    • Any false positive pages must be fixed to no longer do so

      • This is not optional - pager fatigue is real

  • No more than 2 pages per day

    • A pager/alert budget must be maintained and monitored

    • Cascading failures bad!


The glass / dashboard / UI / $buzzword has multiple purposes. Trying to use the same view for all purposes is a fail.

Marketing / Reporting

Das Blinkenlights

The marketing/reporting purpose only exists in some instances (and the terms are not really intended to be as pejorative as they sound 😄). These views provide a quick overview of the system status for other teams, etc. A good example of this sort of purpose/view is the "network overview" view at a conference; it shows the number of uses, bandwidth, general status, etc. It should be easily digestible by someone seeing it for the first time, with only a few seconds of looking.


NOC / Dashboard

This should be a playlist of the most important dashboards, cycling through the views. It must clearly communicate the overall status at a glance, with clear markers for out of threshold values.