Warren's Distributed Monitoring
Principles & Practices...

 

Primary Principles

"Until you have reliable monitoring, you don’t have a service. What you have is a prolonged period of optimism."
    -- apologies to Laurie Voss.

 

  • Alerts must be actionable

  • Minimize Toil

    • Human intervention should (almost) never be needed; DevOps FTW!

  • Export all metrics

    • 'Tis better to expose and not need, than not have.

    • Disk is cheap, humans are expensive, downtime doubly so.

  • "Events"/Anomalies/logs should only be exported if the event is an anomaly.

    • It's the nodes job to know this.

  • Schemas are important.

  • Don't reinvent the wheel, use existing tools for well trodden roads, like TSD B, graphing, alerting.

    • Your monitoring needs to be super-stable. Troubleshooting your monitoring tool is a fail.

  • Metrics must be at the appropriate level

    • "Number of HTTP Responses" is good, "Number of HTTP 200 Responses" is much better.

    • Anything that smells even slightly like PII is not a metric.

      • "Bob logged in" is not a metric!

        • Srsly, does this need to be mentioned?!

Metrics vs Events

  • Metrics must be useful in the aggregate

  • Events are for human debugging, and must be useful as a singular

  • Collectors pull metrics

  • Nodes push events.

Events

  • All exported events must be counted/metric'ed

  • Be really careful that PII doesn't sneak in

    • Sanitize all errors on the node before exporting

  • Events must have all the information needed to troubleshoot the cause.

    • Joins are not your friend

  • A big blob of text is not an event.

    • A blob can be included as well if unable to parse the error

Metrics

  • Metric names are important, even more so than variable names.

    • They must be self-describing and clear

    • They should follow best-practices and conventions.

      • If there isn't a convention, reuse the Prometheus one

      • If you violate the convention, encode this is the name

        • E.g. Time is in seconds. If you need picoseconds, the metric name must indicate this

  • A string is not a metric.

    • If you think it is you either:

    • The primary exception to this things like exporter version

  • Understand your metric types

    • E.g. for Prometheus: Counters, Gauges, Histograms, Summaries

    • Samples are generally float64 with millisecond-precision timestamps

  • Help text is really important, and should communicate the precision of the metric.

Nodes/Probers

Quis custodiet ipsos custodes?

This is important. If your probers are not really reliable, accurate, fast, and (correctly) detailed, they are literally worse than useless.

  • Nodes must be standalone.

    • You don't want to be troubleshooting your troubleshooting.

  • Nodes should be minimally complex.

  • Nodes must not rely on a "central source of truth".

    • They may have a config file for collector location, port number, frequency, etc., but this is only read at startup, not dynamically tunable.

      • It can be really tempting to violate this principle, but dynamic sources of truth, dynamic prober thresholds, etc will bite you. These sorts of things belong in the collector/alerter. Trying to put them in the prober is a complexity failure.

      • Don't violate this, no matter how attractive the idea seems. It will bite you...

  • Nodes must be trivially deployed (collectors/reporting should be trivially deployed)

  • Scalability is critical

    • Even in small systems, if this is not followed, the monitor perturbs the system

  • Timeliness is critical

    • Knowing that a problem happened 5 minutes ago is not useful, you need to be able to see what's happening now.

    • Nodes must expose a health check/proof of life

    • A counter of tests run is acceptable for this, as long as it is on the order of minutes.

  • Exporters must expose internal state, like RAM, heap, etc. - Quis custodiet ipsos custodes?

Collectors

⚠️ Don't be stingy...

This is a really common mistake that people make. For all practical purposes, disk is free for TSDB use. As an example, Prometheus stores an average of only 1-2 bytes per sample. If you collect 1000 metrics per second, that means you can store 15 years of metrics on a 1TB disk. If you earn just $100K, you are earning about 1TB disk/hour. This means that if you waste more than 3 minutes (60/15) per year because you don't have a metric handy, it's false economy. Obviously, your loaded costs are higher, the cost of downtime way exceeds your salary, etc - but that just makes the argument stronger.

  • Don't be stingy with retention, number of metrics, etc.

  • You are monitoring for a bunch of reasons - uptime, performance, security, capacity planning, etc

    • These clearly much more valuable than the cost of some disk/RAM/CPU/bits.

  • Metrics should be shared. Transparency good, silos bad.

    • The only exception to this is security metrics

  • Sharding/clustering the collector is important

    • You can look really dumb if you don't have metrics showing the outage because your link to the collector is implicated in the outage.

Operations

⚠️ Read the Site Reliability Engineering book - it's <$30 on Amazon and well worth the money. Srsly.

This is just a very high level overview of the ethos, along with my focus on the monitoring bits.

  • Minimize Toil

    • Logging into a machine is a fail

  • Thresholds must be visible and tunable.

    • They must progress from Inform -> Alert -> Page

    • Unless there is a serious incident, any time a threshold causes a page, someone messed up...

Alerts

  • An alert must be actionable and require action

  • Unless it is a sudden incident (e.g backhoe fade), an alert must fire before a page.

    • Alerts can escalate over time

Pages

  • A page must be immediately actionable, and critical (if not, it isn't a page, it is an alert or inform)

    • It must be clear and self-contained.

    • Pretend you got the page at 3:00AM while in a cab after a 15 hour flight

      • Is the reason for the page clear?

      • Is all of the supporting info included?

      • Could you talk your computer illiterate uncle Frank though resolving it?

      • If the answer to any of these is "No", then the page is inadequate.

    • Any false positive pages must be fixed to no longer do so

      • This is not optional - pager fatigue is real

  • No more than 2 pages per day

    • A pager/alert budget must be maintained and monitored

    • Cascading failures bad!

Glass

The glass / dashboard / UI / $buzzword has multiple purposes. Trying to use the same view for all purposes is a fail.

Marketing / Reporting

Das Blinkenlights

The marketing/reporting purpose only exists in some instances (and the terms are not really intended to be as pejorative as they sound 😄). These views provide a quick overview of the system status for other teams, etc. A good example of this sort of purpose/view is the "network overview" view at a conference; it shows the number of uses, bandwidth, general status, etc. It should be easily digestible by someone seeing it for the first time, with only a few seconds of looking.

Operations

NOC / Dashboard

This should be a playlist of the most important dashboards, cycling through the views. It must clearly communicate the overall status at a glance, with clear markers for out of threshold values.