Warren's Distributed Monitoring
Principles & Practices...

Primary Principles

"Until you have reliable monitoring, you don’t have a service. What you have is a prolonged period of optimism."
-- apologies to Laurie Voss.

Alerts must be actionable
Minimize Toil
- Human intervention should (almost) never be needed; DevOps FTW!
Export all metrics
- 'Tis better to expose and not need, than not have.
- Disk is cheap, humans are expensive, downtime doubly so.
"Events"/Anomalies/logs should only be exported if the event is an anomaly.
- It's the nodes job to know this.
Schemas are important.
Don't reinvent the wheel, use existing tools for well trodden roads, like TSD B, graphing, alerting.
- Your monitoring needs to be super-stable. Troubleshooting your monitoring tool is a fail.
Metrics must be at the appropriate level
- "Number of HTTP Responses" is good, "Number of HTTP 200 Responses" is much better.
- Anything that smells even slightly like PII is not a metric.
  - "Bob logged in" is not a metric!
    - Srsly, does this need to be mentioned?!

Metrics vs Events

Metrics must be useful in the aggregate
Events are for human debugging, and must be useful as a singular
Collectors pull metrics
Nodes push events.

Events

All exported events must be counted/metric'ed
Be really careful that PII doesn't sneak in
- Sanitize all errors on the node before exporting
Events must have all the information needed to troubleshoot the cause.
- Joins are not your friend
A big blob of text is not an event.
- A blob can be included as well if unable to parse the error

Metrics

Metric names are important, even more so than variable names.
- They must be self-describing and clear
- They should follow best-practices and conventions.
  - If there isn't a convention, reuse the Prometheus one
  - If you violate the convention, encode this is the name
    - E.g. Time is in seconds. If you need picoseconds, the metric name must indicate this
A string is not a metric.
- If you think it is you either:
  - do not understand the system you are monitoring, or
  - don't grok monitoring. I suggest Prometheus: Up & Running
- The primary exception to this things like exporter version
Understand your metric types
- E.g. for Prometheus: Counters, Gauges, Histograms, Summaries
- Samples are generally float64 with millisecond-precision timestamps
Help text is really important, and should communicate the precision of the metric.

Nodes/Probers

Quis custodiet ipsos custodes?

This is important. If your probers are not really reliable, accurate, fast, and (correctly) detailed, they are literally worse than useless.

Nodes must be standalone.
- You don't want to be troubleshooting your troubleshooting.
Nodes should be minimally complex.
Nodes must not rely on a "central source of truth".
- They may have a config file for collector location, port number, frequency, etc., but this is only read at startup, not dynamically tunable.
  - It can be really tempting to violate this principle, but dynamic sources of truth, dynamic prober thresholds, etc will bite you. These sorts of things belong in the collector/alerter. Trying to put them in the prober is a complexity failure.
  - Don't violate this, no matter how attractive the idea seems. It will bite you...
Nodes must be trivially deployed (collectors/reporting should be trivially deployed)
Scalability is critical
- Even in small systems, if this is not followed, the monitor perturbs the system
Timeliness is critical
- Knowing that a problem happened 5 minutes ago is not useful, you need to be able to see what's happening now.
- Nodes must expose a health check/proof of life
- A counter of tests run is acceptable for this, as long as it is on the order of minutes.
Exporters must expose internal state, like RAM, heap, etc. - Quis custodiet ipsos custodes?

Collectors

⚠️ Don't be stingy...

This is a really common mistake that people make. For all practical purposes, disk is free for TSDB use. As an example, Prometheus stores an average of only 1-2 bytes per sample. If you collect 1000 metrics per second, that means you can store 15 years of metrics on a 1TB disk. If you earn just $100K, you are earning about 1TB disk/hour. This means that if you waste more than 3 minutes (60/15) per year because you don't have a metric handy, it's false economy. Obviously, your loaded costs are higher, the cost of downtime way exceeds your salary, etc - but that just makes the argument stronger.

Don't be stingy with retention, number of metrics, etc.
You are monitoring for a bunch of reasons - uptime, performance, security, capacity planning, etc
- These clearly much more valuable than the cost of some disk/RAM/CPU/bits.
Metrics should be shared. Transparency good, silos bad.
- The only exception to this is security metrics
Sharding/clustering the collector is important
- You can look really dumb if you don't have metrics showing the outage because your link to the collector is implicated in the outage.

Operations

⚠️ Read the Site Reliability Engineering book - it's <$30 on Amazon and well worth the money. Srsly.

This is just a very high level overview of the ethos, along with my focus on the monitoring bits.

Minimize Toil
- Logging into a machine is a fail
Thresholds must be visible and tunable.
- They must progress from Inform -> Alert -> Page
- Unless there is a serious incident, any time a threshold causes a page, someone messed up...

Alerts

An alert must be actionable and require action
Unless it is a sudden incident (e.g backhoe fade), an alert must fire before a page.
- Alerts can escalate over time

Pages

A page must be immediately actionable, and critical (if not, it isn't a page, it is an alert or inform)
- It must be clear and self-contained.
- Pretend you got the page at 3:00AM while in a cab after a 15 hour flight
  - Is the reason for the page clear?
  - Is all of the supporting info included?
  - Could you talk your computer illiterate uncle Frank though resolving it?
  - If the answer to any of these is "No", then the page is inadequate.
- Any false positive pages must be fixed to no longer do so
  - This is not optional - pager fatigue is real
No more than 2 pages per day
- A pager/alert budget must be maintained and monitored
- Cascading failures bad!

Glass

The glass / dashboard / UI / $buzzword has multiple purposes. Trying to use the same view for all purposes is a fail.

Marketing / Reporting

Das Blinkenlights

The marketing/reporting purpose only exists in some instances (and the terms are not really intended to be as pejorative as they sound 😄). These views provide a quick overview of the system status for other teams, etc. A good example of this sort of purpose/view is the "network overview" view at a conference; it shows the number of uses, bandwidth, general status, etc. It should be easily digestible by someone seeing it for the first time, with only a few seconds of looking.

Operations

NOC / Dashboard

This should be a playlist of the most important dashboards, cycling through the views. It must clearly communicate the overall status at a glance, with clear markers for out of threshold values.

Principles: Distributed Monitoring

Warren's Distributed Monitoring
Principles & Practices...

Primary Principles

Metrics vs Events

Events

Metrics

Nodes/Probers

Collectors

Operations

Alerts

Pages

Glass

Marketing / Reporting

Das Blinkenlights

Operations

NOC / Dashboard

Main Menu

Principles: Distributed Monitoring

Warren's Distributed MonitoringPrinciples & Practices...

Primary Principles

Metrics vs Events

Events

Metrics

Nodes/Probers

Collectors

Operations

Alerts

Pages

Glass

Marketing / Reporting

Das Blinkenlights

Operations

NOC / Dashboard

Main Menu

Warren's Distributed Monitoring
Principles & Practices...