Warren's Distributed Monitoring
Principles & Practices...
Primary Principles
"Until you have reliable monitoring, you don’t have a service. What you have is a prolonged period of optimism."
-- apologies to Laurie Voss.
-
Alerts must be actionable
-
Minimize Toil
-
Human intervention should (almost) never be needed; DevOps FTW!
-
-
Export all metrics
-
'Tis better to expose and not need, than not have.
-
Disk is cheap, humans are expensive, downtime doubly so.
-
-
"Events"/Anomalies/logs should only be exported if the event is an anomaly.
-
It's the nodes job to know this.
-
-
Schemas are important.
-
Don't reinvent the wheel, use existing tools for well trodden roads, like TSD B, graphing, alerting.
-
Your monitoring needs to be super-stable. Troubleshooting your monitoring tool is a fail.
-
-
Metrics must be at the appropriate level
-
"Number of HTTP Responses" is good, "Number of HTTP 200 Responses" is much better.
-
Anything that smells even slightly like PII is not a metric.
-
"Bob logged in" is not a metric!
-
Srsly, does this need to be mentioned?!
-
-
-
Metrics vs Events
-
Metrics must be useful in the aggregate
-
Events are for human debugging, and must be useful as a singular
-
Collectors pull metrics
-
Nodes push events.
Events
-
All exported events must be counted/metric'ed
-
Be really careful that PII doesn't sneak in
-
Sanitize all errors on the node before exporting
-
-
Events must
-
Joins are not your friend
-
-
A big blob of text is not an event.
-
A blob can be included as well if unable to parse the error
-
Metrics
-
Metric names are important, even more so than variable names.
-
They must be self-describing and clear
-
They should follow best-practices and conventions.
-
If there isn't a convention, reuse the
-
If you violate the convention, encode this is the name
-
E.g. Time is in seconds. If you need picoseconds, the metric name must indicate this
-
-
-
-
A string is not a metric.
-
If you think it is you either:
-
do not understand the system you are monitoring, or
-
don't grok monitoring. I suggest
-
-
The primary exception to this things like exporter version
-
-
Understand your metric types
-
E.g. for Prometheus: Counters, Gauges, Histograms, Summaries
-
Samples are generally float64 with millisecond-precision timestamps
-
-
Help text is really important, and should communicate the precision of the metric.
Nodes/Probers
Quis custodiet ipsos custodes?
This is important. If your probers are not really reliable, accurate, fast, and (correctly) detailed, they are literally worse than useless.
-
Nodes must be standalone.
-
You don't want to be troubleshooting your troubleshooting.
-
-
Nodes should be minimally complex.
-
Nodes must not rely on a "central source of truth".
-
They may have a config file for collector location, port number, frequency, etc., but this is only read at startup, not dynamically tunable.
-
It can be really tempting to violate this principle, but dynamic sources of truth, dynamic prober thresholds, etc will bite you. These sorts of things belong in the collector/alerter. Trying to put them in the prober is a complexity failure.
-
Don't violate this, no matter how attractive the idea seems. It will bite you...
-
-
-
Nodes must be trivially deployed (collectors/reporting should be trivially deployed)
-
Scalability is critical
-
Even in small systems, if this is not followed, the monitor perturbs the system
-
-
Timeliness is critical
-
Knowing that a problem happened 5 minutes ago is not useful, you need to be able to see what's happening now.
-
Nodes must expose a health check/proof of life
-
A counter of tests run is acceptable for this, as long as it is on the order of minutes.
-
-
Exporters must expose internal state, like RAM, heap, etc. - Quis custodiet ipsos custodes?
Collectors
⚠️ Don't be stingy...
This is a really common mistake that people make. For all practical purposes, disk is free for TSDB use. As an example, Prometheus stores an average of only 1-2 bytes per sample. If you collect 1000 metrics per second, that means you can store 15 years of metrics on a 1TB disk. If you earn just $100K, you are earning about 1TB disk/hour. This means that if you waste more than 3 minutes (60/15) per year because you don't have a metric handy, it's false economy. Obviously, your loaded costs are higher, the cost of downtime way exceeds your salary, etc - but that just makes the argument stronger.
-
Don't be stingy with retention, number of metrics, etc.
-
You are monitoring for a bunch of reasons - uptime, performance, security, capacity planning, etc
-
These clearly much more valuable than the cost of some disk/RAM/CPU/bits.
-
-
Metrics should be shared. Transparency good, silos bad.
-
The only exception to this is security metrics
-
-
Sharding/clustering the collector is important
-
You can look really dumb if you don't have metrics showing the outage because your link to the collector is implicated in the outage.
-
Operations
⚠️ Read the book - it's and well worth the money. Srsly.
This is just a very high level overview of the ethos, along with my focus on the monitoring bits.
-
Minimize Toil
-
Logging into a machine is a fail
-
-
Thresholds must be visible and tunable.
-
They must progress from Inform -> Alert -> Page
-
Unless there is a serious incident, any time a threshold causes a page, someone messed up...
-
Alerts
-
An alert must be actionable and require action
-
Unless it is a sudden incident (e.g backhoe fade), an alert must fire before a page.
-
Alerts can escalate over time
-
Pages
-
A page must be immediately actionable, and critical (if not, it isn't a page, it is an alert or inform)
-
It must be clear and self-contained.
-
Pretend you got the page at 3:00AM while in a cab after a 15 hour flight
-
Is the reason for the page clear?
-
Is all of the supporting info included?
-
Could you talk your computer illiterate uncle Frank though resolving it?
-
If the answer to any of these is "No", then the page is inadequate.
-
-
Any false positive pages must be fixed to no longer do so
-
This is not optional - pager fatigue is real
-
-
-
No more than 2 pages per day
-
A pager/alert budget must be maintained and monitored
-
Cascading failures bad!
-
Glass
The glass / dashboard / UI / $buzzword has multiple purposes. Trying to use the same view for all purposes is a fail.
Marketing / Reporting
Das Blinkenlights
The marketing/reporting purpose only exists in some instances (and the terms are not really intended to be as pejorative as they sound 😄). These views provide a quick overview of the system status for other teams, etc. A good example of this sort of purpose/view is the "network overview" view at a conference; it shows the number of uses, bandwidth, general status, etc. It should be easily digestible by someone seeing it for the first time, with only a few seconds of looking.
Operations
NOC / Dashboard
This should be a playlist of the most important dashboards, cycling through the views. It must clearly communicate the overall status at a glance, with clear markers for out of threshold values.