Warren's Distributed Monitoring
Principles & Practices...
Primary Principles
"Until you have reliable monitoring, you don’t have a service. What you have is a prolonged period of optimism."
-- apologies to Laurie Voss.
- 
Alerts must be actionable 
- 
Minimize Toil - 
Human intervention should (almost) never be needed; DevOps FTW! 
 
- 
- 
Export all metrics - 
'Tis better to expose and not need, than not have. 
- 
Disk is cheap, humans are expensive, downtime doubly so. 
 
- 
- 
"Events"/Anomalies/logs should only be exported if the event is an anomaly. - 
It's the nodes job to know this. 
 
- 
- 
Schemas are important. 
- 
Don't reinvent the wheel, use existing tools for well trodden roads, like TSD B, graphing, alerting. - 
Your monitoring needs to be super-stable. Troubleshooting your monitoring tool is a fail. 
 
- 
- 
Metrics must be at the appropriate level - 
"Number of HTTP Responses" is good, "Number of HTTP 200 Responses" is much better. 
- 
Anything that smells even slightly like PII is not a metric. - 
"Bob logged in" is not a metric! - 
Srsly, does this need to be mentioned?! 
 
- 
 
- 
 
- 
Metrics vs Events
- 
Metrics must be useful in the aggregate 
- 
Events are for human debugging, and must be useful as a singular 
- 
Collectors pull metrics 
- 
Nodes push events. 
Events
- 
All exported events must be counted/metric'ed 
- 
Be really careful that PII doesn't sneak in - 
Sanitize all errors on the node before exporting 
 
- 
- 
Events must - 
Joins are not your friend 
 
- 
- 
A big blob of text is not an event. - 
A blob can be included as well if unable to parse the error 
 
- 
Metrics
- 
Metric names are important, even more so than variable names. - 
They must be self-describing and clear 
- 
They should follow best-practices and conventions. - 
If there isn't a convention, reuse the 
- 
If you violate the convention, encode this is the name - 
E.g. Time is in seconds. If you need picoseconds, the metric name must indicate this 
 
- 
 
- 
 
- 
- 
A string is not a metric. - 
If you think it is you either: - 
do not understand the system you are monitoring, or 
- 
don't grok monitoring. I suggest 
 
- 
- 
The primary exception to this things like exporter version 
 
- 
- 
Understand your metric types - 
E.g. for Prometheus: Counters, Gauges, Histograms, Summaries 
- 
Samples are generally float64 with millisecond-precision timestamps 
 
- 
- 
Help text is really important, and should communicate the precision of the metric. 
Nodes/Probers
Quis custodiet ipsos custodes?
This is important. If your probers are not really reliable, accurate, fast, and (correctly) detailed, they are literally worse than useless.
- 
Nodes must be standalone. - 
You don't want to be troubleshooting your troubleshooting. 
 
- 
- 
Nodes should be minimally complex. 
- 
Nodes must not rely on a "central source of truth". - 
They may have a config file for collector location, port number, frequency, etc., but this is only read at startup, not dynamically tunable. - 
It can be really tempting to violate this principle, but dynamic sources of truth, dynamic prober thresholds, etc will bite you. These sorts of things belong in the collector/alerter. Trying to put them in the prober is a complexity failure. 
- 
Don't violate this, no matter how attractive the idea seems. It will bite you... 
 
- 
 
- 
- 
Nodes must be trivially deployed (collectors/reporting should be trivially deployed) 
- 
Scalability is critical - 
Even in small systems, if this is not followed, the monitor perturbs the system 
 
- 
- 
Timeliness is critical - 
Knowing that a problem happened 5 minutes ago is not useful, you need to be able to see what's happening now. 
- 
Nodes must expose a health check/proof of life 
- 
A counter of tests run is acceptable for this, as long as it is on the order of minutes. 
 
- 
- 
Exporters must expose internal state, like RAM, heap, etc. - Quis custodiet ipsos custodes? 
Collectors
⚠️ Don't be stingy...
This is a really common mistake that people make. For all practical purposes, disk is free for TSDB use. As an example, Prometheus stores an average of only 1-2 bytes per sample. If you collect 1000 metrics per second, that means you can store 15 years of metrics on a 1TB disk. If you earn just $100K, you are earning about 1TB disk/hour. This means that if you waste more than 3 minutes (60/15) per year because you don't have a metric handy, it's false economy. Obviously, your loaded costs are higher, the cost of downtime way exceeds your salary, etc - but that just makes the argument stronger.
- 
Don't be stingy with retention, number of metrics, etc. 
- 
You are monitoring for a bunch of reasons - uptime, performance, security, capacity planning, etc - 
These clearly much more valuable than the cost of some disk/RAM/CPU/bits. 
 
- 
- 
Metrics should be shared. Transparency good, silos bad. - 
The only exception to this is security metrics 
 
- 
- 
Sharding/clustering the collector is important - 
You can look really dumb if you don't have metrics showing the outage because your link to the collector is implicated in the outage. 
 
- 
Operations
⚠️ Read the book - it's and well worth the money. Srsly.
This is just a very high level overview of the ethos, along with my focus on the monitoring bits.
- 
Minimize Toil - 
Logging into a machine is a fail 
 
- 
- 
Thresholds must be visible and tunable. - 
They must progress from Inform -> Alert -> Page 
- 
Unless there is a serious incident, any time a threshold causes a page, someone messed up... 
 
- 
Alerts
- 
An alert must be actionable and require action 
- 
Unless it is a sudden incident (e.g backhoe fade), an alert must fire before a page. - 
Alerts can escalate over time 
 
- 
Pages
- 
A page must be immediately actionable, and critical (if not, it isn't a page, it is an alert or inform) - 
It must be clear and self-contained. 
- 
Pretend you got the page at 3:00AM while in a cab after a 15 hour flight - 
Is the reason for the page clear? 
- 
Is all of the supporting info included? 
- 
Could you talk your computer illiterate uncle Frank though resolving it? 
- 
If the answer to any of these is "No", then the page is inadequate. 
 
- 
- 
Any false positive pages must be fixed to no longer do so - 
This is not optional - pager fatigue is real 
 
- 
 
- 
- 
No more than 2 pages per day - 
A pager/alert budget must be maintained and monitored 
- 
Cascading failures bad! 
 
- 
Glass
The glass / dashboard / UI / $buzzword has multiple purposes. Trying to use the same view for all purposes is a fail.
Marketing / Reporting
Das Blinkenlights
The marketing/reporting purpose only exists in some instances (and the terms are not really intended to be as pejorative as they sound 😄). These views provide a quick overview of the system status for other teams, etc. A good example of this sort of purpose/view is the "network overview" view at a conference; it shows the number of uses, bandwidth, general status, etc. It should be easily digestible by someone seeing it for the first time, with only a few seconds of looking.
Operations
NOC / Dashboard
This should be a playlist of the most important dashboards, cycling through the views. It must clearly communicate the overall status at a glance, with clear markers for out of threshold values.