A couple of excellent related videos from Goto Conference 2017. Some highlights are below.
Metrics are better than nothing, but some context will make them much more useful. (My queue is filling up – is that because more things are arriving than I’d expect, or things are leaving more slowly?)
Alerts and logs better than nothing, but centralise them (to let you see everything easily) and give some way of filtering and grouping so that you can find the signal in the noise. Also there should be some way of telling when errors have changed – these 13 systems were red, but now this 14th system has failed too. Going from very red to slightly more red is hard to tell.
Debugging shouldn’t be guesswork, but rather it should be a series of questions and answers that spiral in on the root cause. Write your system so that people can ask it useful questions and get decent answers.
Don’t be fooled into thinking that micro-services have made the world simpler. There are little islands that are simpler to understand, but their interactions are complex. One micro-service instance deciding to quit because it’s overloaded will shift its load elsewhere, causing other instances to get overloaded and quit, leading to cascading failure.
Back-ups are better than nothing. Testing your back-ups, and monitoring that they’re running makes them actually useful. (It’s not the back-up that’s valuable, it’s the ability to restore from a back-up that’s valuable.)
Be careful when being permissive about the state of the world – sometimes you’ll be letting in pathological data.