We view Monitoring as both a Software Application that we develop and maintain (like our Products), and also as an Application we have embedded into the running of our business.
This means we see Monitoring both as Tools and Processes. The people operating within the processes are just as critical as the software tools.
Having an alert advising something is broken without people and processes in place to react and properly manage the issue behind the alert – is worthless.
Effective Monitoring cuts across the business:
- vital to stable operations and meeting Client SLA’s
- vital to strong security position and Infosec compliance
- vital to capacity management and incident management
Our 9 principals to Monitoring
- Monitor the part and the whole – Monitoring the small individual parts is easy, monitoring end to end processing takes a lot more care and effort.
- No alerts without a pre-determined action – If you don’t know what to do with an alert before it comes, it is purely informational. The time to decide what to do with an alert is when you put the alert in place, not at 3am.
- For the person reading the alert, assume they know nothing (and that it is 3am) – The alert needs to explain, what the business impact is, and what to do next
- Eyes on screens will let you down eventually. – Even your “heroic” people who check their mail 24/7, eventually their eyes will glaze over.
- If there is a way to automatically heal the application, do it – (restart it, delete files), log what happened and investigate in the morning. As we say .. Eyes on screens will let you down eventually.
- You do have single points of failure – you just haven’t thought of them yet or had them fail yet.
- Monitor the monitors – Figure out which of your monitors are critical, which are important and what is information. How much do you trust your critical monitors will always work?
- You need a test plan – for your monitoring system (tools and processes), just like you have a test plan for any Application. When your mean time to failure is months or years, you need to test in Production that your Monitoring system will pick up failures. Beyond the alert coming out of the system, are the alerts received by the right people? Does the team know how to react to an incident? Are your out of hours procedures working?
- Quality not quantity – It might sound impressive that you have 1000 monitors, but that one monitor that measures what is important to the Customer is worth more than any number of OS health checks.
What was fit for purpose in your monitoring system last year is unlikely to be fit for purpose this year, the infrastructure environment will have changed and the Business’s expectation of what is an acceptable failure rate has reduced.
Part of this article originally was originally published here.