Status Page Conundrum [2016]

Published: May 30, 2016
Author picture of Brady Catherman
Brady Catherman
Sultan of Scale

At a previous job we had a major network issue take out our infrastructure and the resolution wasn’t going to be for several hours. We really wanted to inform our users of the state of things but we couldn’t. Our status page was on the same network and therefore was also down.

Developers and users alike love status pages, and thankfully most web sites are deploying them! There are even status page as a service offerings out there that are gaining traction. When deploying a status page it’s important to think through the failure modes that you are going to be reporting and making sure that they do not overlap with product failure modes.

You should also monitor your status page. Nothing can be more aggravating than finding out during an incident that your status page hasn’t been loading right, or has been relying on old out of date caches for a while. Even more so if its a third party service.

Another tip, which is most likely the least obvious of this batch, is to monitor your status page’s hit counts. Seeing a sudden spike in hits might mean that some users are experiencing an issue that you are not seeing in your monitoring. If you page gets an average of 3–4 visitors an hour, but suddenly spikes to 600 then you should investigate what is causing this.

And finally my last tip is to monitor the status pages of the infrastructure that you depend on. Its always better to get an alert saying that your site is down and AWS is reporting an outage in your zone than a simple “site down” alert with little context.


Copyright 2016 - 2024