Status Page Conundrum

At a previous job we had a major network issue take out our infrastructure and the resolution wasn’t going to be for several hours. We really wanted to inform our users of the state of things but we couldn’t. Our status page was on the same network and therefore was also down.

Developers and users alike love status pages, and thankfully most web sites are deploying them! There are even status page as a service offerings out there that are gaining traction. When deploying a status page it’s important to think through the failure modes that you are going to be reporting and making sure that they do not overlap with product failure modes.

Your status page shouldn’t run on the same network, or on the same provider as your service. If you run in AWS consider using Digital Ocean, or Azure.
Be careful that your status page doesn’t share failure modes. If it runs in Heroku it might be running in the same AWS data center that you are.
Your status page needs to be deployed using a different authentication model. If a hacker can compromise a developer laptop and extend that into taking down both your page and your status page you are in trouble.
This might seem obvious, but your status page shouldn’t require any resources (css, JavaScript, etc) from your main page. This often gets missed when the page gets “skinned” by the design team.
Be sure that the DNS for your status page also doesn’t use the same servers as your production DNS services. If they both share the same DNS service then a single provider can take down your service and your ability to inform users of issues.
Ideally the management layer used to deploy should be different as well. If your production services uses terraform, to deploy the same run shouldn’t also be modifying the status page otherwise one bad deploy can take everything out at the same time.

You should also monitor your status page. Nothing can be more aggravating than finding out during an incident that your status page hasn’t been loading right, or has been relying on old out of date caches for a while. Even more so if its a third party service.

Another tip, which is most likely the least obvious of this batch, is to monitor your status page’s hit counts. Seeing a sudden spike in hits might mean that some users are experiencing an issue that you are not seeing in your monitoring. If you page gets an average of 3–4 visitors an hour, but suddenly spikes to 600 then you should investigate what is causing this.

And finally my last tip is to monitor the status pages of the infrastructure that you depend on. Its always better to get an alert saying that your site is down and AWS is reporting an outage in your zone than a simple “site down” alert with little context.