Keeping a website online is hard. For smaller companies it is practically impossible. The largest companies with dedicated teams of engineers working on it still struggle. Automated reliability is important, but for many companies it’s more worthwhile acting on their alert systems. In order to react quickly to unexpected events, you need to know that something is going wrong.
Single server systems versus the cloud
It used to be a bit easier with single server based architecture. There was only a single point of failure: either your machine was on, or it wasn’t. Atleast that’s if you exclude reasons why your product might not work even while switched on: misconfigured software, the data center internet connection, faulty memory, domain name issues. There are more things that can go wrong than you can possibly wrap your head around.
The advent of cloud software really hasn’t made it any easier. Now instead of having one point at which any one of a hundred things could go wrong, if you’re using multiple servers the number of possible failures grows exponentially. Assuming a failure will happen (it envitably will) — how do you find out before your customer?
Monitoring systems aren’t easy either. If you set up a system to monitor your servers, what monitors that system? Once it’s figured out somethings wrong, how does it let you know? What if email doesn’t get sent?
Every process in our system has a heartbeat. At the moment there are around eighty of them. Some of ours include: “Invoices generated for GMT+2″, which has a pulse every 24 hours. “Retention statistics” pulses every fifteen minutes. “Redis slave” goes off every 15 seconds. “Free disk space on web3″, “Clean directory on web2″, “Job on queue completed”, “All jobs on queues less than five minutes old”.
The heartbeats are all monitored internally and each has a warning timeout as well as a critical timeout. If the retention statistics beat is half an hour late – it is something to look into so our warning flag gets set. I really don’t want to be woken with a call at 3am for that though. On the other hand if redis-slave falls more than 5 minutes behind, that could be a serious problem and warrants the call.
We generate a page internally with all the results, and can use multiple external services to monitor that page. If the page comes back without any errors/warnings, we can be completely sure everything inside our system is working. We know every single heartbeat has been firing on schedule. If it comes back with errors, or doesn’t come back at all — thats a problem.
Once we know there’s a problem, we have the monitoring services trigger PagerDuty directly, and they then handle waking us up if necessary. We currently do have a single monitoring point of failure, and if PagerDuty is down at the same time as us – we might have a bad time. They’ve done a huge amount of work on reliability though, and for us the first step is knowing there is a problem.
If anyone from PagerDuty (or their competitors) is listening though, it would be far better if you could monitor our heartbeat page for us. Perhaps even implement it yourself… I believe there is a viable business in this.
Further uses once it’s up
At its core heartbeats are a very simple concept. Simple is good. Being able to utilize them though throughout our project we’ve come up with some additional handy uses.
Removing deprecated code: Occasionally you write a piece of code that has a fixed lifetime and you know you want to get rid of it once its no longer used. We used to put “// TODO” comments but those tend to hang around for a lot longer than anticipated. Instead we’ve started putting up heartbeats with very large timeouts. When we changed our reference code format, we knew it would be safe to delete the old code once it hadn’t been run for a month. Simple heartbeat solution: “Old reference code” with a warning timeout of 30 days. When it eventually went off four months later we cleaned up that area of the code.
Manual tasks: Early businesses often have tasks that they would like to automate later, but frankly it’s easier doing them manually at the start. The problem there is you might forget to do something important, such as send out past due warnings. A 30-day heartbeat of when you last performed it, which you trigger manually each time, will ensure you never forget.
Cron jobs: Most of our machines are not configured to send out emails, but that is the standard solution to get alerts of a failing cron job. There are a lot of better long term solutions, but a simple heartbeat at the end of one of our cron scripts which performed file backups let us know quickly that something was going wrong.
Pausing heartbeats: Often heartbeats have failed for something not entirely in our control. Perhaps a third party system has gone down and attempts to log statistics are failing. A manual trigger of the heartbeat with an extended warning timeout of when we expect the provider to come back up, allows us to get the system back to ‘OK’ while automatically reverting back later if the problem doesn’t subside.
Heres a sample of what our heartbeat system currently looks like. Yes, it does need to be cleaned up a little:
web8.snapbill.com OK: batch-pager is at 0:00:31.728470, timeout is 1 day, 0:00:00 web8.snapbill.com OK: clean-working-directory is at 0:00:30.728470, timeout is 3 days, 0:00:00 web8.snapbill.com OK: jobq-write is at 0:00:30.728470, timeout is 0:08:00 web8.snapbill.com OK: provision-reserve/provision-work-0 is at 0:00:15.728470, timeout is 0:10:00 web8.snapbill.com OK: provision-reserve/provision-work-1 is at 0:00:16.728470, timeout is 0:10:00 web8.snapbill.com OK: timezone/check is at 0:08:30.728470, timeout is 1:30:00 web8.snapbill.com OK: work-reserve/0 is at 0:00:00.728470, timeout is 0:15:00 web8.snapbill.com OK: work-reserve/1 is at 0:00:00.728470, timeout is 0:15:00 web8.snapbill.com OK: work-reserve/2 is at 0:00:12.728470, timeout is 0:15:00 web8.snapbill.com OK: work-reserve/3 is at 0:00:02.728470, timeout is 0:15:00 web8.snapbill.com OK: work-reserve/4 is at 0:00:00.728470, timeout is 0:15:00 web8.snapbill.com OK: work-reserve/5 is at 0:00:00.728470, timeout is 0:15:00