Postmortem of our November 29, 2016 API Outage

Yesterday, onesignal.com became unresponsive at 22:22 PST. Service was mostly restored by 23:10 PST, and service was fully restored by 23:16 PST for a total of 54 minutes of downtime. We would like to apologize to all of our users for this period of unexpected downtime. At OneSignal, we strive to provide the best experience of any Push Notification provider to all of our developers. We consider unexpected downtime of any length to be unacceptable, and for this we are sorry.

The remainder of this post details what happened and how we plan to mitigate such issues in the future. It is worth mentioning that scheduled deliveries continued to go out during this period, but click-through rate of those notifications is mostly unaccounted for since we couldn't serve requests.

Timeline of Events

  • 22:20 PST: Our load balancer CPU usage I/O Wait time increases dramatically.
  • 22:22 PST: All request processing has stopped, and I/O Wait is consuming 100% of our load balancer's CPU.
  • 22:30 PST (8 minutes later): Our monitoring alerts us that there is a major problem. At this point, our engineering team begins to diagnose the problem.
  • 22:38 PST (8 minutes later): We've determined that this doesn't appear to be a DoS attack, and that the load balancer server is no longer operable; a hardware failure of the boot disk is thought to be the cause of the problem. We start bringing a new load balancer online.
  • 23:03 PST (25 minutes later): A new load balancer is ready to serve requests.
  • 23:05 PST (2 minutes later): The DNS A record for onesignal.com is updated for the new load balancer.
  • 23:10 PST (5 minutes later): DNS has propagated within CloudFlare, and the new load balancer begins serving requests. We realize that the DNS records for a legacy domain (used by some older SDKs in the wild) still needs to be updated.
  • 23:11 PST (1 minute later): DNS for the legacy domain is updated.
  • 23:16 PST (5 minutes later): Service is fully restored.

Discussion

Reviewing this timeline, there are several opportunities for improvement. Our alerts were slow, identifying the problem could have been much faster, provisioning a new load balancer took far too long, and waiting for DNS could have been avoided by parallelizing this activity. Beyond optimizing our resolution of such incidents, we believe a full service degradation could have been avoided in this case. Below we outline both improvements to our event resolution and improvements to our service architecture.

Monitoring

Starting from the top, there was a period of 8 minutes where we didn't even know there was a problem. Monitoring for our web servers is done through New Relic and their statistics only update every 5 minutes. Additionally, our alert was configured to go off after our RPM dropped below a certain point for 5 minutes. The first report of low throughput took 3 minutes to arrive. The second report of low RPM in New Relic arrived after another 5 minutes; this update finally triggered the alert.

To address the issue of timely alerts, we are going to add monitoring for the web API to our internal monitoring system. We will be able to configure alerts on this to go off within as little as 10 seconds of a service degradation event.

Problem Identification

The next issue in this timeline was the 8 minutes spent attempting to resolve the problem with the existing load balancer. Within a minute of starting this investigation, the server appeared to simply be unresponsive and we attempted a hard reboot. A minute later there was no sign the server was coming back. At this point, we should have cut our losses and moved onto provisioning a replacement. This could have saved a further 6 minutes.

Load Balancer Provisioning

This stage of our resolution accounted for a whopping 25 minutes our outage. I'm going to tell you a bit about how we manage our infrastructure and then explain what went wrong here. We use Ansible (which by the way is an absolutely fantastic tool) for managing our infrastructure. We've got what they call playbooks for (among other things) provisioning our Rails web servers and load balancers. Provisioning a new load balancer should involve just a couple of steps for us:

  1. Create a new server through the Rackspace control panel (Ansible is actually capable of doing this part, but we don't currently have nearly enough servers for the time investment of configuring it to be worthwhile).
  2. Add the server to our Ansible inventory
  3. Run the playbook for configuring the load balancer

And indeed, this part of the process took no more than ~10 minutes. The next 15 minutes is where things went a bit sideways.

We have pretty strict iptables configurations on our Rails servers to make them unreachable to all but our edge servers (the load balancer). Remember how I mentioned we took too long to "give up" on the old load balancer server? Well, we still hadn't done that completely. If we had, a simple ansible-playbook webservers.yml --tags iptables would have disallowed the old load balancer and allowed traffic from the new one. This wasn't possible because, with the old load balancer being unreachable, Ansible facts for the old server were unavailable. Among other things, these facts contain the ip addresses of each server. With no way to do this automatically, we spent time figuring out specifically which iptables rules needed to be added on each front-end server. Thankfully, ansible helps with such ad-hoc commands by providing a way to run the same command on a group of servers. Even so, this whole process took ~15 minutes.

Once we were satisfied that the new load balancer was ready, we advanced to updating the DNS records.

DNS Updates

DNS is kind of a bummer in these situations; after submitting a change, there is nothing one can do but wait. Had we accepted the loss of the first load balancer from the beginning, we might have replaced the A record in parallel with provisioning the new server.

The second issue here was that we neglected to update the A record of our legacy domain which meant another 6 minutes of downtime for a small subset of users. This particular issue will be easy to prevent in the future with just a bit of automation.

Mitigation

The discussion so far has addressed how we can improve our issue resolution in the future. As we mentioned before, we believe we can prevent this mode of failure in the future. Simply having a second load balancer active could have prevented a full service degradation, and it would have dramatically reduced the duration of service degradation.

DNS allows one to specify multiple A records for a domain. This means that multiple name resolution requests may return different addresses. Since our SDK is set up to retry with exponential back-off, queries would probably end up hitting the second load balancer after a few attempts.

Having a second load balancer also enables quicker resolution of such incidents: simply delete the A record of the faulty load balancer. This requires that the remaining load balancers are able to handle all of the traffic, but that requirement is easy to meet. The faulty server would not be returned in name lookups after 5 minutes, and the engineering team would not be in a rush to add a new server because there are already more running.

Examining other failure modes

We would be remiss if we didn't take this opportunity to consider other possible failure modes which would result in such significant downtimes.
We will be auditing the rest of our system to identify such possibilities, and we will implement any mitigations necessary.

Summary

We experienced a hardware failure with our single load balancer which resulted in 54 minutes of downtime. Our incident resolution has several opportunities for improvement including improved monitoring for more timely alerts, reducing time between issue identification and beginning resolution, and knowing when to cut our losses. We had a pretty good idea early on that the load balancer would not be coming back; despite that, we made choices to the contrary. We estimate that the incident could have been resolved 26 minutes faster if we had just cut our losses.

This particular issue will be mitigated in the future by having multiple edge servers so that when one fails, returning the service to full capacity is just a matter of removing a DNS record. We plan to implement mitigation strategies for other failure modes we identify in a self audit of our system.

We take unexpected downtime like this very seriously, and we are sorry for the inconvenience to our users.