OneSignal's Holiday "Freeze" Experience
Charity Majors, Co-founder and CTO of Honeycomb.io, started a thread recently on Twitter about the concept of a “holiday deployment freeze.” I thought it might be an interesting opportunity to discuss the OneSignal experience with production freezes recently, and to say I agree with her general statements on a contentious issue.
For some broader background: There’s currently an ongoing debate within the SRE and Dev/Ops communities about the idea of freezing deployments. At the most extreme end of the freeze side is “never deploy on Fridays.” At the most extreme end of the anti-freeze side is “deploy on Christmas.”
The argument for not deploying at times you don’t want the team to be engaged is obvious - if you don’t deploy, you probably won’t have an unexpected problem. Don’t cause fires right before you want to take some time off as a firefighter! The argument on the anti-freeze side is that your processes and systems should be automatic and robust enough that, even if you do deploy something that doesn’t work well on a holiday, it should be no big deal for the engineer doing the deployment to detect it early and roll it back. And hopefully your testing and development processes are good enough that the likelihood of doing a bad deploy is pretty small to begin with.
Charity makes the point that “deployment” isn’t what you should be banning - rather that “merging” is what you should be banning, since your CI/CD pipeline should work such that merge causes a deployment. So it’s nonsensical to demand a pause to “deployment,” you shouldn’t merge to begin with. While correct, if your CI/CD pipeline doesn’t automatically deploy on merge, your processes probably aren’t very mature and you definitely should be freezing deployments on critical days!
I’m happy to report that many of OneSignal’s backend CI/CD processes do automatically deploy on merges, so we’ve got that going for us, though we do have some outliers we’re vigorously working on. We’ve had a few experiences with production freezes recently and I thought I’d discuss our thinking, our experiences, and what we’ve learned from them.
The first came from the US elections in November. We have hundreds of media and news partners, some particularly large US-based ones, who use the OneSignal platform to send “breaking news” alerts to their users. In that business, being able to deliver messages quickly is a key strategic goal, and a matter of minutes can be the difference from getting credit for a scoop…and getting scooped. So we were very eager to not have any issues during and following the election. While the actual volume of these messages was small relative to our normal flow, we knew our availability was key to our partners and the general public at large.
Thus we decided to enforce a “Production Freeze” from Friday, October 30, through Wednesday, November 4 - perhaps longer if need be. This freeze was to be a “hard freeze” and in fact I set myself up as the gatekeeper. If you needed to deploy to production, I needed to let our media and news partners know - because I’d personally promised them we wouldn’t, barring a critical emergency.
Of course, with such a long - and indeterminate! - freeze, our developers felt pressure to get new features out the door. We were implementing some specific billing changes to our plans, and this needed some code changes to properly display to customers how they were using our service. On Thursday, we deployed some new code around this feature. On Saturday, we realized this caused a not-quite-critical issue for a large customer of ours. Not “can’t send a message” bad, but still inconvenient and frustrating. Ultimately we made the call not to fix this and violate the freeze, but, man…it was a painful decision, if the correct one.
As you may recall, the final disposition of the US election was…a rather drawn-out affair. Our Wednesday post-election freeze end date came, and there was no hint of an end in sight. However, we had this almost-critical issue we needed to address. So, we communicated with our partners that we’d be extending our freeze until at least Sunday, November 8, but that we would be deploying this fix on Wednesday. Everyone seemed OK with this, and thankfully the deployment of this fix went quickly and smoothly. We were then able to continue this freeze through Sunday. With the election having been called by media outlets on Friday, November 6, we kept our promise to our partners, and were able to deliver 100% uptime during this critical window!
However, we learned that a freeze of that length was really painful. And, to be honest, our engineering team didn’t feel like it had been really necessary - that we would’ve inevitably had problems absent it. That experience led us into thinking about freezes for the upcoming holidays.
Now first, I want to note, that my colleague Nick Artman has in a past life had a personal policy of deploying on Christmas Day, just to prove to everyone how robust their systems were! We definitely had internal advocates of not doing a holiday freeze. After some discussion though, the management team decided that a freeze was probably best for the team. Especially around Christmas and New Year’s Day - because of the placement this year, we expected most of the teams to take the week after Christmas off, because you can take three vacation days (December 28 - 30) and, with company holidays and weekends, spend ten days not working. I’m writing this as part of the skeleton crew here this week and let me tell you, this is exactly what happened.
We learned from our previous “hard freeze” though. First, every freeze we have now is preceded by a one-day feature freeze: No new features released on this day, so we have time to notice and fix any issues before the real freeze. Second, this freeze is a soft freeze. The point is to make life easier for our SRE and Dev/Ops folks. If releasing will make your life easier - then go ahead and release. Just, please, hold off on new features for Wednesday - Sunday through the holidays.
So far, it’s worked well. We’ve not needed to do any code releases over the holidays, which, ideally we shouldn’t because they’re holidays and people shouldn’t be working!
At the end of the day, in my book, that’s the reason not to release code over the holidays. In a well-run team and organization, you’re not working people so hard they feel pressure to get things done when they should be resting and enjoying time with their loved ones. No matter how good your CI/CD pipeline is!