Scaling Rust to 5 Billion Messages a Day at OneSignal

The OneSignal podcast aims to democratize digital messaging best practices, industry news, and product expertise. Today we're joined by OneSignal’s CTO, Joe Wilm, who discusses the utilization of rust and new technologies such as Kubernetes to scale a digital service to 5 billion+ daily messages.

This is your host, Josh Wetzel.

Josh: Welcome to the OneSignal podcast where we aim to educate ourselves on product, industry and best practices as it relates to building and growing the customer messaging practice.  I'm excited to have OneSignal CTO, Joe Wilm, as our guest to discuss various technical scale decisions, our utilization of Rust and exploration of new technologies such as Kubernetes to continue to enable exponential growth in usage and data storage for digital service that now serves 5 billion-plus daily message to more than 1 billion unique devices weekly. Welcome, Joe.

Joe: Thanks for having me.

Josh: We've worked together for 15 months now. I've been very impressed from the day I met you, actually really even in the interview process with the sort of the confidence and the sheer performance and viability of OneSignal architecture. I'm curious, how did you end up here?

Joe: It was really just by chance. One of my good friends, Colin Green, who actually works here now, was using the product back in 2015 when it was still called Game Thrive. And he knew I was starting to look for a new position, sort of just really considering it, and introduced me to George. And you know, I had a bunch of concerns coming out of my last startup about just is this a company that actually cares about good engineering? Is it a company that's, you know, has ambitions to really have customers and support them well, which was sorely lacking at my previous role. But, yeah. So, I got introduced to George, our CEO, and you know, things really just got moving from there, saw a lot of potential with working with George and the opportunity to come, like build really solid systems, which is what I was excited about at the time. And here we are.

Josh: You were at Lockheed Martin earlier in your career?

Joe: I was for a couple of years right after college.

Josh: How different is that environment compared to this environment, especially as you're starting to scale early on at OneSignal?

Joe: Well, I think I had a fairly unique experience at Lockheed Martin. I was working in their R&D group in Palo Alto and had a lot of leeway to work on various projects. It was a lot of full-stack development actually, but also out of embedded systems and reverse engineering of various pieces of hardware so we could attach them to the network and that sort of thing. But it was a very small team. There wasn't really-- it was R&D. it was very much R&D. There wasn't like a production environment to run or anything like that. We didn't have customers yet. It was very research-oriented. So, a very different environment, but both have been very excited in their own ways. OneSignal, in particular, is exciting because we have customers and because we have so many of them and because we want to serve such a high-quality product and have a reliable system.

Josh: Yeah, that's a good segue. So, approximately three years ago, we announced OnePush, which is the delivery system for notifications, which we wrote in Rust. What was the decision behind using Rust at OneSignal to power OnePush?

Joe: So, to understand the decisions for OnePush, we need to look at the previous delivery system which was written in Ruby. It was a capable system, but due to sort of the constraints of running in Ruby, it had some very hard limitations. Well, the initial limitation is just that the RubyVM is single process. So like, you can have threads but due to the interpreter lock, you can't have true parallelism. So, your options for scaling at that point are basically running more and more copies of this thing. And with Ruby and back in 2016 when we were looking at this thing, we're sending 10 to 20 million messages a day and a single process was capped out. And we're looking at how we're going to grow and just like what the server costs are going to be behind this. And it's like, well, this isn't going to be sustainable. We're going to just burn money if we run this. So, that kind of left us with a couple of options. We knew we wanted some sort of compiled language, like these scripting environments just have such a performance set. So, at the time, the two real options for us were Go and Rust. Java wasn't a great option for us due to various reasons. And C++ is just really hard to get right when it comes to concurrency and various other things.So, we actually talk about this a little bit in our Rust at OneSignal blog post.  It came down to,with Rust, we were going to be able to write a system. We could almost know at compile time it was going to behave exactly how it should before we ever deployed it to production. And that's largely been the case since rolling it out. And this is mostly thanks to Rust robust air handling storage. Which makes it easy to write correct code and it makes it impossible to have race conditions or data races rather, which you know, if you're trying to do this in like C++ or even Go, which is another modern language for parallel programming.

[05:04]

Josh: So, we've gone from-- at the time of this announcement and roughly the time of deployment, 2 billion messages a week. Today, that's 35 billion. That’s a 16x growth. Peak spikes were about 125,000 messages a second, now we’re handling spikes of 1.7 5 million messages sent per second. 10x+ growth.  How has the original code held up with the astronomical growth?

Joe: Well, it's been pretty phenomenal. Today, we're running only four processes of the delivery system, just one per server. And we have plenty of capacity to spare for that part of the system, at least. The challenges we run into with scaling actually are not the delivery system. It's all of the other systems that it's connected to. So, talking to Redis and Postgres and Kafka even.

Josh: It's all the storage and all of the computation around segmentation and whatnot?

Joe: Oh yeah.

Josh: So, now that we're three years in, do you think it's going to take us to the next 10x growth?

Joe: Oh, certainly. So, the One Push was made to be sort of pluggable with all of these external data sources and sinks, where notification deliveries are coming from, where the results are being sent to. Those are all sorts of plugins to the application. So, the core of it, which we know to perform very, very well and scale very well, is sort of forever for us. Whereas, you know, how we're getting notifications into the system, that's something that we can create a new adapter for. And basically, whatever sort of message queue or other software we want to put to bring notifications in, like, you know, we're sort of future proof there.

Josh: Did you come into OneSignal with Rust experience?

Joe: So I came into OneSignal with some Rust experience. My biggest project at the time was a library for writing chatbots called Chatbot. And I don't think it's used too much anymore. It hasn't really been maintained, but there's been a few people and-- so, that's sort of the extent of the experience beforehand. But about just a month into starting this job, in addition to working on OnePush, the delivery system, I also started a side project writing a terminal emulator called alacrity and that has become quite popular actually. Alacrity is built in Rust.

Josh: Is this what led you to becoming seen as a community leader within Rust. Is that fair to characterize?

Joe: I don't think-- I would never call myself a community leader. You know, it has given me the opportunity to interact with a lot of folks in the rest of the community. And it's actually helped us to hire some very excellent engineers as well.

Josh: Yeah, yeah. I know, I've seen that as a sort of objective third party here. What do you think is the future for Rust? Will it become much more widespread? What's your prediction of the evolution?

Joe: So, Rust is innovating at a pace really exceeding any other language at this point. Well, you know, Rust itself, being written in Rust, helps them write, you know, a Rust compiler that's going to work and you know, be able to support new language features much more quickly and easily I think then, say, you know, C++ trying to evolve or even go trying to evolve really. And the proof is, in putting with this one, like, they just launched async-await, which is some language constructs for concurrent programming, really making that quite a bit easier. And that's gonna really help to boost Rust popularity in the server space. And we're already starting to adopt it at OneSignal. But there's a lot of interesting projects coming out every day. And I think Rust is going to eventually come to be the default for all new sort of compiled programs where performance matters. Whereas that might be C++ today, I think the future is Rust for sure.

Josh: Is there anything you would change in terms of that decision or how we executed that over the last three years?

Joe: Anything I would change about using Rust?

Josh: Rust or just how we implemented it?

Joe: Yeah, that's a really good question. I think we've definitely had some extra challenges due to adopting Rust at the time we did. I think you wouldn't necessarily have some of those same challenges today. But just as an example, so One Push is, if you kind of get down to the basics, it's a system that sends a lot of HTTP requests really fast. And at the time we started, the main library for doing this in Rust called "hyper", it just supported asynchronous API. And so, your only real option at the time for sending concurrent requests was to spawn a whole bunch of threads. And so, hyper at the time was actually working on its first asynchronous back end. So, that would allow us to send a bunch of HTTP requests using a single thread. But it never actually sorted of merged into the master development branch. This was based on this async library called Rotor, which is sort of been deprecated in favor of this newer library in Tokyo. But we ended up living on that branch for quite a while because we-- you know, we couldn't support, you know, or we couldn't really get the performance numbers we wanted just spawning tons and tons with threads.

[10:35]

Josh: Interesting. And do we have issues, or not issues, but challenges finding people who had experience in Rust as well?

Joe: So, we've had a lot of luck with hiring really good engineers who are interested in Rust, but we actually haven't hired any-- well, I take that back. We haven't hired very many people with extensive Rust experience. We've had one come in with a lot of Rust experience. But most of the team is actually sort of learned it on the job. And it's actually great for onboarding new engineers to with-- the compilers are rigorously checking everything. It helps to make sure that people are writing good code.A lot of the team has gone through the experience of fighting with the borrow checker for the first time here at OneSignal.

Josh: Let’s talk a little bit about the technical decisions that got OneSignal where we are today and specifically, those growth points.  So, we covered the delivery and specifically Rust. What were the other growth points that required the team, specifically from an architecture backend standpoint, to make changes to handle the scale.

Joe: So, I think we can sort of look at the historical things there, the historical decisions, things we're dealing with currently and some of the things we're at for the future. Historically, the main things we've done are: one, the delivery system, two, shorting out our database Postgres, which has been a little bit operationally heavy. And so, that's one of the things we're kind of-- one of the challenges we're working on currently is, is there sort of a scale-out version of Postgres that we can continue to use and you know, scale the service and make our lives easier operationally as well? Or do we need to look at, you know, some other sort of database for some of like our subscriber data, like, something like ScyllaDB for instance. So, the databases, you know, we've made some decisions in the past that have helped us scale to where we are today. And I think there's some choices we're still trying to figure out for the future on that front. One of the other-- so, one of the other pieces of the technology we use is this database called Redis, which is all in memory. And so, historically, we've sort of just managed that with having replicas and sort of a manual failover process in case of any sort of disaster. But as we're upping our operational game, we're actually in the process of rolling out Redis Enterprise with automatic failover and better high availability features than you know, we were able to provide before. So, that's pretty exciting. And we're actually rolling that out on top of Kubernetes, which is another technology. We're bringing it into, sort of improve our operational posture here.

Josh: And then-- so, that leads me to the next question, which is probably a little bit kind of part and parcel of what you're just talking about, but how do we think about, and how do you deploy kind of experimentation with these new technologies? Because we have a weekly all-hands where we talk about this stuff and some of the backend engineers kind of under your direction or you specifically will get up and talk about, "Hey, we're spinning up this cluster around this and we're testing it and it's going to like reduce our load or our utilization. And it's all impressive." And somebody who's not an engineer, I'm like, "I'm glad those guys are on my team." But I'm curious, how you think about that, how we've tested those things and how do we actually get to the decision of deploying that technology, whether it be Kubernetes or even back in the day with Rust, but like specifically recently, how are we thinking about that? How do you approach it?

Joe: Well, so with anything we're rolling out, you know, we want a change that's going to make our lives easier than it was before or make the system more robust than it was before. So, with Kubernetes as an example, prior to that, we've basically orchestrated all of our services on our servers using Ansible, which gets to be very difficult to maintain very quickly, especially as the number of services grows. This is a problem that Kubernetes solves very well. And a lot of the initial applications we're putting on there actually just Kafka consumers, which are actually rather easy to move on to Kubernetes. They don't have some of the ingress challenges that like API services would, for example.On the database front, which I think is where things are a little bit more interesting and certainly more challenging, how do you sort of, that a new data store for your workload and not just, you know, sort of like sandbox testing, but like how do you validate that to the point that you're confident of, you know, moving over all of your production traffic to a new system.

[14:58]

And so, one of the-- the main way we can achieve that sort of thing is actually because a lot of our database rights are actually going through Kafka, and certainly all of the subscriber updates as far as like new push subscriptions, tag updates, that sort of thing. I'll get written out to Kafka. And so, what we can do is we can actually spin up, you know, this new data store to basically taking or creating a new consumer group on Kafka. So, sort of starting to record all of the future updates and then doing an import from the existing data stores into that new system and then start consuming off of Kafka to get all those updates being missed. And then eventually, we're back to real-time. So now, we have two databases running with all of the up-to-date production data and we can sort of start dark loading, and testing some of the production queries on the new process. And then eventually once everything has been nicely validated, you can move some production traffic over. So they're starting with reads, seeing that everything is performing well. Eventually, you move all production systems over. But having sort of a transaction log external from the database itself that allows you to sort of load up these other data stores just by attaching another Kafka consumer is really helpful.

Josh: It was a process of parallel processing and batch to see that's working. And then once validated that it's doing what you're looking to do, you move everything over.

Joe: That's right.

Josh: It's busy, especially with Black Friday, Cyber Monday, Giving Tuesday, for all digital services.

Joe: Yeah, exciting days for us.

Josh: Our customers rely on us for reliability. What technologies or languages are you considering to handle the next phase of growth? I mean, you touched on some of them there, but…

Joe: Yeah, we've touched on a few of them. So, to get very concrete, we're evaluating a few different scale-out database solutions. We’re looking at ScyllaDB because it's sort of a drop-in replacement for Postgres, which is what we're using for all of the push data today. We are-- I mentioned we're bringing in Kubernetes, Redis Enterprise, those sorts of things. A lot of what we're looking at right now is basically, how do we take the remaining hard to scale parts of our system and make them linearly scalable? And so, on the database side, like Scylla’s might be an option, but we're also looking at-- all the options are on the table at this point. So, we're looking at like the test, which is sort of scale-out my SQL. It’s actually my SQL under the hood at the end of the day, but they have some various software to make it scale-out. ScyllaDB, which is sort of Cassandra rewrite in C++ and a lot more performance. They've done a really cool benchmark on our infrastructure host where they're able to fetch a billion rows per second. We're delivering 5 billion notifications a day. We can get to 5 billion per minute.

Josh: That’s cool.

Joe: That's pretty incredible. There's some interesting research to be done there. And we've kind of touched on how we'll be performing some of those experiments. And, you know, hopefully, we'll find something that satisfies all of our desires.

Josh: What's the thing that gets you most excited every day to come work at OneSignal?

Joe: Honestly, it's hard to pick one thing. It's really fun to come work with our team who's all very excited at solving these difficult technical challenges and supporting our customers the best we can. One of the things that really struck me the other day is when we were sort of talking about a PagerDuty incident. But it didn't really like impact. It didn't have a broad impact at all. But one of the engineers brought up that, you know, sort of every time they get paged, they sort of feel like they're letting down the customers. And so, it's really-- I think it's really exciting to be working with a team that cares so deeply about the success of our customers, but also solving really interesting technical problems together and you know, working as a team to make our customers and ourselves successful.

Josh: Yeah, I am not going to get teary-eyed, but I do feel like one of the things that's really interesting about this company is there's a lot of collaboration and a sense of ownership across functions. People are very collaborative and it's rare as somebody who's worked in a lot of different startups and in large companies. You get very siloed, especially as you scale. So, the team has been outstanding and you've been a huge contributor to that in terms of bringing in great technical talent. So, I really appreciate you joining us, Joe. Hopefully, you have more of these in the future.

Joe: Certainly. It's been fun. Thank you for having me.

Josh: Thank you guys for listening. If you enjoyed what you heard, please subscribe to OneSignal at your preferred podcast directory, Spotify, Apple, Google, TuneIn, you name it. Thank you. Have a good day.