Rust at OneSignal

Earlier last year, we announced OnePush, our notification delivery system written in Rust.

In this post, we will cover improvements in our delivery capabilities since then, an interactive tour of OnePush’s subsystems and reflections of our experience shipping production Rust code. We hope you'll find it insightful!

Delivery Stats

OnePush was built to scale deliveries at OneSignal. To know whether this endeavor was a success, we collect metrics such as historical delivery counts and delivery throughput. Here's how OnePush is performing:

  • OneSignal had ~10,000 users at the start of 2016 and now has over 110,000 at the time of publishing this post. (Over 10x growth!)
  • We've increased the number of daily notifications sent by 20x in the same period.
  • OnePush delivers over 2 billion notifications per week.
  • OnePush is fast - We've observed sustained deliveries up to 125,000/second and spikes up to 175,000/second.

The title image on this post is a screenshot from our live delivery monitoring. Each bar represents deliveries occurring in that second, and each vertical division denotes 5,000 deliveries. The colors represent different platforms like iOS, Android, Chrome WebPush, etc. Every single one of them was delivered by OnePush.

OnePush

OnePush is comprised of several subsystems for loading notifications, delivering notifications across HTTP/1.1 and HTTP/2, and for processing events and results.

A self-guided tour of OnePush

The following SVG diagram shows the high level architectural components of OnePush.

A few notes about reading the diagram:

  • The squares with lightning bolts are threads.
  • The circles are connection pools for Redis and PostgreSQL.
  • Multiple boxes with the same type grouped together are for illustrative purposes. Clicking on any of them will result in the same description being displayed.
  • Squares with dotted borders represent top level generic types. That is, the implementations presented here integrate with the rest of OneSignal, and the core delivery system is independent of their implementations.

Click in any box that lights for a description of the subsystem; the text will appear below the diagram.

Router HTTP/2 Client HTTP/2 Client HTTP/2 Client HTTP/2 Client HTTP/2 Client HTTP/2 Client Feeder Source R2D2 Redis APNs Events HTTP Driver Notification Batch Request Generation Hyper Async Pool Client Client Client Client Terminal Event Stream Worker Worker Worker Worker Redis Reactor R2D2 Redis R2D2 Postgres Events Everything Else

System Description

Click on a box to learn more about it!


Choosing Rust

Choosing the programming language for a core system is a big decision. If not careful, one could end up with months of time invested and get stuck writing library code instead of the application itself. This is less of a concern with programming languages that have a mature ecosystem, but that's not exactly Rust just yet. On the other hand, Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules.

Given that we now have a production system written in Rust, it's obvious which side of this trade we landed on. Our experience has been positive overall and indeed we have had fantastic results. The following sections discuss the specific pros and cons we considered for building OnePush in Rust, what risks we accepted on the outset, the successes we had, and issues we ran into.

Reasons to not use Rust

The Rust ecosystem is young. Even if there exists a library for your purpose, it's not guaranteed to be robust enough for a production deployment. Additionally, many libraries today have a "truck factor" of 1. If the library's developer gets hit by a truck, it's going to be on you to maintain it.

Next, Rust's tooling story is weak. You can use tools like Racer and YCM to get pretty far, but they fail in a lot of cases. Good tooling is a necessity, especially for developers that are getting up-to-speed.

Having team members (who may be unfamiliar with Rust) contribute to the project may take a lot of "ramp-up" time. This risk has turned out to be quite real, but it hasn't stopped other members of our team from contributing patches to the project. Mentoring from team members more proficient with the language and familiar with the code base helped a lot here.

Finally, iteration times can be long. This wasn't something we anticipated up front, but build times have become onerous for us. A build from scratch now falls into the category of "go make coffee and play some ping-pong." Recompiling a couple of changes isn't exactly quick either.

Before settling on Rust, we considered writing OnePush in Go. Go has a lot going for it for this sort of application - its concurrency model is perfectly suited for managing many async TCP connections, and the ecosystem has good libraries for HTTP requests, Redis and PostgreSQL clients, and serialization. Go is also more approachable for someone unfamiliar with the language; this makes the code base more accessible to the rest of your team. Go's developer tools have also had more time to mature than Rust's.

Why choose Rust

Despite the negatives and the presence of a good alternative, Rust has a lot going for it that makes it a good choice for us. As mentioned earlier,

Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules

This is huge. Being able to encode constraints of your application in the type system makes it possible to refactor, modify, or replace large swaths of code with confidence. The type system is our ultimate "move quickly and don't break things" secret weapon.

Rust's error handling model forces developers to handle every corner case. Even if there is a system with the potential to panic, it can be moved into its own thread for isolation. More recently, it has become possible to catch panics within a thread instead of only at the boundary. Languages like Go make it too easy to ignore errors.

Next, OnePush needed to be fast. Rust makes writing multithreaded programs quite easy. The Send and Sync traits work together to ensure such programs are free from data races.

At the end of the day, our OnePush service is just a program optimized for sending a lot of HTTP requests. The library ecosystem offered everything we needed to build this system: An async HTTP/2 client, an async HTTP/1.1 client, a Redis client library and a PostgreSQL client library. We are fortunate that the Rust community is full of talented and ambitious developers who have already published a great deal of quality libraries that suit our specific needs.

Finally, the developer leading the effort had experience and a strong preference for Rust. There are plenty of technologies that would have met our requirements, but deferring to a personal preference made a lot of sense. Having engineers excited about what they are working on is incredibly valuable. Such intrinsic motivation increases developer happiness and reduces burnout. Imagine going to work every day and getting to work on something you're excited about! Developer happiness is important to us as a company. Being able to provide so much by going with one technology versus another was a no-brainer.

Risks

Aside from risks associated with not choosing Rust, we had a few additional concerns for this particular project.

As a glorified HTTP client, OnePush needed to be able to send lots of HTTP/1.1 requests very quickly. In the beginning, this wasn't quite as true because of our scale and because Android notifications could be batched into single requests. Going forward, we expected a huge increase in HTTP/1.1 outgoing request volume due to growth and the new WebPush specification with encrypted payloads. Hyper (Rust's HTTP library), had an async branch that was just a prototype when we started. We hoped that, by the time we truly needed an async client, it would be ready.

As it turned out, the initial async Rotor-based branch of Hyper never stabilized since tokio and futures were announced in August 2016. By the time we really needed the async branch, we ended up having to spend a week or two debugging, stress-testing and fixing the Rotor-based hyper::Client. This turned out to be ok since it was a chance to give back to the Rust community.

Since we would be on the nightly channel for serde derive and clippy lints, another risk was spending a lot of time doing rustc upgrades. We avoided this situation by pinning to specific versions of the compiler and upgrading infrequently. When we did upgrade, the process required finding a recent rustc that was supported by both libraries. This will become less of an issue very soon with the advent of Macros 1.1.

Finally, Solicit (Rust's HTTP/2 library) uses three threads per connection. Although this is fine in isolation, having 20,000 connections quickly becomes expensive. We've mitigated this issue by using a short keep-alive to limit the number of active connections and by taking advantage of the Apple's HTTP/2 provider API (APNs), which allows 500 requests in-flight per connection.

Unexpected Issues

For the most part, we knew what we were getting into building such a system in Rust. However, one thorn in our side that we didn't anticipate was rust-openssl upgrades. We are stuck on an earlier version of rust-openssl since the Solicit library depends on an API that has been removed since v0.8.0. This means that we are unable to upgrade other dependencies which rely on rust-openssl until we fix the Solicit issue.

Another minor issue at one time was the limited test framework. A common feature for test frameworks is to have some setup and teardown steps that run before and after a test. We say this issue was minor because we were able to work around its absence by generating many tests declaratively with macros (discussed below).

Successes

Writing OnePush in Rust has been hugely successful for us. We've been able to easily meet our performance and scaling goals with the application. OnePush is capable of delivering over 100k notifications per second and efficiently maximizes the use of system resources. Despite being highly multithreaded, race conditions have not been an issue for us. Even better, OnePush needs very little attention. We were able to leave it running without any issues through the holiday break.

Regressions are very infrequent. There's a huge class of bugs in languages like Ruby that just aren't possible in Rust. When combined with good test coverage, it becomes difficult to break things - all thanks to Rust's fantastic type system. This isn't just about regressions either. The compiler and type system make refactoring basically fool-proof. We like to say that Rust enables belligerent refactoring - making dramatic changes and then working with the compiler to bring your project back to a working state.

The macro system has been another big win. Our favorite example of how this saves us engineering time is using macros for writing tests declaratively. For example, a large set of tests we have are for the Terminal. Each test takes some Events as input, and then the state of Redis and Postgres are checked to be correct after processing the event. The macro system enabled us to remove all of the boilerplate for these tests and declaratively say what the event is and what the expected outcome should be. Writing a test for this system today looks like this:

// Invoking terminal test-writing macro
push_test! {  
    // The part before the arrow ends up being the test name.
    // The `response` describes an `Event`, and the rest describes the system
    // state after processing it. There are more parameters that can be
    // specified, but the default values are acceptable in this case.
    apns_success => {
        response: apns::Response::Success,
        success: 1,
        sending_done: true
    },
    // .. and so on
}

Writing a lot of similar tests in this fashion enables us to get a lot of coverage without a lot of work. It also helps us work around the lack of features in the Rust test system (such as before/after hooks).

The final thing we want to comment on here is serde. This library enables adding a #[derive(Deserialize)] attribute to a struct and getting a deserialize implementation. Combined with our serde-redis library, this makes it possible to load data out of Redis like so:

/// A person has a name and an ID.
///
/// This is just some data with a derived
/// Deserialize implementation
#[derive(Deserialize)]
struct Person {  
    name: String,
    id: u64
}

// Gets a `Person` out of redis
let person: Person = redis.hgetall("person")?;

On the left hand side of the line fetching person, there's a binding name with a type annotation. On the right hand side, there's a call to Redis with HGETALL, and a ?. The ? is a bit of error handling; if the request is successful and deserialization works, person will be a valid Person, and the name and id fields can be used directly with knowledge that they were returned from Redis. If something goes wrong, like Redis is unreachable or there is data missing for the Person (such as a missing id), an error is returned from the current function.

This is really powerful! We can just describe our data, add this derive attribute and then safely load the data out of Redis. To get the same effect in a dynamic language, one would need to load this dictionary out of Redis and write a bunch of boilerplate to validate that the returned fields are correct. This sort of thing makes Rust more expressive than many high-level languages.

Open Source

Early adoption in an ecosystem means there are lots of opportunities for open source contributions. The most notable of our contributions is a project called serde-redis, a Redis deserialization backend for serde. We've also had the opportunity to contribute several patches to Hyper's Rotor-based async client. We use that client in OnePush and have made billions of HTTP requests with it.

What's next

We've come far with OnePush, but there's still more work to do! Here's just a few of our upcoming projects related to OnePush:

  • Upgrade to Hyper's Tokio-based async implementation. We probably won't be super early adopters here since we've got an HTTP client with a lot of production miles on it right now.
  • Rework result processing to use futures. The Terminal's concurrency from threads is limited, whereas something backed by mio could have much higher throughput. This would require futures compatible Redis and Postgres clients.
  • Replace Solicit's thread-based async client with a mio-based one. We've actually got a prototype of something from earlier in 2016.

We also have a new internal application written in Rust which we hope to blog about soon! It's a core piece of our monitoring which is responsible for collecting statistics from our production systems and storing them in InfluxDB.

Conclusion

We've had fantastic results building one of our core systems in Rust. It has delivered many billions of notifications, and it's delivering more and more each day. We hope that sharing our experience as early adopters in the Rust ecosystem will be helpful to others when making similar decisions. We've certainly found Rust to be a secret weapon for quickly building robust systems.

Like what we're doing? We're hiring!