One in five new apps made today uses OneSignal — we send over 10 billion messages every day! With so many messages and apps to support, we maintain a continual focus on improving the speed, efficiency, and reliability of our underlying infrastructure. To this end, we recently migrated one of our most trafficked endpoints, the “on session” endpoint, from our legacy Ruby on Rails codebase to Go.
What is the “on session” endpoint?
The “on session” endpoint accounts for about 30% of our overall traffic and is called every time someone opens an app that uses OneSignal (on every session, get it?). Every time this endpoint is hit it:
- updates the last time a subscription launched the application and the number of times they have done so, or registers a new subscription if none was found
- deletes any automatic notifications configured to be reset on each session
- downloads any in-app messages that a subscription is eligible for
- optionally, can update basically any of the information on a subscription depending on the incoming request
This is a very important endpoint for our product!
High-Level Overview of the Systems Involved
Our SDKs installed in apps send HTTP requests to the “on session” endpoint every time a subscription opens the app. Those requests are parsed, their fields validated, and then they are compared against our record of the subscription if one exists in our database. Any valid and authorized changes are then compiled for update and sent synchronously via Kafka. From that point, a consumer will read the messages from the relevant topic and send the updated information along to other services that will update the subscription in our database. The endpoint then goes on to reset and delete appropriate automatic notifications in the database. And finally, fetches and returns any in-app messages a subscription is eligible for (we were actually able to refactor this last step as part of this migration!).
Why move the “on session” endpoint?
The “on session” endpoint has lived in our Ruby on Rails codebase since it was first created. Although it has served us well up until now, it made sense to move it to our Go codebase (called Turbine) for a few reasons:
1. Go is a much faster and more efficient language than Ruby. Part of this is because it is a compiled language, whereas Ruby is an interpreted language. The other piece of this, however, is that Go has built-in concurrency through its lightweight goroutines. This higher concurrency allows each pod to handle more requests at a time. This means that Turbine uses less CPU and finishes its tasks faster than our Ruby codebase, allowing it to handle the same amount of traffic more easily. This allows us to divert costly resources away from Rails and use them elsewhere.
2. For our team, maintaining and refactoring code written in Go is much easier than doing so in Rails. For one thing, Go’s static type system makes it easier to write initial code with fewer bugs, as many are detected automatically, and it is also simpler to understand statically-typed code because of the clear function signatures denoting what is taken in and returned. For another, Go has significantly less “magic” than Ruby, obscuring less of the actual work being done and making it more straightforward to read and reason about for someone new to a codebase. Finally, because Go is faster than Ruby, it means testing is also faster, making developing and iterating on logic a faster, more efficient process. In fact, as a part of this migration, we were able to make a few refactors that increased efficiency and decreased dependency! When considering future refactors and improvements in an effort to continue to move more toward service-oriented architecture, Go is an easier language for OneSignal to reason about.
3. Our product supports large businesses that sometimes require more than their usual amount of notifications to be sent. This initiative makes sure that we have excess capacity in our Rails codebase to handle these with ease.
We started this massive migration in late August of 2021 (one year and four and a half months from the end of the project, but really, who’s counting? This did include a six or seven-month hiatus midway through). At that point, I was three months into my job at OneSignal and, more importantly, three months into my career as a software engineer.
At the time, the “on session” endpoint migration was more of a nice-to-have future improvement but not yet urgent, which is why I was given the opportunity to take it on. Thankfully, I had no idea how monumental a task this would end up being. I might have been more tentative to take it on if I had, and I’m grateful for the learning that came from jumping in without hesitation (not to mention the freed-up resources in our system! But more on that later). Throughout the process, I was forced to think deeply about how our systems worked together; I became frustrated and overwhelmed, I had to problem-solve and debug in totally new ways, and I got to surround myself with new languages and skills around reading and understanding code. This was an incredibly long, often tedious process that helped me to level up in a way that I am grateful for and taught me to reach out and lean on my team when needed.
What did the process entail?
The rollout process from zero to 100% in production went incredibly smoothly with no major setbacks or interruptions to service. The ease of the actual rollout came at the end of an extremely lengthy and involved process of documentation, translation, testing, and validation that altogether made our confidence in rolling out to production very high.
Phase one of migrating the “on session” endpoint was documenting everything (truly everything). The document in question ran over 17 pages, single-spaced, detailing every tiny nuance of the existing “on session” endpoint. At times, the time and care spent on this portion almost felt silly. However, it was extremely important from the beginning that this new implementation of “on session” was bug-for-bug compatible with the existing implementation. If the old implementation was doing something even unexpected or unintended, it was possible that a current customer knew about it and was depending on that behavior. Because this was such a large migration, we had absolutely no intention of changing the endpoint’s functionality in any way. Since we were migrating away from Ruby on Rails, there was magic afoot! But no, really. The “magic” that is part and parcel of Ruby on Rails meant a great deal of nuanced functionality was hidden in very small snippets of code that needed to be painstakingly parsed, understood, and documented in long form.
Once we were reasonably confident that we had identified everything that this new implementation needed to do in order to match the existing one, it was time to start the equally long process of translating. Needless to say, the “magic” that Ruby does to abstract all of the work it is actually doing made the Go implementation significantly longer.
Things that were particularly difficult:
- Parsing the incoming request! Rails does a nontrivial bit of magic to take an incoming request and parse it into a usable object. Go does not. This means the hand-wavey way that the existing implementation could receive the request and move forward with nothing explicitly detailed needed to be detailed in exacting logic in Go. This part leads to another overarching difficulty that Go presented, where Ruby did not.
- Dealing with nullable values! Because of how Go works, before trying to manipulate any information from the incoming request or from many of the fields coming out of the database, we needed to match Ruby’s implicit behavior and check for nils.
- The migration was put on hold for six or seven months as other projects took precedence but then picked up again with a renewed sense of urgency. This led to some of its own challenges, as the implementation up to that point had to be reparsed upon resuming the project, and new changes that had been made to the endpoint and to our system architecture needed to be understood and slotted in.
In the process of translating, we realized that a significant portion of translated code that queried for and retrieved the in-app messages that a subscription was eligible for could be factored out into its own separate service, the “In-App Message Service.” This was extremely exciting as it was yet another step toward smaller, more service-oriented architecture and allowed us to create a gRPC service that could be used more generally in other areas of our product.
The testing portion of migrating the “on session” endpoint involved two parts. First was proxying the existing “on session” test suite that existed in our Ruby on Rails codebase against the new implementation that now lived in Go. In order to make this happen, we used ‘Rack-Proxy’, a request/response rewriting HTTP proxy, to send requests from the test suite to the new implementation of the endpoint. This proved a bit trickier than we first anticipated. After setting up the initial proxy connection and getting the two repositories to talk to each other, we realized it wasn’t going to be quite that simple. Turbine, our Go codebase, had a pattern of testing whereby the change that is to be put on Kafka to be processed was compared to the expected change. In our Ruby codebase, the tests were actually set up to synchronously update the subscription in question sans Kafka and then check the database to see that the appropriate changes were made. This is subtle, but it essentially meant that there was no way to use the existing test suite against the new implementation without duplicating this testing event dispatcher. Which, of course, led to the next step in testing.
After all of the tests were passed via rack-proxy, the next step in fully testing the new implementation was to translate the test suite into Turbine, where the new implementation lived. Using a Ginkgo and Gomega test setup, we transferred the entire existing test suite in an effort to ensure all functionality was and would continue to operate as it had. An interesting issue came up when translating some of the passing tests over to Go: some of the tests that were passing when proxying started failing when moved into the Go repo. This ultimately came down to a discrepancy between a fake (in-memory) version of one of our services that the Go repo utilized in Docker when testing and the real (postgres) version of that service that the Rails repo used in Docker when testing. In theory, these two should respond to input in the same way, as the point of a mock service is to act as a simplified version of the real functionality but with much less processing time and complexity. This helped us to identify and rectify some of the differences between these two versions.
The validation portion of the rollout process was more convoluted than validation typically is. The “on session" is an extremely write-heavy endpoint, meaning that in the course of running, it makes persistent changes to our data stores. Because of this, it was not possible to simply run the old and new implementations in parallel for a small portion of traffic and compare the results. In doing this, we would be attempting to make changes to subscriptions twice, which would cause issues semantically, but also technically with issues of race conditions. And in the end, it would not have served to prove that both implementations were doing the same thing. So we had to get a bit more creative. Along the way, we came up with several possible ways to do this. We ultimately decided on a two-phase process.
Phase 1: Confirm that the responses returned by both implementations are the same
In order to test that the new implementation of “on session” was performing in exactly the same way as the legacy implementation, we used the Scientist Gem. The Scientist Gem allows you to compare how existing and new implementations of code act under load with production data. It does this by running an “experiment” where the “control” is the existing production path and the “candidate” is the new logic. The entire normal production traffic goes to the “control” exactly as is, processing and returning normally. At the same time, a configured portion of that production traffic is also sent to the “candidate.” The results of both are compared, and any differences are recorded. We used the Scientist Gem in our Rails repo to send a small portion of incoming "on session" traffic to the new Go implementation via rack-proxy. At this point, we inserted the proxy at the very top of the Rails middleware stack so that Rails had no effect on the incoming request before it was sent to the new Go implementation. The Scientist Gem was great for our purposes because it's specifically designed so that you can run the control (i.e. our Rails implementation) as normal while also running the candidate code (i.e. our Turbine implementation) for a configurable portion of traffic and ignore the results it returns.
In order to do this, we first needed to create a version of the "on session" endpoint in Turbine that acted as it should from a functional standpoint but without producing any persistent changes along the way (i.e. without changing anything in any database, cache, or production logic). To accomplish this, we added a new header to all the requests sent to the new implementation via the Scientist Gem. Then, we added logic to parse this new header and set what we called “OnSessionPersistence.” This eventually had three modes: “persistenceNone,” “persistenceVerification,” and “persistenceLive.” For this first phase, we used “persistenceNone” as we wanted to hit the endpoint and make absolutely no persistent changes. In “persistenceNone” mode, connections to Postgres, Kafka, Redis, and outside services were disabled.
We then ran the experiment and compared the responses (headers and payloads) returned by both Rails and Turbine and sent any mismatch logs to DataDog for comparison. We initially ran the experiment on about one request per second (about 0.005% of requests). We continued focusing solely on this first phase of
Run Scientist Gem experiment with persistenceNone setting -> compare responses -> address mismatches for a week or two, bumping the percentage from 0.005% -> 0.5% -> 2%, before also moving on to step two. In order to ensure that all scenarios and corner cases were addressed, we actually kept these Scientist experiments running in the background throughout the entire rollout process.
Gradually, any mismatches that needed addressing became rare and then nonexistent, which increased our confidence. We squashed any bugs as they came up and addressed nuances between the two implementations that had not been caught by testing. These things often had to do with nil pointer exceptions, one implementation returning nil while another returned a nil UUID, issues with header case sensitivity or slight differences in symbols, and the odd edge case.
Phase 2: Validate the messages being put onto Kafka to update subscriptions
The way the “on session” endpoint accomplishes the “update subscription” portion is by putting a message onto Kafka, which then gets consumed and passed along to relevant services that do the updating in the database. The response returned by the endpoint doesn't tell us very much about these updates. It only tells us that the request was successful (no errors occurred along the way), who it was successful for (which subscription), and what in-app messages that subscription is eligible for. Because of this, although our first step in validation was important, it did not validate that the updates being made on a subscription were correct. It only validated that both implementations did or didn't error in the same way, that they were operating on the correct subscription, and that they retrieved the same in-app messages. Useful information, but certainly not enough to verify that everything was working fully and accurately.
Verifying that the same updates were being made was a bit more involved. As we do not return the update being made as part of the response, the new implementation could not actually produce a message to Kafka for us to consume before we rolled it out to production, as this would attempt to update a subscription twice. So, we had to find a way to get the update message out midway through the process so that we could compare it to the update message that the old Rails implementation was producing to Kafka.
To do so, we:
1. Created a new Redis instance to store the messages that the new implementation would have put on Kafka.
2. Set the “onSessionPersistence” mode to “persistenceVerification,” which would insert the Kafka message into our new Redis instance instead of enqueueing to Kafka. For the key, we used the unique Cloudflare Ray ID from the request ( a unique “identifier given to every request that goes through Cloudflare”) and set the value to the message in its entirety.
3. Updated the old implementation of “on session” to smuggle the Cloudflare Ray ID from the request onto the message being put on Kafka.
4. Created a new Kafka consumer to consume messages off of the appropriate topic and check to see if a Cloudflare Ray ID was present on the message. If it was, we then checked to see if it was a valid key in the Redis instance. If so, the consumer then went on to retrieve the associated value from the Redis store in order to compare the two update messages. The first step in comparison was to normalize necessary fields, for example:
- Set the last active time on both messages to the same time if they were within 3 seconds of each other
- Order tags alphabetically so that when compared, they wouldn’t mismatch due to inconsequential, different order, etc.
5. Sent all mismatches between the update messages to DataDog.
6. Fixed any differences between them. There were more than a few seemingly “silly” mismatches — those where the resulting update was the same, but the intervening message differed. For instance, the old version always set the session count to one when a new subscription was created. However, this work is somewhat redundant because the default behavior of the database table when creating a new subscription is to set session count to one unless another value is provided. The new implementation did not initially explicitly set this value when we were first testing, but would still end up producing the same result. We nevertheless updated it in order to reduce mismatch noise and make everything bug-for-bug compatible. We also found and fixed quite a few nil pointer exceptions.
After the prolonged verification phase, the actual rollout went quickly and smoothly. So much so it was almost anticlimactic. We were expecting approximately 30% more traffic to Turbine than we currently had, so we increased CPU limits accordingly before ramping the new "on session" endpoint up in production. We started conservatively, switching 0.01% of production traffic over to the new Turbine implementation. While monitoring error and info logs constantly, we spent the first week bumping rather conservatively, reaching 20% by Friday.
In this first week, we found that we needed to increase connections for each database server from two to 16 in order to keep up with the increased demand. Oddly, when we did this, we still saw shards bounded by two database connections. It took some sleuthing to figure out that our database connection pooler was only configured to hold two idle connections at once. We saw the connection count increase once we updated this configuration. We ended up leaving the number of database connections quite high, at 16 per database server, until we finished the rollout and some time had passed, allowing us to determine how many were actually required.
As we had prepped thoroughly and everything had gone so smoothly the previous week, we ended up ramping from 20%, to 50%, to 80%, to 100% within the next three days. Throughout the rollout process, we ran into the odd nil pointer exception and added more nil checks throughout, but generally, everything worked as it should from the start because of the extensive validation process that came before.
One alternative we considered for the validation process was mirroring traffic via our HTTP ingress layer to the new implementation of "on session" when we were ready to compare the Kafka messages between the two versions. Through the process, however, we realized that instead of adding this extra step, we could continue to use Scientist and achieve the same level of confidence.
In the end, we saw about a 40% reduction in CPU usage and a 40% reduction in the number of concurrent requests being handled at any given time in our Rails codebase. In conjunction with other efforts to make our systems more efficient and free up space, we were ultimately able to decommission an astonishing 50% of our Rails hardware while not adding any additional hardware to Turbine! And those results speak for themselves!