Executing gRPC Client Retries in Ruby

At OneSignal, we rely heavily on the gRPC protocol for internal communication within our service-oriented architecture. It’s a great tool not only because it's fast, but because it allows us to clearly define our service boundaries in a language-agnostic way.

We recently started extracting some core functionality out of our Ruby on Rails monolith and into more focused gRPC microservices. The reduced complexity of our Rails application is great, but it comes at the cost of added networking complexity, which is inherently unreliable. Sure enough, as we ramped up our first gRPC integration, we encountered a persistent low level of gRPC UNAVAILABLE errors, which can result from various networking-related blips.

After convincing ourselves that these errors weren’t caused by any obvious misconfiguration on our part, we started looking for a robust retry mechanism for Ruby gRPC clients. Unfortunately we couldn't find a library to fit this need, but after some digging we found out that the gRPC core library, which is written mostly in C++ and shared between many gRPC client implementations, has built-in support for retry policies via gRPC service configs.

Service configs give you fine-tuned control over how your client recovers from various failure modes. In this article, we will only be covering retry policies, but service configs can also be used to configure hedging policies, which allow you to send multiple redundant requests to reduce the impact of transient failures. See the service config documentation for more detailed information.

It's also worth noting that, although we chose to define service configs on our client, you can also configure them server-side using the name resolver plugin. In this setup, clients can request the service config from the server via a gPRC request. Initially, we thought this approach was preferable because it would allow retry policies to be shared between clients without duplicating configuration. However, we decided against it due to the unique state of our service map. Currently, our Ruby on Rails service is the only gRPC consumer that does not live inside the same Kubernetes cluster as the gRPC service it connects to. For all other services, we use the service mesh Linkerd, which has built-in retry support. We ultimately decided that the service mesh layer was the ideal place for configuring retry policies between these Kubernetes services. In order to avoid conflicts between Linkerd and server-side service configs, we opted to stick with client-side service configs for our Ruby client.

Adding a Client-Side Service Config

Assuming you have already generated a service stub class from your protobuf definitions, service configs can be passed to your service stub via the :channel_args constructor parameter. Here is an example of what that would look like:

require "json"

Test::MyService::V1::Stub.new(
  "localhost:8080",
  :this_channel_is_insecure,
  channel_args: {
    "grpc.enable_retries" => 1,
    "grpc.service_config" => JSON.generate(
      methodConfig: [
        {
          name: [
            {
              service: "test.v1.MyService",
              method: "MyRetryableMethod"
            }
          ],
          retryPolicy: {
            retryableStatusCodes: ["UNAVAILABLE"],
            maxAttempts: 3,
            initialBackoff: "0.1s",
            backoffMultiplier: 2.0,
            maxBackoff: "0.3s"
          }
        }
      ]
    )
  }
)

The two channel arguments passed here are grpc.enable_retries, which globally enables or disables retries ( 1 or 0 respectively), and grpc.service_config, which contains the service config itself. Somewhat surprisingly, the grpc.service_config argument must be a JSON string rather than a hash. Let's break down these configuration parameters.

The top level entry in the service config is methodConfig. This is a list that can contain any number of retry policies in the event that you need different policies for different RPC endpoints.

The name entry in each methodConfig object is a list of objects containing the service and optional method keys. service is the fully-qualified identifier for your service (e.g. "test.v1.MyService"). If you leave out the method entry, this retry policy will apply to all RPC methods in this service. If you want more fine-tuned control (this is recommended, because not all methods can be safely retried), you can specify method, which is just a single RPC method name. To apply this retry policy to multiple methods, you must include additional service/method entries in the name list.

Finally, we've arrived at retryPolicy. Let's break down each retry option:

retryableStatusCodes

This indicates the gRPC status codes that you consider retryable. Typically this list will contain at least UNAVAILABLE, but you can add other status codes as you see fit (if you’re sure they’re safe to retry — more on this later). These codes may be uppercase, lowercase, or the corresponding integer value.

maxAttempts

As you can probably guess, this option indicates the maximum number of retries that will be attempted. This, along with the subsequent backoff configuration values, defines how to calculate the delay before retrying (this calculation is often referred to as "exponential backoff").

initialBackoff

This specifies the delay after the first retryable status code is encountered. In our example, we used 0.1s (100 milliseconds). This value must be in seconds and end in the suffix: s.

backoffMultiplier

This describes how long to delay retries as a function of the number of total retry attempts. This parameter was important for us because it protects our infrastructure from becoming overloaded with retry attempts in the event of a legitimate networking outage. In our example, the backoffMultiplier is 2.0 and our initialBackoff is 0.1s. This means that our first failure will delay for 100 milliseconds before retrying, but the delay period will double for every subsequent failure (our second failure will delay 200 milliseconds, our third will delay 400 milliseconds, and so on).

maxBackoff

Similar to initialBackoff, this is expressed as a decimal duration (in seconds) and represents the maximum delay allowed for consecutive retry attempts. For instance, in our consecutive retry example, three consecutive retry attempts would result in delays of 100, 200, and 400 milliseconds. However, if we set a maxBackoff parameter of 0.3s, we prevent the backoffMultiplier from exceeding 300 milliseconds, so retries would be delayed 100, 200, and 300 milliseconds. If we had chosen to configure more than three maxAttempts, all additional attempts would delay at the maxBackoff parameter of 300 milliseconds.

Validating the Service Config

In order to validate that these seemingly arbitrary JSON blobs were doing what we expected, we decided to create an actual gRPC application to simulate failure scenarios and see how the clients responded to them. Here is the basic protobuf schema for that service:

syntax = "proto3";

package sandbox.v1;

service SandboxService {
  rpc SimulateErrors(SimulateErrorsRequest) returns (SimulateErrorsResponse);
}

message SimulateErrorsRequest {
  string request_id = 1;
  repeated ResponseConfig responses = 2;

  message ResponseConfig {
    uint32 status_code = 1;
  }
}

message SimulateErrorsResponse {
  string request_id = 1;
  uint32 attempts = 2;
}

The idea behind the SimulateErrors RPC is that it allows you to “ask" for a specific series of responses based on a unique request_id. Here is an example using grpcurl:

grpcurl -plaintext -d '{
  "request_id": "123abc",
  "responses": [
    {"status_code": 14},
    {"status_code":  2}
  ]
}' -proto sandbox_service.proto \
  localhost:8080 sandbox.v1.SandboxService/SimulateErrors

This example produces the following responses:

# 1st request
ERROR:
  Code: Unavailable
  Message: request 1

# 2nd request
ERROR:
  Code: Unknown
  Message: request 2

# 3rd request
{
  "requestId": "123abc",
  "attempts": 3
}

It takes three separate grpcurl invocations to see these three responses, but a client configured to retry on both UNAVAILABLE and UNKNOWN errors could execute these automatically within a single RPC method invocation.

The key to making this work is the request_id field, which the server uses to track individual requests and respond according to the number of times a request has been seen. Since clients retry by retransmitting the original request body, our request_id remains consistent between subsequent attempts.

We verified each of the retry policy configuration values from our service config using this as our gRPC backend server.

“Gotchas” (Server-Side Streaming RPCs)

Most of the surprises we encountered during this investigation involved server-side streaming endpoints. As you may know, gRPC supports server-side streaming RPCs in which the server can stream multiple responses back to clients within the lifetime of a single request. These can be useful if you have a particularly large amount of data to return, or if the number of responses is potentially unbounded. Retrying these endpoints is a little more restrictive than a typical unary RPC.

Regardless of a client’s retry policies, it will only retry a server-side streaming RPC if an error is encountered before any streamed messages have been received. So, if you received the first of 100 streamed responses and then received an UNAVAILABLE error, you would be out of luck, and would have to manually reissue the request from your client in order to retry it.

The second “gotcha” we encountered with server-side streaming RPCs is that certain retry conditions can potentially abort your entire client process due to an apparent race condition in gRPC core. We were able to consistently trigger this edge case with our sandbox service, however, it’s unclear how realistic this scenario is in practice (for reasons we will outline later).

Essentially, if your gRPC server receives a request for a server-side streaming RPC and returns a gRPC error that the calling client considers retryable before streaming any responses back, the client process will die unrecoverably with the following error:

This is a compelling reason to closely audit your retryable status codes. For instance, UNKNOWN errors might seem like good candidates for retrying, but most gRPC application frameworks will return this when an exception isn’t otherwise explicitly handled. Therefore, an unexpected bug at the beginning of a streaming RPC handler could trigger this race condition on your client and have disastrous consequences.

A good rule of thumb is to only retry UNAVAILABLE errors (which no reasonable application would explicitly return from an RPC handler) and only add others after carefully considering how they might inadvertently be returned from your streaming endpoints.

Conclusion

If you find yourself needing to configure client-side retries in your Ruby gRPC clients, we hope this article will prove helpful. A well-tuned retry policy can save you considerable headaches, but you should always be careful and deliberate when making changes to avoid surprises!