Optimizing Push Notifications Timing

Andres Gutierrez
Brainz Engineering
Published in
4 min readJan 14, 2017

--

Sending push notifications to our players in a timely manner has always been priority to the engineering team here at Brainz. Every day we send thousands of notifications informing and notifying users about different events that happen in the game. Without the proper architecture to perform this delivery it can quickly become a bottleneck.

In this post, we will cover recent improvements in our delivery capabilities and reflections of our experience shipping production in Elixir/Erlang. We hope you’ll find it insightful!

Choosing Elixir
Choosing the programming language for a core system is a big decision. Elixir enables one to build robust, high-concurrent and complex systems quickly and fearlessly thanks to its powerful concurrency model inherited from Erlang that is perfectly suited for managing many concurrent TCP connections and processes.

Erlang and Elixir, are great fit for modern hardware. They have evolved to have more and more number of processors. Elixir’s programs are built in distributed manner even if they run on a single machine. This allows you to use the resources you have better.

Finally, iteration times weren’t long, we could adapt the system to the many requirements we encountered while shaping our ideas from the game industry.

Clustering
It is usually better idea to scale horizontally if possible, rather than vertically. Our push notifications solution can be executed in a cluster so we can have many nodes running in the same machine or in many servers. The idea is that you connect them in some sort of networked fashion. Handling the incoming requests is performed in some sort of shared manner.

Erlang and Elixir rely on message passing rather than function calls and sharing state, we can easily distribute charge to the many available servers. This makes it perfect for building systems, that are decentralized and fault tolerant.

The first node started becomes the master node and it subscribe to the publish/subscribe channels. More nodes can be added to the cluster by just turning them on. If a node fails it’s removed from the cluster and load is no longer assigned to it. If the master node dies, we keep a list of slave candidates to replace the master. Our replacement system allows us to replace the master in a few seconds if necessary.

Publish/Subscribe
We use Redis as a gateway to evenly distribute load from our backend servers to the delivery system. A publish–subscribe pattern is implemented where the backend server queues a payload directly to specific receivers in the cluster of push notifications.

This pattern also provides loose coupling and greater network scalability. Notification payloads are queued into multiple channels or topics in Redis (we can also have many Redis servers in case of identifying a single server
as a bottleneck).

Like many publish-subscribe messaging systems, Redis maintains feeds of messages in topics/channels. Producers write data to topics and consumers read from topics. When a message is received an O(1) lookup is performed by Redis in order to deliver the message to the registered subscriber.

During payload dispatch, many Elixir GenServer processes are started to host the aggregate root instance. It will subscribe to the many topics available and will fetch the aggregate’s event stream from the event store, to maintain its state. Any returned domain events are appended to the event stream in Elixir. The master process is always alive. So subsequent events routed to the same instance will not require rebuilding its state from the event store. The event stream allow us to control back-pressure and measure the quality of service. Messages are queued up in the Redis channels and then in a second queue available in the Elixir node.

Channels or partitions have their own pusher who reads from the log in the channel and distribute the load in the cluster to any free available node. Load are distributed using a round-robin schedule algorithm. Sending push notifications takes the same time and load in average, so isn’t necessary keep statistics about system load in each node attached to the cluster.

Messages in the queue are consumed concurrently by GenServer pushers, each node has the same amount of pushers running simultaneously. Depending on where you put your probe, you can optimize for different levels of latency and throughput, but this is something you have to define in order to have proper operational limits in our system.

Our push notifications service is comprised of several subsystems for loading notifications, delivering notifications using Amazon Simple Notification Service (SNS), and for processing events and results. Amazon’s SNS service is able to handle our load without issues. Also, it encapsulates the details of publishing events to APNS and GCM freeing us from having to implement and update the inner details of these platforms.

Conclusion & Successes
To know whether this endeavor is a success, we collect relevant metrics such as delivery counts and failures and persist them into our main data store. Writing our delivery system in Elixir has been hugely successful for us. We’ve been able to easily meet our performance and scaling goals with the application. We’ll continue improving our notification delivery system based on our findings and we’ll share them here in the blog.

--

--