Designing a Zero-Loss Kafka Data Pipeline
As Yieldbot grows, we are seeing tremendous load on our data pipeline. Today we process about 12 billion events per month and the data pipeline is responsible for receiving, processing, and storing these events reliably. It’s well known with distributed systems that networks partition, servers die, and cables break. So how do we achieve the goal of reliability?
Apache Kafka is at the heart of Yieldbot’s data pipeline. It’s a durable, persistent queue, where immutable messages are stored across a cluster of machines in a fault tolerant way. Deploying Kafka in the data pipeline is a critical piece to this puzzle, when your event producers are spread out across the whole world. Here’s how we deploy Kafka in our data pipeline:
Adservers are the source of all of our events. We have several deployed all over the US in various AWS regions. Each adserver runs a single node Kafka cluster within it. The adserving process generates events and sends them to this local Kafka cluster so that events are persisted to local disk. Each adserver also runs a mirror-maker service within it. A mirror-maker service is basically a consumer and a producer running within the same process space. The consumer part of it consumes events from the local Kafka cluster which are then passed on to the producer part of this service. The producer part forwards these events to the region Kafka cluster. In a normal scenario the events are forwarded from the local Kafka cluster to the region Kafka cluster in near real-time. However, if for some reason the adserver loses connectivity with the region Kafka cluster then the mirror-maker service will pause forwarding these events. As soon as connectivity is restored the mirror-maker will resume from the point where it left off. We have a retention time of 2 days, so the adserver can stay disconnected for up to 2 days without losing any events.
Region Kafka Setup
The region Kafka setup consists of a set of Kafka brokers and a set of mirror-makers. Their job is to receive events from all of the adservers in their region and forward them to the home Kafka setup. They have a retention period of 7 days, which means each region can stay disconnected from the home region for up to 7 days and still not lose any events.
There are 3 mirror-maker servers in this setup to distribute the load and for high availability purpose. The 3 of them are spread across 3 different availability zones in AWS for better redundancy. The consumer part of the mirror-maker service owns a topic and is responsible for reading it and forwarding it to home Kafka via the producer part. The ownership is decided by an ephemeral znode that is maintained in zookeeper. If at any point the mirror-maker server dies, then this ephemeral znode connection is torn down (a tcp connection) and other mirror-maker server will immediately discover it and start owning that topic. The switch is near-instantaneous.
Home Kafka Setup
The home Kafka setup is a set of several brokers. They are responsible for receiving events from various regions and are a centralized home for them. Various consumers like storm topologies, spark streaming, elasticsearch, etc. consume events from this centralized location. It also serves as a temporary home for the events until they are archived to permanent storage (Amazon S3).
Nothing completes the picture until we have good monitoring in place. The usual server/hardware level monitoring like cpu, disk, mem, etc. is there, but apart from that we have several application level monitoring alerts in place. Sensu is used heavily at Yieldbot for our monitoring needs. We have written several sensu-checks that monitor application specific things. Specifically we have a Sensu-check that monitors how far the mirror-maker is falling behind real time on the adservers. We have a watcher that watches this Sensu-check to make sure it is running. When the mirror-maker consumer lag goes above a certain threshold then alerts are generated that go to email/Slack/PagerDuty. We then have a choice to either remove the adserver from our adserver discovery pool in case of extended outage or just debug/fix the network connectivity issue. Similarly, we have a Sensu check for the mirror-makers in the region Kafka setup that does the exact same job.
We have a default assumption that software/hardware is going to fail rather than the other way around. Therefore we have added redundancy - load balancing at every point in the data pipeline. We have monitoring checks and another set of checks that watch these monitoring checks. Basically, watching the watcher. Increasing capacity is done by merely adding more Kafka-brokers, mirror-makers, and balancing the Kafka topics across the new set of servers. To increase retention period on Kafka topics we increase the partition count.
At Yieldbot, achieving reliability is a multi-pronged effort. With all the uncontrollable threats to reliability that exist, the key to safeguarding our data pipeline has been instituting layers of tool-based failsafes and monitoring. The way we see it, the best way to avoid having things fall through the cracks is by weaving as close to an impenetrable safety net as possible.
Kunal Nawale is a member of the Yieldbot Platform team in Boston and leads the development of key components of Yieldbot’s platform, such as the data pipeline and performance metrics database. He works closely with the Data and Engineering teams to create, advance, and scale Yieldbot technology.