Scaling Yieldbot’s Data Pipeline Architecture for Business Growth
“If it ain’t broke, why fix it?” was the question I asked myself when I was first introduced to the idea of upgrading the Yieldbot architecture. Dave, one of Yieldbot’s early engineers, was the original author of this part of the data pipeline. Internally, and lovingly, we referred to it as “Dave’s Tubes.”
Though it had been working fantastically over the past few years, Dave voiced concerns, suggesting it needed to be replaced. It was extremely refreshing that the person who fathered this architecture openly championed the need to move past his creation – no defensiveness, no ego, no hiding behind his seniority. (Getting an opportunity to work alongside these types of people and in this kind of culture is one of the perks of being a part of the #yieldbotfam. But I digress…)
Yieldbot has seen rapid expansion over the past few years, significantly growing traffic to our platform every month. We needed our data pipeline to handle this growth and we wanted to find a solution by only adding hardware – without having to worry about software changes. The current architecture while working great, lacked flexibility. Here’s a brief overview of what that architecture looked like:
The evt-producers sit on the edge and collect and forward events to the evt-archiver. The archiver knows all the producers it is connected to and, once it has heard from all the producers and they have moved on to a new hour, it archives the events that it has saved for the previous hour into S3. The archiver-monitor keeps track of the hours that the evt-archiver has archived. Once it has heard from both archivers it kicks off a job on the analytics-runner.
As you can see from the diagram above, if we needed to add five new producers, it required a tremendous amount of coordination between all these components. Additionally, the evt-archiver, the archiver-monitor, and analytics-runner are all single points of failure. It was time for a new architecture.
Apache Spark, S3, Mesos, Jake, and Secor to the Rescue!
Here’s a look at Yieldbot’s new and improved data pipeline architecture:
The evt-producers send events to Kafka topics based on the event type. Each Kafka topic has different number of partitions based on the traffic for that event type. Secor, our own adaptation of the open-source project Secor by Pinterest, reads the Kafka topics and saves the events locally. (We modified the existing project to account for more advanced upload, filtering, and reporting capabilities and also so that incoming Kafka messages could be sent to a custom S3 path.) After a configurable period of time, it uploads these local files to S3. When the traffic on certain event types increases, we simply increase the number of Kafka partitions and add more Secor instances. This takes care of increasing our archiving capacity on demand. It also addresses the single points of failure concern for the archiver. If a Secor instance crashes, another instance is brought up automatically by our Mesos Marathon framework. The new Secor instance simply starts up from where the previous one left off, since Secor uses ZooKeeper to maintain the Kafka offsets it has successfully read and archived.
Each Secor instance is responsible for reading a certain Kafka topic-partition. On every S3 upload, Secor records the hour for which the upload was done. Jake, a simple Clojure app monitors these records and, once all topic-partitions have moved onto a new hour, Jake triggers the analytics run for that hour in our Spark cluster. We also run our Spark cluster inside Mesos.
To summarize, here’s what Yieldbot gained from implementing this new architecture:
- No more component coordination or software changes to manage traffic growth
- The ability to scale, gained simply by allocating additional Secor instances
- High-availability that comes with the Mesos Marathon framework
- Analytics runs performed in Spark that can be scaled simply by increasing the size of the Spark cluster
In short, we’ve significantly reduced our growing pains. Now, when our CEO inevitably announces the addition of more publishers to our platform which always come with large increases in traffic, we’re ready for it.
Kunal Nawale is a member of the Yieldbot Platform team in Boston and leads the development of key components of Yieldbot’s platform, such as the data pipeline and performance metrics database. He works closely with the Data and Engineering teams to create, advance, and scale Yieldbot technology.