Architecting Elasticsearch Cluster Deployments
Elasticsearch is a fantastic technology based on Lucene. At Yieldbot, we use it for storing and querying billions of documents. We populate our Elasticsearch cluster with data from Kafka. We have built a tool called Raigad, it reads data from Kafka and indexes it real time into Elasticsearch. Elasticsearch had a river plugin that read data from kafka and indexed it. However river plugin has been deprecated from version 2.0. Therefore we had to build Raigad which did this job. We will soon be open sourcing it. The indexed documents are then available for search and querying. Elasticsearch serves two primary needs at Yieldbot. First, to answer queries for our mission critical applications. Second, to answer queries for our not so critical applications, like the reporting and analytics services. They show reports on our ad serving performance based on the ad events data. Spendcheck, one of our mission critical application, makes decision as to allow or disallow serving an ad, based on various factors. These factors are deduced by querying Elasticsearch. This service cannot afford the Elasticsearch cluster to be down. On the other hand our reporting and analytics services may be ok if the Elasticsearch cluster is unavailable for a little while. Although it’s still undesirable but it will not have a big (financial) impact on our business.
Elasticsearch has been working great for us for the past few years. However we have seen problems like, EC2 nodes going down, network partitioning, a rogue query exhausting the heap memory on the nodes. Since we had several hundred indices with several thousand shards, any node that went down, turned the cluster yellow/red. It became partially or completely unavailable. To bring back the cluster to a healthy green state, we had to stop live indexing, let the shards recover and rebalance. Once the cluster was green, live indexing was turned on. Sometimes this whole process took several hours, sometimes a whole night, all the while our business critical applications were suffering. Not to mention the sleepless nights and anxiety for our engineers.
We came up with the following architecture to mitigate some of these problems.
We divided our Elasticsearch clusters into several clusters according to functional applications.
A small 6 node cluster, which only has the recent 5 days of data. Yieldbot’s mission critical services are only interested in the recent data. By limiting the number of indices and shards, the recovery time of this cluster is super fast, in conditions when nodes go down. We also moved the reporting and analytics services away from this cluster, therefore the large time range queries do not get executed here. The load got reduced as well as the heap memory crash causing queries disappeared.
Analytics Umbrella Cluster
An umbrella of several smaller sub-clusters were created. They hold specific indices according to their functionality.
1. Tribe Nodes
We distributed our indices across many sub-clusters. These sub-clusters were fronted via a set of 3 nodes called tribe nodes. Tribe nodes are special Elasticsearch nodes, that keep track of which index resides on which cluster. They act as front-end proxies for queries coming in from clients. They also do the summation/aggregation of query results if the response is coming back from multiple clusters.
2. Hotevents Cluster
This is a small 6 node cluster with compute heavy nodes. This cluster receives the live indexing traffic. It keeps only the past 3 days of indices. It’s configuration settings are optimized to give the indexing processes more resources than the search processes. Every day at 3 am, a cron job, merges the previous day’s lucene segments and snapshots the index to a S3 repository.
3. Coldevents Cluster
This is a large cluster with storage heavy nodes. Its configuration settings are optimized to give the search processes more resources than the indexing processes. It does not receive any live indexing traffic, however it is populated via a cron job. The cron job restores the index that was previously snapshotted in the hotevents cluster.
4. Aggregation Cluster
Every hour a spark job runs which does some aggregations on the raw event data that we have in S3. Common queries can be answered by just looking at this pre-computed aggregation documents rather than asking Elasticsearch to do the same aggregations. We save some compute time due to this. These documents are populated in this separate cluster. It is a small 6 node cluster with equal distribution of resources for search and indexing processes.
So what have we achieved by the above architecture ? Here are some of the advantages that we see:
- Separating the indexing and search clusters, lets us tune each cluster according to its needs.
- Memory pressure on the indexing cluster is less since it has very few indices. Because of this its recovery time is very fast in the event of node failures. We also do not have to stop indexing, since the nodes can handle indexing traffic along with the recovery operations.
- Due to multitude of clusters, software upgrades are much smoother and quicker.
- Since there are multiple clusters with multiple nodes, you might ask the cost must be very high. However since we picked nodes according to the cluster’s functionality the overall cost has not risen much. Infact we have doubled the retention time/storage, without doubling the cost as compared to the previous architecture.
- When our traffic inevitably grows over the next few months, we will just have to increase the storage capacity on the coldevents cluster. And storage is cheap.
Designing Elasticsearch deployments is not an easy task. There are no standard recipes or solutions for it. The way we have approached this problem is to first look at our traffic pattern, our use cases, and the issues that we ran into. Next we created a solution that tried to address these things. It will be unfair to say that we came up with this solution in one shot. There were multiple iterations, until we found the one that worked. We have been very happy with this solution so far.
This solution would have been not possible without the help from my amazing co-workers. Fatih Cetinkaya for pointing out the use case patterns. Dave White, our Elasticsearch wizard, for helping on the nitty gritty details about Elasticsearch. Boris Jonica for helping on the switch of the mission critical services. Anthony Spring for the chef cookbook guidance. Sunita Kawane and Pratik Narode for helping out on testing. Rich Shea, our CTO for letting us innovate, fail, innovate freely. And last but not the least, Jonathan Mendez, our CEO, for building this business and making it possible for us to create technological innovations.
Kunal Nawale is a member of the Yieldbot Platform team in Boston and leads the development of key components of Yieldbot’s platform, such as the data pipeline and performance metrics database. He works closely with the Data and Engineering teams to create, advance, and scale Yieldbot technology.