How Yieldbot Uses Tools & Trust for Better Monitoring
Monitoring. We all know it’s necessary, and yet we constantly battle it. There’s no shortage of opinions on how monitoring should work and most of the time it’s tough to get a consensus on how to implement it most effectively.
Using the Right Tools
Here at Yieldbot, the infrastructure team has built a monitoring stack based on core company tenets of responsibility and empowerment. The developer who wrote and maintains the product is the person best able to detect and solve any adverse conditions that may present. This not only fosters a sense of team among the application and infrastructure developers, but also gives us monitoring insights from a wider perspective.
The more detailed the information is about the conditions to be detected, the better the monitoring surrounding them can be. The infrastructure team strongly believes in building simple tools to enable the writing of effective monitors that can be deployed and maintained autonomously by the application developers. Currently, the frontline monitoring stack includes
Graphite, Statsd, and Consul, with Sensu acting as a monitoring router. Notifications are handled via internally written Slack, email, and Pagerduty handlers, and the types of monitors actually employed are determined based on the event or data that requires capture.
While standard system-level monitors such as CPU and disk checks are heavily utilized, they’ll soon give way to rate-of-change monitoring based on times-series data. In this stack, we will include Statsd, Influxdb, Grafana, and Reimann, with Sensu once again acting as a router.
Flexibility in Monitoring
Many developers write monitors for their own products; the tools we provide give them a lot of flexibility. In some cases the developers need only expose some JSON on a given port, add a few simple lines to a Chef recipe, and a few minutes later their monitor is sitting in the dashboard. As time goes on, this may be made even simpler using internally developed tools and processes. Leveraging the existing Consul and Mesos infrastructure, monitoring can dynamically ascertain the address of the product and ensure that it is performing optimally. And if a developer changes the location or other select parameters,the monitoring can dynamically adjust itself without intervention from either the developer or the infrastructure team.
A good example of the flexibility of our monitoring stack at Yieldbot is our monitoring of Kafka, which sits as the backbone of moving large amounts of data at high speed across our infrastructure. In order to monitor these flows of data, developers wrote a small Python application called Bongo that runs as a service in our Mesos infrastructure. Bongo gathers needed metrics related to Kafka, aggregates, and then exposes them on an endpoint. Due to the dynamic nature of Bongo, monitoring built with static configuration files will not work – there is no way to tell which slave Bongo will be running on at any given time.
In this case, the only configuration necessary for the check was the relative endpoint where the metric lived. The check itself will query for where in Mesos Bongo is running, fetch the metrics, and push them via Rabbitmq to the Sensu server. There, Statsd takes over. Determining the location of Bongo involves nothing more than calling a library and passing a product name to a function, the result being the current location.
A Boolean, Metrics They Are Not
But Bongo brings up an important distinction in our monitoring environment: Metrics checks are not state checks. The check described above will do nothing more than fetch whatever is at the endpoint. If there are any issues with finding the endpoint or retrieving the data, the script will simply fail gracefully. Sensu will detect a non-ok status and, consequently, will not push any output to Statsd. There’s a separate state check that will validate the endpoint and the JSON data returned.
This may seem a little excessive, but with regard to monitoring, we hold to the Unix philosophy of doing one thing and doing it well. Pushing metrics and ascertaining the state of an endpoint are two distinct tasks and should ideally be treated as such. Using this design, when we have an endpoint failure we get a single notification, and all metrics checks have their own associated endpoint checks as dependencies.
Leveraging tools such as Consul and Sensu, monitoring has been able to fully adapt to and integrate with a variety of cloud providers. Generally speaking, from the standpoint of monitoring, there is no difference between AWS. Google Compute, or a colo in Gainsville, Florida. If you want it monitored, we just need to know its name and thresholds.
The rest, as they say, is magic.
Matt Jones is a Yieldbot Engineer on the Infrastructure Team. Located in our Boston office, Matt works to provide tools to enable developers to monitor and respond efficiently to events, ensuring Yieldbot Technology operates as efficiently and effectively as possible