Golang, Yieldbot, Sensu and Now You!
Network, system and application monitoring have come a long way since the late 1990’s when many of us used simple ping monitors or prohibitively expensive solutions that required an entire team to maintain. As our systems have evolved from rows of racks containing bare metal workhorses to large virtualized environments to now containers and microservices, our monitoring tools and practices have had to evolve as well. Some have evolved better than others and much of that was due to the developer’s foresight in the design and implementation.
As previously noted, Sensu is used as a core monitoring router in our environment. When we decided to move towards an infrastructure focused on microservices and containers we had to change some of the ways in which we use Sensu and empower the developers to write checks and handlers. One of those ways was the underlying language that the core-infrastructure team uses for its tooling.
Today we are now ready to push a few of the early checks and handlers we wrote out to the world. For those that just want to look at the code not why and how we did what we did here are the links.
- Sensu Process Checks (golang)
- Sensu Chrony Checks (golang)
- Sensu File Checks (golang)
- Sensu Sensu-Server Checks (golang)
- Sensu Elasticsearch Handler (golang)
- Sensu Slack Handler (golang)
- Sensu Email Handler (unmaintained ruby)
- Sensu Pagerduty Handler (unmaintained ruby)
- Sensu Plugins Library (golang)
RUNTIMES, INTERPRETERS, AND DEPENDENCIES…OH MY!
One of the many challenges with containers is keeping things as small and compact as possible. To that end, carrying around a Ruby runtime is not desirable. The Sensu client has it’s own embedded Ruby and we rely on this, but do we really need the client installed in every container? Current thinking is no, a single client living in each containerized environment is all that is needed. Without the client though we have no ability to execute checks in Ruby if needed as much of our environment is built with either Node.js or Clojure.
We have always supported checks and handlers being written in the language that the developer is most comfortable with but there is also a need for the core team to write very specific apps in a language common to us. After some performance testing that only confirmed what we already new to be true, we settled upon Golang last October.
AND THE WINNER IS…
Golang was chosen over other candidate languages including Rust and C for several reasons. Without going to in-depth Rust was knocked out of the running due to perceived immaturity and lack of established knowledge at Yieldbot and C was not picked as Golang solves many of the design flaws that C had, is easier to get started with, and again C is not currently used too any large extent at Yieldbot.
Of the many factors tipping the Golang scales a major one was our heavy reliance on tools designed and written by Hashicorp, including Consul and Terraform. There was already a base accumulated knowledge around Golang that we could build from and continue to foster.
There were still hurdles including development, build, and release pipelines. Once these were put in place the code and designs started to flow nicely. Many of the tools are already on their third or fourth major revision as our skill and the time spent with the language have increased in the last few months.
While most standard checks will not be run in containers, there has still been the need at times for some special snowflakes. One of those instances was the need to check the number of open file handles for a given process. This is a simple check that was the first to incorporate Cobra and Viper to handle the common, low level bits.
We also have several other golang checks, including ones for Chrony and Sensu. Along with Cobra for scaffolding the checks and Viper for managing the configuration of checks via Consul KV or a yaml file we have incorporated Goconvey for unit testing, and logrus for logging check and handler data to syslog to be consumed by an ELK stack.
There are several solutions for the golang vendoring issue and several have been tried but a standard one has yet to be fully adopted and incorporated into our workflow.
WHAT ARE THE CHECKS REALLY DOING
Logging is often overlooked in writing checks and this is understandable. Many times the developer is writing something in response to a recently solved condition and just wants to get it in place and move on. We feel checks should be more thought out and treated as independent projects, therefore in some cases we take things to another level.
Using logrus allows us to keep our check output clean and also detect and diagnose false negatives (missed alert conditions) easier. The Sensu client log can be quite noisy on large systems, by generating specific messages to another source we can make detecting and then troubleshooting problems easier.
The code to enable this is really quite simple.
This will check for a configuration using Viper if it is not supplied via the commandline. As can be easily seen in the code the log will contain the check’s name, the hostname of the machine running the check, the version of the check, and a plain English error message.
The output below is a nice structured log that can be easily parsed using automated methods. In cases where a developer is reading the log we can switch the output to ascii, for example when writing to stderr.
WHAT DO YOU MEAN 2+2 != 4
When it comes to testing, there are several good packages including the standard one. For multiple reasons including ease of use and the web interface Goconvey is the most widely used.
A simple check to make sure the date is converted properly from a human readable format to epoch is a good example.
The Golang function.
Let’s make sure it works.
You don’t get much easier than that and the web interface is simply icing on the cake. Just run ./goconvey in the project root and a web browser opens with the following details. The center pane gives you your results and any debugging output, the right and left panes give details about the tests being run including coverage. The nice part is it continues to run in the background as the files are modified and the tests are automatically run ala guard or watch. This is great when doing test driven development.
We initially started with a Sensu handler for taking event data and dropping it into Elasticsearch to both provide developers a way to create their own custom dashboards using Kibana or Dashing and in the future giving them a way to visualize historical and trending data.
More on this specific topic will be forthcoming but the repo for this project can be found here.
At this point the poc for Golang and the specific use case we initially had in mind was complete but if you aren’t trying to break anything you aren’t innovating so where to go from here.
Before going further we pulled out several of the functions and data structures needed for Sensu handlers along with some other common functions and created a library that functions much like the sensu-plugin gem. This library allows us to have a single place to define exit codes we want to monitor on as well as a common exit function and method to obtain check results both modeled in part after functionality found in the aforementioned gem.
Slack is turning into a fairly ubiquitous tool in IAC circles. To this end we created a Slack handler that sends a subset of event data to a given Slack channel as an attachment.
While using much the same functionality as the ES handler we created a few new functions such as a standard set of colors to define important notifications. It is well known that the eye can react to color faster than words and on a mobile device this is even more true. This color set will also be applicable to the rich, dynamic email handler that is currently being developed.
Another thing to note above is the distinction between Sensu Client and Monitored Instance this is directly related to JIT clients and SNMP checks. There are many cases where the client could be reporting results from an outside source so we built that functionality in at the outset along with dynamic links to the Uchiwa dashboard and the repo containing the playbook associated with the check.
As a general rule Slack should not be used for alerts but in some cases this makes sense. It is up to the individual developer if it makes sense for their product.
The backbone of our alerts are email based.
This is what is currently running and is written in Ruby with an Erubis template. It has many of the features of the Slack notification but when fully ported to golang it will be able to do much more. A few of the requests from various teams include:
- a link to Grafana, Kibana, or some other tool giving more context to the alert
- results from any automated contextual awareness
- results from any automated remediation
- a simplified list of additional alerts on the monitored instance to provide additional context
The point of much of this data is help decide if this is an actionable alert. Non-actionable alerts, also known as false positives, should be kept to a minimum, ideally less than 10%. Alerts in general should only be used to make a decision, adding too much data to alert risks generating unmanageable noise and eventually people will just ignore it.
There are still several things in the works that have not been fully fleshed out yet. Automated remediation is something we are starting to look at along with continuing to pack more context from the client into the alert output while still maintaining an even keel with the signal to noise. Anything that helps to remove false negatives from our stream while preserving a minimal amount of false positives would be a welcome addition.
There are cross-platform disk, cpu, and memory checks with built-in contextual awareness in the works as well as a statsd handler.