Archive for the ‘Devops’ Category

How we made monitoring simple with TIG stack

TIG is a popular stack to use for monitoring, including Telegraf, InfluxDB, and Grafana.

  • Telegraf is a monitoring agent that can collect metrics and send them to a destination.
  • InfluxDB is a time series database, optimized for storing data points associated with a timestamp, and querying those data points based on a time range.
  • Grafana is a dashboard and visualization tool.

When we started our monitoring project, we inherited a TIG stack with a separate influxdb instance per region. It did 90% of what we needed, but had a complicated deployment strategy and was not easy to modify. The goal was to consolidate to a single InfluxDB instance per tier (QA, Staging, Prod) for cost savings and aggregated reporting.

During this project, the Telegraf config didn’t change very much. It was pretty simple: an Influx line protocol input, and an Influx output using authentication. The Telegraf config file was managed using Salt. We would change the output from the per-region influxdb instances to the shared instance.

The shared InfluxDB instance would be in a US-based region which would result in higher latency from international services. This was an acceptable trade-off and I believed that telegraf would be able to handle the additional latency without causing issues for the applications we monitored.

First step: Deploy the new InfluxDB instance

  • This was the simplest step: add the new instance as a new “region” alongside the existing regions.
  • We stored the credentials in the same secrets manager as the original instances.

Second step: Update Telegraf config to point to new InfluxDB instance

  • Most telegraf configs were deployed with Salt, and a few were deployed with Ansible inside an AMI. Either way, updating the config was straightforward.
  • While updating Telegraf config: Add tagging to indicate the origin of the telemetry. We called this the dc tag because it indicated which datacenter the metrics were coming from. This environment tag preserved the previous ability to report on an individual environment.

Third step: Iterate and make sure all instances are cut over to the new shared instance

  • I wrote an InfluxQL query to check for the existence of telemetry with a recent timestamp. This was the canary that could tell if a regions-specific instance was still being used.

Fourth step: Destroy the region-specific instances

  • Save money, make things simpler.
  • A multi-tenant configuration.

Gotchas:

  • During the migration, we also decided to refactor some Ansible, moving one of the roles from a shared base module into the specific playbooks that needed it. This resulted in a gotcha, where that AMI was using apt-get to install the latest version of telegraf rather than a pinned version that the shared base module had been installing. To resolve this, we went into the telegraf config and updated some config items from deprecated names to the new names used by the latest version of telegraf.