Prometheus & Ansible

a way to manage monitoring

Roman Demachkovych

Paweł Krupa

  • simple
  • agentless
  • one basic dependency - python
  • config in YAML and jinja2

Prometheus

Open source, metrics-based monitoring system.

It does one thing and does it well.

Simple text format makes it easy to expose metrics to Prometheus.

The data model identifies each time series an unordered set of key-value pairs called labels.

Scraped data is stored in local time-series database.

PromQL expression language allows easy metrics selection and aggregation.

create graphs

set alert rules

expose data

PromQL

Architecture

Caution!

If you need 100% accuracy, such as for per-request billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough.

How to gather data?

Metrics exposition format

# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} 73334.095
http_request_duration_microseconds{handler="prometheus",quantile="0.9"} 85549.187
http_request_duration_microseconds{handler="prometheus",quantile="0.99"} 183985.353
http_request_duration_microseconds_sum{handler="prometheus"} 8.432908577878979e+09
http_request_duration_microseconds_count{handler="prometheus"} 109800
# HELP http_request_size_bytes The HTTP request sizes in bytes.
# TYPE http_request_size_bytes summary
http_request_size_bytes{handler="prometheus",quantile="0.5"} 178
http_request_size_bytes{handler="prometheus",quantile="0.9"} 178
http_request_size_bytes{handler="prometheus",quantile="0.99"} 178
http_request_size_bytes_sum{handler="prometheus"} 1.9546806e+07
http_request_size_bytes_count{handler="prometheus"} 109800
# HELP http_requests_total Total number of HTTP requests made.
# TYPE http_requests_total counter
http_requests_total{code="200",handler="prometheus",method="get"} 109800
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes summary
http_response_size_bytes{handler="prometheus",quantile="0.5"} 21881
http_response_size_bytes{handler="prometheus",quantile="0.9"} 21898
http_response_size_bytes{handler="prometheus",quantile="0.99"} 21919
http_response_size_bytes_sum{handler="prometheus"} 2.400906706e+09
http_response_size_bytes_count{handler="prometheus"} 109800
# HELP node_arp_entries ARP entries by device
# TYPE node_arp_entries gauge
node_arp_entries{device="docker0"} 1
node_arp_entries{device="eth0"} 4

Embed into software

Official client libraries:

* Go
* Java or Scala
* Python
* Ruby

Unofficial third-party client libraries:

* Bash
* C++
* Common Lisp
* Elixir
* Erlang
* Haskell
* Lua for Nginx
* Lua for Tarantool
* .NET / C#
* node.js
* PHP
* Rust

Or use metrics exporters

## Core components starting at 9090

* 9090 - Prometheus server
* 9091 - Pushgateway
* 9093 - Alertmanager
* 9094 - Alertmanager clustering

## Exporters starting at 9100

* 9100 - Node exporter
* 9101 - HAProxy exporter
* 9102 - StatsD exporter
* 9103 - Collectd exporter
* 9108 - Graphite exporter
* 9110 - Blackbox exporter

Write your own!

import json
import time
import urllib2
from prometheus_client import start_http_server
from prometheus_client.core import GaugeMetricFamily, REGISTRY

class JenkinsCollector(object):
  def collect(self):
    metric = GaugeMetricFamily(
        'jenkins_job_last_successful_build_timestamp_seconds',
        'Jenkins build timestamp in unixtime for lastSuccessfulBuild',
        labels=["jobname"])

    result = json.load(urllib2.urlopen(
        "http://jenkins:8080/api/json?tree="
        + "jobs[name,lastSuccessfulBuild[timestamp]]"))

    for job in result['jobs']:
      name = job['name']
      # If there's a null result, we want to export a zero.
      status = job['lastSuccessfulBuild'] or {}
      metric.add_metric([name], status.get('timestamp', 0) / 1000.0)

    yield metric

if __name__ == "__main__":
  REGISTRY.register(JenkinsCollector())
  start_http_server(9118)
  while True: time.sleep(1)

Connect prometheus

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 'localhost:9100'

Visualise.

Promdash

Grafana!

Prometheus integration

  • Datasource support
  • Prometheus dashboard
  • PromQL autocomplete
  • Alerts

Alert!

Alertmanager handles alerts sent by client applications such as the Prometheus, Grafana, etc.

Alertmanager

  • deduplication
  • grouping
  • routing
  • sending

Functions

  • silencing
  • inhibition

Functions

Alertmanager supports a mesh configuration to create a cluster for High Availability.

Warning: High Availability is under active development

  • email
  • hipchat
  • pagerduty
  • pushover
  • slack
  • opsgenie
  • webhook
  • victorops

Notification integrations

Install

Method

  • source
  • pre-compiled binary
  • docker container
  • apt-get install prometheus
  • yum install prometheus
  • any installation from package

Recommended

Don't do this!

$ cd /tmp
$ wget https://github.com/prometheus/prometheus/releases/download/v2.2.0/prometheus-2.2.0.linux-amd64.tar.gz
$ tar -xzf prometheus-2.2.0.linux-amd64.tar.gz

Binary

$ sudo chmod +x prometheus-2.2.0.linux-amd64/{prometheus,promtool}
$ sudo cp prometheus-2.2.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
$ sudo chown root:root /usr/local/bin/{prometheus,promtool}
$ sudo mkdir -p /etc/prometheus
$ sudo vim /etc/prometheus/prometheus.yml
$ promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found

$ prometheus --config.file "/etc/prometheus/prometheus.yml" &

Repeat for every component (prometheus, alertmanager, node_exporter, blackbox_exporter, *_exporter) on multiple nodes every month or so

Problems

  • Too many operations
  • Won't survive reboot
  • No dedicated user
  • Try changing config
  • Troublesome upgrade
  • SELinux anyone?

Manage

(aka why Ansible?)

Goals

  • Zero-configuration deployment
  • Easy management of multiple nodes
  • Error checking
  • Multiple CPU architecture support

Where is my config?

  • command line parameters
  • main configuration file (in YAML)
  • files included from main file (ex. alert rules or file_sd config)

Prometheus

global:
  evaluation_interval: 15s
  scrape_interval: 15s
  scrape_timeout: 10s

  external_labels:
    environment: localhost.localdomain

scrape_configs:
  - job_name: "prometheus"
    metrics_path: "/metrics"
    static_configs:
    - targets:
      - "localhost:9090"
  - job_name: node
    file_sd_configs:
    - files:
      - "/etc/prometheus/file_sd/node.yml"
# Nothing.

Ansible

Main config

Prometheus

global:
  evaluation_interval: 15s
  scrape_interval: 15s
  scrape_timeout: 10s

  external_labels:
    environment: localhost.localdomain

rule_files:
  - /etc/prometheus/rules/*.rules

alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - localhost:9093

scrape_configs:
  - job_name: "prometheus"
    metrics_path: "/metrics"
    static_configs:
    - targets:
      - "localhost:9090"
  - job_name: node
    file_sd_configs:
    - files:
      - "/etc/prometheus/file_sd/node.yml"
prometheus_alertmanager_config:
  - scheme: http
    static_configs:
      - targets:
        - "localhost:9093"

prometheus_scrape_configs:
- job_name: "node"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/node.yml"

prometheus_targets:
  node:
    - targets:
      - "localhost:9100"

Ansible

Main config (extended)

Command line parameters

# Ansible managed file. Be wary of possible overwrites.
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
Environment="GOMAXPROCS=1"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention=30d \
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=http://demo.cloudalchemy.org:9090

SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target

Everyone makes mistakes.

preflight checks included in role

use `promtool` in ansible `validate` directive

Gathering system metrics from many nodes with multiple CPU architectures?

node_exporter!

  • One binary
  • Simple configuration with cli flags

ansible role bonuses:

  • versioning
  • system user management
  • CPU architecture auto-detection
  • systemd service files
  • linux capabilites support
  • basic SELinux support

Example

Resources