Monitoring and Alerting with Prometheus

Introduction

  • It is an open source monitoring solution, which provides Metrics and Alerting.
  • Dimensional data: time series are identified by metric name and a set of key/value pairs (labels).
    • temperature{location=outside}=90
  • Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints.
  • Prometheus runs by default on port 9090.
  • Grafana runs by default on port 3000.
  • Basic concepts:
    • All data is stored as time series.
    • Every time series is identified by the metric name and a set of key-value pairs, called labels. Example: go_memstat_alloc_bytes{instance="localhost:9100",job="node_exporter"}
    • The time series data also consists of the actual data, called samples, which can be a float64 value or a millisecond-precision timestamp.
  • Prometheus configuration is stored in a YAML configuration file. It can be changed and applied without having to restart Prometheus, by triggering a reload with kill -SIGHUP <pid>
  • Monitor nodes (servers)
    • To monitor nodes, you need to install the node-exporter, which exposes machine metrics of Linux machines.
  • Prometheus Architecture
    • It pulls metrics from targets and store them on the local node storage (disk).
    • The scrapped metrics can be queried with PromQL.
    • Alerts can be pushed to Alertmanager.
    • Retrieval can also find targets from some certain Service Discovery integrations.

Monitoring

  • Client libraries
    • Code can be instrumented with client libraries available for many programming languages.
    • You can also implement your own libraries that use Prometheus Exposition format, which is a simple text-based format.
    • An exposition format contains a metric name, a label name, a label value and a type, which can be if type:
      • counter: a numeric value that only goes up
      • gauge: a single numeric value that can go up and down
      • histogram: collects sample observations, which are then counted into buckets. Its purpose is calculating quantiles.
      • summary: similar to histogram, but also provides a total count of observations and a sum of the observed values.
  • Pushing metrics
    • Prometheus, by default, pulls metrics from applications, but it also supports push metrics, by using a Pushgateway.
    • The Pushgateway is used as an intermediary service which allows you to push metrics.
    • The Pushgateway never forgets metrics unless they are manually deleted.
    • It should be used as an exception, because it adds another point of failure.
  • Querying
    • Prometheus provides a functional expression language called PromQL. It provides built-in functions and can execute calculations over vectors.
    • Expressions
      • Instant vector: a set of time series containing a single sample for each time series, all sharing the same timestamp. E.g.: node_cpu_seconds_total
      • Range vector: a set of time series containing a range of data points over time for each time series. E.g.: node_cpu_seconds_total[5m]
      • Scalar: a single numeric floating point value. E.g.: -3.14
      • String: a string value (unused).
    • Operators
      • Arithmetic binary operators
        • - (subtraction)
        • * (multiplication)
        • / (division)
        • % (modulo)
        • ^ (exponentiation)
      • Comparison binary operators
        • == (equal)
        • != (not equal)
        • > (greater than)
        • < (less than)
        • >= (greater than or equal)
        • <= (less than or equal)
      • Logical/set binary operators
        • and (intersection)
        • or (union)
        • unless (complement)
      • Aggregation operators
        • sum
        • min
        • max
        • avg
        • stddev
        • stdvar
        • count
        • count_values
        • bottomk
        • topk
        • quantile
      • Label matching operators
        • =: select labels that are exactly equal to the provided string
        • !=: select labels that are not equal to the provided string
        • =~: select labels that regex-match the provided string
        • !~: select labels that do not regex-match the provided string
  • Service discovery: Prometheus can use service discovery mechanisms to find services to scrape.
  • Exporters: they export metrics from a system or from an application.

Alerting

  • Alerting in Prometheus is separated in two parts:
    • Alerting Rules in Prometheus server
    • Alertmanager (runs on port 9093)
  • Alerting Rules live on the server configuration. You can include many rules via other YAML files, which is a best practice.
    • The rules are based on a PromQL expression.
  • Alertmanager handles the alerts fired by the server.
    • It handles deduplication, grouping and routing of alerts:
      • Grouping: groups similar alerts into one notification
      • Inhibition: silences other alerts if one specified alert is already fired
      • Silences: mutes certain notifications
    • Routes forward alerts to receivers (PagerDuty, email, Slack).
    • It can be setup as an high available service using mesh config.
    • Alert states
      • Inactive: no rule is met.
      • Pending: rule is met, but can be supressed due to validations.
      • Firing: alert is sent to the configured channels (receivers).