Monitoring and Alerting with Prometheus | guh.me

Introduction

It is an open source monitoring solution, which provides Metrics and Alerting.
Dimensional data: time series are identified by metric name and a set of key/value pairs (labels).
- temperature{location=outside}=90
Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints.
Prometheus runs by default on port 9090.
Grafana runs by default on port 3000.
Basic concepts:
- All data is stored as time series.
- Every time series is identified by the metric name and a set of key-value pairs, called labels. Example: go_memstat_alloc_bytes{instance="localhost:9100",job="node_exporter"}
- The time series data also consists of the actual data, called samples, which can be a float64 value or a millisecond-precision timestamp.
Prometheus configuration is stored in a YAML configuration file. It can be changed and applied without having to restart Prometheus, by triggering a reload with kill -SIGHUP <pid>
Monitor nodes (servers)
- To monitor nodes, you need to install the node-exporter, which exposes machine metrics of Linux machines.
Prometheus Architecture
- It pulls metrics from targets and store them on the local node storage (disk).
- The scrapped metrics can be queried with PromQL.
- Alerts can be pushed to Alertmanager.
- Retrieval can also find targets from some certain Service Discovery integrations.

Monitoring

Client libraries
- Code can be instrumented with client libraries available for many programming languages.
- You can also implement your own libraries that use Prometheus Exposition format, which is a simple text-based format.
- An exposition format contains a metric name, a label name, a label value and a type, which can be if type:
  - counter: a numeric value that only goes up
  - gauge: a single numeric value that can go up and down
  - histogram: collects sample observations, which are then counted into buckets. Its purpose is calculating quantiles.
  - summary: similar to histogram, but also provides a total count of observations and a sum of the observed values.
Pushing metrics
- Prometheus, by default, pulls metrics from applications, but it also supports push metrics, by using a Pushgateway.
- The Pushgateway is used as an intermediary service which allows you to push metrics.
- The Pushgateway never forgets metrics unless they are manually deleted.
- It should be used as an exception, because it adds another point of failure.
Querying
- Prometheus provides a functional expression language called PromQL. It provides built-in functions and can execute calculations over vectors.
- Expressions
  - Instant vector: a set of time series containing a single sample for each time series, all sharing the same timestamp. E.g.: node_cpu_seconds_total
  - Range vector: a set of time series containing a range of data points over time for each time series. E.g.: node_cpu_seconds_total[5m]
  - Scalar: a single numeric floating point value. E.g.: -3.14
  - String: a string value (unused).
- Operators
  - Arithmetic binary operators
    - - (subtraction)
    - * (multiplication)
    - / (division)
    - % (modulo)
    - ^ (exponentiation)
  - Comparison binary operators
    - == (equal)
    - != (not equal)
    - > (greater than)
    - < (less than)
    - >= (greater than or equal)
    - <= (less than or equal)
  - Logical/set binary operators
    - and (intersection)
    - or (union)
    - unless (complement)
  - Aggregation operators
    - sum
    - min
    - max
    - avg
    - stddev
    - stdvar
    - count
    - count_values
    - bottomk
    - topk
    - quantile
  - Label matching operators
    - =: select labels that are exactly equal to the provided string
    - !=: select labels that are not equal to the provided string
    - =~: select labels that regex-match the provided string
    - !~: select labels that do not regex-match the provided string
Service discovery: Prometheus can use service discovery mechanisms to find services to scrape.
Exporters: they export metrics from a system or from an application.

Alerting

Alerting in Prometheus is separated in two parts:
- Alerting Rules in Prometheus server
- Alertmanager (runs on port 9093)
Alerting Rules live on the server configuration. You can include many rules via other YAML files, which is a best practice.
- The rules are based on a PromQL expression.
Alertmanager handles the alerts fired by the server.
- It handles deduplication, grouping and routing of alerts:
  - Grouping: groups similar alerts into one notification
  - Inhibition: silences other alerts if one specified alert is already fired
  - Silences: mutes certain notifications
- Routes forward alerts to receivers (PagerDuty, email, Slack).
- It can be setup as an high available service using mesh config.
- Alert states
  - Inactive: no rule is met.
  - Pending: rule is met, but can be supressed due to validations.
  - Firing: alert is sent to the configured channels (receivers).