Monitoring and Alerting with Prometheus
Introduction
- It is an open source monitoring solution, which provides Metrics and Alerting.
- Dimensional data: time series are identified by metric name and a set of key/value pairs (labels).
temperature{location=outside}=90
- Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints.
- Prometheus runs by default on port 9090.
- Grafana runs by default on port 3000.
- Basic concepts:
- All data is stored as time series.
- Every time series is identified by the metric name and a set of key-value pairs, called labels. Example:
go_memstat_alloc_bytes{instance="localhost:9100",job="node_exporter"}
- The time series data also consists of the actual data, called samples, which can be a float64 value or a millisecond-precision timestamp.
- Prometheus configuration is stored in a YAML configuration file. It can be changed and applied without having to restart Prometheus, by triggering a reload with
kill -SIGHUP <pid>
- Monitor nodes (servers)
- To monitor nodes, you need to install the node-exporter, which exposes machine metrics of Linux machines.
- Prometheus Architecture
- It pulls metrics from targets and store them on the local node storage (disk).
- The scrapped metrics can be queried with PromQL.
- Alerts can be pushed to Alertmanager.
- Retrieval can also find targets from some certain Service Discovery integrations.
Monitoring
- Client libraries
- Code can be instrumented with client libraries available for many programming languages.
- You can also implement your own libraries that use Prometheus Exposition format, which is a simple text-based format.
- An exposition format contains a metric name, a label name, a label value and a type, which can be if type:
- counter: a numeric value that only goes up
- gauge: a single numeric value that can go up and down
- histogram: collects sample observations, which are then counted into buckets. Its purpose is calculating quantiles.
- summary: similar to histogram, but also provides a total count of observations and a sum of the observed values.
- Pushing metrics
- Prometheus, by default, pulls metrics from applications, but it also supports push metrics, by using a Pushgateway.
- The Pushgateway is used as an intermediary service which allows you to push metrics.
- The Pushgateway never forgets metrics unless they are manually deleted.
- It should be used as an exception, because it adds another point of failure.
- Querying
- Prometheus provides a functional expression language called PromQL. It provides built-in functions and can execute calculations over vectors.
- Expressions
- Instant vector: a set of time series containing a single sample for each time series, all sharing the same timestamp. E.g.:
node_cpu_seconds_total
- Range vector: a set of time series containing a range of data points over time for each time series. E.g.:
node_cpu_seconds_total[5m]
- Scalar: a single numeric floating point value. E.g.: -3.14
- String: a string value (unused).
- Operators
- Arithmetic binary operators
-
(subtraction)
*
(multiplication)
/
(division)
%
(modulo)
^
(exponentiation)
- Comparison binary operators
==
(equal)
!=
(not equal)
>
(greater than)
<
(less than)
>=
(greater than or equal)
<=
(less than or equal)
- Logical/set binary operators
and
(intersection)
or
(union)
unless
(complement)
- Aggregation operators
sum
min
max
avg
stddev
stdvar
count
count_values
bottomk
topk
quantile
- Label matching operators
=:
select labels that are exactly equal to the provided string
!=:
select labels that are not equal to the provided string
=~:
select labels that regex-match the provided string
!~:
select labels that do not regex-match the provided string
- Service discovery: Prometheus can use service discovery mechanisms to find services to scrape.
- Exporters: they export metrics from a system or from an application.
Alerting
- Alerting in Prometheus is separated in two parts:
- Alerting Rules in Prometheus server
- Alertmanager (runs on port 9093)
- Alerting Rules live on the server configuration. You can include many rules via other YAML files, which is a best practice.
- The rules are based on a PromQL expression.
- Alertmanager handles the alerts fired by the server.
- It handles deduplication, grouping and routing of alerts:
- Grouping: groups similar alerts into one notification
- Inhibition: silences other alerts if one specified alert is already fired
- Silences: mutes certain notifications
- Routes forward alerts to receivers (PagerDuty, email, Slack).
- It can be setup as an high available service using mesh config.
- Alert states
- Inactive: no rule is met.
- Pending: rule is met, but can be supressed due to validations.
- Firing: alert is sent to the configured channels (receivers).