使用 TimescaleDB + Prometheus + Grafana 实现指标数据采集、存储及可视化
0. Background
这是关于如何使用TimescaleDB
,Prometheus
和Grafana
来存储和分析长期指标数据的教程。在这一部分中,您将学习如何使用指标数据来回答有关基础架构的性能问题。
本教程将通过直接使用监视PostgreSQL
数据库下载包含大量Prometheus
指标的样本数据集,而不设置所有必需的基础结构,来完成实验。使用此样本数据集,将自动为您创建所需的架构。
在本教程中,您将学习如何 针对Prometheus
指标优化TimescaleDB
并构建可视化帮助您使用指标来回答常见问题,例如:
-
How many active connections are there to the database?
-
What is the maximum and average percent of memory usage?
-
When is the database being read most often?
-
How much disk space is being used?
-
What is the cache hit rate?
Prometheus
是一个开源且社区驱动的监视系统,一开始非常简单。Prometheus
之所以令人赞叹,是其以简单直接的方式解决监控问题。他们的原则是做一件事,并且做好。这反映在例如PromQL语言的设计中。
但是,这种原则也存在限制性。值得赞扬的是,Prometheus
的开发人员预见到他们的产品是opinionated,他们内置可扩展性以允许其他系统对其进行改进。反过来,Prometheus
用户经常将其他系统视为增加其监视设置的一种方式。
这就是TimescaleDB和PostgreSQL出现的地方。特别是,Prometheus用户经常出于以下原因而转向TimescaleDB和PostgreSQL:
- 可扩展,持久的长期数据存储
- 广泛的工具生态系统使操作更轻松
- 使用SQL查询功能和灵活性
通过将Prometheus和TimescaleDB一起使用,可以将Prometheus的简单性与TimescaleDB的可靠性,功能和灵活性相结合,并选择最适合手头任务的方法。例如,您可以使用PromQL或完整SQL进行查询,或同时使用两者。
1. 设置Timescale pg_prometheus
和postgresql
适配器
1. 1 安装 pg_prometheus
pg_prometheus
项目地址:https://github.com/timescale/pg_prometheus
百度网盘链接:https://pan.baidu.com/s/1FHSqtiFIU_afhcIGIpjozw
提取码:hwva,可将文件解压至安装目录,取代 git clone
git clone https://github.com/timescale/pg_prometheus
cd pg_prometheus
# git checkout 0.2.2
make
sudo make install
若 sudo pg_config 提示找不到命令。这是因为 sudo 在切换成 root 用户的时候,env 并不会去保留这些环境变量,需要特别的指明才可以。向 /etc/sudoers 文件中 env_reset 下增加一行。
Defaults env_keep+="PATH PYTHONPATH LD_LIBRARY_PATH MANPATH JAVA_HOME"
另外,secure_path 变量里面也要加上 /usr/local/pgsql/bin 路径。
Defaults secure_path="/usr/local/pgsql/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"
验证一下 pg_prometheus 扩展有没有装好。我们连接 TimescaleDB 命令:
$ psql -U postgres
psql (11.8)
Type "help" for help.
postgres=# SELECT name, default_version,installed_version FROM pg_available_extensions WHERE name LIKE '%prometheus%';
应该看到类似于以下输出的内容:
name | default_version | installed_version
---------------+-----------------+-------------------
pg_prometheus | 0.2.2 |
(1 row)
1.2 创建Prometheus
数据库
$ psql -U postgres
CREATE USER prometheus WITH LOGIN PASSWORD 'secret';
CREATE DATABASE prometheus OWNER prometheus;
\connect prometheus;
CREATE EXTENSION timescaledb;
CREATE EXTENSION pg_prometheus;
Optionally grant permissions to the database user (prometheus
) that will own the Prometheus data:
-- Create the role
-- CREATE ROLE prometheus WITH LOGIN PASSWORD 'secret';
GRANT ALL PRIVILEGES ON DATABASE prometheus TO prometheus;
创建表格。
-- 创建表,只使用pg_prometheus
-- SELECT create_prometheus_table('metrics');
-- 让pg_prometheus 使用 TimescaleDB 扩展。
SELECT create_prometheus_table('metrics',use_timescaledb=>true);
-- 配置数据保留,默认情况下,数据存储90天,然后删除
-- SELECT set_default_retention_period(180 * INTERVAL '1 day');
查看一下创建的表格。
# \dt
List of relations
Schema | Name | Type | Owner
--------+----------------+-------+----------
public | metrics_copy | table | postgres
public | metrics_labels | table | postgres
public | metrics_values | table | postgres
(3 rows)
# \d metrics_values
Table "public.metrics_values"
Column | Type | Collation | Nullable | Default
-----------+--------------------------+-----------+----------+---------
time | timestamp with time zone | | not null |
value | double precision | | |
labels_id | integer | | |
Indexes:
"metrics_values_labels_id_idx" btree (labels_id, "time" DESC)
"metrics_values_time_idx" btree ("time" DESC)
Triggers:
ts_insert_blocker BEFORE INSERT ON metrics_values FOR EACH ROW EXECUTE PROCEDURE _timescaledb_internal.insert_blocker()
# \d metrics_labels
Table "public.metrics_labels"
Column | Type | Collation | Nullable | Default
-------------+---------+-----------+----------+--------------------------------------------
id | integer | | not null | nextval('metrics_labels_id_seq'::regclass)
metric_name | text | | not null |
labels | jsonb | | |
Indexes:
"metrics_labels_pkey" PRIMARY KEY, btree (id)
"metrics_labels_metric_name_labels_key" UNIQUE CONSTRAINT, btree (metric_nam
e, labels)
"metrics_labels_labels_idx" gin (labels)
"metrics_labels_metric_name_idx" btree (metric_name)
# \d metrics_copy
Table "public.metrics_copy"
Column | Type | Collation | Nullable | Default
--------+-------------+-----------+----------+---------
sample | prom_sample | | not null |
Triggers:
insert_trigger BEFORE INSERT ON metrics_copy FOR EACH ROW EXECUTE PROCEDURE prometheus.insert_view_normal('metrics_values', 'metrics_labels')
1.3 安装 prometheus-postgresql-adapter
prometheus-postgresql-adapter
项目地址:https://github.com/timescale/prometheus-postgresql-adapter
百度网盘链接:https://pan.baidu.com/s/1NXpA-cxx7VeFh6lfXW-F8A
提取码:50gd
# install dependency
sudo apt-get install golang-go
git clone https://github.com/timescale/prometheus-postgresql-adapter
cd prometheus-postgresql-adapter
# git checkout 0.6.0
make ## 用go编译遇到各种问题
# 尝试用prebuilt
wget https://github.com/timescale/prometheus-postgresql-adapter/releases/download/v0.6.0/prometheus-postgresql-adapter-0.6.0-linux-amd64.tar.gz
tar xf prometheus-postgresql-adapter-0.6.0-linux-amd64.tar.gz
本地启动
./prometheus-postgresql-adapter -pg-port 5432 --pg-user postgres
执行返回结果。
{"caller":"log.go:31","config":"\u0026{remoteTimeout:30000000000 listenAddr::9201 telemetryPath:/metrics pgPrometheusConfig:{host:localhost port:5432 user:postgres password: database:postgres schema: sslMode:disable table:metrics copyTable: maxOpenConns:50 maxIdleConns:10 pgPrometheusNormalize:true pgPrometheusLogSamples:false pgPrometheusChunkInterval:43200000000000 useTimescaleDb:true dbConnectRetries:0 readOnly:false} logLevel:debug haGroupLockID:0 restElection:false prometheusTimeout:-1 electionInterval:5000000000}","level":"info","ts":"2020-06-22T03:39:09.775Z"}
{"caller":"log.go:31","level":"info","msg":"host=localhost port=5432 user=postgres dbname=postgres password='' sslmode=disable connect_timeout=10","ts":"2020-06-22T03:39:09.775Z"}
{"caller":"log.go:31","level":"info","msg":"Initialized pg_prometheus extension","ts":"2020-06-22T03:39:10.023Z"}
{"caller":"log.go:35","level":"warn","msg":"No adapter leader election. Group lock id is not set. Possible duplicate write load if running adapter in high-availability mode","ts":"2020-06-22T03:39:10.025Z"}
{"caller":"log.go:31","level":"info","msg":"Starting up...","ts":"2020-06-22T03:39:10.025Z"}
{"addr":":9201","caller":"log.go:31","level":"info","msg":"Listening","ts":"2020-06-22T03:39:10.025Z"}
1.4 导入数据
数据集百度网盘链接:https://pan.baidu.com/s/1i-w-PjuK5Qi7kda6duGLjg
提取码:7hk4
/P4/datasets/prom_data_csvs$ psql -U postgres -d prometheus
\COPY metrics_labels FROM metrics_labels_202003201742.csv DELIMITER ',' CSV HEADER;
\COPY metrics_values FROM metrics_values_202003201743.csv DELIMITER ',' CSV HEADER;
遇到错误,因为开始的时候忘了加DELIMITER ',' CSV HEADER
等CSV文件的描述信息。
ERROR: invalid input syntax for integer: ""id","metric_name","labels""
CONTEXT: COPY metrics_labels, line 1, column id: ""id","metric_name","labels""
1.5 理解Prometheus的监控数据模型
Prometheus的监控数据模型有三大核心概念。参考:https://prometheus.io/docs/concepts/data_model/
- 时间序列:Prometheus fundamentally stores all data as time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions. Every time series is uniquely identified by its metric name and optional key-value pairs called labels.
- 测度名称:The metric name specifies the general feature of a system that is measured (e.g.
http_requests_total
- the total number of HTTP requests received). - 测度标签:Labels enable Prometheus's dimensional data model: any given combination of labels for the same metric name identifies a particular dimensional instantiation of that metric (for example: all HTTP requests that used the method
POST
to the/api/tracks
handler). The query language allows filtering and aggregation based on these dimensions. Changing any label value, including adding or removing a label, will create a new time series.
具体例子
metrics_values_202003201743.csv文件里面外键引用了labels_id。time域是时间戳,value是测量数据,lables_id指明了这个测量数据的含义。
"time","value","labels_id"
2020-03-18 07:59:57,0,1
2020-03-18 07:59:47,0,1
2020-03-18 07:59:37,0,1
2020-03-18 07:59:27,0,1
2020-03-18 07:59:17,0,1
2020-03-18 07:59:07,0,1
2020-03-18 07:58:57,0,1
2020-03-18 07:58:47,0,1
2020-03-18 07:58:37,0,1
metrics_labels_202003201742.csv文件提供了各种labels的定义。id是编号,metric_name是可读的测度名称,labels是metric详细定义的测度标签。这里labels是jsonb格式数据。
"id","metric_name","labels"
1,cpu_usage_guest,"{""cpu"": ""cpu-total"", ""job"": ""prometheus"", ""host"": ""tsdb-avthar-taxi-tutorial-9"", ""cloud"": ""timescale-aws-us-west-2"", ""project"": ""support-50d0"", ""service"": ""tsdb-avthar-taxi-tutorial"", ""instance"": ""tsdb-avthar-taxi-tutorial-support-50d0.a.timescaledb.io:9273"", ""service_type"": ""pg""}"
2,cpu_usage_guest,"{""cpu"": ""cpu0"", ""job"": ""prometheus"", ""host"": ""tsdb-avthar-taxi-tutorial-9"", ""cloud"": ""timescale-aws-us-west-2"", ""project"": ""support-50d0"", ""service"": ""tsdb-avthar-taxi-tutorial"", ""instance"": ""tsdb-avthar-taxi-tutorial-support-50d0.a.timescaledb.io:9273"", ""service_type"": ""pg""}"
把这两行记录转义的""格式转换后。可以较清晰的看清labels,如何用json格式表达的键值对。
{cpu: cpu-total, job: prometheus, host: tsdb-avthar-taxi-tutorial-9, cloud: timescale-aws-us-west-2, project: support-50d0, service: tsdb-avthar-taxi-tutorial, instance: tsdb-avthar-taxi-tutorial-support-50d0.a.timescaledb.io:9273, service_type: pg}
{cpu: cpu0, job: prometheus, host: tsdb-avthar-taxi-tutorial-9, cloud: timescale-aws-us-west-2, project: support-50d0, service: tsdb-avthar-taxi-tutorial, instance: tsdb-avthar-taxi-tutorial-support-50d0.a.timescaledb.io:9273, service_type: pg}
从PostgreSQL 9.3开始,json就成了postgres里的一种数据类型,也就是和varchar、int一样,我们表里的一个字段的类型可以为json了。与此同时,postgres还提供了jsonb格式,jsonb格式是json的二进制形式,二者的区别在于json写入快,读取慢,jsonb写入慢,读取快,但在操作上,二者是没有区别的。下面几种数据类型都是可变长的。
数据类型 | 长度限制 |
---|---|
char(n) | 定长,长度限制n,长度不足用空白补充 |
varchar(n) | 变长,长度限制n |
text | 变长,无长度限制 |
jsonb | 变长,长度限制268435455个字节 |
Metric Names
关于测度名称和测度标签的命名规范,参考文档 METRIC AND LABEL NAMING。
A metric name...
- ...must comply with the data model for valid characters.
- ...should have a (single-word) application prefix relevant to the domain the metric belongs to. The prefix is sometimes referred to as namespace by client libraries. For metrics specific to an application, the prefix is usually the application name itself. Sometimes, however, metrics are more generic, like standardized metrics exported by client libraries. Examples:
**prometheus**_notifications_total
(specific to the Prometheus server)**process**_cpu_seconds_total
(exported by many client libraries)**http**_request_duration_seconds
(for all HTTP requests)
- ...must have a single unit (i.e. do not mix seconds with milliseconds, or seconds with bytes).
- ...should use base units (e.g. seconds, bytes, meters - not milliseconds, megabytes, kilometers). See below for a list of base units.
- ...should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.
http_request_duration_**seconds**
node_memory_usage_**bytes**
http_requests_**total**
(for a unit-less accumulating count)process_cpu_**seconds_total**
(for an accumulating count with unit)foobar_build**_info**
(for a pseudo-metric that provides metadata about the running binary)
- ...should represent the same logical thing-being-measured across all label dimensions.
- request duration
- bytes of data transfer
- instantaneous resource usage as a percentage
As a rule of thumb, either the sum()
or the avg()
over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not.
Labels
Use labels to differentiate the characteristics of the thing that is being measured:
api_http_requests_total
- differentiate request types:operation="create|update|delete"
api_request_duration_seconds
- differentiate request stages:stage="extract|transform|load"
Do not put the label names in the metric name, as this introduces redundancy and will cause confusion if the respective labels are aggregated away.
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
Base units
Prometheus does not have any units hard coded. For better compatibility, base units should be used. The following lists some metrics families with their base unit. The list is not exhaustive.
Family | Base unit | Remark |
---|---|---|
Time | seconds | |
Temperature | celsius | celsius is preferred over kelvin for practical reasons. kelvin is acceptable as a base unit in special cases like color temperature or where temperature has to be absolute. |
Length | meters | |
Bytes | bytes | |
Bits | bytes | To avoid confusion combining different metrics, always use bytes, even where bits appear more common. |
Percent | ratio | Values are 0–1 (rather than 0–100). ratio is only used as a suffix for names like disk_usage_ratio . The usual metric name follows the pattern A_per_B . |
Voltage | volts | |
Electric current | amperes | |
Energy | joules | |
Mass | grams | grams is preferred over kilograms to avoid issues with the kilo prefix. |
2. 设置 Grafana
2.1 安装 Grafana
可通过 Grafana安装手册 来安装。安装完成后启动服务器。
2.2 将 Grafana
连接到您的 TimescaleDB
实例
接下来,您需要配置Grafana
以连接到您的TimescaleDB
实例。
首次登陆
- 打开我们的Web浏览器,然后转到 http://localhost:3000/ 。如果未配置其他端口,3000则是
Grafana
侦听的默认HTTP端口。 - 在登录页面上,默认用户名和密码均为admin。
- 可按照需求更改登陆密码。
首先选择“添加数据源”,然后在SQL组中选择“ PostgreSQL”选项:
在设置界面,设置各项参数,我们的参数是:
- Host:
localhost:5432
- Database:
prometheus
- User:
postgres
- Password:
secret
- SSL Mode:
disable
点击Save & Test
按钮后,出现Database Connection OK
框表示连接成功。
3. 针对Prometheus
指标优化TimescaleDB
有几种方法可以优化您的TimescaleDB
,以最大化存储,查询时间和总体成本效率:
- 使用压缩来节省存储空间
- 使用数据保留策略并决定要保留多少度量数据
- 使用连续聚合来减少对数据库的查询频率
3.1 压缩
对于本教程,我们能够使用压缩节省70%的空间。通过压缩,您可以使用更少的磁盘空间将原始的Prometheus指标保持更长的时间。
使用压缩很容易。首先,确定要压缩数据的时间间隔(例如6小时后),然后在metrics_values超级表上添加自动压缩策略 ,因为随着收集更多样本,该表将增长。为此,请在中运行以下命令psql:
--- Add compression policy on metrics_values -- compress old data after every 6 hours
ALTER TABLE metrics_values SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'labels_id'
);
SELECT add_compress_chunks_policy('metrics_values', INTERVAL '6 hours');
3.2 数据保留
您可以使用数据保留来确定要存储长期指标的时间。
数据保留策略功能强大,因为它们使您可以保持缩减采样的指标,可以在其中保留数据的汇总,以进行长期分析,但可以丢弃构成这些汇总的基础数据以节省存储空间。
例如,如果每天要生成大量指标数据,则可能只希望将原始数据保留一定的时间(例如15天或30天),然后无限期地保留这些数据的汇总。使用将数据保留2天左右的方案,可以通过以下方法启用它:
--Adding a data retention policy to drop chunks that only consist of data older than 2 days old
-- available on Timescale Cloud and in Timescale Community v 1.7+
SELECT add_drop_chunks_policy('metrics_values', INTERVAL '2 days', cascade_to_materializations=>FALSE);
要检查策略是否成功创建以及有关该策略的信息,请运行:
SELECT * FROM timescaledb_information.drop_chunks_policies;
3.3 连续聚合
我们在本教程中使用连续聚合,以便在后台计算常见的聚合查询,以便您可以在需要信息时节省处理时间。
要将连续聚合与数据保留策略一起使用,您的聚合间隔可以小于或等于数据保留策略间隔。也就是说,如果您的数据保留策略间隔为DRI,而汇总间隔为AI,则AI <= DRI。
要创建一个五分钟(5m)的连续汇总,以汇总最大,平均和最小指标,根据我们上面设置的数据保留策略将基础数据删除:
-- 5 minute rollups of metrics
-- refresh interval same as time bucket interval
CREATE VIEW metrics_5mins
WITH (timescaledb.continuous,
timescaledb.ignore_invalidation_older_than='2d',
timescaledb.refresh_lag = '-30m',
timescaledb.refresh_interval = '5m')
AS
SELECT time_bucket('5 minutes', time) as bucket,
labels_id,
avg(value) as avg,
max(value) as max,
min(value) as min
FROM
metrics_values
GROUP BY bucket, labels_id;
在该WITH
语句中,我们指定:
- 汇总将忽略尝试查找早于 2 天的基础数据,因为该数据将根据我们在上文数据保留政策部分中设置的自动保留政策删除
- 每次刷新时,聚合将是最新的,具有最新可用数据。我们还可以通过将其设置
timescaledb.refresh_lag
为正值(例如10m或1d取决于我们的要求),将其配置为稍低于实时数据,从而减少计算量。 - 汇总数据每5分钟刷新一次,那是我们
time_bucket
声明中指定的时间间隔。
要检查是否成功创建了聚集以及有关聚集的信息,请运行:
SELECT view_name, refresh_lag, refresh_interval, max_interval_per_job, materialization_hypertable
FROM timescaledb_information.continuous_aggregates;
要创建指标的每小时汇总,请运行:
--1 hour agg
CREATE VIEW metrics_hourly
WITH (timescaledb.continuous,
timescaledb.ignore_invalidation_older_than='2d',
timescaledb.refresh_lag = '-30m',
timescaledb.refresh_interval = '1h')
AS
SELECT time_bucket('1 hour', time) as bucket,
labels_id,
avg(value) as avg,
max(value) as max,
min(value) as min
FROM
metrics_values
GROUP BY bucket, labels_id;
注意刷新间隔如何根据time_bucket
语句部分中设置的聚合间隔而变化。
通过这些优化,我们准备深入研究并开始使用我们的数据来回答常见问题。
4. 在Grafana
中可视化Prometheus
数据
- 与数据库有多少个连接?
Prometheus metrics 存储在metrics_values
表中,而 metrics 的标签存储在metrics_labels
表中。如果匹配id
从metrics_labels
对应于所有的值表label_id
在metrics_values
表中,你将能够看到所有的值该特定 metric。
首先,我们来确定id
与postgresql_pg_stat_activity_conn_count
相对应的metrics:
SELECT *
FROM metrics_labels
WHERE metric_name LIKE 'postgresql_pg_stat_activity_conn_count';
结果将如下所示:
id | metric_name | labels
-----+----------------------------------------+------------------
296 | postgresql_pg_stat_activity_conn_count | { labels here }
297 | postgresql_pg_stat_activity_conn_count | { labels here }
我们的Prometheus
数据来自两个不同的数据库,一个默认数据库和一个测试数据库。我们可以看到,id所需指标的为296(对应于我们的默认数据库)和297(对应于我们的测试数据库)。因此,我们现在可以通过运行以下两个SQL查询(每个数据库一个)来获取指标的所有值:
SELECT
time,
avg(value) AS "defaultdb connections"
FROM metrics_values
WHERE
labels_id = 296
GROUP BY 1
ORDER BY 1;
SELECT
time,
avg(value) AS "defaultdb connections"
FROM metrics_values
WHERE
labels_id = 297
GROUP BY 1
ORDER BY 1;
以上第一个查询的结果应如下所示:
time | defaultdb connections
----------------------------+-----------------------
2020-03-16 04:11:27.936+00 | 2
2020-03-16 04:11:37.937+00 | 2
2020-03-16 04:11:47.937+00 | 2
...
该metric_labels
表可能会有所不同,这取决于您所安装的Prometheus
。因此,最好不要用label_id
来查询,而是JOIN对两个表进行查询,如下所示:
SELECT
time,
avg(value) AS "defaultdb connections"
FROM metrics_values
JOIN metrics_labels ON metrics_values.labels_id = metrics_labels.id
WHERE
metrics_labels.metric_name LIKE 'postgresql_pg_stat_activity_conn_count'
GROUP BY 1
ORDER BY 1;
在Grafana中对该查询进行可视化
请先确保登录了 grafana,并连接成功数据库。
左上角 + Create Dashboard
选择 Add new panel 进入 Edit Panel 界面,grafana 不同版本该页面布局会有所不同,内容基本一致。
我们将直接编辑查询,而不是使用 Grafana query builder。在视图中,单击底部的“Edit SQL”按钮。
在开始编写查询之前,我们还想将查询数据库设置为我们先前连接的PostgreSQL
数据源:
输入以下 SQL 查询
SELECT
$__timeGroupAlias("time", 1m),
avg(value) AS "defaultdb connections"
FROM metrics_values
JOIN metrics_labels ON metrics_values.labels_id = metrics_labels.id
WHERE
$__timeFilter("time") AND
metrics_labels.metric_name LIKE 'postgresql_pg_stat_activity_conn_count'
GROUP BY 1
ORDER BY 1;
可视化结果应类似于下图:
由于虚机性能限制,无法同时展示本教程开头提的五个指标的可视化视图,所以仅展示了第一个指标,How many active connections are there to the database?
的可视化视图,其余的指标可视化参照 tutorial ,在 psql
库中创建相应VIEW
,然后在 grafana
中创建可视化文件,即可完成。