timescaledb 集成 madlib
github 上有人提出了一个问题(2017 很早了),然后搜索timescaledb 的docs 文档,发现有
一片介绍的文章,所以尝试运行下
备注: 环境使用虚拟机安装(没有使用docker madlib 的原因,实际上可以尝试基于timescaledb 的镜像改造)
安装madlib
这个可以参考madlib 官方文档,或者https://www.cnblogs.com/rongfengliang/p/10298159.html
因为timescaledb 是一个pg 的标准扩展,可以复用以前的安装方式
- 添加pg 10 repo
yum install https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm
- 安装pg python 基本包
yum -y install postgresql10-plpython supervisor
- 安装madlib 依赖包
注意python 版本,我使用python 2.7 安装时候失败了,修改为了python34
yum update -y && yum install -y \
git \
gcc \
wget \
postgresql10-devel \
openssl \
m4 \
vim \
flex \
bison \
graphviz \
java \
epel-release \
python34-devel
- 安装pip 包
默认一般是包含的
yum install -y python34-pip
- pg_conf 配置(环境变量)
PATH="$PATH:/usr/pgsql-10/bin"
- 安装python 依赖(通过pip)
pip3 install awscli pygresql paramiko --upgrade
- 安装apache-madlib
下载rpm 包
wget https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm
安装
yum install -y apache-madlib-1.15.1-bin-Linux.rpm
timescaledb 扩展安装
- 下载timescaledb pg 扩展
wget https://timescalereleases.blob.core.windows.net/rpm/timescaledb-1.1.1-postgresql-10-0.x86_64.rpm
- 安装
yum install -y timescaledb-1.1.1-postgresql-10-0.x86_64.rpm
- 初始化数据库
/usr/pgsql-10/bin/postgresql-10-setup initdb
- 配置pg 扩展
/var/lib/pgsql/10/data/postgresql.conf 文件
+ shared_preload_libraries = 'timescaledb'
- 修改pg_hba.conf 添加访问支持
之后修改之后,需要重启服务,systemctl restart postgresql-10
/var/lib/pgsql/10/data/pg_hba.conf
修改如下:
# "local" is for Unix domain socket connections only
local all all trust
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
- 启动数据库
systemctl enable postgresql-10
systemctl start postgresql-10
注册madlib 扩展
- 下载&&安装madlib 扩展
下载rpm 包
wget https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm
安装
yum install -y apache-madlib-1.15.1-bin-Linux.rpm
- 注册
/usr/local/madlib/bin/madpack -s madlib -p postgres -c postgres@localhost:5432/postgres install
效果
madpack.py: INFO : Detected PostgreSQL version 10.6.
madpack.py: INFO : *** Installing MADlib ***
madpack.py: INFO : MADlib tools version = 1.15.1 (/usr/local/madlib/Versions/1.15.1/bin/../madpack/madpack.py)
madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=postgres, schema=madlib)
madpack.py: INFO : Testing PL/Python environment
madpack.py: INFO : > Creating language PL/Python
madpack.py: INFO : > PL/Python environment OK (version: 2.7.5)
madpack.py: INFO : > Preparing objects for the following modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
madpack.py: INFO : > - crf
madpack.py: INFO : > - elastic_net
madpack.py: INFO : > - linalg
madpack.py: INFO : > - pmml
madpack.py: INFO : > - prob
madpack.py: INFO : > - sketch
madpack.py: INFO : > - svec
madpack.py: INFO : > - svm
madpack.py: INFO : > - tsa
madpack.py: INFO : > - stemmer
madpack.py: INFO : > - conjugate_gradient
madpack.py: INFO : > - knn
madpack.py: INFO : > - lda
madpac
- 检查
/usr/local/madlib/bin/madpack -s madlib -p postgres -c postgres@localhost:5432/postgres install-check
效果
TEST CASE RESULT|Module: bayes|bayes.ic.sql_in|PASS|Time: 121 milliseconds
TEST CASE RESULT|Module: crf|crf_train_small.ic.sql_in|PASS|Time: 112 milliseconds
TEST CASE RESULT|Module: crf|crf_test_small.ic.sql_in|PASS|Time: 133 milliseconds
TEST CASE RESULT|Module: elastic_net|elastic_net.ic.sql_in|PASS|Time: 133 milliseconds
TEST CASE RESULT|Module: linalg|linalg.ic.sql_in|PASS|Time: 44 milliseconds
TEST CASE RESULT|Module: linalg|svd.ic.sql_in|PASS|Time: 168 milliseconds
TEST CASE RESULT|Module: linalg|matrix_ops.ic.sql_in|PASS|Time: 238 milliseconds
TEST CASE RESULT|Module: prob|prob.ic.sql_in|PASS|Time: 21 milliseconds
TEST CASE RES
简单测试
内容来自官方文档
https://docs.timescale.com/v0.11/tutorials/tutorial-forecasting,https://docs.timescale.com/v1.1/tutorials/tutorial-hello-nyc
同时需要安装的组件也比较多,gis,timescale。。
- 预备安装
gis 扩展安装,后边需要,实际上这个是可选的
wget https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm // 这个yum 源
上边实际上已经执行了
yum install -y postgis25_10
- 加载扩展
都在postgres 数据库
psql -U postgres -d postgres -h localhost
\c postgres
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
CREATE EXTENSION postgis;
- 加载数据部分数据(因为gis 数据有依赖关系)
数据有点多,可能会比较慢
wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz
tar -xvzf nyc_data.tar.gz
psql -U postgres -d postgres -h localhost < nyc_data.sql
psql -U postgres -d postgres -h localhost -c "\COPY rides FROM nyc_data_rides.csv CSV"
- 初始化gis 扩展 && forecast.sql
数据有点多,可能会比较慢
ALTER TABLE rides ADD COLUMN pickup_geom geometry(POINT,2163);
ALTER TABLE rides ADD COLUMN dropoff_geom geometry(POINT,2163);
生成geo 数据:
UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);
UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);
wget http://assets.iobeam.com/sql/forecast.sql
// 此过程会创建rides_count rides_length rides_price 等hypertables
psql -U postgres -d postgres -h localhost -f forecast.sql
查询操作
- 查询rides_price
如下:
SELECT * FROM rides_price;
one_hour | trip_price
---------------------+------------------
2016-01-01 00:00:00 | 58.34
2016-01-01 01:00:00 | 58.34
2016-01-01 02:00:00 | 58.34
2016-01-01 03:00:00 | 58.34
2016-01-01 04:00:00 | 58.34
2016-01-01 05:00:00 | 59.59
2016-01-01 06:00:00 | 58.34
2016-01-01 07:00:00 | 60.3833333333333
2016-01-01 08:00:00 | 61.2575
2016-01-01 09:00:00 | 58.435
2016-01-01 10:00:00 | 63.952
2016-01-01 11:00:00 | 59.9576923076923
2016-01-01 12:00:00 | 60.4
- 创建训练以及测试的数据集
这个主要是测试从肯尼迪国际机场到时代广场的车程价格
-- Make the training dataset
SELECT * INTO rides_price_train FROM rides_price
WHERE one_hour <= '2016-01-21 23:59:59';
-- Make the testing dataset
SELECT * INTO rides_price_test FROM rides_price
WHERE one_hour >= '2016-01-22 00:00:00';
效果
SELECT * INTO rides_price_train FROM rides_price
postgres-# WHERE one_hour <= '2016-01-21 23:59:59';
SELECT 504
postgres=# SELECT * INTO rides_price_test FROM rides_price
postgres-# WHERE one_hour >= '2016-01-22 00:00:00';
SELECT 240
- 使用madlib 函数生成训练数据
DROP TABLE IF EXISTS rides_price_output;
DROP TABLE IF EXISTS rides_price_output_residual;
DROP TABLE IF EXISTS rides_price_output_summary;
DROP TABLE IF EXISTS rides_price_forecast_output;
SELECT madlib.arima_train('rides_price_train', -- input table
'rides_price_output', -- output table
'one_hour', -- timestamp column
'trip_price', -- time-series column
NULL, -- grouping columns
TRUE, -- include_mean
ARRAY[2,1,3] -- non-seasonal orders
);
SELECT madlib.arima_forecast('rides_price_output', -- model table
'rides_price_forecast_output', -- output table
240 -- steps_ahead (10 days)
);
- 查看生成的结果
SELECT * FROM rides_price_forecast_output;
数据
SELECT * FROM rides_price_forecast_output;
steps_ahead | forecast_value
-------------+----------------
1 | 62.317574661
2 | 62.7126520761
3 | 62.892038632
4 | 62.755044625
5 | 62.6064068106
6 | 62.6197088752
7 | 62.7032172965
8 | 62.729257786
9 | 62.6956015739
10 | 62.6685762986
11 | 62.6755999496
12 | 62.6926055721
13 | 62.695637064
14 | 62.6878845777
15 | 62.683
结论: 从肯尼迪国际机场到时代广场的车程价格在日常基础上保持不变
- 模型评估的
ALTER TABLE rides_price_test ADD COLUMN id SERIAL PRIMARY KEY;
ALTER TABLE rides_price_test ADD COLUMN forecast DOUBLE PRECISION;
UPDATE rides_price_test
SET forecast = rides_price_forecast_output.forecast_value
FROM rides_price_forecast_output
WHERE rides_price_test.id = rides_price_forecast_output.steps_ahead;
SELECT madlib.mean_abs_perc_error('rides_price_test', 'rides_price_mean_abs_perc_error', 'trip_price', 'forecast');
SELECT * FROM rides_price_mean_abs_perc_error;
结果
SELECT * FROM rides_price_mean_abs_perc_error;
mean_abs_perc_error
---------------------
0.0423789161947618
结论: 从机场到曼哈顿的旅行价格保持在62美元,并且与测试数据集相比表现良好
说明
有一些关于机器学习以及统计的专业术语不是很懂,但是从基本的运行上,说明,madlib 与timescaledb 的集成还是
很方便的,这样我们既能有时序数据库的方便,同时还包含了基于sql 的机器学习能力,当然我们同时也可以集成graphql
engine 进行方便的graphql api ,总的来说基于标准pg 扩展的模型,部署还挺方便的
参考资料
https://docs.timescale.com/v0.11/tutorials/tutorial-forecasting
https://docs.timescale.com/v0.11/getting-started
https://docs.timescale.com/v1.1/tutorials/tutorial-hello-nyc#tutorial-postgis