TimescaleDB 入门
用例: 物联网分析与监控
在此教程中,我们的主要任务:
- 学习如何开始使用TimescaleDB
- 如何使用TimescaleDB分析和监视来自IoT传感器的数据
数据集: nyc_data.tar.gz
Background
在本教程中,我们将使用TimescaleDB分析和监视来自纽约出租车的数据,以提高效率并减少温室气体排放。我们的分析与许多领域中用于计划升级、设置预算、分配资源等的分析类似。
- 目标1: 准备工作 设置和连接到TimescaleDB实例。
- 目标2: 加载数据 使用psql从本地终端中的CSV文件加载数据。
- 目标3: 数据分析 使用TimescaleDB和PostgreSQL分析时间序列数据集。
0. TimescaleDB介绍
TimescaleDB 基于 PostgreSQL 数据库打造的一款时序数据库, 它完全支持 SQL操作和 PostgreSQL 的查询,并针对时间序列数据的快速插入和复杂查询进行了优化,特别适合用于监控,IoT,金融,物流等大数据领域。
1. 准备工作
1.1 Download and Load Data
从下载数据集开始,我们将仅获取2016年1月的数据。
该数据集包含以下两个文件:
nyc_data.sql
- SQL文件用于建表nyc_data_rides.csv
- CSV文件包含数据
数据集解压后可以放在我们的虚机中,我创建data_toturials
文件夹单独放在P4目录下:
`p4-basic@p4-basic:~/P4/data_tutorials/nyc_data
1.2 Get Connected to TimescaleDB
连接数据库的方式有两种:
- 命令行连接,即 PostgreSQL 自带的 psql 命令行工具;
- 图形工具连接。
此处只介绍第一种方式。
下面我们来连接 TimescaleDB
首先确认 psql 已安装
psql --version
psql 命令有两种,分别是:
psql postgres://username:password@host:port/dbname
psql -U username -h hostname -p port -d dbname
各个参数:
- username:连接数据库的用户名,默认值 postgres
- password:密码,默认值是 postgres
- host:主机名,默认值是 localhost
- port:端口,默认值是 5432
- dbname:要连接的数据库名,默认值是 postgres
若第一种命令出现错误:
psql: server does not support SSL, but SSL was required
可在命令后加上 ?sslmode=allow ,即
psql postgres://username:password@host:port/dbname?sslmode=allow
参考回答:https://stackoverflow.com/questions/36350600/psql-server-does-not-support-ssl-but-ssl-was-required
我们连接 TimescaleDB 命令:
psql -x "postgres://postgres:p4@localhost:5432/postgres?sslmode=allow"
验证是否成功连接到 TimescaleDB,运行\dx命令,列出PostgreSQL数据库中所有已安装的扩展。应该看到类似于以下输出的内容:
List of installed extensions
| Name | Version | Schema | Description |
|-------------|---------|------------|----------------------------------------------|
| plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language |
| timescaledb | 1.6.0 | public | Enables scalable inserts and complex queries |
若缺少某个扩展,如 timescaledb,后面可手动添加
1.3 Create your table
从psql命令行中,先创建一个 nyc_data 数据库,然后切换到 nyc_data,再添加 timescaledb 扩展。
-- remove the old databases
drop database postgis_reg;
drop database nyc_data;
CREATE DATABASE nyc_data;
\c nyc_data
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
1.4 Define your data schema
如上所述,NYC TLC从其车队中的每辆车中收集特定于乘车的数据,每天从数百万次的乘车中生成数据。
他们收集有关每次乘车的以下数据:
- Pickup date and time (as a timestamp)
- Pickup location (latitude and longitude)
- Drop off date and time (as a timestamp)
- Drop off location (latitude and longitude)
- Trip distance (in miles)
- Fares (in USD)
- Passenger count
- Rate type (e.g, standard, airport, etc.)
- Payment type (Cash, credit card, etc.)
教程中有以下三个表,用于存储数据:
- 一个超表
rides
: 存储每次行程的上面列出的所有数据。 - 常规Postgres表
payment_types
: 付款类型的英文映射。 - 常规Postgres表
rates
: 数字费率代码的英文映射。
nyc_data.sql
脚本为我们定义了三个表的架构。该脚本用适当的自动配置TimescaleDB实例rides
,payment_types
和rates
表。
通过 nyc_data.sql 脚本自动创建。命令形式:
psql -x "postgres://postgres:p4@localhost:5432/nyc_data?sslmode=allow" < nyc_data.sql
注意:先要切换到 nyc_data.sql 所在路径,或者在命令中加上 nyc_data.sql 的路径
我们在执行 nyc_data.sql 脚本时存在如下错误:
psql -x "postgres://postgres:p4@localhost:5432/nyc_data?sslmode=allow" < nyc_data.sql
DROP EXTENSION
ERROR: extension "timescaledb" has already been loaded with another version
DETAIL: The loaded version is "1.7.1".
HINT: Start a new session and execute CREATE EXTENSION as the first command. Make sure to pass the "-X" flag to psql.
NOTICE: table "rides" does not exist, skipping
DROP TABLE
CREATE TABLE
ERROR: function create_hypertable(unknown, unknown, unknown, integer, create_default_indexes => boolean) does not exist
LINE 1: SELECT create_hypertable('rides', 'pickup_datetime', 'paymen...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE TABLE
INSERT 0 6
CREATE TABLE
INSERT 0 6
解决办法:
- extension 'timescaledb' 原本已经存在,但脚本中删除后没有重新创建,所以需要重新创建
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
- 由于不存在 timescaledb 扩展,function create_hypertable 报错,重新加上 timescaledb 扩展后可解决此错误。
或者,您可以从 psql 命令行手动运行每个脚本。
- 超表
rides
,该表将存储行程数据
CREATE TABLE "rides"(
vendor_id TEXT,
pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
passenger_count NUMERIC,
trip_distance NUMERIC,
pickup_longitude NUMERIC,
pickup_latitude NUMERIC,
rate_code INTEGER,
dropoff_longitude NUMERIC,
dropoff_latitude NUMERIC,
payment_type INTEGER,
fare_amount NUMERIC,
extra NUMERIC,
mta_tax NUMERIC,
tip_amount NUMERIC,
tolls_amount NUMERIC,
improvement_surcharge NUMERIC,
total_amount NUMERIC
);
SELECT create_hypertable('rides', 'pickup_datetime', 'payment_type', 2, create_default_indexes=>FALSE);
CREATE INDEX ON rides (vendor_id, pickup_datetime DESC);
CREATE INDEX ON rides (pickup_datetime DESC, vendor_id);
CREATE INDEX ON rides (rate_code, pickup_datetime DESC);
CREATE INDEX ON rides (passenger_count, pickup_datetime DESC);
- 常规Postgres表
payment_types
,并预配置出租车可以接受的付款类型:
CREATE TABLE IF NOT EXISTS "payment_types"(
payment_type INTEGER,
description TEXT
);
INSERT INTO payment_types(payment_type, description) VALUES
(1, 'credit card'),
(2, 'cash'),
(3, 'no charge'),
(4, 'dispute'),
(5, 'unknown'),
(6, 'voided trip');
- 常规Postgres表
rates
,并预配置出租车可以收取的费用类型:
CREATE TABLE IF NOT EXISTS "rates"(
rate_code INTEGER,
description TEXT
);
INSERT INTO rates(rate_code, description) VALUES
(1, 'standard rate'),
(2, 'JFK'),
(3, 'Newark'),
(4, 'Nassau or Westchester'),
(5, 'negotiated fare'),
(6, 'group ride');
在psql
命令行中运行\dt
命令来确认脚本配置:
List of relations
Schema | Name | Type | Owner
--------+---------------+-------+----------
public | payment_types | table | postgres
public | rates | table | postgres
public | rides | table | postgres
(3 rows)
2. 加载数据
在我们的nyc_data数据路径下(我的数据路径是p4-basic@p4-basic:~/P4/data_tutorials/nyc_data
)打开postgres并连接到创建的nyc_data表:
p4-basic@p4-basic:~/P4/data_tutorials/nyc_data$ psql -x "postgres://postgres:p4@localhost:5432/postgres?sslmode=allow"
postgres=# \c nyc_data
nyc_data=#
表下运行以下命令,将CSV文件中的数据copy到我们的表中,数据同步需要一定时间:
nyc_data=# \COPY rides FROM nyc_data_rides.csv CSV;
通过运行以下命令来验证设置:
SELECT * FROM rides LIMIT 5;
看到下面的内容:
-[ RECORD 1 ]---------+--------------------
vendor_id | 1
pickup_datetime | 2016-01-01 00:00:01
dropoff_datetime | 2016-01-01 00:11:55
passenger_count | 1
trip_distance | 1.20
pickup_longitude | -73.979423522949219
pickup_latitude | 40.744613647460938
rate_code | 1
dropoff_longitude | -73.992034912109375
dropoff_latitude | 40.753944396972656
payment_type | 2
fare_amount | 9
extra | 0.5
mta_tax | 0.5
tip_amount | 0
tolls_amount | 0
improvement_surcharge | 0.3
total_amount | 10.3
-[ RECORD 2 ]---------+--------------------
vendor_id | 1
pickup_datetime | 2016-01-01 00:00:02
dropoff_datetime | 2016-01-01 00:11:14
passenger_count | 1
trip_distance | 6.00
pickup_longitude | -73.947151184082031
pickup_latitude | 40.791046142578125
rate_code | 1
dropoff_longitude | -73.920768737792969
dropoff_latitude | 40.865577697753906
payment_type | 2
fare_amount | 18
extra | 0.5
mta_tax | 0.5
tip_amount | 0
tolls_amount | 0
improvement_surcharge | 0.3
total_amount | 19.3
3. 数据分析
- Question 1:How many rides took place on each day?
将要探讨的第一个问题很简单:2016年1月,每天有多少行程?
由于TimescaleDB支持完整的SQL,因此只需一个简单的SQL查询即可计算乘车次数,并对行程日期进行分组/排序,如下所示::
-- What's the total number of rides that took place everyday for first 5 days
SELECT date_trunc('day', pickup_datetime) as day, COUNT(*) FROM rides GROUP BY day ORDER BY day;
Syntax:date_trunc( datepart, timestamp )
根据指定的时间部分 detapart,截断时间戳表达式或文本 timestamp
Arguments:
datepart,要将时间戳值截断到的日期部分,可以理解为精度
timestamp ,待转换的时间戳文本
如本例中将 pickup_datetime 截断为当天的 0点 (2016-01-01 00:00:00)
有了这些信息,我们就可以知道每天有多少行程,并且可以确定一周中和哪一天的乘车次数最多。结果如下所示:
day | count
---------------------+--------
2016-01-01 00:00:00 | 345037
2016-01-02 00:00:00 | 312831
2016-01-03 00:00:00 | 302878
2016-01-04 00:00:00 | 316171
...
(32 rows)
- Question 2:What is the average fare amount for passengers?
调查仅一名乘客的每日平均车费:
-- What is the daily average fare amount for rides with only one passenger for first 7 days?
SELECT date_trunc('day', pickup_datetime)
AS day, avg(fare_amount)
FROM rides
WHERE passenger_count = 1
AND pickup_datetime < '2016-01-08'
GROUP BY day ORDER BY day;
TIP:Queries like the ones above execute up to 20x faster on large datasets with TimescaleDB vs. a vanilla PostgreSQL database, thanks to Timescale’s automatic time and space partitioning.
结果应如下所示:
day | avg
---------------------+---------------------
2016-01-01 00:00:00 | 12.5464748850129787
2016-01-02 00:00:00 | 12.1129878886746750
2016-01-03 00:00:00 | 12.8262352076841150
...
(7 rows)
- Question 3:How many rides took place for each rate type?
统计行程中有多少种费率类型:
-- How many rides of each rate type took place in the month?
SELECT rate_code, COUNT(vendor_id) AS num_trips
FROM rides
WHERE pickup_datetime < '2016-02-01'
GROUP BY rate_code
ORDER BY rate_code;
运行完上面的查询后,结果如下:
rate_code | num_trips
-----------+-----------
1 | 10626315
2 | 225019
3 | 16822
4 | 4696
5 | 33688
6 | 102
99 | 216
(7 rows)
我们也可以利用SQL连接的强大功能,将这些结果与rates
表的内容结合起来,如下面的查询所示:
-- How many rides of each rate type took place?
-- Join rides with rates to get more information on rate_code
SELECT rates.description, COUNT(vendor_id) AS num_trips FROM rides
JOIN rates ON rides.rate_code = rates.rate_code
WHERE pickup_datetime < '2016-02-01'
GROUP BY rates.description
ORDER BY rates.description;
TIP:This is a simple illustration of a powerful point: By allowing JOINs over hypertables and regular PostgreSQL tables, TimescaleDB allows you to combine your time-series data with your relational or business data to unearth powerful insights.
您的结果应如下所示,将rates
表中的信息与您之前运行的查询结合起来:
description | num_trips
-----------------------+-----------
group ride | 102
JFK | 225019
Nassau or Westchester | 4696
negotiated fare | 33688
Newark | 16822
standard rate | 10626315
(6 rows)
注意此查询相比上次的查询,由于缺省查询约束,所以结果为 6 rows
- Question 4:Analysis of rides to JFK(Kennedy International Airport) and EWR(Newark International Airport)
From your work calculating rides by rate type, the NYC TLC noticed that rides to John F Kennedy International Airport (JFK) and Newark International Airport (EWR) were the second and fourth most popular ride types, respectively. Given this popularity in airport rides and consequent carbon footprint, the city of New York thinks that airport public transportation could be an area of improvement - reducing traffic in the city and overall carbon footprint associated with airport trips.
在开始任何程序之前,他们希望您更仔细地研究去肯尼迪国际机场(code2)和纽瓦克(code3)的旅行。对于每个机场,我们想了解1月份的以下信息:- Number of trips to that airport
- Average trip duration (i.e drop off time - pickup time)
- Average trip cost
- Average tip
- Minimum, Maximum and Average trip distance
- Average number of passengers
运行以下查询:
-- For each airport: num trips, avg trip duration, avg cost, avg tip, avg distance, min distance, max distance, avg number of passengers
SELECT rates.description, COUNT(vendor_id) AS num_trips,
AVG(dropoff_datetime - pickup_datetime) AS avg_trip_duration, AVG(total_amount) AS avg_total,
AVG(tip_amount) AS avg_tip, MIN(trip_distance) AS min_distance, AVG (trip_distance) AS avg_distance, MAX(trip_distance) AS max_distance,
AVG(passenger_count) AS avg_passengers
FROM rides
JOIN rates ON rides.rate_code = rates.rate_code
WHERE rides.rate_code IN (2,3) AND pickup_datetime < '2016-02-01'
GROUP BY rates.description
ORDER BY rates.description;
输出结果为:
-[ RECORD 1 ]-----+--------------------
description | JFK
num_trips | 225019
avg_trip_duration | 00:45:46.822517
avg_total | 64.3278115181384683
avg_tip | 7.3334228220728027
min_distance | 0.00
avg_distance | 17.2602816651038357
max_distance | 221.00
avg_passengers | 1.7333869584346211
-[ RECORD 2 ]-----+--------------------
description | Newark
num_trips | 16822
avg_trip_duration | 00:35:16.157472
avg_total | 86.4633688027582927
avg_tip | 9.5461657353465700
min_distance | 0.00
avg_distance | 16.2706122934252764
max_distance | 177.23
avg_passengers | 1.7435501129473309
根据数据分析,我们可以确定:
- 肯尼迪国际机场的游乐设施是纽瓦克的13倍。这通常会导致往返肯尼迪国际机场的交通繁忙,尤其是在高峰时段。他们已决定探索改善这些地区的道路,并增加往返机场的公共交通工具(例如公共汽车,地铁,火车等)。
- 每次乘坐飞机平均每次旅行有相同数量的乘客(每次旅行约有1.7名乘客)。
- 行程距离大致相同,为16-17英里。
- 肯尼迪国际机场便宜约30%,这很可能是因为新泽西州的隧道和高速公路收费。
- 纽瓦克旅行减少了22%(10分钟)。
该数据不仅对城市规划者有用,而且对机场旅客和纽约市旅游局等旅游组织也很有用。例如,一个旅游组织可能会推荐精打细算的旅行者,他们不想花84美元去纽瓦克,而要使用公共交通工具,例如从Penn Station站出发的NJ Transit火车(成人票为15.25美元)。同样,他们可以向前往肯尼迪国际机场的旅客和那些交通繁忙的人推荐乘坐地铁和飞机,价格仅为$7.50。
此外,我们还可以为那些飞往纽约市的人提供关于选择哪个机场的建议。例如,从以上数据中,我们可以推荐那些认为自己会急事而又不介意支付额外费用的旅行者,以考虑从纽瓦克飞越肯尼迪机场。
如果到目前为止,我们已经成功完成了任务,现在对如何使用 TimescaleDB 分析时序数据有了基本的了解!
4. 其他任务
在教程中,还存在 Monitoring 任务,由于我们要做的项目不涉及 PostGIS 内容,所以未完整学习完 monitoring 内容,以下记录在初次学习时遇到的问题及思考
使用出租车行驶的时间序列数据来监视行驶的当前状态。更现实的设置将涉及创建一条数据管道,该管道将传感器数据直接从汽车流传输到 TimescaleDB 中。此处使用 2016 年 1 月的数据来说明原理。
- Question 1:2016年第一天每5分钟有多少次乘车?
-- Vanilla Postgres query for num rides every 5 minutes
SELECT
EXTRACT(hour from pickup_datetime) as hours,
trunc(EXTRACT(minute from pickup_datetime) / 5)*5 AS five_mins,
COUNT(*)
FROM rides
WHERE pickup_datetime < '2016-01-02 00:00'
GROUP BY hours, five_mins;
使用 08:49:00 的采样时间来分析上述语句。
首先,提取乘车时间的 hours:
EXTRACT(hour from pickup_datetime) as hours
因此对于 08:49:00,hours 将是 8
然后,我们使用 five_mins 对给定的时间戳计算5分钟的最接近倍数,实质上就是找到分钟最接近的5分钟存储桶
trunc(EXTRACT(minute from pickup_datetime) / 5)*5 AS five_mins
因此,在08:49的时间结果为trunc(49/5)*5 = trunc(9.8)*5 = 9*5 = 45
,因此这次的时间为45分钟。
TimescaleDB 具有许多自定义的 SQL 函数,可以使时间序列分析变得快速而简单。
例如,time_bucket 是 PostgreSQL date_trunc 函数的更强大的版本。它允许任意时间间隔截取,而不是 date_trunc 所提供的标准日期,分钟,小时
因此,在使用TimescaleDB时,上面的复杂查询变成了一个更简单的SQL查询,如下所示
-- How many rides took place every 5 minutes for the first day of 2016?
-- using the TimescaleDB "time_bucket" function
SELECT time_bucket('5 minute', pickup_datetime) AS five_min, count(*)
FROM rides
WHERE pickup_datetime < '2016-01-02 00:00'
GROUP BY five_min
ORDER BY five_min;
执行结果
five_min | count
---------------------+-------
2016-01-01 00:00:00 | 703
2016-01-01 00:05:00 | 1482
2016-01-01 00:10:00 | 1959
2016-01-01 00:15:00 | 2200
2016-01-01 00:20:00 | 2285
...
- Question 2:How many rides on New Year’s morning originated from within 400m of Times Square, in 30 minute buckets?
TimescaleDB 与所有其他 PostgreSQL 扩展兼容,对于地理空间数据,可以使用 PostGIS。这使我们可以按时间和位置的方式按时间比例和规模来缩放数据。
-- Geospatial queries - TimescaleDB + POSTGIS -- slice by time and location
-- Install the extension in the database
CREATE EXTENSION postgis;
出现错误:
ERROR: could not open extension control file "/usr/local/pgsql/share/extension/postgis.control": 没有那个文件或目录
运行 sudo find /usr -name postgis
依旧没有找到 postgis
可以断定 PostgreSQL 没有安装 postgis 扩展
手动安装 postgis 扩展
由于 PostgreSQL 是通过编译源码安装,与上述两链接中 apt 安装的 postgis 路径不一致,无法在 PostgreSQL 中添加 postgis 扩展
因此,对于 postgis ,也需要通过编译源码安装
postgis 3.0 版本安装文档:
http://www.postgis.net/docs/manual-3.0/postgis_installation.html#PGInstall
Required
-
PostgreSQL 9.5 or higher. A complete installation of PostgreSQL (including server headers) is required. PostgreSQL is available from http://www.postgresql.org .
For a full PostgreSQL / PostGIS support matrix and PostGIS/GEOS support matrix refer to http://trac.osgeo.org/postgis/wiki/UsersWikiPostgreSQLPostGIS
-
GNU C compiler (
gcc
). Some other ANSI C compilers can be used to compile PostGIS, but we find far fewer problems when compiling withgcc
. -
GNU Make (
gmake
ormake
). For many systems, GNUmake
is the default version of make. Check the version by invokingmake -v
. Other versions ofmake
may not process the PostGISMakefile
properly. -
Proj4 reprojection library, version 4.6.0 or greater. Proj4 4.9 or above is needed to take advantage of improved geodetic. The Proj4 library is used to provide coordinate reprojection support within PostGIS. Proj4 is available for download from http://trac.osgeo.org/proj/ .
-
GEOS geometry library, version 3.6 or greater, but GEOS 3.7+ is recommended to take full advantage of all the new functions and features. GEOS is available for download from http://trac.osgeo.org/geos/ .
-
LibXML2, version 2.5.x or higher. LibXML2 is currently used in some imports functions (ST_GeomFromGML and ST_GeomFromKML). LibXML2 is available for download from http://xmlsoft.org/downloads.html.
-
JSON-C, version 0.9 or higher. JSON-C is currently used to import GeoJSON via the function ST_GeomFromGeoJson. JSON-C is available for download from https://github.com/json-c/json-c/releases/.
-
GDAL, version 1.8 or higher (1.9 or higher is strongly recommended since some things will not work well or behavior differently with lower versions). This is required for raster support. http://trac.osgeo.org/gdal/wiki/DownloadSource.
-
If compiling with PostgreSQL+JIT, LLVM version >=6 is required https://trac.osgeo.org/postgis/ticket/4125.
首先在Ubuntu系统上安装依赖
sudo apt-get install proj-bin libproj-dev libgeos-dev libxml2 libjson-c-dev libgdal-dev docbook
然后源代码编译安装postgis。
git clone https://github.com/postgis/postgis/
cd postgis
sudo make clean
git checkout stable-3.0
./autogen.sh
./configure
make
#export PGUSER=postgres #overwrite psql variables
#make check #to test before install
sudo make install
# to test extensions
#make check RUNTESTFLAGS=--extension
验证一下安装的postgis版本。
$ psql -U postgres
psql (11.8)
Type "help" for help.
postgres=# SELECT name, default_version,installed_version FROM pg_available_extensions WHERE name LIKE 'postgis%' or name LIKE 'address%';
name | default_version | installed_version
------------------------+-----------------+-------------------
postgis_topology | 3.0.2dev |
postgis_raster | 3.0.2dev |
postgis_tiger_geocoder | 3.0.2dev |
postgis | 3.0.2dev |
(4 rows)
下面开始和postgis相关的实验,开启postgres客户端并连接到创建的nyc_data表:
$ psql -x "postgres://postgres:p4@localhost:5432/postgres?sslmode=allow"
postgres=# \c nyc_data
nyc_data=#
-- Geospatial queries - TimescaleDB + POSTGIS -- slice by time and location
-- Install the extension in the database
CREATE EXTENSION postgis;
Then, run the \dx
command in psql
to verify that PostGIS was installed properly. You should see the PostGIS extension in your extension list, as noted below:
List of installed extensions
Name | Version | Schema | Description
-------------+---------+------------+---------------------------------------------------------------------
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
postgis | 2.5.1 | public | PostGIS geometry, geography, and raster spatial types and functions
timescaledb | 1.6.0 | public | Enables scalable inserts and complex queries for time-series data
(3 rows)
Now, we need to alter our table to work with PostGIS. To start, we’ll add geometry columns for ride pick up and drop off locations:
-- Create geometry columns for each of our (lat,long) points
ALTER TABLE rides ADD COLUMN pickup_geom geometry(POINT,2163);
ALTER TABLE rides ADD COLUMN dropoff_geom geometry(POINT,2163);
Next we’ll need to convert the latitude and longitude points into geometry coordinates so that it plays well with PostGIS:
WARNING:This next query may take several minutes.
-- Generate the geometry points and write to table
UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);
UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);
Lastly, we need one more piece of info: Times Square is located at (lat
, long
) (40.7589
,-73.9851
).
Now, we have all the information to answer our original question: How many rides on New Year’s morning originated within 400m of Times Square, in 30 minute buckets?
-- How many taxis pick up rides within 400m of Times Square on New Years Day, grouped by 30 minute buckets.
-- Number of rides on New Years Day originating within 400m of Times Square, by 30 min buckets
-- Note: Times Square is at (lat, long) (40.7589,-73.9851)
SELECT time_bucket('30 minutes', pickup_datetime) AS thirty_min, COUNT(*) AS near_times_sq
FROM rides
WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)) < 400
AND pickup_datetime < '2016-01-01 14:00'
GROUP BY thirty_min ORDER BY thirty_min;
You should get the following results:
thirty_min | near_times_sq
---------------------+---------------
2016-01-01 00:00:00 | 74
2016-01-01 00:30:00 | 102
2016-01-01 01:00:00 | 120
2016-01-01 01:30:00 | 98
2016-01-01 02:00:00 | 112
2016-01-01 02:30:00 | 109
2016-01-01 03:00:00 | 163
2016-01-01 03:30:00 | 181
2016-01-01 04:00:00 | 214
2016-01-01 04:30:00 | 185
2016-01-01 05:00:00 | 158
2016-01-01 05:30:00 | 113
2016-01-01 06:00:00 | 102
2016-01-01 06:30:00 | 91
2016-01-01 07:00:00 | 88
2016-01-01 07:30:00 | 58
2016-01-01 08:00:00 | 72
2016-01-01 08:30:00 | 94
2016-01-01 09:00:00 | 115
2016-01-01 09:30:00 | 118
2016-01-01 10:00:00 | 135
2016-01-01 10:30:00 | 160
2016-01-01 11:00:00 | 212
2016-01-01 11:30:00 | 229
2016-01-01 12:00:00 | 244
2016-01-01 12:30:00 | 230
2016-01-01 13:00:00 | 235
2016-01-01 13:30:00 | 238
(28 rows)
另外要在Ubuntu 16.04上安装下面的python依赖。支持后面的时间序列预测的实验。
sudo pip3.6 install pandas numpy statsmodels
sudo apt-get update
sudo apt-get install python3-pandas python3-pandas-lib python3-scipy python3-numpy python3-patsy
sudo apt-get install python-statsmodels-lib python-statsmodels
# sudo apt-get install python3-statsmodels-lib python3-statsmodels
官方网站上ubuntu 16.04不提供python3-statsmodels包。https://packages.ubuntu.com/search?keywords=python3-statsmodels。要在ubuntu 16.04上安装 python3-statsmodels,必须参考http://neuro.debian.net/pkgs/python3-statsmodels.html。
wget -O- http://neuro.debian.net/lists/xenial.cn-bj1.libre | sudo tee /etc/apt/sources.list.d/neurodebian.sources.list
sudo apt-key adv --recv-keys --keyserver hkp://pool.sks-keyservers.net:80 0xA5D32F012649A5A9
sudo apt-get update
sudo apt-get install python3-statsmodels-lib python3-statsmodels
注意NeuroDebian不安全,有风险:NeuroDebian provides a large collection of popular neuroscience research software for the Debian operating system as well as Ubuntu and other derivatives. Popular packages include AFNI, FSL, PyMVPA and many others. While we do strive to maintain a high level of quality, we make no guarantee that a given package works as expected, so use them at your own risk. If you do encounter problems or you just like to say thanks, simply send us an email.
所以我还是
sudo rm /etc/apt/sources.list.d/neurodebian.sources.list
下面在python3.6环境下验证组件都起来了。
$ python3.6
Python 3.6.3 (default, Jun 10 2020, 19:27:40)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
import psycopg2
import psycopg2.extras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.api import ExponentialSmoothing
再试试python3.5环境,发现
$ python3
Python 3.5.2 (default, Apr 16 2020, 17:47:17)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
import psycopg2
import psycopg2.extras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# from statsmodels.tsa.api import ExponentialSmoothing ## import error
如果此时提示未安装psycopg2模块,那么直接用pip3命令安装。
但直接安装psycopg2时,可能会报 Failed building wheel for psycopg2 。此时执行以下命令安装:
pip3 install psycopg2-binary
5. Common psql commands
命令 | 操作 |
---|---|
\l |
列出可用数据库 |
\c dbname |
连接到新数据库 |
\dt |
列出可用表 |
\d tablename |
描述给定表的细节 |
\dn |
列出当前数据库中的所有架构 |
\df |
列出当前数据库中的函数 |
\h |
获取有关SQL命令语法的帮助 |
\? |
列出所有psql 斜杠命令 |
\set |
系统变量列表 |
\timing |
显示查询执行所需的时间 |
\x |
显示扩展的查询结果 |
\q |
退出 psql |