CMU 15-445 数据库系统 Homework SQL查询
cmu15-445是一门关于数据库的课程,看到该课程的homeworks和project就觉得十分有挑战性。 本文是对该课程的Homework1, SQL语句的十道题,做完以后可以对SQL查询有个好的了解。
Sqlite安装配置省略
Homework的指导页面
课程提供的数据库表关系如下
下面十道题都将对该表进行操作。
三个表,
trip:包含id,来去车站,来去时间,自行车id
station:车站,车站所属的城市,
天气:与城市对应
Q1 查询名字唯一的城市的数量
Count the number of cities. The purpose of this query is to make sure that the formatting of your output matches exactly the formatting of our auto- grading script.
Details: Print the number of cities (eliminating duplicates).
Answer: Here's the correct SQL query and expected output:
x
sqlite> select count(distinct(city)) from station;
5
Q2 (分组聚集,排序, 计数 count)
需要查询每个城市的火车站的数量,并且先按照每个城市的火车站的数量排序,再按照城市名排序(都是升序)。此题并不难
Count the number of stations in each city.
Details: Print city name and number of stations. Sort by number of stations (increasing), and break ties by city name (increasing).
sqlite> select city, count(station_id) as cnt
...> from station
...> group by city
...> order by cnt asc, city asc;
Palo Alto|5
Mountain View|7
Redwood City|7
San Jose|16
San Francisco|35
注意group的聚合, group会将所有具有相同属性的元组分到一个组。 然后不能对没有出现在group by子句中的属性进行select(或者是聚集以后该属性的值是唯一的也可以)。 一般会使用count, avg等对一个分组内的几个元组的该的属性进行聚集操作
Q3 多关系查询
查询每个城市名,以及其所占trip总数的比例(当其作为trip的起点或终点时候,起点终点不重复计数)。按照如下步骤,将sql语句分解
Find the percentage of trips in each city. A trip belongs to a city as long as its start station or end station is in the city. For example, if a trip started from station A in city P and ended in station B in city Q, then the trip belongs to both city P and city Q. If P equals to Q, the trip is only counted once.
Details: Print city name and ratio between the number of trips that belong to that city against the total number of trips (a decimal between 0-1, round to four decimal places using ROUND()). Sort by ratio (decreasing), and break ties by city name (increasing).
首先,将trip和station两个表根据station id联系起来 where判断关联
sqlite> select ID, city, station_id, station_id, end_station_id
...> from trip, station
...> where station_id = start_station_id or station_id = end_station_id
...> limit 10;
4069|San Francisco|64|64|64
4073|San Francisco|66|66|69
4073|San Francisco|69|69|69
4074|San Francisco|66|66|69
4074|San Francisco|69|69|69
4075|San Francisco|66|66|69
4075|San Francisco|69|69|69
4076|San Francisco|66|66|69
4076|San Francisco|69|69|69
4078|Redwood City|22|22|22
这种查询称为多关系查询
select A1, A2... 属性
from r1,r2.. 关系
where P; 谓词
尽管句子必须以select form where 的次序写出,但查询所代表的运算首先是:from, 然后是where,最后是select
from句定义所有关系的笛卡尔积,然后由where使用谓词来限制笛卡尔积所创建的元组, select最后指定属性
然后需要根据city进行分组,分组的时候计算这个城市参与的trip的个数,使用count加distinct。
sqlite> select city, count(distinct(ID))
...> from trip, station
...> where station_id = start_station_id or station_id = end_station_id
...> group by city
...> limit 10;
Mountain View|18606
Palo Alto|7271
Redwood City|3494
San Francisco|603733
San Jose|37897
计算分母,旅行的总数量
select count(*) as cnt from trip;
经常使用count计算关系中元组的数量
使用as连接起来,将这几个操作结合起来
sqlite> select city_trip_cnt.city, (ROUND(city_trip_cnt.cnt*1.0 / all_trip_cnt.cnt, 4)) as ratio
...> from (
...> select city, count(distinct(ID)) as cnt
...> from trip, station
...> where station_id = start_station_id or station_id = end_station_id
...> group by city
...> ) as city_trip_cnt, (select count(*) as cnt from trip) as all_trip_cnt
...> order by ratio desc, city asc
...> ;
San Francisco|0.9011
San Jose|0.0566
Mountain View|0.0278
Palo Alto|0.0109
Redwood City|0.0052
as起了个别名, 不然会有重复的属性
这里,from后面的有两部分,用了一个逗号隔开了,每个部分的表起了一个别名,其实不起别名直接用了列也能找到,但是可能会出现不同表里的列冲突的问题。
desc降序, asc升序
Q4 with as句式 每组里面最大的属性a对应的属性b
找到每个城市最收欢迎的车站,输出访问次数。
For each city, find the most popular station in that city. "Popular" means that the station has the highest count of visits. As above, either starting a trip or finishing a trip at a station, the trip is counted as one "visit" to that station. The trip is only counted once if the start station and the end station are the same.
Details: For each station, print city name, most popular station name and its visit count. Sort by city name, ascending.
这里有个条件,出发城市和结束城市都要计算,而且如果是同一个城市只计算一次。这样直接使用where拼表即可
sqlite> select city, station_id, station_name, count(distinct(ID)) as cnt 其实这里对ID不必使用distinct
...> from trip, station
...> where station_id = start_station_id or station_id = end_station_id
...> group by station_id
...> limit 10;
San Jose|2|San Jose Diridon Caltrain Station|18782
San Jose|3|San Jose Civic Center|2989
San Jose|4|Santa Clara at Almaden|7368
San Jose|5|Adobe on Almaden|2348
San Jose|6|San Pedro Square|5883
San Jose|7|Paseo de San Antonio|4582
San Jose|8|San Salvador at 1st|3239
San Jose|9|Japantown|3880
San Jose|10|San Jose City Hall|3822
San Jose|11|MLK Library|4364
这里使用where连接后,根据笛卡尔积,交叉后筛选出来了符合条件的,然后聚集,之后使用count计数。 注意这里city不在group by子句中但是也可以select,因为city对station_id聚集后的分组内是唯一的。
因为要对上面这个表反复使用,使用with as句式 命名为visit
这时还有个问题,要找出同一个city下的的最大属性a元组对应的属性b,可以先按city分组后使用max聚集函数求出最大值,然后找出等于这个最大值的元组,select属性b
with visit(station_id, station_name, city, cnt) as (
select station_id, station_name, city, count(distinct(ID)) as cnt
from trip, station
where station_id = start_station_id or station_id = end_station_id
group by station_id
)
select visit.city, visit.station_name, visit.cnt
from visit
where visit.cnt = (
select max(max_visit.cnt)
from visit as max_visit
where max_visit.city = visit.city
)
order by city
;
Q5 日期,时间戳, union合并两个或者多个select语句的结果 round操作
Find the top 10 days that have the highest average bike utilization. For simplicity, we only consider trips that use bikes with id <= 100. The average bike utilization on date D is calculated as the sum of the durations of all the trips that happened on date D divided by the total number of bikes with id <= 100, which is a constant. If a trip overlaps with date D, but starts before date D or ends after date D, then only the interval that overlaps with date D (from 0:00 to 24:00) will be counted when calculating the average bike utilization of date D. And we only calculate the average bike utilization for the date that has been either a start or an end date of a trip. You can assume that no trip has negative time (i.e., for all trips, start time <= end time).
查找平均自行车利用率最高的前 10 天。为简单起见,我们只考虑使用 id <= 100 的自行车的行程。日期 D 的平均自行车利用率计算为日期 D 发生的所有行程的持续时间之和除以 id <= 的自行车总数= 100,这是一个常数。如果行程与日期 D 重叠,但在日期 D 之前开始或在日期 D 之后结束,则在计算日期 D 的平均自行车利用率时,仅计算与日期 D 重叠的间隔(从 0:00 到 24:00) . 而且我们只计算旅行开始或结束日期的平均自行车利用率。您可以假设没有行程有负时间(即,对于所有行程,开始时间 <= 结束时间)。
Details: For the dates with the top 10 average duration, print the date and the average bike duration on that date (in seconds, round to four decimal places using the ROUND() function). Sort by the average duration, decreasing. Please refer to the updated note before Q1 when calculating the duration of a trip.
详细信息:对于平均持续时间排名前 10 的日期,打印日期和该日期的平均自行车持续时间(以秒为单位,使用 ROUND() 函数四舍五入到小数点后四位)。按平均持续时间排序,递减。计算行程时长请参考Q1之前的更新说明。
Hint: All timestamps are stored as text after loaded from csv in sqlite. You can use datetime(timestamp string) to get the timestamp out of the string and date(timestamp string) to get the date out of the string. You may also find the funtion strftime() helpful in computing the duration between two timestamps.
提示:所有时间戳在从 sqlite 中的 csv 加载后都存储为文本。您可以使用 datetime(timestamp string) 从字符串中获取时间戳,使用 date(timestamp string) 从字符串中获取日期。您可能还会发现函数 strftime() 有助于计算两个时间戳之间的持续时间。
只计算旅行开始和结束这两天的使用率,中间过程不算使用。
with dates as (
select date(start_time) as tdate
from trip
union
select date(end_time) as tdate
from trip)
select tdate, round(sum(strftime('%s', min(datetime(end_time), datetime(tdate, '+1 day'))) - strftime('%s', max(datetime(start_time), datetime(tdate)))) * 1.0 / (select count(distinct(bike_id)) from trip where bike_id <= 100), 4) as avg_duration
from trip, dates
where bike_id <= 100 and datetime(start_time) < datetime(tdate, '+1 day') and datetime(end_time) > datetime(tdate)
group by tdate
order by avg_duration desc
limit 10;
SQL UNION 操作符合并两个或多个 SELECT 语句的结果。 得到要进行统计的日子。
round(数字, 位数)
通过该问题,了解了日期的函数究竟是怎么一回事。date:年月日,time:时分秒,datetime:年月日+时分秒。
另外,sum,strftime函数的意义,以及with语句中其实也不全是需要像上面那样那种方式对列去别名
以及union对相同数据的合并去重操作,distinct是对同一列去重。 union是两个表的列去重
扩展:SQLite的五个时间函数:
date(timestring, modifier, modifier, …)**:以 YYYY-MM-DD 格式返回日期
time(timestring, modifier, modifier, …)**:以 HH:MM:SS 格式返回时间
datetime(timestring, modifier, modifier, …)**:以 YYYY-MM-DD HH:MM:SS 格式返回日期时间
julianday(format, timestring, modifier, modifier, ..):返回从格林尼治时间的公元前 4714 年 11 月 24 日正午算起的天数
strftime(format, timestring, modifier, modifier, ..):根据第一个参数指定的格式字符串返回格式化的日期
讲道理其他四个函数都可以用 strftime() 函数来表示:
Q6 多重条件,对一个关系使用两次
One of the possible data-entry errors is to record a bike as being used in two different trips, at the same time. Thus, we want to spot pairs of overlapping intervals (start time, end time). To keep the output manageable, we ask you to do this check for bikes with id between 100 and 200 (both inclusive). Note: Assume that no trip has negative time, i.e., for all trips, start time <= end time.
一种可能的数据输入错误是记录一辆自行车同时用于两次不同的旅行。因此,我们想要发现成对的重叠间隔(开始时间、结束时间)。为了使输出易于管理,我们要求您使用 id between 100 and 200 (both inclusive) 对自行车进行此检查。注意:假设没有行程有负时间,即对于所有行程,开始时间 <= 结束时间。
Details: For each conflict (a pair of conflict trips), print the bike id, former trip id, former start time, former end time, latter trip id, latter start time, latter end time. Sort by bike id (increasing), break ties with former trip id (increasing) and then latter trip id (increasing).
详细信息:对于每个冲突(一对冲突行程),打印自行车 ID、前行程 ID、前开始时间、前结束时间、后行程 ID、后开始时间、后结束时间。按自行车 ID(增加)排序,与前一个行程 ID(增加)和后一个行程 ID(增加)断开联系。
Hint: (1) Report each conflict pair only once, so that former trip id < latter trip id. (2) We give you the (otherwise tricky) condition for conflicts: start1 < end2 AND end1 > start2
提示:(1)每个冲突对只报告一次,所以 former trip id < latter trip id 。 (2) 我们给你冲突的(否则棘手的)条件:start1 < end2 AND end1 > start2
基本都是按照题目的提示来写的,比较简单。
写成这种形式看起来比较舒服
select ftrip.bike_id, ftrip.id, ftrip.start_time, ftrip.end_time, ltrip.id, ltrip.start_time, ltrip.end_time
from trip as ftrip, trip as ltrip
where ftrip.bike_id between 100 and 200
and ftrip.bike_id == ltrip.bike_id
and ftrip.id < ltrip.id
and ftrip.start_time < ltrip.end_time
and ftrip.end_time > ltrip.start_time
order by ftrip.bike_id asc, ftrip.id asc, ltrip.id asc
;
Q7 having 子句对分组进行限制条件而不是对元组
Find all the bikes that have been to more than one city. A bike has been to a city as long as the start station or end station in one of its trips is in that city.
找出所有去过一个以上城市的自行车。只要其中一次旅行的起点站或终点站在那个城市,一辆自行车就已经去过那个城市。
Details: For each bike that has been to more than one city, print the bike id and the number of cities it has been to. Sort by the number of cities (decreasing), then bike id (increasing).
详细信息:对于去过多个城市的每辆自行车,打印自行车 ID 和去过的城市数量。按城市数量(递减)排序,然后按自行车编号(递增)排序。
先根据来去城市拼表,将一次旅行中去的城市中等于station的元组用where筛选出来,之后自行车id进行group后,一个分组内的不同因为来去车站而包含多个city,再对city使用distinct得到cnt,使用having语句。
select trip.bike_id, count(distinct(station.city)) as cnt
from trip, station
where trip.start_station_id == station.station_id
or trip.end_station_id == station.station_id
group by trip.bike_id
having cnt > 1
order by cnt desc, bike_id asc
这里having限制的是每个分组而不是元组。 having次序在聚集之后,select之前。 次序from where group having select
Q8
Find what is the average number of trips made per day on each type of weather day. The type of weather on a day is specified by weather.events, such as 'Rain', 'Fog' and so on. For simplicity, we consider all days that does not have a weather event (weather.events = '\N') as a single type of weather. Here a trip belongs to a date only if its start time is on that date. We use the weather at the starting position of that trip as its weather type as well. There are also 'Rain' and 'rain' in weather.events. For simplicity, we consider them as different types of weathers. When counting the total number of days for a weather, we consider a weather happened on a date as long as it happened in at least one region on that date.
求出在每种天气类型下每天的平均出行次数。一天的天气类型由 weather.events 指定,例如 'Rain'、'Fog' 等。为简单起见,我们将所有没有天气事件 ( weather.events = '\N' ) 的日子视为单一类型的天气。此处,只有开始时间在该日期的旅行才属于该日期。我们也使用该行程起始位置的天气作为其天气类型。 weather.events 中也有 'Rain' 和 'rain'。为简单起见,我们将它们视为不同类型的天气。在计算天气的总天数时,我们认为天气发生在某个日期,只要它在该日期至少发生在一个地区即可。
Details: Print the name of the weather and the average number of trips made per day on that type of weather (round to four decimal places using ROUND()). Sort by the average number of trips (decreasing), then weather name (increasing).
详细信息:打印天气名称和在该天气类型下每天的平均出行次数(使用 ROUND() 舍入到 four 小数位)。按平均出行次数(递减)排序,然后是天气名称(递增)。
统计每种天气出现的次数(只有开始时间在该日期的旅行才属于该日期):直接求所有天气,该天气共有多少天,以及该天气中有多少天是出行(start_time)天,求个比例即可。
with event_cnt (events, cnt) as (
select events, count(distinct(date)) as cnt
from weather
group by events
)
select event_cnt.events, round(1.0*count(distinct(t.id)) / event_cnt.cnt, 4) as avg_num
from trip as t, station as s, weather as w, event_cnt
where t.start_station_id = s.station_id
and date(t.start_time) = w.date
and s.zip_code == w.zip_code
and w.events == event_cnt.events
group by event_cnt.events
order by avg_num desc, event_cnt.events asc
Q9 三表拼接筛选,
A short trip is a trip whose duration is <= 60 seconds. Compute the average temperature that a short trip starts versus the average temperature that a non-short trip starts. We use weather.mean_temp on the date of the start time as the Temperature measurement.
短途旅行是持续时间为 <= 60 seconds 的旅行。计算短途旅行开始的平均温度与非短途旅行开始的平均温度。我们在开始时间的日期使用 weather.mean_temp 作为温度测量。
Details: Print the average temperature that a short trip starts and the average temperature that a non-short trip starts. (on the same row, and both round to four decimal places using ROUND()) Please refer to the updated note before Q1 when calculating the duration of a trip.
详细信息:打印短途旅行开始的平均温度和非短途旅行开始的平均温度。 (在同一行,并且都使用 ROUND() 舍入到 four 小数位) Please refer to the updated note before Q1 when calculating the duration of a trip.
这题有个坑点就是连接trip、station、weater表的时候要使用date 和 zip_code 和station这几个字段来连接。其他没什么难点了。
分别查询查询short_trip 和short trip,然后横向组合两个列
select short_trip.temp, long_trip.temp
from (
select round(1.0 * sum(mean_temp) / count(*), 4) as temp
from trip, station, weather
where strftime("%s", end_time) - strftime("%s", start_time) <= 60 短途
and start_station_id = station_id trip与station拼接
and station.zip_code = weather.zip_code station与天气拼接
and date(start_time) = date 天气与trip日期拼接
) as short_trip, (
select round(1.0 * sum(mean_temp) / count(*), 4) as temp
from trip, station, weather
where strftime("%s", end_time) - strftime("%s", start_time) > 60
and start_station_id = station_id
and station.zip_code = weather.zip_code
and date(start_time) = date
) as long_trip
;
这里有个疑惑是,为啥两个表的可以好横着拼接在一起?是因为彼此都只剩下一个标量了吗,应该因为对这两个表没有用where,所以直接二重循环遍历,这样正好拼到一起了。
Q10
For each zip code that has experienced 'Rain-Thunderstorm' weather, find the station that has the most number of trips in that zip code under the storm weather. For simplicity, we only consider the start time of a trip when deciding the station and the weather for that trip.
对于经历过“Rain-Thunderstorm”天气的每个邮政编码,找到在暴风雨天气下该邮政编码中出行次数最多的车站。为简单起见,我们在决定旅行的车站和天气时只考虑旅行的开始时间。
Details: Print the zip code that has experienced the 'Rain-Thunderstorm' weather, the name of the station that has the most number of trips under the strom weather in that zip code, and the total number of trips that station has under the storm weather. Sort by the zip code (increasing). You do not need to print the zip code that has experienced 'Rain-Thunderstorm' weather but no trip happens on any storm day in that zip code.
详细信息:打印经历过“Rain-Thunderstorm”天气的邮政编码、该邮政编码中在暴风雨天气下出行次数最多的车站名称,以及该车站在暴风雨天气下出行的总次数天气。按邮政编码排序(递增)。您不需要打印经历过“雷雨”天气的邮政编码,但在该邮政编码的任何风暴日都不会发生旅行。
由于要找最大的,同时要考虑并列的,因此where中也需要重复查询storm_count,因此抽出来一个storm_count。
先求出来每个下过雨的zip处接待过的trip次数
select s.zip_code, s.station_name, count(t.id) as cnt
from trip as t, station as s, weather as w
where t.start_station_id = s.station_id
and s.zip_code = w.zip_code
and w.events = "Rain-Thunderstorm"
group by station_id
limit 10;
这样先找出了每个出现过雨的车站的下雨天接待过的trip的个数
然后在在此基础上使用with语句选取最大的
with rain_staion (zip_code, station_name, cnt) as(
select s.zip_code, s.station_name, count(t.id) as cnt
from trip as t, station as s, weather as w
where t.start_station_id = s.station_id
and s.zip_code = w.zip_code
and w.events = "Rain-Thunderstorm"
group by s.station_id
) 雷雨天气的出行和邮编站点信息
select r.zip_code, r.station_name, cnt 这里就有邮编,经历过得站点,已经该站点经历过的雷雨天气总数
from rain_staion as r
where r.cnt = (
select max(r_max.cnt) 选出最大的那个
from rain_staion as r_max
group by r_max.zip_code 将元组按照邮编划分一下
)
order by r.zip_code asc
;