CMU15-445:Homework #1 - SQL
Homework #1 - SQL
本文是对CMU15-445课程第1个作业文档的一个粗略翻译和完成。仅供个人(M1kanN)学习使用。
1. Overview
第一个作业要我们构建一组SQL查询,用于分析给定的数据集。我们要先了解IMDB数据。这个作业的目标:
- 学习基本的和某些高级的SQL功能。
- 熟悉使用SQLite。
2. Specification
本次作业总共包含10个问题,满分100分。每个问题我们都需要构建一个SQL查询来从SQLite DBMS获取想要的数据。预计时间6~8小时。
Placeholder Folder
创建带有空SQL文件的占位符提交文件夹,你将为每个问题使用该文件。
$ mkdir placeholder
$ cd placeholder
$ touch \
q1_sample.sql \
q2_sci_fi.sql \
q3_oldest_people.sql \
q4_crew_appears_most.sql \
q5_decade_ratings.sql \
q6_cruiseing_altitude.sql \
q7_year_of_thieves.sql \
q8_kidman_colleagues.sql \
q9_9th_decile_ratings.sql \
q10_house_of_the_dragon.sql
$ cd ..
在填写完查询内容后,你可以通过运行以下命令来压缩文件夹。
$ zip -j submission.zip placeholder/*.sql
``-j`标志可以让你在压缩文件中不需要路径信息来压缩所有的SQL查询。
3. Instructions
Setting Up SQLite
安装SQLite。起码高于3.25版本!
Load the Database Dump
-
首先跟着这篇指导确定SQLite是否工作正常:Command Line Shell For SQLite
-
检查MD5校验和,确保正确下载文件
-
运行以下命令以从提供的database dump解压数据库。注意解压后大小约为836MB
$ gunzip imdb-cmudb2022.db.gz $ sqlite3 imdb-cmudb2022.db
-
我们为这项作业准备了原始数据集的一个随机样本。虽然这不是完成作业的必要条件,但按照这里的步骤,可以获得完整的数据集: imdb-sqlite
-
通过在sqlite3终端运行
.tables
命令来检查数据库的内容。你应该看到6个表. -
在SQLite中使用以下命令创建索引
熟悉这些表的模式(结构)(它们包含什么属性,什么是主键和外键)。在sqlite3终端为每个表运行.schema $TABLE_NAME
命令。每个表的输出结果应该像下面的例子一样。
-
PEOPLE:
sqlite> .schema people CREATE TABLE people ( person_id VARCHAR PRIMARY KEY, name VARCHAR, born INTEGER, died INTEGER ); CREATE INDEX ix_people_name ON people (name);
包含一个人的详细信息
例如:nm0000006|Ingrid Bergman|1915|1982
-
TITLES
sqlite> .schema titles CREATE TABLE titles ( title_id VARCHAR PRIMARY KEY, type VARCHAR, primary_title VARCHAR, original_title VARCHAR, is_adult INTEGER, premiered INTEGER, ended INTEGER, runtime_minutes INTEGER, genres VARCHAR ); CREATE INDEX ix_titles_type ON titles (type); CREATE INDEX ix_titles_primary_title ON titles (primary_title); CREATE INDEX ix_titles_original_title ON titles (original_title);
包含标题的细节。例如:
tt0088763|movie|Back to the Future|Back to the Future|0|1985||116|Adventure,Comedy,Sci-Fi
Titles may also be referred to as "works" in the assignment specification.
-
AKAS
sqlite> .schema akas CREATE TABLE akas ( title_id VARCHAR, -- REFERENCES titles (title_id), title VARCHAR, region VARCHAR, language VARCHAR, types VARCHAR, attributes VARCHAR, is_original_title INTEGER ); CREATE INDEX ix_akas_title_id ON akas (title_id); CREATE INDEX ix_akas_title ON akas (title);
本表包含配音电影的alternate titles。
例子:tt0015648|El acorazado Potemkin|XSA|es|imdbDisplay||0
For this assignment, we will not use the fields
region
,types
,attributes
oris_original_title
. -
CREW
sqlite> .schema crew CREATE TABLE crew ( title_id VARCHAR, -- REFERENCES titles (title_id), person_id VARCHAR, -- REFERENCES people (person_id), category VARCHAR, job VARCHAR, characters VARCHAR ); CREATE INDEX ix_crew_title_id ON crew (title_id); CREATE INDEX ix_crew_person_id ON crew (person_id);
包含标题中演员的详细信息。例子:
tt0000886|nm0609814|actor||["Hamlet"]
本次作业中,不会用到
job
或者cahracters
。当考虑一个人的角色的时候,参考category
。 -
RATINGS
sqlite> .schema ratings CREATE TABLE ratings ( title_id VARCHAR PRIMARY KEY, -- REFERENCES titles (title_id), rating FLOAT, votes INTEGER );
评分
-
EPISODES
这个表在样本数据集中包含了,不应该参考这个表。 -
Sanity Check
行数:sqlite> select count(*) from titles; 1375462
0 Hour Countdown 构造SQL查询 (作业部分)
作业官网有公布答案,请自行参阅
建议用菜鸟教程来熟悉SQLite3
-
Q1:sample
只是一个测试问题。确保格式正确。
-
Solutions:
用vim在
placeholder
的q1_sample.sql
文件,写上下列命令来测试:sqlite> SELECT DISTINCT(language) ...> FROM akas ...> ORDER BY language ...> LIMIT 10;
然后进入placeholder文件夹,在终端执行下列命令:
sqlite3 ../imdb-cmudb2022.db < q1_sample.sql
就可以在imdb数据库下,执行sql文件中写的代码了。
-
-
Q2:compound value of
-
Brief: Find the 10
Sci-Fi
works with the longest runtimes. -
Details: Print the title of the work, the premiere date, and the runtime. The column listing the runtime should be suffixed with the string " (mins)", for example, if the
runtime_mins
value is12
, you should output12 (mins)
. Note a work isSci-Fi
even if it is categorized in multiple genres, as long asSci-Fi
is one of the genres.
Your first row should look like this:Cicak-Man 2: Planet Hitam|2008|999 (mins)
-
Solution:
compound value of SELECT primary_title, premiered, runtime_minutes || ' (mins)' FROM titles WHERE genres LIKE '%Sci-Fi%' ORDER BY runtime_minutes DESC LIMIT 10;
标准答案中:
SELECT primary_title, premiered, CAST(runtime_minutes AS VARCHAR) || " (mins)" FROM titles WHERE genres LIKE '%Sci-Fi%' ORDER BY runtime_minutes DESC LIMIT 10;
CAST
的作用是将runtime_minutes
转为VARCHAR类型,当然我认为不转也是可以的。
-
-
Q3:
-
Brief:
Determine the oldest people in the dataset who were born in or after 1900. You should assume that a person without a known death year is still alive. -
Details:
Print thename
andage
of each person. People should be ordered by a compound value of their age and secondly their name in alphabetical order. Return the first 20 results.
Your output should have the format:NAME|AGE
-
Solution:
SELECT name, CASE WHEN died IS NOT NULL THEN died - born ELSE 2022 - born END AS age FROM people WHERE born >= 1900 ORDER BY age DESC, name LIMIT 20;
-
注意:
这里需要使用CASE字句,作为一个过滤条件.
CASE ... WHEN ..1.. THEN ... WHEN ..2.. THEN ... ELSE END
-
-
Q4:
-
Brief:
Find the people who appear most frequently as crew members. -
Details:
Print thenames
andnumber of appearances
of the 20 people with the most crew appearances ordered by their number of appearances in a descending fashion.
Your output should look like this:NAME|NUM_APPEARANCES
-
Solutions:
SELECT name, COUNT(*) as num_appearances FROM people INNER JOIN crew Solutions: ON people.person_id = crew.person_id GROUP BY name ORDER BY num_appearances DESC LIMIT 20;
-
注意:这里需要用到内部联结表!
... INNER JOIN ... ON ...
如果不用INNER JOIN,以及ON或者WHERE,会变成笛卡尔积。
-
-
Q5:
-
Brief:
Compute intersting statistics on the ratings of content on a per-decade basis. -
Details:
Get the average rating (rounded to two decimal places), top rating, min rating, and the number of releases in each decade. Exclude titles which have not been premiered (i.e. where premiered is
NULL
). Print the relevant decade in a fancier format by constructing a string that looks like this:1990s
. Order the decades first by their average rating in a descending fashion and secondly by the decade, ascending, to break ties.
Your output should have the format:DECADE|AVG_RATING|TOP_RATING|MIN_RATING|NUM_RELEASES
-
Solutions:
思路,对每一个年份取10余,并加上s。SELECT CAST(permiered/10 * 10 AS TEXT) || 's' AS decade, ROUND(AVG(rating), 2) AS avg_rating, MAX(rating) AS top_rating, MIN(rating) AS min_rating, COUNT(*) AS num_releases FROM titles INNER JOIN ratings ON titles.title_id = ratings.title_id WHERE premiered IS NOT NULL GROUP BY decade ORDER BY avg_rating DESC, decade ASC
记得用
GROUP BY
-
-
Q6:
-
Brief:
Determine the most popular works with a person who has "Cruise" in their name and is born in 1962. -
Details:
Get the works with the most votes that have a person in the crew with "Cruise" in their name who was born in 1962. Return both the name of the work and the number of votes and only list the top 10 results in order from most to least votes. Make sure your output is formatted as follows:Top Gun|408389
-
Solutions:
-
方法1:直接JOIN相关表。
SELECT primary_title, votes FROM titles, people, crew, ratings WHERE people.name LIKE '%Cruise%' AND people.born = 1962 AND people.person_id = crew.person_id AND titles.title_id = crew.title_id AND ratings.title_id = crew.title_id ORDER BY votes DESC LIMIT 10;
所需时间:Run Time: real 2.538 user 1.898501 sys 0.639496
-
方法二:(标准答案)
WITH cruise_movies AS ( SELECT crew.title_id AS title_id FROM crew INNER JOIN people ON crew.person_id = people.person_id WHERE people.name LIKE "%Cruise%" AND people.born = 1962 ) SELECT titles.primary_title as name, ratings.votes as votes FROM cruise_movies -- 这里就显然比第一种更快了 INNER JOIN ratings ON cruise_movies.title_id = ratings.title_id INNER JOIN titles ON cruise_movies.title_id = titles.title_id ORDER BY votes DESC LIMIT 10;
所需时间:Run Time: real 1.321 user 1.081924 sys 0.23936
- 第二种方法用到了
WITH AS
短语,也就是子查询部分。定义一个SQL片断后,该SQL片断可以被整个SQL语句所用到。有的时候,with as
是为了提高SQL语句的可读性,减少嵌套冗余。
注意,MySQL不支持此短语。(我说我咋没在MySQL必知必会中找到)
- 第二种方法用到了
-
为什么第二种更快?
我认为,是因为第一种方法需要将3个一起JOIN后,再用WHERE来过滤。而第二种则是先用WITH AS 子句,在两个表中过滤出Cruise和出生年份1962的数据出来,再在其他表中查询作品和投票数,显然,第二种更快。
-
-
-
Q7:
-
Brief:
List the number of works that premiered in the same year that "Army of Thieves" premiered. -
Details:
Print only the total number of works. The answer should include "Army of Thieves" itself. For this question, determine distinct works by their
title_id
, not their names. -
Solutions:
思路:先找"Army of Thieves"的首映式年份,然后用这个年份去过滤。SELECT COUNT(DISTINCT titles.title_id) FROM titles WHERE premiered IN ( SELECT premiered FROM titles WHERE primary_title = 'Army of Thieves' );
IN操作符常常用于小嵌套。大嵌套可以用WITH AS代替。
-
-
Q8:
-
Brief:
List the all the different actors and actresses who have starred in a work with Nicole Kidman (born in 1967). -
Details:
Print only the names of the actors and actresses in alphabetical order. The answer should include Nicole Kidman herself. Each name should only appear once in the output.
-
Note: As mentioned in the schema, when considering the role of an individual on the crew, refer to the field
category
. The roles "actor" and "actress" are different and should be accounted for as such. -
Solutions:
思路:通过Nicole找到她出演的所有作品,然后列出所有参与这些作品的演员,并去重。因为我们先要找作品,然后要找演员,所以可以有两个WITH AS子句。
WITH Nicole_Works AS ( SELECT DISTINCT(crew.title_id) FROM people INNER JOIN crew ON crew.person_id == people.person_id AND people.name == "Nicole Kidman" AND people.born == 1967 ), Nicole_Colleagues AS ( SELECT DISTINCT(crew.person_id) AS id FROM crew WHERE (crew.category == "actor" OR crew.category == "actress") AND crew.title_id IN Nicole_Works ) SELECT name FROM people JOIN Nicole_Colleagues ON Nicole_Colleagues.id = people.person_id ORDER BY name ASC;
-
-
Q9:
-
Brief:
For all people born in 1955, get their name and average rating on all movies they have been part of through their careers. Output the 9th decile of individuals as measured by their average career movie rating. -
Details:
Calculate average ratings for each individual born in 1955 across only the movies they have been part of. Compute the quantiles for each individual's average rating using NTILE(10).
Make sure your output is formatted as follows (round average rating to the nearest hundredth, results should be ordered by a compound value of their ratings descending and secondly their name in alphabetical order):Stanley Nelson|7.13
-
Note: You should take quantiles after processing the average career movie rating of individuals. In other words, find the individuals who have an average career movie rating in the 9th decile of all individuals.
-
Solutions:
思路:先找1955年生的人,然后根据人来分组,列出作品,算出平均值,最后用NTILE来分桶,然后给出第九份。WITH actor_movie_1955 AS ( SELECT people.person_id, -- 用来根据人分组 people.name, titles.title_id, -- 用来查ratings titles.primary_title FROM people INNER JOIN crew ON people.person_id = crew.person_id INNER JOIN titles ON titles.title_id = crew.title_id WHERE people.born = 1955 AND titles.type = "movie" ), -- 第一个with过滤出来了1995年出生的人出演的电影 actor_ratings AS ( SELECT name, ROUND(AVG(ratings.rating), 2) AS rating FROM ratings INNER JOIN actor_movie_1955 ON ratings.title_id = actor_movie_1955.title_id GROUP BY actor_movie_1955.person_id ), quartiles AS ( SELECT *, NTILE(10) OVER (ORDER BY rating ASC) AS RatingQuartile FROM actor_ratings ) SELECT name, rating FROM quartiles WHERE RatingQuartile = 9 ORDER BY rating DESC, name ASC;
NTILE()
函数:就是把所有行分为n等份。
-
-
Q10:
-
Brief:
Concatenate all the unique titles for the TV Series "House of the Dragon" as a string of comma-separated values in alphabetical order of the titles. -
Details:
Find all the unique dubbed titles for the new TV show "House of the Dragon" and order them alphabetically. Print a single string containing all these titles separated by commas.
-
Hint: You might find Recursive CTEs useful.
-
Note: Two titles are different even if they differ only in capitalization. Elements in the comma-separated value result should be separated with both a comma and a space, e.g. "foo, bar".
-
Solutions:
就是输出权利的游戏的所有其他名字(不同语言的),然后用逗号分割输出WITH p as ( SELECT titles.primary_title as name, akas.title as dubbed FROM titles INNER JOIN akas ON titles.title_id = akas.title_id WHERE titles.primary_title = "House of the Dragon" AND titles.type = "tvSeries" GROUP BY titles.primary_title, akas.title ORDER BY akas.title ), c as ( SELECT row_number() OVER (ORDER BY p.name ASC) AS seqnum,p.dubbed as dubbed FROM p ), flattended AS ( SELECT seqnum, dubbed FROM c WHERE seqnum = 1 UNION ALL SELECT c.seqnum, f.dubbed || ', ' || c.dubbed FROM c JOIN flattended f ON c.seqnum = f.seqnum + 1 ) SELECT dubbed FROM flattended ORDER BY seqnum DESC LIMIT 1;
-
5. Summary
Q1~Q9还是不难的,但是Q10还没理解到。以后再看。