CMU15-445:Homework #1 - SQL

Homework #1 - SQL


本文是对CMU15-445课程第1个作业文档的一个粗略翻译和完成。仅供个人(M1kanN)学习使用。


1. Overview

第一个作业要我们构建一组SQL查询,用于分析给定的数据集。我们要先了解IMDB数据。这个作业的目标:

  1. 学习基本的和某些高级的SQL功能。
  2. 熟悉使用SQLite。

2. Specification

本次作业总共包含10个问题,满分100分。每个问题我们都需要构建一个SQL查询来从SQLite DBMS获取想要的数据。预计时间6~8小时。

Placeholder Folder

创建带有空SQL文件的占位符提交文件夹,你将为每个问题使用该文件。

$ mkdir placeholder
$ cd placeholder
$ touch \
  q1_sample.sql \
  q2_sci_fi.sql \
  q3_oldest_people.sql \
  q4_crew_appears_most.sql \
  q5_decade_ratings.sql \
  q6_cruiseing_altitude.sql \
  q7_year_of_thieves.sql \
  q8_kidman_colleagues.sql \
  q9_9th_decile_ratings.sql \
  q10_house_of_the_dragon.sql
$ cd ..

在填写完查询内容后,你可以通过运行以下命令来压缩文件夹。

$ zip -j submission.zip placeholder/*.sql

``-j`标志可以让你在压缩文件中不需要路径信息来压缩所有的SQL查询。

3. Instructions

Setting Up SQLite

安装SQLite。起码高于3.25版本!

参考链接:SQLite - Installation

Load the Database Dump

  1. 首先跟着这篇指导确定SQLite是否工作正常:Command Line Shell For SQLite

  2. 下载database dump file

    检查MD5校验和,确保正确下载文件

  3. 运行以下命令以从提供的database dump解压数据库。注意解压后大小约为836MB

    $ gunzip imdb-cmudb2022.db.gz
    $ sqlite3 imdb-cmudb2022.db
    
  4. 我们为这项作业准备了原始数据集的一个随机样本。虽然这不是完成作业的必要条件,但按照这里的步骤,可以获得完整的数据集: imdb-sqlite

  5. 通过在sqlite3终端运行.tables命令来检查数据库的内容。你应该看到6个表.

  6. 在SQLite中使用以下命令创建索引

熟悉这些表的模式(结构)(它们包含什么属性,什么是主键和外键)。在sqlite3终端为每个表运行.schema $TABLE_NAME命令。每个表的输出结果应该像下面的例子一样。

  • PEOPLE:

    sqlite> .schema people
    CREATE TABLE people (
      person_id VARCHAR PRIMARY KEY,
      name VARCHAR,
      born INTEGER,
      died INTEGER
    );
    CREATE INDEX ix_people_name ON people (name);
    

    包含一个人的详细信息
    例如:

    nm0000006|Ingrid Bergman|1915|1982
    
  • TITLES

    sqlite> .schema titles
    CREATE TABLE titles (
      title_id VARCHAR PRIMARY KEY,
      type VARCHAR,
      primary_title VARCHAR,
      original_title VARCHAR,
      is_adult INTEGER,
      premiered INTEGER,
      ended INTEGER,
      runtime_minutes INTEGER,
      genres VARCHAR
    );
    CREATE INDEX ix_titles_type ON titles (type);
    CREATE INDEX ix_titles_primary_title ON titles (primary_title);
    CREATE INDEX ix_titles_original_title ON titles (original_title);
    

    包含标题的细节。例如:

    tt0088763|movie|Back to the Future|Back to the Future|0|1985||116|Adventure,Comedy,Sci-Fi
    

    Titles may also be referred to as "works" in the assignment specification.

  • AKAS

    sqlite> .schema akas
    CREATE TABLE akas (
      title_id VARCHAR, -- REFERENCES titles (title_id),
      title VARCHAR,
      region VARCHAR,
      language VARCHAR,
      types VARCHAR,
      attributes VARCHAR,
      is_original_title INTEGER
    );
    CREATE INDEX ix_akas_title_id ON akas (title_id);
    CREATE INDEX ix_akas_title ON akas (title);
    

    本表包含配音电影的alternate titles。
    例子:

    tt0015648|El acorazado Potemkin|XSA|es|imdbDisplay||0
    

    For this assignment, we will not use the fields region, types, attributes or is_original_title.

  • CREW

    sqlite> .schema crew
    CREATE TABLE crew (
      title_id VARCHAR, -- REFERENCES titles (title_id),
      person_id VARCHAR, -- REFERENCES people (person_id),
      category VARCHAR,
      job VARCHAR,
      characters VARCHAR
    );
    CREATE INDEX ix_crew_title_id ON crew (title_id);
    CREATE INDEX ix_crew_person_id ON crew (person_id);
    

    包含标题中演员的详细信息。例子:

    tt0000886|nm0609814|actor||["Hamlet"]
    

    本次作业中,不会用到job或者cahracters。当考虑一个人的角色的时候,参考category

  • RATINGS

    sqlite> .schema ratings
    CREATE TABLE ratings (
      title_id VARCHAR PRIMARY KEY, -- REFERENCES titles (title_id),
      rating FLOAT,
      votes INTEGER
    );
    

    评分

  • EPISODES
    这个表在样本数据集中包含了,不应该参考这个表。

  • Sanity Check
    行数:

    sqlite> select count(*) from titles;
    1375462
    

    schema

0 Hour Countdown 构造SQL查询 (作业部分)

作业官网有公布答案,请自行参阅
建议用菜鸟教程来熟悉SQLite3

  • Q1:sample

    只是一个测试问题。确保格式正确。

    • Solutions:

      用vim在placeholderq1_sample.sql文件,写上下列命令来测试:

      sqlite> SELECT DISTINCT(language)
         ...> FROM akas
         ...> ORDER BY language
         ...> LIMIT 10;
      

      然后进入placeholder文件夹,在终端执行下列命令:

      sqlite3 ../imdb-cmudb2022.db < q1_sample.sql
      

      就可以在imdb数据库下,执行sql文件中写的代码了。

  • Q2:compound value of

    • Brief: Find the 10 Sci-Fi works with the longest runtimes.

    • Details: Print the title of the work, the premiere date, and the runtime. The column listing the runtime should be suffixed with the string " (mins)", for example, if the runtime_mins value is 12, you should output 12 (mins). Note a work is Sci-Fi even if it is categorized in multiple genres, as long as Sci-Fi is one of the genres.
      Your first row should look like this: Cicak-Man 2: Planet Hitam|2008|999 (mins)

    • Solution:

      compound value of SELECT primary_title, premiered, runtime_minutes || ' (mins)'
      FROM titles WHERE genres LIKE '%Sci-Fi%'
      ORDER BY runtime_minutes DESC
      LIMIT 10;
      

      标准答案中:

      SELECT
      	primary_title, premiered, CAST(runtime_minutes AS VARCHAR) || " (mins)"
      FROM
      	titles
      WHERE genres LIKE '%Sci-Fi%'
      ORDER BY runtime_minutes DESC
      LIMIT 10;
      

      CAST的作用是将runtime_minutes转为VARCHAR类型,当然我认为不转也是可以的。

  • Q3:

    • Brief:
      Determine the oldest people in the dataset who were born in or after 1900. You should assume that a person without a known death year is still alive.

    • Details:
      Print the name and age of each person. People should be ordered by a compound value of their age and secondly their name in alphabetical order. Return the first 20 results.
      Your output should have the format: NAME|AGE

    • Solution:

      SELECT name, 
      	CASE
      		WHEN died IS NOT NULL
      			THEN died - born
      		ELSE
      			2022 - born
      	END AS age
      FROM people
      WHERE born >= 1900
      ORDER BY age DESC, name
      LIMIT 20;
      
    • 注意:
      这里需要使用CASE字句,作为一个过滤条件.
      CASE ... WHEN ..1.. THEN ... WHEN ..2.. THEN ... ELSE END

  • Q4:

    • Brief:
      Find the people who appear most frequently as crew members.

    • Details:
      Print the names and number of appearances of the 20 people with the most crew appearances ordered by their number of appearances in a descending fashion.
      Your output should look like this: NAME|NUM_APPEARANCES

    • Solutions:

      SELECT name, 
      		COUNT(*) as num_appearances
      FROM people INNER JOIN crew Solutions:
      	ON people.person_id = crew.person_id
      GROUP BY name
      ORDER BY num_appearances DESC
      LIMIT 20;
      
    • 注意:这里需要用到内部联结表!... INNER JOIN ... ON ...
      如果不用INNER JOIN,以及ON或者WHERE,会变成笛卡尔积。

  • Q5:

    • Brief:
      Compute intersting statistics on the ratings of content on a per-decade basis.

    • Details:

      Get the average rating (rounded to two decimal places), top rating, min rating, and the number of releases in each decade. Exclude titles which have not been premiered (i.e. where premiered is NULL). Print the relevant decade in a fancier format by constructing a string that looks like this: 1990s. Order the decades first by their average rating in a descending fashion and secondly by the decade, ascending, to break ties.
      Your output should have the format: DECADE|AVG_RATING|TOP_RATING|MIN_RATING|NUM_RELEASES

    • Solutions:
      思路,对每一个年份取10余,并加上s。

      SELECT
      	CAST(permiered/10 * 10 AS TEXT) || 's' AS decade, 
      	ROUND(AVG(rating), 2) AS avg_rating,
      	MAX(rating) AS top_rating,
      	MIN(rating) AS min_rating,
      	COUNT(*) AS num_releases
      FROM titles INNER JOIN ratings
      	ON titles.title_id = ratings.title_id
      WHERE premiered IS NOT NULL
      GROUP BY decade
      ORDER BY avg_rating DESC, decade ASC
      

      记得用GROUP BY

  • Q6:

    • Brief:
      Determine the most popular works with a person who has "Cruise" in their name and is born in 1962.

    • Details:
      Get the works with the most votes that have a person in the crew with "Cruise" in their name who was born in 1962. Return both the name of the work and the number of votes and only list the top 10 results in order from most to least votes. Make sure your output is formatted as follows: Top Gun|408389

    • Solutions:

      • 方法1:直接JOIN相关表。

        SELECT primary_title, votes
        FROM titles, people, crew, ratings
        WHERE people.name LIKE '%Cruise%'
        	AND people.born = 1962
        	AND	people.person_id = crew.person_id
        	AND titles.title_id = crew.title_id
        	AND ratings.title_id = crew.title_id
        	
        ORDER BY votes DESC
        LIMIT 10;
        

        所需时间:Run Time: real 2.538 user 1.898501 sys 0.639496

      • 方法二:(标准答案)

        WITH cruise_movies AS (
             SELECT
                  crew.title_id AS title_id
             FROM crew
             INNER JOIN
                  people ON crew.person_id = people.person_id
             WHERE people.name LIKE "%Cruise%" AND people.born = 1962
        )
        SELECT
             titles.primary_title as name,
             ratings.votes as votes
        FROM
             cruise_movies -- 这里就显然比第一种更快了
        INNER JOIN
             ratings ON cruise_movies.title_id = ratings.title_id
        INNER JOIN
             titles ON cruise_movies.title_id = titles.title_id
        ORDER BY votes DESC
        LIMIT 10;
        

        所需时间:Run Time: real 1.321 user 1.081924 sys 0.23936

        • 第二种方法用到了WITH AS 短语,也就是子查询部分。定义一个SQL片断后,该SQL片断可以被整个SQL语句所用到。有的时候,with as是为了提高SQL语句的可读性,减少嵌套冗余。
          注意,MySQL不支持此短语。(我说我咋没在MySQL必知必会中找到)
      • 为什么第二种更快?
        我认为,是因为第一种方法需要将3个一起JOIN后,再用WHERE来过滤。而第二种则是先用WITH AS 子句,在两个表中过滤出Cruise和出生年份1962的数据出来,再在其他表中查询作品和投票数,显然,第二种更快。

  • Q7:

    • Brief:
      List the number of works that premiered in the same year that "Army of Thieves" premiered.

    • Details:

      Print only the total number of works. The answer should include "Army of Thieves" itself. For this question, determine distinct works by their title_id, not their names.

    • Solutions:
      思路:先找"Army of Thieves"的首映式年份,然后用这个年份去过滤。

      SELECT COUNT(DISTINCT titles.title_id)
      FROM titles
      WHERE premiered IN (
      	 SELECT premiered FROM titles
          WHERE primary_title = 'Army of Thieves'
      );
      

      IN操作符常常用于小嵌套。大嵌套可以用WITH AS代替。

  • Q8:

    • Brief:
      List the all the different actors and actresses who have starred in a work with Nicole Kidman (born in 1967).

    • Details:

      Print only the names of the actors and actresses in alphabetical order. The answer should include Nicole Kidman herself. Each name should only appear once in the output.

    • Note: As mentioned in the schema, when considering the role of an individual on the crew, refer to the field category. The roles "actor" and "actress" are different and should be accounted for as such.

    • Solutions:

      思路:通过Nicole找到她出演的所有作品,然后列出所有参与这些作品的演员,并去重。因为我们先要找作品,然后要找演员,所以可以有两个WITH AS子句。

      WITH Nicole_Works AS (
      	SELECT DISTINCT(crew.title_id)
          FROM people INNER JOIN crew 
          	ON crew.person_id == people.person_id
          		AND people.name == "Nicole Kidman"
          		AND people.born == 1967
      ),
      Nicole_Colleagues AS (
      	SELECT DISTINCT(crew.person_id) AS id
          FROM crew
          WHERE
          	(crew.category == "actor" OR crew.category == "actress") AND crew.title_id IN Nicole_Works
      )
      SELECT name
      FROM people JOIN Nicole_Colleagues 
      	ON Nicole_Colleagues.id = people.person_id
      ORDER BY name ASC;
      
  • Q9:

    • Brief:
      For all people born in 1955, get their name and average rating on all movies they have been part of through their careers. Output the 9th decile of individuals as measured by their average career movie rating.

    • Details:

      Calculate average ratings for each individual born in 1955 across only the movies they have been part of. Compute the quantiles for each individual's average rating using NTILE(10).
      Make sure your output is formatted as follows (round average rating to the nearest hundredth, results should be ordered by a compound value of their ratings descending and secondly their name in alphabetical order): Stanley Nelson|7.13

    • Note: You should take quantiles after processing the average career movie rating of individuals. In other words, find the individuals who have an average career movie rating in the 9th decile of all individuals.

    • Solutions:
      思路:先找1955年生的人,然后根据人来分组,列出作品,算出平均值,最后用NTILE来分桶,然后给出第九份。

      WITH actor_movie_1955 AS (
      	SELECT
          	people.person_id,	-- 用来根据人分组
          	people.name,
          	titles.title_id,	-- 用来查ratings
          	titles.primary_title
          FROM
          	people
          INNER JOIN
          	crew ON people.person_id = crew.person_id
          INNER JOIN
          	titles ON titles.title_id = crew.title_id
          WHERE people.born = 1955 AND titles.type = "movie"
      ),	-- 第一个with过滤出来了1995年出生的人出演的电影
      actor_ratings AS (
      	SELECT
          	name,
          	ROUND(AVG(ratings.rating), 2) AS rating
          FROM ratings
          INNER JOIN actor_movie_1955
          	ON ratings.title_id = actor_movie_1955.title_id
          GROUP BY actor_movie_1955.person_id
      ),
      quartiles AS (
      	SELECT *, NTILE(10) OVER (ORDER BY rating ASC)
          	AS RatingQuartile FROM actor_ratings
      )
      SELECT name, rating
      FROM quartiles
      WHERE RatingQuartile = 9
      ORDER BY rating DESC, name ASC;
      
      • NTILE()函数:就是把所有行分为n等份。
  • Q10:

    • Brief:
      Concatenate all the unique titles for the TV Series "House of the Dragon" as a string of comma-separated values in alphabetical order of the titles.

    • Details:

      Find all the unique dubbed titles for the new TV show "House of the Dragon" and order them alphabetically. Print a single string containing all these titles separated by commas.

    • Hint: You might find Recursive CTEs useful.

    • Note: Two titles are different even if they differ only in capitalization. Elements in the comma-separated value result should be separated with both a comma and a space, e.g. "foo, bar".

    • Solutions:
      就是输出权利的游戏的所有其他名字(不同语言的),然后用逗号分割输出

      WITH p as (
      	SELECT titles.primary_title as name, akas.title as dubbed
          FROM titles
          INNER JOIN akas ON titles.title_id = akas.title_id
          WHERE titles.primary_title = "House of the Dragon" AND titles.type = "tvSeries"
          GROUP BY titles.primary_title, akas.title
          ORDER BY akas.title
      ),
      c as (
      	SELECT row_number() 
          	OVER (ORDER BY p.name ASC) AS seqnum,p.dubbed as dubbed
          FROM p
      ),
      flattended AS (
      	SELECT seqnum, dubbed
          FROM c
          WHERE seqnum = 1
          UNION ALL
          SELECT c.seqnum, f.dubbed || ', ' || c.dubbed
          FROM c JOIN flattended f 
          	ON c.seqnum = f.seqnum + 1
      )
      SELECT dubbed FROM flattended
      ORDER BY seqnum DESC LIMIT 1;
      

5. Summary

Q1~Q9还是不难的,但是Q10还没理解到。以后再看。

posted @ 2023-01-24 15:02  M1kanN  阅读(565)  评论(1编辑  收藏  举报