SQL Basic Notes

🥥 Table of Content

Basic SQL


Intermediate SQL


Advanced SQL

  • Query Optimization
  • Complex Function
  • Recursive CTE

Interview Hot Questions




🥑 Get Started!

Basic SQL

01 - Order of SQL execution

FROM(JOIN) Choose and join tables to get base data
WHERE Filters the base data (by rows)
GROUP BY Aggregates the base data
HAVING Filters the aggregated data
SELECT Returns the final data
ORDER BY Sorts the final data
LIMIT Limits the returned data to a row count

02 - JOIN

Resource 1: Visual JOINs
Resource 2: SQL JOINs Cheatsheet | Datacamp

JOIN的最终结果是生成临时表

<1> CROSS JOIN

CROSS JOIN:单纯将table a和table b的每一条记录连接

<2> 4 Types of JOIN, based on the CROSS JOIN

Step1: 将table a和table b的每一条记录连接
Step2: 根据ON条件进行筛选

  • INNER JOIN:根据ON clause返回intersection area,不具有保留左右表的功能。可以理解为左表 \(\cap\) 右表
  • LEFT JOIN:返回左表的所有信息、右表中与ON匹配的信息(不匹配的信息返回NULL)。可以理解为intersection \(\cup\) 左表
  • RIGHT JOIN:与上述相似
  • FULL JOIN:返回根据ON匹配后的左右表共有信息 + 保留左右表的所有。可以理解为左表 \(\cup\) 右表

<3> INNER JOIN vs. CROSS JOIN + WHERE

## Scenario 1
# CROSS JOIN + WHERE
SELECT *
FROM table1 t1, table2 t2
WHERE t1.price = t2.price

# INNER JOIN
SELECT *
FROM table1 t1 JOIN table2 t2 ON t1.price = t2.price

# LEFT JOIN
SELECT *
FROM table1 t1 LEFT JOIN table2 t2 ON t1.price = t2.price

Now, there are two tables, the goal for you is to find the product_id, price and its product_name.

If we use CROSS JOIN + WHERE, the process and result is like...

If we use INNER JOIN, the process and result is like...

If we compare with other JOIN, the process and result is like...

<4> ON vs. WHERE

  • ON是在生成临时表时使用的条件,会根据JOIN类型进行保留
  • WHERE是生成临时表后使用的条件,整行进行筛选,不考虑保留any other info from 左右表
## Scenario 2
# CROSS JOIN + WHERE
SELECT *
FROM table1 t1, table2 t2
WHERE t1.price = t2.price AND t2.product_name='A'

# INNER JOIN
SELECT *
FROM table1 t1 JOIN table2 t2 ON t1.price = t2.price AND t2.product_name='A'

# INNER JOIN
SELECT *
FROM table1 t1 JOIN table2 t2 ON t1.price = t2.price
WHERE t2.product_name='A'

# LEFT JOIN
SELECT *
FROM table1 t1 LEFT JOIN table2 t2 ON t1.price = t2.price AND t2.product_name='A'

# LEFT JOIN
SELECT *
FROM table1 t1 LEFT JOIN table2 t2 ON t1.price = t2.price
WHERE t2.product_name='A'

Now, there are two tables, the goal for you is to find the product_id and price which product name is A.

If we use CROSS JOIN + WHERE, the process and result is like...

If we use INNER JOIN and put the WHERE clause on the ON clause, the process and result is like...

If we use INNER JOIN + WHERE, the process and result is like...

If we use other JOIN and put the WHERE clause on the ON clause, the process and result is like...

If we use other JOIN + WHERE, the process and result is like...

<5> SELF JOIN

  1. 经理问题(谁归谁管,此时只有一个员工表)
    Employees Earning More Than Their Managers
  2. 时间错位(同列不同行的比较)-- 优化方法:LAG( ) / LEAD( )


03 - WHERE & HAVING

  • WHERE是筛选分组前的数据,HAVING是筛选分组后的数据
  • WHEREHAVING后必须是原始列名,不能是列的别名(AS后的别名)
  • WHEREHAVING后跟的是condition(condition是=,!=,>,>=,<,<=,LIKE,NOT LIKE,IS NULL,IS NOT NULLOR,AND的组合)
  • WHERE的condition不能包含Aggregate_function()
  • HAVING的condition只能包含Aggregate_function()GROUP BY后的字段


04 - LIKE & REGEXP

Resource: MySQL Regular Expression | 菜鸟教程

  • LIKE 'c%': 以c开头的
  • LIKE '%c': 以c结尾的
  • LIKE '%c%': 包含c的(不限位置)
  • LIKE '%%': 包含''的
SELECT *
FROM Customers
- WHERE Name LIKE 'F%'
- WHERE Name REGEXP '^F'


05 - GROUP BY

  • GROUP BY按column_name 或 (column_name_1, ... , column_name_N)对数据进行分组,假想压缩成一行数据
  • GROUP BY搭配SELECT Aggregate_fuction(column_name),根据Aggregate_fuction()输出一条记录
  • GROUP BY不搭配SELECT Aggregate_fuction(column_name),默认取分组结果的第一条记录
  • GROUP BY筛选的不是完整的一行记录,WHERE筛选的是完整的一行记录

06 - Five Aggregate Function(5)

<1>Common Aggregate Function

  • MIN(column_name): 可对文本、日期进行操作
  • MAX(column_name): 同上
  • SUM(column_name)
  • AVG(column_name)
  • COUNT(column_name), COUNT(*), COUNT(1)

<2>AVG(column_name) vs. SUM(column_name)/COUNT(column_name)

SELECT
	meal_category,
	AVG(no_of_views_lst_mth) AS average_views
FROM recipes
GROUP BY meal_category

Hence, for the real value, we need to ...

  • Change the method to SUM(column_name)/COUNT(*)
    or
  • Convert the NULL value to 0

But you need to double check the value if it equals to real value.



07 - DISTINCT

使用范围:SELECT clause

用法一:SELECT DISTINCT column_name -- 返回一列不同名称
用法二:SELECT COUNT(DISTINCT column_name) -- 统计该列不同的名称个数
用法三:SELECT COUNT(DISTINCT column_name1,column_name2) -- 统计该两列不同的组合名称个数




Intermediate SQL

01 - Window Function


<1> Basic Syntax & Key Point

<window_function> OVER(PARTITION BY <expr_clause> ORDER BY <order_clause> <frame_clause>)

Key Point: 窗口函数的核心思想是新增一列
Note 1:frame_clause不适用于窗口排名函数,适用于窗口聚合函数。
Note 2:PARTITION BY在窗口排名函数和窗口聚合函数中都可省略。


<2> ROW_NUMBER() & DENSE_RANK() & RANK()

ROW_NUMBER() OVER(PARTITION BY <partition_clause> ORDER BY <order_clause>)
DENSE_RANK() OVER(PARTITION BY <partition_clause> ORDER BY <order_clause>)
RANK() OVER(PARTITION BY <partition_clause> ORDER BY <order_clause>)

There are three methods for creating the ranking column in Window Function.

value ROW_NUMBER() DENSE_RANK() RANK()
5 1 1 1
4.9 2 2 2
4.9 3 2 2
4.8 4 3 4

If there is no ties, the three methods are same for rank.

  • ROW_NUMBER() : It's like index.
  • DENSE_RANK() : e.g. return employee names which has third high salary(4.8).
  • RANK()

💡 窗口排名函数中PARTITION BY的细节

  • 窗口排名函数,不可省略ORDER BY
  • 窗口排名函数不使用PARTITION BY进行不分组的总体排名,并打印排名。
  • 窗口排名函数使用PARTITION BY进行分组的组内排名,并打印排名。
  • 外部ORDER BY和窗口排名函数DENSE_RANK()中不使用PARTITION BYORDER BY输出的总体数据排序相同,后者多了一列排名。
  • 使用窗口函数后如需继续使用WHEREHAVING筛选行,必须外层嵌套,因为窗口函数的操作顺序在WHEREGROUP BY之后。

<3> Aggeragate Function in Window Function

💡 窗口聚合函数中ORDER BY的细节

  • ORDER BY,无<frame_clause>:窗口聚合范围 默认组内所有行
  • ORDER BY,无<frame_clause>:窗口聚合范围 默认从第一行到当前行
  • ORDER BY,有<frame_clause>:窗口聚合范围 <frame_clause>

i.窗口聚合范围:从第一行到当前行

“累计”:累计求和、累计计数、累计平均数、累计最小值、累计最大值
<aggeragate_function> OVER(ORDER BY <order_clause>)
<aggeragate_function> OVER(ORDER BY <order_clause> ROWS UNBOUNDED PRECEDING AND CURRENT ROW)

ii.窗口聚合范围:组内所有行

<aggeragate_function> OVER()
<aggeragate_function> OVER(ORDER BY <order_clause> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

iii.窗口聚合范围:frame_clause

<aggeragate_function> OVER(ORDER BY <order_clause> <frame_clause>)

<4> LAG() & LEAD()

The LAG Function and the LEAD Function in SQL | LearnSQL

LAG(column_name, offset) OVER(PARTITION BY <partition_clause> ORDER BY <order_clause>)
LEAD(column_name, offset) OVER(PARTITION BY <partition_clause> ORDER BY <order_clause>)
SELECT
	year,
	total_sale,
	LAG(total_sale) OVER(ORDER BY year) AS previous_year_total_sale,
	total_sale - LAG(total_sale) OVER(ORDER BY year) AS total_sale_difference
FROM sales

'''
time complexity: O(N)
space complexity: O(N)
'''

收益率 = \(\dfrac{current\_total\_sale - previous\_total\_sale}{previous\_total\_sale}\)

💡Compared with Self Join

SELECT
	t1.year,
	t1.total_sale - COALESCE(t2.total_sale,0) AS total_sale_diff
FROM sales t1 LEFT JOIN sales t2 ON t1.year=t2.year+INTERVAL 1 DAY
ORDER BY t1.year

'''
time complexity: O(NlogN)
space complexity: O(NlogN)
'''


02 - Handle NULL value

  • CASE WHEN column_name IS NULL THEN 0 ELSE column_name END
  • COALESCE(first_choice_value, second_choice_value),依次考虑是否为NULL,非NULL就返回value,都是NULL最终返回NULL

  • IFNULL(x,value)
# is null条件判断
SELECT
	meal_category,
	AVG(CASE WHEN no_of_views_lst_mth IS NULL
		THEN 0
		ELSE no_of_views_lst_mth END) AS average_views
FROM recipes
GROUP BY meal_category

# coalesce转换
SELECT
	meal_category,
	AVG(COALESCE(no_of_views_lst_mth, 0)) AS average_views
FROM recipes
GROUP BY meal_category

# ifnull转换
SELECT
	meal_category,
	AVG(IFNULL(no_of_views_lst_mth, 0)) AS average_views
FROM recipes
GROUP BY meal_category


03 - Handling DATETIME value

YEAR(column_name)
TIMESTAMPDIFF(unit,start,end)
DATEDIFF(column_name)

04 - Handling VARCHAR value

VARCHAR_LENGTH()
LENGTH()

SUM IF & COUNT IF

对条件进行计数:SUM(条件)等价于SUM(if(条件,1,0))
对条件进行计数:COUNT(IF(条件,1,null))

注:COUNT(条件)等价于COUNT(条件,true,null),条件满足返回true,否则返回null

DATEDIFF()

DATEDIFF(x1,x2) -- 返回x1-x2的时间差值,其中x1,x2为column_name或时间字符串

IFNULL()

IFNULL(x,value) -- 如果x为NULL,返回value

注:指标计算中,需要外部加IFNULL()。例如IFNULL(COUNT()/COUNT(),0)

ROUND()

ROUND(x,d):四舍五入保留x的d位小数
  1. 查询近30天活跃用户数

  2. 好友申请 I:总体通过率




Interview Hot Questions

Top N

Top N | Emma Ding

最大值问题的三种思路

Solution1: Subquery(WHERE is not required)

SELECT ...
FROM ...
(WHERE) ...
GROUP BY 关键字段
HAVING Aggregate_function(字段) >= ALL(SELECT Aggregate_function(字段)
                                      FROM ...
                                      (WHERE) ...
                                      GROUP BY 关键字段)
)

Solution2: Windows_function overall rank(do not use PARTITION BY)

SELECT ...
FROM (SELECT *,
          <窗口排名函数> OVER(ORDER BY 字段 DESC) AS ranking
      FROM ...

)cte
WHERE ranking = 1

Solution3: ORDER BY (only for the simple question)

ORDER BY 字段 DESC,LIMIT 1

最小值问题的三种思路

Solution1: Subquery(WHERE is not required)

SELECT ...
FROM ...
(WHERE) ...
GROUP BY 关键字段
HAVING Aggregate_function(字段) <= ALL(SELECT Aggregate_function(字段)
                                      FROM ...
                                      (WHERE) ...
                                      GROUP BY 关键字段)
)

Solution2: Windows function overall rank(do not use PARTITION BY)

SELECT ...
FROM (SELECT *,
          <窗口排名函数> OVER(ORDER BY 字段 ASC) AS ranking
      FROM ...
)cte
WHERE ranking = 1

Solution3: ORDER BY (only for the simple question)

ORDER BY 字段 ASC,LIMIT 1

Ratios

Ratios | Emma Ding

用户留存率Users Retention Rate
复购率


Consecutive

Notion

Types of Problems

  • 用户最长连续登录的天数
  • Consecutive N Days Login Users 连续登录N天的用户数
  • 连续N天交易额超过X的店铺数

Solution

  • 自连接解法
  • ROW_NUMBER()常规解法
  • LEAD()滑动窗口解法

Median

LeetCode 569. Median Employee Salary

# Solution 1: 正反序
WITH cte AS(
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY company ORDER BY salary,id ASC) AS rnk1,
        ROW_NUMBER() OVER(PARTITION BY company ORDER BY salary DESC,id DESC) AS rnk2,
        COUNT(id) OVER(PARTITION BY company) AS cnt
    FROM
        Employee
)

SELECT
    id,
    company,
    salary
FROM cte
WHERE
    rnk1>=cnt/2 AND rnk2>=cnt/2


# Solution 2: 讨论排序和n的关系
SELECT 
	Id,Company,Salary 
FROM 
(
    SELECT
    	a.*,
    	ROW_NUMBER() OVER(PARTITION BY Company ORDER BY Salary) AS rnk,
    	COUNT(Salary) OVER(PARTITION BY Company) AS cnt 
   	FROM Employee a
)t
WHERE rnk IN (cnt/2,cnt/2+1,(cnt+1)/2)
posted @ 2023-06-10 00:22  ForHHeart  阅读(12)  评论(0编辑  收藏  举报