Introduction
In this article, we are going to see how Index Selectivity works in relational database systems and why the database Optimizer might choose to avoid using an index if the number of matching records is large.
Index selectivity is inversely proportional to the number of index entries matched by a given value. So, a unique index has the highest selectivity because only a single entry can be matched by any given value.
On the other hand, if column values are skewed, then a column value matching a large number of table records is going to have a low selectivity.
Domain Model
Let’s consider we have the following task
table:
The status
column is task_status
, which is a custom PosgreSQL Enum type we created like this:
CREATE TYPE task_status AS ENUM ( 'TO_DO' , 'DONE' , 'FAILED' ) |
The status
column values are unevenly distributed across the task
records, so when counting the number of matching records for every task_status
value:
SELECT COUNT (*) as matching_records, status FROM task GROUP BY status ORDER BY 1 DESC |
We can see that the records are skewed since 95% of the table records have the status
value of DONE
while only 1% of the table records have the value of TO_DO
:
| matching_records | status | | ------------------|--------| | 95000 | DONE | | 4000 | FAILED | | 1000 | TO_DO | |
Index Selectivity
Now, let’s create a B+Tree index on the status
column:
CREATE INDEX IF NOT EXISTS idx_task_status ON task (status) |
When executing the following query:
EXPLAIN ANALYZE SELECT * FROM task WHERE status = 'TO_DO' |
PostgreSQL chooses the following execution plan:
Index Scan using idx_task_status on task (cost=0.29..294.70 rows =1013 width=22) (actual time =0.162..0.342 rows =1000 loops=1) Index Cond: (status = 'TO_DO' ::task_status) |
However, when running the same query against the status
value of DONE
, we get the following SQL execution plan:
Seq Scan on task (cost=0.00..1887.00 rows =95120 width=22) (actual time =0.017..11.450 rows =95000 loops=1) Filter: (status = 'DONE' ::task_status) Rows Removed by Filter: 5000 |
As I explained in this article, relational databases use a Cost-Based Optimizer to choose the most efficient Execution Plan for a given SQL query execution.
Because the status
value of DONE
matches 95% of the table records, PostgreSQL choses to use a sequential scan instead of scanning the idx_task_status
index.
When using the idx_task_status
index, the database needs to traverse the index, locate the row identifier, and then load the table record by the record identifier in order to select the rest of the columns required by the SQL query project.
In the first example, the number of matched records is 1000
, which is 1% of the entire table record. The cost of scanning the index and loading the rest of the columns from the table is lower than scanning the entire table, so PostgreSQL uses the idx_task_status
index for the TO_DO
predicate value.
In the second example, the predicate value of DONE
has a very low index selectivity since it matches 95,000 table records. In this case, scanning the task
table page by page is faster than doing 95,000 lookups from the idx_task_status
index to the task
table.
Partial or Filtered Indexes
When the indexed column values are skewed, it’s more efficient to use a Partial or Filtered Index, like the following one:
CREATE INDEX idx_task_status ON task (status) WHERE status <> 'DONE' |
Since only the TO_DO
and FAILED
status
values have high index selectivity, it’s better to build the index only for these values and skip the row identifiers having the status
value of DONE
. This will help us reduce the index size.
Now, because PostgreSQL provides only a pg_indexes_size
function that gives us the size of all indexes associated to a given table, we will need to calculate the idx_task_status
index size by subtracting the size of the Primary Key index.
As I explained in this article, PostgreSQL creates a default index on the Primary Key, which in our case looks as follows:
| tablename | indexname | indexdef | | -----------|-----------|---------------------------------------------------------------| | task | task_pkey | CREATE UNIQUE INDEX task_pkey ON public .task USING btree (id) | |
When calculating the size of all indexes on the task
tables
SELECT pg_size_pretty( pg_indexes_size(relid) ) as "Index Size" FROM pg_statio_user_tables WHERE pg_statio_user_tables.relname = 'task' |
We get the value of 2208 kB
for the task_pkey
index:
| Index | Size | | -------|------| | 2208 | kB | |
When creating the Partial Index that contains only the TO_DO
and FAILED
status
values, the overall task
table index size is:
| Index | Size | | -------|------| | 2264 | kB | |
So, the size of the Partial Index is just 56 kB
.
However, if we create the Full Index that includes all values of the status
column, the overall task
table index size is:
| Index | Size | | -------|------| | 2904 | kB | |
So, the Full Index has a size of 696 kB
and no advantage of the Partial Index since the index selectivity of the value of DONE
is too low to be considered by our query predicate.
Conclusion
When running an SQL query, an index can help us speed up the query execution if the selectivity is high. Otherwise, the database might choose to avoid using an index.
If you have columns with skewed data, using a Partial Index can help you save space, therefore allowing you to store more index and table records in the Buffer Pool.
Copied from: https://vladmihalcea.com/index-selectivity/
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾(3.3-3.9)
· winform 绘制太阳,地球,月球 运作规律