postgresql创建统计信息优化

创建统计信息

创建统计信息

预估行数

pg中对于单列选择性的估算比较准确，而对于多列的情况则会出现不准确的情况，因为pg默认使用独立属性，直接以多个字段选择性相乘的方法计算多个字段条件的选择性

创建统计信息功能是在pg10版本引入的功能，我们在优化SQL的时候，一个最重要的点就是统计信息是否准确，统计信息不准会导致优化器预估的行数不准确，进而影响扫描方法和连接方式。

--创建表
postgres=# create table test2(n_id int,id1 int,id2 int);
CREATE TABLE
postgres=# insert into test2 select i,i/1000,i/10000 from generate_series(1,1000000) s(i);
INSERT 0 1000000
postgres=# analyze test2;
ANALYZE
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stats where tablename = 'test2' and attname = 'id1';
-[ RECORD 1 ]----------+-------------------------------------------------------------------------------------
schemaname             | public
tablename              | test2
attname                | id1
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 1000
most_common_vals       | {381,649,852,142,269,415,496,537,714,80,177,303,526,870,924}
most_common_freqs      | {0.0016,0.0016,0.0015666666,0.0015333333,0.0015333333...}
histogram_bounds       | {0,10,19,29,39,49,59,69,78,89,99,109,119,128,139,...}
correlation            | 1
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   | 

--查看执行计划
postgres=# explain (analyse,buffers) select * from test2 where id1 = 1;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on test2  (cost=0.00..17906.00 rows=992 width=12) (actual time=0.144..109.573 rows=1000 loops=1)
   Filter: (id1 = 1)
   Rows Removed by Filter: 999000
   Buffers: shared hit=5406
 Planning Time: 0.118 ms
 Execution Time: 109.697 ms
(6 rows)

预估的行数为992和实际返回1000，预估和实际非常接近

如果在id1和id2上都过滤数据时，会怎么样？

postgres=# explain (analyse,buffers) select * from test2 where id1 = 1 and id2= 0;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on test2  (cost=0.00..20406.00 rows=10 width=12) (actual time=0.153..138.057 rows=1000 loops=1)
   Filter: ((id1 = 1) AND (id2 = 0))
   Rows Removed by Filter: 999000
   Buffers: shared hit=5406
 Planning Time: 0.267 ms
 Execution Time: 138.184 ms
(6 rows)

预估为10但是实际为1000，相差100倍了，为什么会发生这种情况呢？

第一列的选择性大约是0.001（1/1000），第二列的选择性是0.01（1/100）。为了计算被这2个 "独立 "条件过滤的行数，planner将它们的选择性相乘。所以，我们得到

选择性 = 0. 001 * 0. 01 = 0. 00001

当这个乘以我们在表中的行数即1000000时，我们得到10。这就是planner估计的10的由来。但是，这几列不是独立的，我们怎么告诉planner呢？

函数依赖

pg10开始支持用户自定义统计信息，这样我们便可以针对这种多列的情况创建自定义多个字段的统计信息，目前支持多列相关性和多列唯一值两种统计。

创建函数依赖
回到我们之前的估算问题，问题是col2的值其实不过是col1 / 10。在数据库术语中，我们会说col2在功能上依赖于col1。这意味着col1的值足以决定col2的值，不存在两行col1的值相同而col2的值不同的情况。因此，col2上的第2个过滤器实际上并没有删除任何行！但是，planner捕捉到了足够的统计数据。但是，规划者捕捉到了足够的统计数据来知道这一点。

--创建统计信息
postgres=# create statistics s1(dependencies) on id1,id2 from test2;
CREATE STATISTICS
--analyze
postgres=# analyze test2;
ANALYZE
--查看执行计划
postgres=# explain (analyse,buffers) select * from test2 where id1 = 1 and id2 = 0;
                                                QUERY PLAN                                         
        
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on test2  (cost=0.00..20406.00 rows=997 width=12) (actual time=0.159..124.450 rows=1000 l oops=1)
   Filter: ((id1 = 1) AND (id2 = 0))
   Rows Removed by Filter: 999000
   Buffers: shared hit=5406
 Planning Time: 0.364 ms
 Execution Time: 124.592 ms
(6 rows)
--查看统计信息
postgres=# SELECT stxname,stxkeys,extdat.stxddependencies  FROM pg_statistic_ext ext join pg_statistic_ext_data extdat on ext.oid = extdat.stxoid; 
 stxname | stxkeys |   stxddependencies   
---------+---------+----------------------
 s1      | 2 3     | {"2 => 3": 1.000000}
(1 row)
--stxkeys中的2 3表示表的第二列和第三列
--从这一点来看，我们可以看到Postgres意识到col1完全决定了col2，因此有一个系数为1来捕捉这些信息。现在，所有对这两列进行过滤的查询都会有更好的估计。
postgres=# select statistics_name,attnames,dependencies from  pg_stats_ext;
 statistics_name | attnames  |     dependencies     
-----------------+-----------+----------------------
 s1              | {id1,id2} | {"2 => 3": 1.000000}
(1 row)

简单来说就是id1相同的两行数据，那么id2也一定相同，（比如前面插入的数据id2=0的有10000条数据，那么id1的值分别是（1-9）分别有1000条数据，任意两行id1相等的数据，id2一定也是一样的）

如果没有函数依赖性统计，规划器会认为两个WHERE条件是独立的，并且会将它们的选择性乘以一起，以致得到太小的行数估计。通过这样的统计，规划器认识到WHERE条件是多余的，并且不会低估行数。

当前只有在考虑简单等值条件（将列与常量值比较）时，函数依赖才适用。不会使用它们来改进比较两个列或者比较列和表达式的等值条件的估计，也不会用它们来改进范围子句、LIKE或者任何其他类型的条件。

ndistinct 统计

单列统计信息存储每一列中可区分值的数量。在组合多个列（例如GROUP BY a，b）时，如果规划器只有单列统计数据，则对可区分值数量的估计常常会错误，导致选择不好的计划

--对test2表进行group by id1,id2操作
postgres=# explain (analyse,buffers) select id1,id2,count(*) from test2  group by id1,id2;
                                                      QUERY PLAN                                   
                    
------------------------------------------------------------------------------------------------------------------------

 HashAggregate  (cost=22906.00..23906.00 rows=100000 width=16) (actual time=473.444..474.544 rows=1001 loops=1)
   Group Key: id1, id2
   Buffers: shared hit=5406
   ->  Seq Scan on test2  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.022..178.253 rows=1000000 loops=1)
         Buffers: shared hit=5406
 Planning Time: 1.202 ms
 Execution Time: 479.178 ms
(7 rows)

planner预估的行数是100000，而实际行数只有1001

--创建ndistinct
postgres=# create statistics s2(ndistinct) on id1,id2 from test2;
CREATE STATISTICS

postgres=# analyze test2;
ANALYZE
--可以看到planner预估值更加准确了，预估1000，实际返回10001
postgres=# explain (analyse,buffers) select id1,id2,count(*) from test2  group by id1,id2;
                                                      QUERY PLAN                                    
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22906.00..22916.00 rows=1000 width=16) (actual time=442.839..443.160 rows=1001 loops=1)
   Group Key: id1, id2
   Buffers: shared hit=5406
   ->  Seq Scan on test2  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.029..147.498 rows=1000000 loops=1)
         Buffers: shared hit=5406
 Planning Time: 0.364 ms
 Execution Time: 444.362 ms
(7 rows)
--统计信息记录了ndistinct值：为1000
postgres=# sELECT stxname,stxkeys,extdat.stxdndistinct  FROM pg_statistic_ext ext join pg_statistic_ext_data extdat on ext.oid = extdat.stxoid where stxname = 's2'; 
 stxname | stxkeys | stxdndistinct  
---------+---------+----------------
 s2      | 2 3     | {"2, 3": 1000}
(1 row)

常见的情况有月，季，年的列。省，市，区等需要联合group by的情况

建议只对实际用于分组的列组合以及分组数错误估计导致了糟糕计划的列组合创建ndistinct统计信息对象。否则，ANALYZE循环只会被浪费。

高频值-mvc

MCV（most common values）

创建表t2与两个完全相关的列(包含相同的数据)，并且在这些列上创建一个MCV列表:

--创建表
CREATE TABLE t2 (a   int,b   int);
--插入数据
INSERT INTO t2 SELECT mod(i,100), mod(i,100)FROM generate_series(1,1000000) s(i);
analyze t2;
--查看执行计划预估97，实际10000
postgres=#  EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 70);
                                              QUERY PLAN                                           
    
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.00 rows=97 width=8) (actual time=0.038..123.438 rows=10000 loops=1)
   Filter: ((a = 70) AND (b = 70))
   Rows Removed by Filter: 990000
 Planning Time: 0.150 ms
 Execution Time: 124.647 ms
(5 rows)

--创建mcv
CREATE STATISTICS s3 (mcv) ON a, b FROM t2;
ANALYZE t2;

-- valid combination (found in MCV)，a=70 and b=70预估11267，实际10000，相差不大
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 70);
                                                QUERY PLAN                                           
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.00 rows=11267 width=8) (actual time=0.069..181.738 rows=10000 loops=1)
   Filter: ((a = 70) AND (b = 70))
   Rows Removed by Filter: 990000
 Planning Time: 1.120 ms
 Execution Time: 182.452 ms
(5 rows)

-- invalid combination (not found in MCV)
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 80);
                                             QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.00 rows=1 width=8) (actual time=125.878..125.879 rows=0 loops=1)
   Filter: ((a = 70) AND (b = 80))
   Rows Removed by Filter: 1000000
 Planning Time: 0.207 ms
 Execution Time: 125.945 ms
(5 rows)
----可以看到{70，70}在高频值中,{70,80}不在高频值中
postgres=# SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
postgres-# , pg_mcv_list_items(stxdmcv) m WHERE stxname = 's3';
 index | values  | nulls |      frequency       |     base_frequency     
-------+---------+-------+----------------------+------------------------
     0 | {70,70} | {f,f} | 0.011266666666666666 | 0.00012693777777777776
     1 | {78,78} | {f,f} |               0.0111 |             0.00012321
     2 | {32,32} | {f,f} | 0.011066666666666667 | 0.00012247111111111112
     3 | {13,13} | {f,f} | 0.011033333333333332 | 0.00012173444444444442
     4 | {82,82} | {f,f} |                0.011 | 0.00012099999999999999
....

--当 WHERE (a = 70) AND (b = 70)的时候rows=11267是如何计算的呢
rows=  1000000 *  0.011266666666666666
     = 11267

a和b的组合中实际频率（在样本中）约为1%。组合的基本频率（根据简单的每列频率计算）仅为0.01％，导致两个数量级的低估。

--计算WHERE (a = 1) AND (b = 2)的预估值
--为方便计算将a和b的statisics设置为10
alter table  t2 alter column  a SET STATISTICS 10;
alter table  t2 alter column  b SET STATISTICS 10;
analyze t2;
postgres=# SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
postgres-# WHERE tablename='t2' AND attname in('a','b');
 null_frac | n_distinct | most_common_vals | most_common_freqs 
-----------+------------+------------------+-------------------
         0 |        100 | {7,85}           | {0.015,0.015}
         0 |        100 | {7,85}           | {0.015,0.015}
(2 rows)
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
                                             QUERY PLAN                                            
  
--------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.75 rows=98 width=8) (actual time=137.876..137.876 rows=0 loops=1)
   Filter: ((a = 1) AND (b = 2))
   Rows Removed by Filter: 1000000
 Planning Time: 0.452 ms
 Execution Time: 137.924 ms
(5 rows)
--a和b的选择率都是一样的
selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
postgres=# select (1-(0.014999999664723873+0.014999999664723873))/(100-2);
        ?column?        
------------------------
 0.00989795919051583933
(1 row)
--rows，预估返回98行
rows= reltuple*selectivity(a=1) * selectivity(b = 2)
postgres=# select 1000000*0.00989795919051583933*0.00989795919051583933;
                  ?column?                   
---------------------------------------------
 97.9695961371169693741399756143748489000000
(1 row)
 
 --其他a=70 and b=80没有在高频中，预估也是98
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a =70) AND (b =80);
                                             QUERY PLAN                                            
  
---------------------------------------------------------------------------------------------------

 Seq Scan on t2  (cost=0.00..19425.75 rows=98 width=8) (actual time=121.889..121.889 rows=0 loops=1)
   Filter: ((a = 70) AND (b = 80))
   Rows Removed by Filter: 1000000
 Planning Time: 0.160 ms
 Execution Time: 121.952 ms
(5 rows)

建议仅在实际在条件中一起使用的列的组合上创建MCV统计对象，对于这些组合，错误估计组数会导致糟糕的执行计划。否则，只会浪费ANALYZE和规划时间。

default_statistics_target

提升该限制可能会让规划器做出更准确的估计（特别是对那些有不规则数据分布的列），其代价是在pg_statistic中消耗了更多空间，并且需要略微多一些的时间来计算估计数值。

注意：该值的取值范围是0-1000，其中值越低采样比例就越低，分析结果的准确性也就越低，但是ANALYZE命令执行的速度却更快。如果将该值设置为-1，那么该字段的采样比率将恢复到系统当前默认的采样值，我们可以通过下面的命令获取当前系统的缺省采样值。
postgres=# show default_statistics_target;
     default_statistics_target
    ---------------------------
     100
    (1 row)
从上面的结果可以看出，该数据库的缺省采样值为100(10%)。

可以修改表字段也可以修改索引

ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]  
    action [, ... ]  
  
    ALTER [ COLUMN ] column_name SET STATISTICS integer

--创建test表插入100w数据
postgres=# CREATE TABLE test AS (SELECT random() x, random() y FROM generate_series(1,1000000));
SELECT 1000000
postgres=# ANALYZE test;
ANALYZE
--创建索引
postgres=# create index i_test_idx on test((x+y));
CREATE INDEX
postgres=# analyze test;
ANALYZE
--查看x+y的执行计划
postgres=# explain analyze select * from test where x+y <0.01;
                                                      QUERY PLAN                                   
                   
---------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test  (cost=7.68..673.21 rows=652 width=16) (actual time=0.036..0.283 rows=60 loops=1)
  Recheck Cond: ((x + y) < '0.01'::double precision)
  Heap Blocks: exact=60
  ->  Bitmap Index Scan on i_test_idx  (cost=0.00..7.51 rows=652 width=0) (actual time=0.017..0.017 rows=60 loops=1)
        Index Cond: ((x + y) < '0.01'::double precision)
Planning Time: 0.569 ms
Execution Time: 0.342 ms
(7 rows)

--修改statistics
postgres=# ALTER INDEX i_test_idx ALTER COLUMN expr SET STATISTICS 3000;
ALTER INDEX
postgres=# analyze test;
ANALYZE
--再次查看执行计划发现预估的值和实际的值比较接近了
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                      QUERY PLAN                                   
                    
---------------------------------------------------------------------------------------------------
Index Scan using i_test_idx on test  (cost=0.42..135.64 rows=121 width=16) (actual time=0.011..0.277 rows=60 loops=1)
  Index Cond: ((x + y) < '0.01'::double precision)
Planning Time: 0.515 ms
Execution Time: 0.342 ms
(4 rows)

--修改为10000
postgres=#  ALTER INDEX i_test_idx ALTER COLUMN expr SET STATISTICS 10000;
ALTER INDEX
postgres=# analyze test;
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                     QUERY PLAN                                    
                  
---------------------------------------------------------------------------------------------------
 Index Scan using i_test_idx on test  (cost=0.42..80.87 rows=71 width=16) (actual time=0.010..0.217
 rows=60 loops=1)
   Index Cond: ((x + y) < '0.01'::double precision)
 Planning Time: 0.784 ms
 Execution Time: 0.283 ms
(4 rows)

使用alter 修改statistics为3000，这个数字设置了直方图中使用了多少个桶以及存储了多少个最常见的值，

这样取样的粒度也就越细，可以查看pg_stats中histogram_bounds记录了更多的值。

查看修改的值

postgres=# select cla.relname,att.attname,att.attstattarget  from pg_attribute att join pg_class cla on att.attrelid=cla.oid where cla.relname = 'i_test_idx';
  relname   | attname | attstattarget 
------------+---------+---------------
 i_test_idx | expr    |          3000
(1 row)

--或者
postgres=# \d+ i_test_idx
                       Index "public.i_test_idx"
 Column |       Type       | Key? | Definition | Storage | Stats target 
--------+------------------+------+------------+---------+--------------
 expr   | double precision | yes  | (x + y)    | plain   | 3000
btree, for table "public.test"

代价异常

今天人问为什么index时代价显示比最终的代价小，有时候执行计划节点的某一个子节点的cost比总的cost大是正常情况

一般来说常见的有两种情况

1、在有limit 1的情况下，实际的行数只有1并且运行时间远低于开销估计所建议的时间。这并非预估错误

postgres=# explain analyze select * from test3 where n_id < 1000 limit 1;
                                                           QUERY PLAN                              
                             
---------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.022..0.022 rows=1 loops=1)
   ->  Index Only Scan using i_test3_id on test3  (cost=0.29..25.13 rows=939 width=4) (actual time=
0.019..0.019 rows=1 loops=1)
         Index Cond: (n_id < 1000)
         Heap Fetches: 1
 Planning Time: 0.297 ms
 Execution Time: 0.092 ms
(6 rows)

postgres=# explain analyze select * from test3 where n_id < 1000 ;
                                                         QUERY PLAN                                
                          
---------------------------------------------------------------------------------------------------
 Index Only Scan using i_test3_id on test3  (cost=0.29..25.13 rows=939 width=4) (actual time=0.020..19.543 rows=999 loops=1)
   Index Cond: (n_id < 1000)
   Heap Fetches: 999
 Planning Time: 0.718 ms
 Execution Time: 19.707 ms
(5 rows)

2、归并连接也有这样的情况，如果一个归并连接用尽了一个输入并且其中的最后一个键值小于另一个输入中的下一个键值，它将停止读取另一个输入。在这种情况下不过会有更多的匹配，因此不需要第二个输入的剩余部分。这会导致不读取另一个子节点的所有内容
Index Scan using i_aj_all_bh_ysaj预估的代价是47W，而最终总的预估为39W

GroupAggregate  (cost=391236.41..391243.61 rows=188 width=45) (actual time=5839.527..5861.034 rows=184 loops=1)
  Group Key: test_1.c_jbfy
  ->  Sort  (cost=391236.41..391238.18 rows=710 width=38) (actual time=5839.324..5847.166 rows=105340 loops=1)
        Sort Key: test_1.c_jbfy
        Sort Method: quicksort  Memory: 11302kB
        ->  Merge Join  (cost=342460.08..391202.78 rows=710 width=38) (actual time=4280.410..5688.354 rows=105340 loops=1)
              Merge Cond: ((test.c_bh_ysaj)::text = (test_1.c_bh)::text)
              ->  Index Scan using i_aj_all_bh_ysaj on test  (cost=0.43..470402.81 rows=127589 width=32) (actual time=0.012..1054.034 rows=162022 loops=1)
                    Index Cond: (c_bh_ysaj IS NOT NULL)
                    Filter: ((c_ah IS NOT NULL) AND (d_jarq >= to_date('20190101'::text, 'yyyymmdd'::text)) AND (d_jarq <= to_date('20200101'::text, 'yyyymmdd'::text))))
                    Rows Removed by Filter: 254334
              ->  Sort  (cost=342459.60..343014.47 rows=221949 width=38) (actual time=4279.757..4342.135 rows=357118 loops=1)
                    Sort Key: test_1.c_bh
                    Sort Method: quicksort  Memory: 40188kB
.......
--另外由于实现的限制，BitmapAnd和BitmapOr节点总时报告实际行数为0