《Troubleshooting SQL Server》读书笔记-性能故障诊断方法论
第一章 A Performance Troubleshooting Methodology。
确定具体的方法论用于SQL Server故障诊断是件很难的事情。因为根据问题和环境的不同,解决方法也不一样。一个准确的故障诊断,不仅是收集各种信息,而且要明白它们的含义。千万不要只见树木不见森林。
作者通过一个简单的诊断示例,演示基本的方法和借助的工具:
1. 首先查看DMV sys.dm_os_wait_stats来确定操作系统层面的主要的资源等待。
2. 假设发现主要等待类型为PAGEIOLATCH_SH。PAGEIOLATCH_SH发生的原因是磁盘子系统的速度不能满足DB对页的请求速度。
看起来是IO瓶颈。但是我们要考虑周全,也有可能是其它的问题导致的过度IO。而这是表现不是根本原因。
3. 然后通过sys.dm_io_virtual_file_stats得到各个数据库和数据文件的IO量和IO延时。同时查看性能计数器Physical Disk\Avg. Disk Reads/sec和Physical Disk\Avg. Disk Writes/sec。
这样就能把问题定位到具体数据库。
4. 针对上面确定的数据库,通过sys.dm_exec_query_stats查看是哪些查询造成了大量的IO.找出这些语句调整或者采取其它的调优方式。
5. 同时通过观察性能计数器Page Life Expectancy,Free List Stalls/sec和high Lazy Writes/sec,如果PLE值很低,而后两个计数器常不为0,则可认为是内存瓶颈。
在这个故障诊断的案例中有许多要关注的点,一不小心就会被带坑里去,采用了错误的调优方式。比如,从一始我们就认定是IO瓶颈,于是提高IO子系统的速度。虽然短时会改善性能,但是过段时间这个问题又会出现。
所以千万不要只见树木不见森林。
等待统计(Wait Statistics: the Basis for Troubleshooting):
在故障诊断的第一个检查的项一般会是等待统计。SQL Server有一个伪操作系统SQLOS,管理着内部的各种操作和资源。而SQLOS中的操作和资源的各种等待,可以通过sys.dm_os_wait_stats查询得到。
因为很多时候对问题现状的描述不是很清楚,比如“数据库好慢”,所以我们搞明白SQL Server在待什么和为什么等待。
下面的查询已经过滤掉非问题型等待:
SELECT TOP 10
wait_type ,
max_wait_time_ms wait_time_ms ,
signal_wait_time_ms ,
wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms ,
100.0 * wait_time_ms / SUM (wait_time_ms ) OVER ( )
AS percent_total_waits ,
100.0 * signal_wait_time_ms / SUM (signal_wait_time_ms) OVER ( )
AS percent_total_signal_waits ,
100.0 * ( wait_time_ms - signal_wait_time_ms )
/ SUM (wait_time_ms ) OVER ( ) AS percent_total_resource_waits
FROM sys .dm_os_wait_stats
WHERE wait_time_ms > 0 -- remove zero wait_time
AND wait_type NOT IN -- filter out additional irrelevant waits
( 'SLEEP_TASK' , 'BROKER_TASK_STOP', 'BROKER_TO_FLUSH' ,
'SQLTRACE_BUFFER_FLUSH','CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
'LAZYWRITER_SLEEP', 'SLEEP_SYSTEMTASK', 'SLEEP_BPOOL_FLUSH',
'BROKER_EVENTHANDLER', 'XE_DISPATCHER_WAIT' , 'FT_IFTSHC_MUTEX' ,
'CHECKPOINT_QUEUE', 'FT_IFTS_SCHEDULER_IDLE_WAIT' ,
'BROKER_TRANSMITTER' , 'FT_IFTSHC_MUTEX' , 'KSOURCE_WAKEUP',
'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE', 'ONDEMAND_TASK_QUEUE',
'REQUEST_FOR_DEADLOCK_SEARCH' , 'XE_TIMER_EVENT', 'BAD_PAGE_PROCESS',
'DBMIRROR_EVENTS_QUEUE', 'BROKER_RECEIVE_WAITFOR' ,
'PREEMPTIVE_OS_GETPROCADDRESS', 'PREEMPTIVE_OS_AUTHENTICATIONOPS',
'WAITFOR', 'DISPATCHER_QUEUE_SEMAPHORE', 'XE_DISPATCHER_JOIN' ,
'RESOURCE_QUEUE' )
ORDER BY wait_time_ms DESC
一些常见的等待类型有:
CXPACKET:并行执行等待
SOS_SCHEDULER_YIELD:每个worker在占用scheduler一会儿然后退让(Yield),让其它的worker使用scheduler。此等待通常说明有CPU压力。
THREADPOOL:每个task需要绑定到一个worker执行。此等待说明worker用尽,有CPU压力,或者阻塞导致大量的task长时间占用woker。
LCK_*:锁等待。
PAGEIOLATCH_*, IO_COMPLETION, WRITELOG:IO瓶颈的表现。
PAGELATCH_*:页争用等待。此等待最出名代表就是tempdb的系统页争用问题。
LATCH_*:是一种轻量短暂同步对象,用于保护访问的内部对象。根据等待类型不同,代表着不同的问题。sys.dm_os_latch_stats
ASYNC_NETWORK_IO:通常是客户端处理数据性能不够快。
根据不同等待类型确定,来确定我们下一步诊断方向。如果PAGEIOLATCH_*等待过高,接下来就会检查虚拟文件状态,如果LCK_*等待过高,就会检查数据库中的阻塞状况,而不会浪费时间去检查IO配置。
在解决完一个问题后需要清空等待信息,然后等过段时间DMV中的数据积累到一定程度,再来确定问题是否还存在。如下语句清空:
DBCC SQLPERF ('sys.dm_os_wait_stats' , clear )
虚拟文件统计(Virtual File Statistics):
只看等待统计,很多问题都像是IO瓶颈。所以我们要检查虚拟文件统计。
SELECT DB_NAME(vfs .database_id) AS database_name ,
vfs.database_id ,
vfs.FILE_ID ,
io_stall_read_ms / NULLIF(num_of_reads , 0) AS avg_read_latency ,
io_stall_write_ms / NULLIF(num_of_writes, 0)
AS avg_write_latency ,
io_stall / NULLIF(num_of_reads + num_of_writes, 0)
AS avg_total_latency ,
num_of_bytes_read / NULLIF(num_of_reads , 0)
AS avg_bytes_per_read ,
num_of_bytes_written / NULLIF(num_of_writes, 0)
AS avg_bytes_per_write ,
vfs.io_stall ,
vfs.num_of_reads ,
vfs.num_of_bytes_read ,
vfs.io_stall_read_ms ,
vfs.num_of_writes ,
vfs.num_of_bytes_written ,
vfs.io_stall_write_ms ,
size_on_disk_bytes / 1024 / 1024. AS size_on_disk_mbytes ,
physical_name
FROM sys .dm_io_virtual_file_stats (NULL, NULL) AS vfs
JOIN sys .master_files AS mf ON vfs.database_id = mf .database_id
AND vfs.FILE_ID = mf .FILE_ID
ORDER BY avg_total_latency DESC
发现有高IO等待意味着有IO瓶颈,但是其根本原因可能是磁盘了系统有瓶颈,也可能是索引过多、索引缺失、内存瓶颈和低效的查询等。还要借助其它方式进一步确定根本原因。千万不要只见树木不见森林。
性能计数器(Performance Counters)
在诊断初期,会先收集针对SQL Server的性能计数器,随着深入再收集Windows的计数器。这些数据有助于明白当前的性能状况和界定问题的边界。
推荐的分析工具PAL。可以使用PerfMon收集,也可以通过 sys.dm_os_performance_counters查看。
使用DMV时,要注意计数器数据是增量的,所以要收集两个时间点上的数据副本进行对比。
DECLARE @CounterPrefix NVARCHAR (30)
SET @CounterPrefix = CASE WHEN @@SERVICENAME = 'MSSQLSERVER'
THEN 'SQLServer:'
ELSE 'MSSQL$' + @@SERVICENAME + ':'
END ;
-- Capture the first counter set
SELECT CAST(1 AS INT ) AS collection_instance ,
[OBJECT_NAME] ,
counter_name ,
instance_name ,
cntr_value ,
cntr_type ,
CURRENT_TIMESTAMP AS collection_time
INTO #perf_counters_init
FROM sys .dm_os_performance_counters
WHERE ( OBJECT_NAME = @CounterPrefix + 'Access Methods'
AND counter_name = 'Full Scans/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Access Methods'
AND counter_name = 'Index Searches/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager'
AND counter_name = 'Lazy Writes/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager'
AND counter_name = 'Page life expectancy'
)
OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'
AND counter_name = 'Processes Blocked'
)
OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'
AND counter_name = 'User Connections'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Locks'
AND counter_name = 'Lock Waits/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Locks'
AND counter_name = 'Lock Wait Time (ms)'
)
OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics'
AND counter_name = 'SQL Re-Compilations/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Memory Manager'
AND counter_name = 'Memory Grants Pending'
)
OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics'
AND counter_name = 'Batch Requests/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics'
AND counter_name = 'SQL Compilations/sec'
)
-- Wait on Second between data collection
WAITFOR DELAY '00:00:01'
-- Capture the second counter set
SELECT CAST(2 AS INT ) AS collection_instance ,
OBJECT_NAME ,
counter_name ,
instance_name ,
cntr_value ,
cntr_type ,
CURRENT_TIMESTAMP AS collection_time
INTO #perf_counters_second
FROM sys .dm_os_performance_counters
WHERE ( OBJECT_NAME = @CounterPrefix + 'Access Methods'
AND counter_name = 'Full Scans/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Access Methods'
AND counter_name = 'Index Searches/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager'
AND counter_name = 'Lazy Writes/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager'
AND counter_name = 'Page life expectancy'
)
OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'
AND counter_name = 'Processes Blocked'
)
OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'
AND counter_name = 'User Connections'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Locks'
AND counter_name = 'Lock Waits/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Locks'
AND counter_name = 'Lock Wait Time (ms)'
)
OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics'
AND counter_name = 'SQL Re-Compilations/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'Memory Manager'
AND counter_name = 'Memory Grants Pending'
)
OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics'
AND counter_name = 'Batch Requests/sec'
)
OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics'
AND counter_name = 'SQL Compilations/sec'
)
-- Calculate the cumulative counter values
SELECT i .OBJECT_NAME ,
i.counter_name ,
i.instance_name ,
CASE WHEN i.cntr_type = 272696576
THEN s.cntr_value - i.cntr_value
WHEN i.cntr_type = 65792 THEN s.cntr_value
END AS cntr_value
FROM #perf_counters_init AS i
JOIN #perf_counters_second AS s
ON i.collection_instance + 1 = s.collection_instance
AND i.OBJECT_NAME = s.OBJECT_NAME
AND i.counter_name = s.counter_name
AND i.instance_name = s.instance_name
ORDER BY OBJECT_NAME
-- Cleanup tables
DROP TABLE #perf_counters_init
DROP TABLE #perf_counters_second
收集的计数器包括:
• SQLServer:Access Methods\Full Scans/sec
• SQLServer:Access Methods\Index Searches/sec
• SQLServer:Buffer Manager\Lazy Writes/sec
• SQLServer:Buffer Manager\Page life expectancy
• SQLServer:Buffer Manager\Free list stalls/sec
• SQLServer:General Statistics\Processes Blocked
• SQLServer:General Statistics\User Connections
• SQLServer:Locks\Lock Waits/sec
• SQLServer:Locks\Lock Wait Time (ms)
• SQLServer:Memory Manager\Memory Grants Pending
• SQLServer:SQL Statistics\Batch Requests/sec
• SQLServer:SQL Statistics\SQL Compilations/sec
• SQLServer:SQL Statistics\SQL Re-Compilations/sec
Index Searches比Full Scans高个800~1000比较理想。对于大于4Gb RAM的机器,Page life expectancy=(RAM/4)*300.
计划高速缓存的使用(Plan Cache Usage)
在SQL Server2005&2008中可以使用sys.dm_exec_query_stats查询得到各种存于计划高速缓存中的执行信息。结合其它的DMV&DMF可以得到各种TOP(N)信息。如:top 10 Physical Reads
SELECT TOP 10
execution_count ,
statement_start_offset AS stmt_start_offset ,
sql_handle ,
plan_handle ,
total_logical_reads / execution_count AS avg_logical_reads ,
total_logical_writes / execution_count AS avg_logical_writes ,
total_physical_reads / execution_count AS avg_physical_reads ,
t.text
FROM sys .dm_exec_query_stats AS s
CROSS APPLY sys .dm_exec_sql_text(s.sql_handle ) AS t
ORDER BY avg_physical_reads DESC
总结
性能诊断最忌片面,千万不要只见树林不见森林。这不仅是一门手艺,更是一门艺术。