[转]Top 10 DTrace scripts for Mac OS X

org link: http://dtrace.org/blogs/brendan/2011/10/10/top-10-dtrace-scripts-for-mac-os-x/

 

Top 10 DTrace scripts for Mac OS X

Since version 10.5 “Leopard”, Mac OS X has had DTrace, a tool used for performance analysis and troubleshooting. It provides data for Apple’s Instruments tool, as well as a collection of command line tools that are implemented as DTrace scripts. I’m familiar with the latter as I wrote the originals for theDTraceToolkit, which Apple then customized and enhanced for Mac OS X where they are shipped by default (great!). I use them regularly to answer this question:

why is my MacBook slow?

I work in an office where everyone has MacBook Pros, and “why is my MacBook slow?” is a common question. Applications can become slow or unresponsive while waiting for CPU work, memory requests or disk I/O to complete.

For people who try to ignore the slowdown, the question can become:

why is my MacBook fan so loud?

Standard performance analysis tools like Activity Monitor and top(1) (and any third-party tools based on the same foundation) can’t tell you some key information about activity on your system, such as how much CPU consumption is caused by short-lived processes, or which processes are causing disk I/O. DTrace, however, can see (just about) everything.

In this post, I’ll cover the top ten Mac OS X DTrace scripts that I use for figuring out why laptops are slow or why applications are misbehaving. Most of these scripts are already installed, a few are from the new DTrace book.

How to use DTrace

If you’ve never run a DTrace script before or even used the command line, here’s a basic walkthrough:

Open Terminal

Run the “Terminal” application. You can find it here in Finder:

You can also type “terminal” in Spotlight (the magnifying glass in the top right corner of your Mac’s screen), which should find it. I usually drag it to my Dock so it’s easy to find later:

Customize Terminal

When you first run Terminal, it’ll probably look like this:

I find the default font small and hard to read. The size of the window in terms of characters is also small (80 columns by 24 rows), presumably to pay homage to original Unix terminals of that size. (Why Unix? Mac OS X is Unix under the hood: the Darwin kernel).

You can adjust the font to your liking in Terminal->Preferences->Text. I use Monaco 13pt, with “Antialias Text” on. Under the “Window” tab is the default size, 80×24, which you can also increase later by clicking and dragging the bottom right corner of the terminal. For example, here’s my screen as I write this blog post (in a terminal-based text editor).

Running top

While DTrace can see everything, there are some things already covered by easy-to-use (and easy-to-type) tools, like top(1). Now that you have Terminal running, type “top -o cpu” and hit enter, which will refresh the screen showing top running processes. Type “q” to exit the top program. The output will look something like this:

Brendan-2:~ brendan$ top -o cpu
[...]
PID    COMMAND      %CPU TIME     #TH  #WQ  #PORT #MREGS RPRVT  RSHRD  RSIZE
67254- firefox-bin  98.8 04:53:45 62/1 1    743   9935+  374M+  78M    637M+
86550  top          15.3 00:28.60 1/1  0    24    33     1748K  264K   2324K
1068   Terminal     13.2 17:38:19 5    1    358   1326   37M    67M    61M+
61501- Google Chrom 7.9  13:40.89 6    1    93-   556-   45M-   73M    77M-
[...]

I’ve truncated the header block to just include the columns. Look at the “%CPU” column to see which processes are making the CPUs busy, and the “RSIZE” column to see who is consuming main memory. The busiest (CPU) process will be at the top, since we sorted on cpu (“-o cpu”). The top was “firefox-bin” (Mozilla Firefox) at 98.8% CPU, which is in terms of a single processor (this has two). If firefox stayed that high you could look for the responsible tab and close it down, or restart Firefox.

After top, I turn to DTrace.

Running DTrace

DTrace requires admin privileges, so to use it you’ll usually need to type in a password to authenticate, provided you have administrator access (if you aren’t sure you do, click here to see how to check).

You can run DTrace by prefixing your DTrace commands with “sudo”, which will prompt for the password the first time around (but not for some time after that). When a DTrace command is running, you usually type Ctrl-C to end it. Here’s an example:

Brendan-2:~ brendan$ sudo iosnoop
Password: password
  UID   PID D    BLOCK   SIZE       COMM PATHNAME
  503 67261 W 384070496  73728  TweetDeck ??/Cookies/Cookies.plist_tmp_67261_0.dat
  503 67261 W 384070640   4096  TweetDeck ??/Local Store/td_26_brendangregg.db-journal
  503 67261 W 384070640   4096  TweetDeck ??/Local Store/td_26_brendangregg.db-journal
  503 67261 W 308056800   4096  TweetDeck ??/Local Store/td_26_brendangregg.db
  503 67261 W 308056856   4096  TweetDeck ??/Local Store/td_26_brendangregg.db
^C
brendan@macbook:~>

If sudo says “permission denied”, check your admin status.

and here’s what happens if you forgot the “sudo”:

Brendan-2:~ brendan$ iosnoop
dtrace: failed to initialize dtrace: DTrace requires additional privileges

Scripts

The scripts follow. A note on style: if the script ends with “.d”, it’s a basic DTrace script. If it doesn’t, then it’s a shell-wrapped script that provides command line options. Use “-h” to list them.

And if you’d like to learn DTrace, I’d recommend starting by reading the basic ones.

1. iosnoop

This “traces” disk I/O execution live. Each time a disk I/O completes, a line of output is printed to summarize it, including process name and filename details:

Brendan-2:~ brendan$ sudo iosnoop
  UID   PID D    BLOCK   SIZE       COMM PATHNAME
  503 54079 R 286522800   4096 Google Chrome ??/Cache/data_2
  503 54079 R 286522808   4096 Google Chrome ??/Cache/data_2
  503 34852 W 385001216  53248      Adium ??/Default/.dat8824.01b
  503 65002 W 308056800   4096  TweetDeck ??/Local Store/td_26_brendangregg.db
  503 65002 W 308056864   4096  TweetDeck ??/Local Store/td_26_brendangregg.db
  503 65002 W 385001320   4096  TweetDeck ??/Local Store/td_26_brendangregg.db-journal
  503 65002 W 385001320   4096  TweetDeck ??/Local Store/td_26_brendangregg.db-journal
  503 65002 W 385001320   4096  TweetDeck ??/Local Store/td_26_brendangregg.db-journal
  503 54079 W 385001384  12288 Google Chrome ??/Default/Cookies-journal
  503 54079 W 385001384   4096 Google Chrome ??/Default/Cookies-journal
  503 54079 W 134993856   4096 Google Chrome ??/Default/Cookies
  503 54079 W 134994056   4096 Google Chrome ??/Default/Cookies
  503 54079 W 134994176   4096 Google Chrome ??/Default/Cookies
  503 54079 W 134994224   8192 Google Chrome ??/Default/Cookies
[...]

This lets you instantly find out which applications are using the disk, and what files they are reading or writing to. Disk I/O is typically slow (for non-SSD disks), so an application calling frequent disk I/O (a dozen per second or more) may run slowly as it waits for the disk I/O to complete.

The output columns show: UID = user ID, PID = process ID (unique identifier for the process), D = direction (R = read, W = write), BLOCK = location on disk, SIZE = I/O size in bytes, COMM = process name, PATHNAME = trailing portion of file pathname.

In that output I caught Google Chrome reading from a cache file (“data_2″), and writing to cookie files (“Cookies-journal” and “Cookies”). TweetDeck also wrote to database files (“td_26_brendangregg.db” and “td_26_brendangregg.db-journal”). The “??” is where the path information ends for iosnoop (if you are a darwin programmer and want to take a swing at improving that, see the fi_pathname translator in /usr/lib/dtrace/io.d).

The “-h” option lists options (don’t need “sudo” for this):

Brendan-2:~ brendan$ iosnoop -h
USAGE: iosnoop [-a|-A|-DeghiNostv] [-d device] [-f filename]
               [-m mount_point] [-n name] [-p PID]
       iosnoop          # default output
                -a      # print all data (mostly)
                -A      # dump all data, space delimited
                -D      # print time delta, us (elapsed)
                -e      # print device name
                -g      # print command arguments
                -i      # print device instance
                -N      # print major and minor numbers
                -o      # print disk delta time, us
                -s      # print start time, us
                -t      # print completion time, us
                -v      # print completion time, string
                -d device       # instance name to snoop
                -f filename     # snoop this file only
                -m mount_point  # this FS only
                -n name         # this process name only
                -p PID          # this PID only
   eg,
        iosnoop -v      # human readable timestamps
        iosnoop -N      # print major and minor numbers
        iosnoop -m /    # snoop events on filesystem / only

For tricky performance issues I often use “-stoD” to get start and end timestamps for each I/O in microseconds, and a couple of different types of I/O time calculations.

In hindsight, I should have called it diskiosnoop, since “io” could refer to different locations in the kernel I/O stack.

2. hfsslower.d

This script answers an iosnoop FAQ: why dosen’t iosnoop see my application disk I/O?

The reason is that applications rarely request disk I/O directly, rather, they access a file system that does disk I/O on their behalf. To increase performance, the file system will usually try to cache as much file data as possible in main memory (DRAM). The application may (by some notion) think that it’s doing disk I/O, but it’s actually reading from very fast DRAM, thanks to the file system. Writes can also buffer in DRAM and write to disk later, which also speeds up application performance.

The hfsslower.d script measures I/O before it is processed by the HFS+ file system (Apple’s current default file system). iosnoop measures I/O after the file system, and only if it reaches disk:

 

 

A couple of points:

  • hfsslower.d will see a lot more I/O than iosnoop, as it includes file system cache hits.
  • hfsslower.d better reflects application performance, as it measures the same latency that the application directly suffered.

The hfsslower.d script is from the DTrace book, and can be found here. To run it, you’ll need to create a text file containing the script (or pull it from the DTrace book tarball), and make the file executable from Terminal by running “chmod 755 hfsslower.d”.

This script takes an argument which is the minimum number of milliseconds to show I/O for. Here’s tracing 1ms HFS+ I/O and slower:

Brendan-2:~ brendan$ sudo ./dtbook_scripts/Chap5/hfsslower.d 1
TIME                 PROCESS          D   KB     ms FILE
2011 Sep 27 19:00:15 Google Chrome    R    0     15 data_5
2011 Sep 27 19:00:15 Google Chrome    R    0     10 data_0
2011 Sep 27 19:00:15 Google Chrome    R    0     10 data_0
2011 Sep 27 19:00:15 Google Chrome    R    0     10 data_5
2011 Sep 27 19:00:21 TweetDeck        W    0      1 td_26_brendangregg.db-journal
2011 Sep 27 19:00:25 Adium            R    0     18 Smile.png
2011 Sep 27 19:01:08 firefox-bin      W    0     17 _CACHE_001_
2011 Sep 27 19:01:36 firefox-bin      W    0      9 _CACHE_001_
2011 Sep 27 19:01:36 firefox-bin      W    2     14 _CACHE_002_
2011 Sep 27 19:01:36 firefox-bin      W    0      7 _CACHE_001_
2011 Sep 27 19:01:37 firefox-bin      W    0      1 _CACHE_001_
2011 Sep 27 19:01:40 firefox-bin      W    0      6 _CACHE_001_
2011 Sep 27 19:01:46 firefox-bin      W    0     14 _CACHE_001_
2011 Sep 27 19:01:49 firefox-bin      R   15      1 _CACHE_003_
2011 Sep 27 19:01:49 firefox-bin      W   12      7 _CACHE_003_
2011 Sep 27 19:01:49 firefox-bin      W    0      4 _CACHE_001_
2011 Sep 27 19:01:51 firefox-bin      W    0      9 _CACHE_001_
2011 Sep 27 19:01:52 firefox-bin      R   21     27 Times.dfont/..namedfork/rsrc
2011 Sep 27 19:01:52 fontd            R    0     16 annex_aux
2011 Sep 27 19:01:52 firefox-bin      W    0      5 _CACHE_001_
[...]

Columns are: TIME = time of I/O completion, PROCESS = application name, D = direction (R = read, W = write), KB = I/O size in Kbytes, ms = I/O latency in milliseconds, FILE = filename.

If you use the argument “0″, it will trace everything. If I’m chasing down slow I/O, I’ll often use an argument of “10″ for I/O slower than 10 milliseconds.

At this point you may think: if you just care about slow I/O, then just use iosnoop. That works to a point, but there can be slow I/O caused by something other than disks (file system lock contention, for example). The other advantage of the hfsslower.d script is that the measured latency matches the application pain suffered, whereas at the disk level you can only assume a correlation.

3. execsnoop

This traces the execution of new processes. This is great at identifying short-lived processes that may be caused by misbehaving applications and can slow down your system. These short-lived processes are usually too quick to be picked up by standard monitoring tools like the Activity Monitor or top(1).

To demonstrate this tool, here’s what happens when you type “man ls”:

Brendan-2:~ brendan$ sudo execsnoop -v
STRTIME                UID    PID   PPID ARGS
2011 Sep 28 20:19:18     0  67234  66312 man
2011 Sep 28 20:19:18     0  67234  66312 man
2011 Sep 28 20:19:18     0  67235  67234 sh
2011 Sep 28 20:19:18     0  67235  67234 gzip
2011 Sep 28 20:19:19     0  67236  67234 sh
2011 Sep 28 20:19:18     0  67234  66312 man
2011 Sep 28 20:19:19     0  67236  67234 gzip
2011 Sep 28 20:19:19     0  67234  66312 man
2011 Sep 28 20:19:19     0  67244  67243 less
2011 Sep 28 20:19:19     0  67242  67239 sh
2011 Sep 28 20:19:19     0  67242  67239 gzip
2011 Sep 28 20:19:19     0  67240  67238 tbl
2011 Sep 28 20:19:19     0  67241  67238 groff
2011 Sep 28 20:19:19     0  67245  67241 troff
2011 Sep 28 20:19:19     0  67246  67241 grotty

This prints a line for each new process that is executed, in a rolling output similar to the previous tools. The fields are: STRTIME = (string) timestamp, UID = user ID, PID = process ID, PPID = parent process ID, ARGS = process name (should be process + arguments, but that doesn’t yet work on Mac OS X; if you want to debug, see pr_psargs in /usr/lib/dtrace/proc.d).

Here’s what happens when I turn AirPort (wifi) off, then on:

Brendan-2:~ brendan$ sudo execsnoop -v
STRTIME                UID    PID   PPID ARGS
2011 Sep 28 20:28:00     0  67204      1 airportd
2011 Sep 28 20:28:01   503  67205    140 fontworker
2011 Sep 28 20:28:01     0  67206      1 mDNSResponderHel
2011 Sep 28 20:28:06     0     49      1 autofsd
2011 Sep 28 20:28:10     0  67208     37 ManagedClient
2011 Sep 28 20:28:31   503  67209    140 fontworker
2011 Sep 28 20:28:43     0  67210      1 airportd
2011 Sep 28 20:28:49     0  67211      1 mDNSResponderHel
2011 Sep 28 20:28:49   503  67212    140 fontworker
2011 Sep 28 20:28:51     0  67214      1 kerberosautoconf
2011 Sep 28 20:28:51     0  67215      1 kerberosautoconf
2011 Sep 28 20:28:56     0     49      1 autofsd
2011 Sep 28 20:29:01     0  67219     37 ManagedClient
2011 Sep 28 20:29:04     0  67220      1 ocspd
2011 Sep 28 20:29:16   503  67221  54079 Google Chrome He
2011 Sep 28 20:29:19   503  67222    140 fontworker

Near the end I opened up a new tab in Mozilla Firefox and another in Google Chrome. You can see a fundamental difference: Chrome creates a new process for that tab, Firefox doesn’t.

execsnoop has various options, use -h to list them all. I used -v above, to print the time.

4. opensnoop

This traces file opens and prints various details, including the time and error code when using “-ve”. I usually use it to look for failed opens, which can be a sign of misconfigured applications. Discovering their config files and resource files can also be useful.

Brendan-2:~ brendan$ sudo opensnoop -ve
STRTIME                UID    PID COMM          FD ERR PATH
2011 Sep 30 01:44:15     0     11 DirectoryServic  -1   2 /var/db/dslocal/nodes/Default/hosts
2011 Sep 30 01:44:15     0     11 DirectoryServic  20   0 /etc/hosts
2011 Sep 30 01:44:15     0  85198 dtrace         5   0 /etc/localtime
2011 Sep 30 01:44:15   503    146 SystemUIServer  19   0 /Users/brendan/Library/Preferences/com.apple.menuextra.clock.plist.47C2X9o
2011 Sep 30 01:44:15   503    146 SystemUIServer  19   0 /System/Library/CoreServices/Menu Extras/TimeMachine.menu/Contents/Resources/TMRotatingArrow.pdf
2011 Sep 30 01:44:15   503    146 SystemUIServer  19   0 /System/Library/CoreServices/Menu Extras/Clock.menu/Contents/Resources/Clock.pdf
2011 Sep 30 01:44:15   503    146 SystemUIServer  19   0 /System/Library/CoreServices/Menu Extras/TimeMachine.menu/Contents/Resources/TMRotatingArrow.pdf
2011 Sep 30 01:44:17   503  67261 TweetDeck     17   0 /Users/brendan/Library/Cookies/Cookies.plist_tmp_67261_0.dat
2011 Sep 30 01:44:17   503  67261 TweetDeck     17   0 /Users/brendan/Library/Cookies/Cookies.plist
2011 Sep 30 01:44:18   503  54079 Google Chrome  72   0 /var/folders/bU/bU0WYlnwE6KKnMBrvI5QUU+++TQ/-Tmp-/.com.google.Chrome.sQdN3D
2011 Sep 30 01:44:19     0     36 mds            9   0 .
^C

While tracing, I changed the clock in the top bar from digital:

to analogue:

The files opened to performed this can be seen by the “SystemUIServer” process, beginning with a Preferences file and then some PDFs. Huh? PDFs? What are those PDFs?

That’s bizarre. I didn’t know that those tiny icons were implemented as PDFs!

5. dtruss

The previous two tools, opensnoop and execsnoop, operate by tracing specific system calls. A system call (or “syscall”) is what an application performs to request the operating system kernel to perform privileged work, including process creation, file operations, and other I/O (eg, disk or network I/O). Syscalls are a great target for analysis with DTrace, since examining them often provides a pretty good picture of what an application is trying to do. They can also provide useful details including byte counts, file and process names, error codes, and latency.

The dtruss tool traces all types of system calls, which is very useful for general debugging, especially since Mac OS X doesn’t come with a standard syscall tracer (like Linux’s “strace” or Solaris’s “truss”). One advantage of dtruss over those other tools is that dtruss can trace multiple processes at the same time, matching on the process name “-n”. For example, tracing Firefox via its process name “firefox-bin”:

Brendan-2:~ brendan$ sudo dtruss -n firefox-bin
	PID/THRD  SYSCALL(args) 		 = return
67254/0x720286:  write_nocancel(0x7, "8\0", 0x1)		 = 1 0
67254/0x720286:  lseek(0x24, 0x75500, 0x0)		 = 480512 0
67254/0x720286:  read_nocancel(0x24, "\0", 0x200)		 = 512 0
67254/0x720286:  write_nocancel(0x7, "8\0", 0x1)		 = 1 0
67254/0x720286:  lseek(0x24, 0x73B00, 0x0)		 = 473856 0
67254/0x720286:  read_nocancel(0x24, "\0", 0x200)		 = 512 0
67254/0x746a81:  select_nocancel(0x3D, 0xB04909B8, 0xB0490938)		 = 1 0
67254/0x746a81:  read_nocancel(0x6, "8\0", 0x400)		 = 1 0
67254/0x746a81:  recvfrom_nocancel(0x36, 0xB0490C3F, 0x1)		 = -1 Err#35
67254/0x746a81:  select_nocancel(0x3D, 0xB0490998, 0xB0490918)		 = 1 0
67254/0x746a81:  sendto_nocancel(0x36, 0x20A51008, 0x292)		 = 658 0
67254/0x720286:  write_nocancel(0x7, "8\0", 0x1)		 = 1 0
67254/0x720286:  lseek(0x24, 0x74000, 0x0)		 = 475136 0
[...]

The output will be many pages, as applications commonly make frequent system calls.

dtruss can also launch and trace a program. Here the humble “ls -l” command was traced:

Brendan-2:~ brendan$ sudo dtruss ls -l hfsslower.d
-rwxr-xr-x  1 brendan  staff  1152 Jan 28  2011 hfsslower.d
SYSCALL(args) 		 = return
ioctl(0x3, 0x80086804, 0x7FFF5FBFD710)		 = 0 0
close(0x3)		 = 0 0
stat64("/usr/lib/libstdc++.6.dylib\0", 0x7FFF5FBFCB20, 0x7FFF5FBFD160)		 = 0 0
mmap(0x0, 0xD000, 0x3, 0x1002, 0x1000000, 0x7FFF00000001)		 = 0x20000 0
[...100 lines truncated...]
open_nocancel("/etc/sysinfo.conf\0", 0x0, 0x1B6)		 = -1 Err#2
lstat64_extended(0x100100768, 0x7FFF5FBFE3F0, 0x100101580)		 = 0 0
listxattr(0x100100768, 0x0, 0x0)		 = 0 0
fstat64(0x1, 0x7FFF5FBFD960, 0x7FFF5FBFDA2C)		 = 0 0
ioctl(0x1, 0x4004667A, 0x7FFF5FBFD9AC)		 = 0 0
access("/etc/localtime\0", 0x4, 0x0)		 = 0 0
open_nocancel("/etc/localtime\0", 0x0, 0x0)		 = 4 0
fstat64(0x4, 0x7FFF5FBFB8B0, 0x0)		 = 0 0
read_nocancel(0x4, "TZif\0", 0x2A64)		 = 1017 0
close_nocancel(0x4)		 = 0 0
write_nocancel(0x1, "-rwxr-xr-x  1 brendan  staff  1152 Jan 28  2011 hfsslower.d\n\0", 0x3C)		 = 60 0
fchdir(0x3, 0x100100000, 0xFC080)		 = 0 0
fchdir(0x3, 0x0, 0x100800000)		 = 0 0
close_nocancel(0x3)		 = 0 0

100 lines were trimmed to keep that example short.

dtruss supports various options, including the printing of system call timing for use when analyzing performance.

Brendan-2:~ brendan$ dtruss -h
USAGE: dtruss [-acdefholLs] [-t syscall] { -p PID | -n name | command }

          -p PID          # examine this PID
          -n name         # examine this process name
          -t syscall      # examine this syscall only
          -a              # print all details
          -c              # print syscall counts
          -d              # print relative times (us)
          -e              # print elapsed times (us)
          -f              # follow children
          -l              # force printing pid/lwpid
          -o              # print on cpu times
          -s              # print stack backtraces
          -L              # don't print pid/lwpid
          -b bufsize      # dynamic variable buf size
   eg,
       dtruss df -h       # run and examine "df -h"
       dtruss -p 1871     # examine PID 1871
       dtruss -n tar      # examine all processes called "tar"
       dtruss -f test.sh  # run test.sh and follow children

I’ve used the “-e” option many times to figure out what’s slowing down an application. It shows the elapsed time for the system call in microseconds:

Brendan-2:~ brendan$ dtruss -e ls -l hfsslower.d 
-rwxr-xr-x  1 brendan  staff  1152 Jan 28  2011 hfsslower.d
 ELAPSD SYSCALL(args) 		 = return
    245 mmap(0x10000B000, 0x2000, 0x5, 0x12, 0x3, 0x7FFF00000001)		 = 0xB000 0
     11 mmap(0x10000D000, 0x1000, 0x3, 0x12, 0x3, 0x7FFF00000001)		 = 0xD000 0
     10 mmap(0x10000E000, 0x1F10, 0x1, 0x12, 0x3, 0x7FFF00000001)		 = 0xE000 0
     31 open("/dev/dtracehelper\0", 0x2, 0x7FFF5FC45370)		 = 3 0
     40 __sysctl(0x7FFF5FBFD5B0, 0x2, 0x7FFF5FBFD5DC)		 = 0 0

I then look down the “ELAPSD” column for the largest times.

A dtruss FAQ is: why is the output in the wrong order?

As with other DTrace “tracing” style tools (that print lines of output as they occur) the output can be shuffled slightly due to CPU buffering on multi-CPU systems. The “-d” option for relative times (since program start) is useful for cases where the output order is important. You can then post-sort by that column so that the dtruss output is in the correct order. (For example, by redirecting the output to a file, then using the sort(1) command on that column.)

Ryan Dahl (creator of node.js) had been using dtruss so frequently recently that he made some enhancements and posted a newer version on github here.

6. soconnect_mac.d

This script is from the DTrace book, and traces outbound TCP connections along with details:

Brendan-2:~ brendan$ sudo ./dtbook/Chap6/soconnect_mac.d
PID    PROCESS          FAM ADDRESS          PORT   LAT(us) RESULT
45343  firefox-bin      2   70.32.100.166    443        735 In progress
65002  TweetDeck        2   199.59.148.201   80          94 In progress
65002  TweetDeck        2   208.93.137.53    80          89 In progress
54079  Google Chrome    2   97.74.215.143    80          76 In progress
45343  firefox-bin      2   97.74.215.143    80         752 In progress
45343  firefox-bin      2   97.74.215.143    80          88 In progress
45343  firefox-bin      2   97.74.215.143    80          85 In progress
45343  firefox-bin      2   97.74.215.143    80          39 In progress
45343  firefox-bin      2   97.74.215.143    80          22 In progress
45343  firefox-bin      2   97.74.215.143    80          20 In progress
27     ntpd             2   17.151.16.20     123         24 Success
[...]

It’s a quick way to find out what applications are connecting to whom on the Internet. The script is here.

Columns are: PID = process ID, PROCESS = process name, FAM = protocol family (2 = IPv4), ADDRESS = IP address, PORT = TCP destination port, LAT(us) latency of the connect() system call in microseconds, RESULT = return of the connect().

Many connections these days are “non-blocking”, so the latency appears low and has the result “In progress”. The actual connection completes sometime later (the timing can be traced using some more DTrace, when desired).

There is a companion tool, soaccept_mac.d, which shows inbound TCP connections. If you ran both regularly, you may catch something performing networking that shouldn’t be (e.g., spyware, virus, …).

7. errinfo

This tool provides a summary of which system calls were failing, showing the process name, error code, and short description of the error:

Brendan-2:~ brendan$ sudo errinfo -c
Tracing... Hit Ctrl-C to end.
^C
            EXEC          SYSCALL  ERR  COUNT  DESC
       TweetDeck          lstat64    2      1  No such file or directory
       TweetDeck           select    9      1  Bad file descriptor
       TweetDeck           stat64    2      1  No such file or directory
           cupsd           unlink    2      1  No such file or directory
     firefox-bin connect_nocancel   36      1  Operation now in progress
         launchd            mkdir   17      1  File exists
         launchd    open_nocancel    2      1  No such file or directory
[...]
          Google           Chrome    0     35
          Google           Chrome    0     35
        mdworker      getattrlist    2     24  No such file or directory
        mdworker            mkdir   13     24  Permission denied
       TweetDeck           access    2     33  No such file or directory
       TweetDeck             read   35     35  Resource temporarily unavailable
   mDNSResponder          recvmsg   35     38  Resource temporarily unavailable
        mdworker           stat64    2     47  No such file or directory
       gpg-agent    read_nocancel   35     60  Resource temporarily unavailable
        thnuclnt            ioctl   25     60  Inappropriate ioctl for device
            ntpd __pthread_canceled   22     69  Invalid argument
            ntpd        sigreturn   -2     69
            ntpd       sigsuspend    4     69  Interrupted system call
          Google           Chrome    0      0
 activitymonitor        proc_info    3   1495  No such process

This can be another quick way to track down failing or misconfigured applications. Note that (usually) most system call errors are not a problem: it can be normal for system calls to fail, and the calling application handles that condition correctly. errinfo is particularly useful when the application didn’t handle the failure correctly, and the system calls are silently failing.

8. bitesize.d

This is a simple DTrace script that characterizes the disk I/O workload, showing a distribution of the size of the I/O in bytes along with the application name:

Brendan-2:~ brendan$ sudo bitesize.d
Tracing... Hit Ctrl-C to end.
^C

     PID  CMD
   31502  vim\0

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         4
            8192 |                                         0
           16384 |@@@@@@@@                                 1
           32768 |                                         0        

   54079  Google Chrome\0

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        18
            8192 |@@@@@@@                                  4
           16384 |                                         0        

   86143  bsdtar\0

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@                                      19
            8192 |@@@@                                     23
           16384 |@@                                       12
           32768 |@@@@@@@@@@@@@@@@@@@@@@@                  151
           65536 |@@@@                                     25
          131072 |@@                                       10
          262144 |@@                                       10
          524288 |@                                        8
         1048576 |                                         3
         2097152 |                                         0

The “value” column shows the minimum size of the I/O in bytes, and the “count” column shows the number of I/O in that range: between the minimum size and the next size shown.

To follow an example: the “bsdtar” command shown above performed 151 disk I/Os that were between 32768 bytes and 65535 bytes in size (32K to 64K), which was where most of its I/O fell. The text plot in the middle provides a visualization for the distribution of the events, which shows a spike that corresponds to this 32K to 64K range. Google Chrome did a couple of dozen I/O between 4 and 16 Kbytes.

Disk I/O is an expensive operation, so it’s usually preferable for it to be larger in size to improve the throughput to the device.

9. iotop

This presents the same data as iosnoop, but in a summarized way similar to top(1). It’s handy when disk I/O is so frequent that iosnoop is too verbose, and you want a high level summary of which process is rattling the disks. Here I’ve used “-CP” to not clear the screen and provide a rolling output (-C), and to show disk busy percentages (-P):

Brendan-2:~ brendan$ sudo iotop -CP 1
Tracing... Please wait.
2011 Sep 27 19:33:11,  load: 0.74,  disk_r:      0 KB,  disk_w:  57344 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D   %I/O
    0  66661  66312 cp               ??       14   2 W     98

2011 Sep 27 19:33:12,  load: 0.74,  disk_r:      4 KB,  disk_w:  55752 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D   %I/O
    0  66661  66312 cp               ??       14   2 W     93

2011 Sep 27 19:33:13,  load: 0.68,  disk_r:     60 KB,  disk_w:  54464 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D   %I/O
    0  66661  66312 cp               ??       14   2 R     11
    0  66661  66312 cp               ??       14   2 W     88

2011 Sep 27 19:33:14,  load: 0.68,  disk_r:   2560 KB,  disk_w:  32928 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D   %I/O
  503  65002    140 TweetDeck        ??       14   2 W      0
    0      1      0 launchd          ??       14   2 W      1
    0  66661  66312 cp               ??       14   2 R     11
    0  66661  66312 cp               ??       14   2 W     71

The “??” is for where some details aren’t yet available to DTrace, for the same reason as was described for the iosnoop script (if you are a programmer, you can see what needs to be fixed in the /usr/lib/dtrace/io.d file).

If you get “dynvardrops”, sorry: back when I wrote this (6 years ago), I never saw a disk workload heavy enough to cause these. A remedy can be to increase the dynvarsize tunable by adding the following line:

#pragma D option dynvarsize=16m

near this one:

#pragma D option quiet

in /usr/bin/iotop. You may want to make a backup of iotop first (iotop.orig) or copy it somewhere else to run it.

10. maclife.d

This script is from the DTrace book, and traces the creation and deletion of files:

Brendan-2:~ brendan$ sudo ./dtbook/Chap5/maclife.d
TIME(ms)        UID    PID PROCESS      CALL         DIR/FILE
3754594958      503  54079 Google Chrom VNOP_CREATE  Chrome/.com.google.Chrome.K7I9jy
3754597703      503  54079 Google Chrom VNOP_CREATE  -Tmp-/.com.google.Chrome.8MIKKL
3754597703      503  54079 Google Chrom VNOP_REMOVE  -Tmp-/.com.google.Chrome.8MIKKL
3754597703      503  54079 Google Chrom VNOP_CREATE  -Tmp-/.com.google.Chrome.TLlOje
3754597703      503  54079 Google Chrom VNOP_REMOVE  -Tmp-/.com.google.Chrome.TLlOje
3754598365      503  54079 Google Chrom VNOP_CREATE  -Tmp-/.com.google.Chrome.yIwTdE
3754598365      503  54079 Google Chrom VNOP_REMOVE  -Tmp-/.com.google.Chrome.yIwTdE
3754603801      503  65002 TweetDeck    VNOP_CREATE  Cookies/Cookies.plist_tmp_65002_0.dat
3754605028      503  65002 TweetDeck    VNOP_REMOVE  Local Store/td_26_brendangregg.db-journal
3754605026      503  65002 TweetDeck    VNOP_CREATE  Local Store/td_26_brendangregg.db-journal
3754607674      503  65002 TweetDeck    VNOP_CREATE  Local Store/td_26_brendangregg.db-journal
3754607676      503  65002 TweetDeck    VNOP_REMOVE  Local Store/td_26_brendangregg.db-journal
3754609536      503  34852 Adium        VNOP_CREATE  Default/.dat8824.9fa
3754711583      503  17726 thnuclnt     VNOP_CREATE  thnuclnt-17717/thnuclnt.conf-fta17726
3754711587      503  17726 thnuclnt     VNOP_REMOVE  -Tmp-/4e8220118e918
3754711617      503  17726 thnuclnt     VNOP_REMOVE  -Tmp-/4e8220118ead8
3754711620      503  17726 thnuclnt     VNOP_REMOVE  -Tmp-/4e8220119621c
3754711623      503  17726 thnuclnt     VNOP_REMOVE  -Tmp-/4e82201196d7e
3754711624      503  17726 thnuclnt     VNOP_REMOVE  thnuclnt-17717/thnuclnt.conf-fta17726
3754612740      503  31502 vim          VNOP_CREATE  macosx/4913
3754612740      503  31502 vim          VNOP_REMOVE  macosx/4913
3754612741      503  31502 vim          VNOP_CREATE  macosx/top10dtrace.html
3754612741      503  31502 vim          VNOP_REMOVE  macosx/top10dtrace.html~
[...]

Interesting! While tracing I saved the file I was editing in vim, which is seen in the last four lines. This tells me that vim is creating and removing temporary files as part of the save process.

Also note that it looks like TweetDeck created the file twice before removing it (two VNOP_CREATEs followed by a VNOP_REMOVE). This isn’t the correct order, which can be seen by examining the TIME(ms) column. A side-effect of DTrace’s negligible performance impact design is that output can be slightly shuffled due to the way it collects data from per-CPU buffers. I often include a TIME column like that one, not just for the usefulness of knowing time, but also as a means to post sort the output.

Other Scripts

There are other tools ready-to-go in Mac OS X. If you run the “man -k dtrace” command in a Terminal (or “apropos dtrace”), you’ll see the following output:

Brendan-2:~ brendan$ man -k dtrace
[...]
bitesize.d(1m)           - analyse disk I/O size by process. Uses DTrace
cpuwalk.d(1m)            - Measure which CPUs a process runs on. Uses DTrace
creatbyproc.d(1m)        - snoop creat()s by process name. Uses DTrace
dappprof(1m)             - profile user and lib function usage. Uses DTrace
dapptrace(1m)            - trace user and library function usage. Uses DTrace
diskhits(1m)             - disk access by file offset. Uses DTrace
dispqlen.d(1m)           - dispatcher queue length by CPU. Uses DTrace
dtrace(1)                - generic front-end to the DTrace facility
dtruss(1m)               - process syscall details. Uses DTrace
errinfo(1m)              - print errno for syscall fails. Uses DTrace
execsnoop(1m)            - snoop new process execution. Uses DTrace
fddist(1m)               - file descriptor usage distributions. Uses DTrace
filebyproc.d(1m)         - snoop opens by process name. Uses DTrace
hotspot.d(1m)            - print disk event by location. Uses DTrace
httpdstat.d(1m)          - realtime httpd statistics. Uses DTrace
iofile.d(1m)             - I/O wait time by file and process. Uses DTrace
iofileb.d(1m)            - I/O bytes by file and process. Uses DTrace
iopattern(1m)            - print disk I/O pattern. Uses DTrace
iopending(1m)            - plot number of pending disk events. Uses DTrace
iosnoop(1m)              - snoop I/O events as they occur. Uses DTrace
iotop(1m)                - display top disk I/O events by process. Uses DTrace
kill.d(1m)               - snoop process signals as they occur. Uses DTrace
lastwords(1m)            - print syscalls before exit. Uses DTrace
loads.d(1m)              - print load averages. Uses DTrace
newproc.d(1m)            - snoop new processes. Uses DTrace
opensnoop(1m)            - snoop file opens as they occur. Uses DTrace
pathopens.d(1m)          - full pathnames opened ok count. Uses DTrace
pidpersec.d(1m)          - print new PIDs per sec. Uses DTrace
plockstat(1)             - front-end to DTrace to print statistics about POSIX mutexes and read/write locks
priclass.d(1m)           - priority distribution by scheduling class. Uses DTrace
pridist.d(1m)            - process priority distribution. Uses DTrace
procsystime(1m)          - analyse system call times. Uses DTrace
runocc.d(1m)             - run queue occupancy by CPU. Uses DTrace
rwbypid.d(1m)            - read/write calls by PID. Uses DTrace
rwbytype.d(1m)           - read/write bytes by vnode type. Uses DTrace
rwsnoop(1m)              - snoop read/write events. Uses DTrace
sampleproc(1m)           - sample processes on the CPUs. Uses DTrace
seeksize.d(1m)           - print disk event seek report. Uses DTrace
setuids.d(1m)            - snoop setuid calls as they occur. Uses DTrace
sigdist.d(1m)            - signal distribution by process. Uses DTrace
syscallbypid.d(1m)       - syscalls by process ID. Uses DTrace
syscallbyproc.d(1m)      - syscalls by process name. Uses DTrace
syscallbysysc.d(1m)      - syscalls by syscall. Uses DTrace
topsyscall(1m)           - top syscalls by syscall name. Uses DTrace
topsysproc(1m)           - top syscalls by process name. Uses DTrace
weblatency.d(1m)         - website latency statistics. Uses DTrace

Plus more in the DTrace book.

And More…

DTrace isn’t just about running scripts; you can write your own custom scripts, run one-liners, and use higher-level tools that use DTrace behind the scenes, like Apple’s Instruments and Joyent’s Cloud Analytics. And it’s for more than just your MacBook: if you are using servers that have DTrace available, you can use it to diagnose their performance and issues too, including tracing kernel and application code. I regularly use it to see how a MySQL database interacts with a kernel file system, for example.

For more reading about DTrace, you can see my posts tagged dtrace, other blogs on dtrace.org, the original 410 page DTrace guide, 1100 page DTrace book, and my original DTrace page (which includes the DTraceToolkit). It’s a little old now (and I think prustat needs updating), but there’s also Matty’s Top Ten DTrace Scripts, which includes some of those above; and Greg Miller’s Exploring Leopard with DTrace, which includes Objective-C tracing.

For another post on Mac OS X performance, see my USE Method-based Mac OS X performance checklist.

posted @ 2014-05-14 11:11  Proteas  阅读(885)  评论(0编辑  收藏  举报