ProbeVue in Action: Monitoring the Queue Depth of Disks

Disk and storage systems support Tagged Command Queuing, i.e. connected servers can send multiple I/O jobs to the disk or storage system without waiting for older I/O jobs to finish. The number of I/O requests you can send to a disk before you have to wait for older I/O requests to complete can be configured using the hdisk queue_depth attribute on AIX. For many hdisk types, the value 20 for the queue_depth is the default value. In general, most storage systems allow even greater values for the queue depth.

With the help of ProbeVue, the utilization of the disk queue can be monitored very easily.

Starting with AIX 7.1 TL4 or AIX 7.2 TL0, AIX supports the I/O Probe Manager. This makes it easy to trace events in AIX’s I/O stack. If an I/O is started by the disk driver, this is done via the iostart function in the kernel, the request is forwarded to the adapter driver and then passed to the storage system via the host bus adapter. Handling the response is done by the iodone function in the kernel. The I/O Probe Manager supports (among others) probe events at these locations:

@@io:disk:iostart:read:<filter>
@@io::disk:iostart:write:<filter>
@@io:disk:iodone:read:<filter>
@@io::disk:iodone:write:<filter>

As a filter, e.g. a hdisk name like hdisk2 can be specified. The probe points then only trigger events for the disk hdisk2. This allows to perform an action whenever an I/O for a hdisk begins or ends. This would allow to measure how long an I/O operation takes or just to count how many I/Os are executed. In our example, we were interested in the utilization of the disk queue, i.e. the number of I/Os sent to the disk which are not yet completed. The I/O Probe Manager has a built-in variable __diskinfo for the iostart and iodone I/O probe events with the following fields (https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix .genprogc / probevue_man_io.htm):

name          char*     Name of the disk.
…
queue_depth   int       The queue depth of the disk (value from ODM)
cmds_out      int       Number of outstanding I/Os
…

The cmds_out field indicates how many I/Os have already been sent to the disk for which the I/O has not yet been completed (response has not yet arrived at the server).

The following section of code determines the minimum, maximum, and average number of entries in the disk queue:

@@io:disk:iostart:*:hdisk0     // Only I/Os for hdisk0 are considered
{
   queue = __iopath->cmds_out; // Store number of outstanding I/Os in variable queue
   ++numIO;                    // Number of I/Os (used for calculating the average)
   avg += queue;               // Add number of outstanding I/Os to variable avg
   if ( queue < min )
      min = queue;             // Check if minimum
   if ( queue > max )
      max = queue;             // Check if maximum
}

The calculated values are then printed once per second using the interval probe manager:

@@interval:*:clock:1000
{
   if ( numIO == 0 )
      numIO = 1;    // Prevent division by 0 when calculating the average
   if ( min > max )
      min = max;
   printf( "%5d  %5d  %5d\n" , min , avg/numIO , max );
   min = 100000;   // Reset variables for the next interval
   avg = 0;
   max = 0;
   numIO = 0;
}

The full script is available for download on our website: ioqueue.e.

Here is a sample run of the script for the disk hdisk13:

# ./ioqueue.e hdisk13
  min    avg    max
    1      1      2
    1      1      9
    1      1      2
    1      1      8
    1      1      2
    1      1      2
    1      1      8
    1      1     10
    1      1      2
    1      1      1
    1      1     10
    1      1      2
    1      1     11
...

The script expects an hdisk as an argument, and then outputs once per second the values determined for the specified hdisk.

In the example output you can see that the maximum number of entries in the disk queue is 11. An increase of the attribute queue_depth therefore makes no sense from a performance perspective.

Here’s another example:

# ./ioqueue.e hdisk21
  min    avg    max
    9     15     20
   11     17     20
   15     19     20
   13     19     20
   14     19     20
   17     18     20
   18     18     19
   16     19     20
   13     18     20
   18     19     19
   17     19     20
   18     19     20
   17     19     19
...

In this case, the maximum value of 20 (the hdisk21 has a queue_depth of 20) is reached on a regular basis. Increasing the queue_depth can improve throughput in this case.

Of course, the sample script can be expanded in various ways; to determine the throughput in addition, or the waiting time of I/Os in the wait queue, or even the position and size of each I/O on the disk. This example just shows how easy it is to get information about I/Os using ProbeVue.

FC NPIV client throughput

When using NPIV, multiple client LPARs share a physical FC port of a virtual I/O server. Of course, for performance investigations, it would be nice to be able to easily determine the throughput of each client LPAR and to look at the througputs comparatively. Thus, questions like

  • how much throughput is achieved by a particular LPAR
  • which LPARs have the highest throughput and produce the most FC traffic
  • are there resource bottlenecks

could be answered.

Of course, there are several ways to gain this data. A particularly simple option is provided by the virtual I/O server via the padmin command ‘fcstat‘. The command allows to show NPIV client statistics, using the ‘-client‘ option:

(0)padmin@aixvio1:/home/padmin> fcstat -client
              hostname   dev                wwpn     inreqs    outreqs ctrlreqs          inbytes         outbytes  DMA_errs Elem_errs Comm_errs

               aixvio1  fcs0  0x100000XXXXXXXXXX 49467894179 50422150679 947794529 1861712755360927 1451335312750576         0         0         0
     C050760YYYYYYYYY
                                    0          0        0                0                0         0         0         0
     C050760ZZZZZZZZZ
                                    0          0        0                0                0         0         0         0
                 aix01  fcs0  0xC050760XXXXXXXXX   22685402  101956075 10065757     699512617896    1572578056704         0         0         0
                 aix02  fcs0  0xC050760XXXXXXXXX   28200473   82295158 12051365     387847746448     626772151808         0         0         0
                 aix03  fcs0  0xC050760XXXXXXXXX  376500672  255163053 21583628   22619424512608    3786990844928         0         0         0
                 aix04  fcs0  0xC050760XXXXXXXXX  116450405  504688524 14020031    4037786527400    9929289617408         0         0         0
          blbprodora22  fcs0  0xC050760XXXXXXXXX 1341092479  580673554 37458927   44288566807072   12166718497792         0         0         0
...
               aixvio1  fcs1  0x100000XXXXXXXXXX  391131484 1090556094 156294130   71031615240217   87642294572864         0         0         0
              aixtsm01  fcs2  0xC050760XXXXXXXXX  334020900  785597352 74659821   62072552942128   83284555980288         0         0         0
              aixtsm02  fcs0  0xC050760XXXXXXXXX    2943054   40921231 11617552     107317697968     289142333440         0         0         0

               aixvio1  fcs2  0x210000XXXXXXXXXX  403180246 5877180796   236998  105482699300998 1540608710446612         0         0         0
              aixtsm01  fcs6  0xC050760XXXXXXXXX  146492419  392365162    74250   38378099796342  102844775468007         0         0         0
              aixtsm02  fcs2  0xC050760XXXXXXXXX         19     192848       20             1090      50551063184         0         0         0

               aixvio1  fcs3  0x210000XXXXXXXXXX  405673338 7371951499   260575  105969796271246 1932388891128304         0         0         0
              aixtsm02  fcs3  0xC050760XXXXXXXXX          0          0        4                0                0         0         0         0
                 aix02  fcs7  0xC050760XXXXXXXXX      42624 2677470211    34211          2382280  701864613402184         0         0         0
...
Invalid initiator world wide name
Invalid initiator world wide name
(0)padmin@aixvio1:/home/padmin>

The line with WWPN C050760YYYYYYYYY and C050760ZZZZZZZZZ belongs to NPIV adapters of non-activated LPARs. Therefore, only zeros are displayed as counters. For each virtual (NPIV-enabled) FC port of the virtual I/O server, the physical FC port and the NPIV client LPARs are displayed. Based on the bold-marked block, the output will be briefly described here. First, the physical port of the virtual I/O server is always shown, here aixvio1 and FC port fcs1. In the following lines, the NPIV clients will be shown, each with the LPAR name and the associated virtual FC port of the LPAR, here aixtsm01 and aixtsm02. The virtual FC ports of the LPARs fcs2 (aixtsm01) and fcs0 (aixtsm02) are mapped to the physical FC port fcs1 of aixvio1. After a blank line comes the next physical FC port of the virtual I/O server.

The WWPN of the physical or virtual FC ports are listed in the columns. In addition, the number of incoming and outgoing requests, as well as the transferred bytes, also incoming and outgoing, are listed. Errors are listed in the 3 remaining columns. If there is no DMA buffer available for a request, DMA_errs is incremented, if the queue of the FC adapter is full, Elem_errs is incremented, in the case of transmission errors, Comm_errs is incremented. Regular increasing counters on DMA_errs or Elem_errs may be an indication of too small values for some tuning attributes.

Due to the length of the output and the absolute counters being output, the output is somewhat confusing. But with a small script, you can easily calculate delta values and scale the output to MB per second. With the following example script we have done this:

$ cat npivstat
#! /bin/ksh93
#
# Copyright (c) 2019 by PowerCampus 01 GmbH
# Author: Dr. Armin Schmidt
#

delta=5 # seconds

typeset -A dataInreqs
typeset -A dataOutreqs
typeset -A dataInbytes
typeset -A dataOutbytes
typeset -A dataDMA_errs
typeset -A dataElem_errs
typeset -A dataComm_errs

bc |& # start bc as coroutine
print -p "scale=2"

# get first sample

/usr/ios/cli/ioscli fcstat -client 2>/dev/null | \
while read hostname dev wwpn inreqs outreqs ctrlreqs inbytes outbytes DMA_errs Elem_errs Comm_errs rest
do
case "$wwpn" in
0x*)
dataInreqs[${hostname}_${dev}]=$inreqs
dataOutreqs[${hostname}_${dev}]=$outreqs
dataInbytes[${hostname}_${dev}]=$inbytes
dataOutbytes[${hostname}_${dev}]=$outbytes
dataDMA_errs[${hostname}_${dev}]=$DMA_errs
dataElem_errs[${hostname}_${dev}]=$Elem_errs
dataComm_errs[${hostname}_${dev}]=$Comm_errs
;;
esac
done
sleep $delta

while true
do
/usr/ios/cli/ioscli fcstat -client 2>/dev/null | \
while read hostname dev wwpn inreqs outreqs ctrlreqs inbytes outbytes DMA_errs Elem_errs Comm_errs rest
do
case "$wwpn" in
0x*)
prevInreqs=${dataInreqs[${hostname}_${dev}]}
prevOutreqs=${dataOutreqs[${hostname}_${dev}]}
prevInbytes=${dataInbytes[${hostname}_${dev}]}
prevOutbytes=${dataOutbytes[${hostname}_${dev}]}
prevDMA_errs=${dataDMA_errs[${hostname}_${dev}]}
prevElem_errs=${dataElem_errs[${hostname}_${dev}]}
prevComm_errs=${dataComm_errs[${hostname}_${dev}]}
dataInreqs[${hostname}_${dev}]=$inreqs
dataOutreqs[${hostname}_${dev}]=$outreqs
dataInbytes[${hostname}_${dev}]=$inbytes
dataOutbytes[${hostname}_${dev}]=$outbytes
dataDMA_errs[${hostname}_${dev}]=$DMA_errs
dataElem_errs[${hostname}_${dev}]=$Elem_errs
dataComm_errs[${hostname}_${dev}]=$Comm_errs

print -p "(${inreqs}-${prevInreqs})/$delta"
read -p inreqs
print -p "(${outreqs}-${prevOutreqs})/$delta"
read -p outreqs
print -p "(${inbytes}-${prevInbytes})/${delta}/1024/1024"
read -p inbytes
print -p "(${outbytes}-${prevOutbytes})/${delta}/1024/1024"
read -p outbytes
print -p "(${DMA_errs}-${prevDMA_errs})/$delta"
read -p DMA_errs
print -p "(${Elem_errs}-${prevElem_errs})/$delta"
read -p Elem_errs
print -p "(${Comm_errs}-${prevComm_errs})/$delta"
read -p Comm_errs

printf "%15s %5s %16s %6.2f %7.2f %7.2f %8.2f %8.2f %9.2f %9.2f\n" "$hostname" "$dev" "$wwpn" "$inreqs" "$outreqs" \
"$inbytes" "$outbytes" "$DMA_errs" "$Elem_errs" "$Comm_errs"
;;
"wwpn")
printf "%15s %5s %16s %6s %7s %7s %8s %8s %9s %9s\n" "$hostname" "$dev" "$wwpn" "$inreqs" "$outreqs" \
"$inbytes" "$outbytes" "$DMA_errs" "$Elem_errs" "$Comm_errs"
;;
"")
[ -n "$hostname" ] && continue
printf "%15s %5s %16s %6s %7s %7s %8s %8s %9s %9s\n" "$hostname" "$dev" "$wwpn" "$inreqs" "$outreqs" \
"$inbytes" "$outbytes" "$DMA_errs" "$Elem_errs" "$Comm_errs"
;;
esac
done
print

sleep $delta
done

$

The script ‘npivstat‘ is available for download in our download-area.

Here is an excerpt from a run of the script (much shortened, only one of the physical ports is shown):

aixvio1 # ./npivstat
       hostname    dev              wwpn  inreqs  outreqs  inbytes  outbytes  DMA_errs  Elem_errs  Comm_errs
...                                                                                                          
        aixvio1   fcs2  0x210000XXXXXXXXXX    0.00  1019.00     0.00    254.75      0.00       0.00       0.00
       aixtsm01   fcs6  0xC0507605E5890074    0.00     0.00     0.00      0.00      0.00       0.00       0.00
       aixtsm02   fcs2  0xC0507609A6C70004    0.00     0.00     0.00      0.00      0.00       0.00       0.00
          aix05   fcs6  0xC0507609A6C7001C    0.00  1018.20     0.00    254.55      0.00       0.00       0.00
...                                                                                                          
        aixvio1   fcs2  0x210000XXXXXXXXXX    0.00  1020.20     0.00    255.05      0.00       0.00       0.00
       aixtsm01   fcs6  0xC050760XXXXXXXXX    0.00     0.00     0.00      0.00      0.00       0.00       0.00
       aixtsm02   fcs2  0xC050760XXXXXXXXX    0.00     0.00     0.00      0.00      0.00       0.00       0.00
          aix05   fcs6  0xC050760XXXXXXXXX    0.00  1019.80     0.00    254.95      0.00       0.00       0.00
...                                                                                                           
        aixvio1   fcs2  0x210000XXXXXXXXXX    0.00   984.80     0.00    246.20      0.00       0.00       0.00
       aixtsm01   fcs6  0xC050760XXXXXXXXX    0.00     0.00     0.00      0.00      0.00       0.00       0.00
       aixtsm02   fcs2  0xC050760XXXXXXXXX    0.00     0.00     0.00      0.00      0.00       0.00       0.00
          aix05   fcs6  0xC050760XXXXXXXXX    0.00   985.00     0.00    246.25      0.00       0.00       0.00
...
^Caixvio1 # 

In the example above, the NPIV client aix05 generates approximately 250 MB/s of data, while the other two NPIV clients aixtsm01 and aixtsm02 have not produced FC traffic during this time.

The script must be started as root on a virtual I/O server. Of course you can customize the script to your own needs.