TCP connection aborts due to “max assembly queue depth”

Recently we had frequent program crashes from some Java client programs. The following was found in the Java stack trace:

[STACK] Caused by: java.io.IOException: Premature EOF
[STACK]           at sun.net.www.http.ChunkedInputStream.readAheadBlocking(Unknown Source)
[STACK]           at sun.net.www.http.ChunkedInputStream.readAhead(Unknown Source)
[STACK]           at sun.net.www.http.ChunkedInputStream.read(Unknown Source)
[STACK]           at java.io.FilterInputStream.read(Unknown Source)

The problem occurs in the ChunkedInputStream class in the readAheadBlocking method. In the source code of the method you will find:

558 /**
559 * If we hit EOF it means there's a problem as we should never
560 * attempt to read once the last chunk and trailers have been
561 * received.
562 */
563 if (nread < 0) {
564 error = true;
565 throw new IOException("Premature EOF");
566 }

The value nread becomes less than 0 when the end of the data stream is reached. This can happen if the other party closes the connection unexpectedly.

The server side in this case was an AIX system (AIX 7.1 TL5 SP3). A review of the TCP connections for dropdowns using netstat resulted in:

$ netstat -p tcp | grep drop
        361936 connections closed (including 41720 drops)
        74718 embryonic connections dropped
                0 connections dropped by rexmit timeout
                0 connections dropped due to persist timeout
                0 connections dropped by keepalive
        0 packets dropped due to memory allocation failure
        0 Connections dropped due to bad ACKs
        0 Connections dropped due to duplicate SYN packets
        1438 connections dropped due to max assembly queue depth
$

Thus, there were 1438 connection drops because of the maximum TCP assembly queue depth. The queue depth is configured via the new kernel parameter tcp_maxqueuelen, which was introduced as a fix for CVE-2018-6922 (see: The CVE-2018-6922 fix (FreeBSD vulnerability) and scp). The default value is 1000. With larger packet runtimes, the queue may overflow.

After increasing the kernel parameter tcp_maxqueuelen, no connection drops because of max assembly queue have occurred anymore.

 

LPAR tool with test license until 15th september 2019

In our download area, version 1.3.0.2 of our LPAR tool, including a valid test license (valid until 15th september 2019) is available for download. The license is contained directly in the binaries, so no license key must be entered. The included trial license allows use of the LPAR tool for up to 10 HMCs, 100 managed systems and 1000 LPARs.

ProbeVue in Action: Monitoring the Queue Depth of Disks

Disk and storage systems support Tagged Command Queuing, i.e. connected servers can send multiple I/O jobs to the disk or storage system without waiting for older I/O jobs to finish. The number of I/O requests you can send to a disk before you have to wait for older I/O requests to complete can be configured using the hdisk queue_depth attribute on AIX. For many hdisk types, the value 20 for the queue_depth is the default value. In general, most storage systems allow even greater values for the queue depth.

With the help of ProbeVue, the utilization of the disk queue can be monitored very easily.

Starting with AIX 7.1 TL4 or AIX 7.2 TL0, AIX supports the I/O Probe Manager. This makes it easy to trace events in AIX’s I/O stack. If an I/O is started by the disk driver, this is done via the iostart function in the kernel, the request is forwarded to the adapter driver and then passed to the storage system via the host bus adapter. Handling the response is done by the iodone function in the kernel. The I/O Probe Manager supports (among others) probe events at these locations:

@@io:disk:iostart:read:<filter>
@@io::disk:iostart:write:<filter>
@@io:disk:iodone:read:<filter>
@@io::disk:iodone:write:<filter>

As a filter, e.g. a hdisk name like hdisk2 can be specified. The probe points then only trigger events for the disk hdisk2. This allows to perform an action whenever an I/O for a hdisk begins or ends. This would allow to measure how long an I/O operation takes or just to count how many I/Os are executed. In our example, we were interested in the utilization of the disk queue, i.e. the number of I/Os sent to the disk which are not yet completed. The I/O Probe Manager has a built-in variable __diskinfo for the iostart and iodone I/O probe events with the following fields (https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix .genprogc / probevue_man_io.htm):

name          char*     Name of the disk.
…
queue_depth   int       The queue depth of the disk (value from ODM)
cmds_out      int       Number of outstanding I/Os
…

The cmds_out field indicates how many I/Os have already been sent to the disk for which the I/O has not yet been completed (response has not yet arrived at the server).

The following section of code determines the minimum, maximum, and average number of entries in the disk queue:

@@io:disk:iostart:*:hdisk0     // Only I/Os for hdisk0 are considered
{
   queue = __iopath->cmds_out; // Store number of outstanding I/Os in variable queue
   ++numIO;                    // Number of I/Os (used for calculating the average)
   avg += queue;               // Add number of outstanding I/Os to variable avg
   if ( queue < min )
      min = queue;             // Check if minimum
   if ( queue > max )
      max = queue;             // Check if maximum
}

The calculated values are then printed once per second using the interval probe manager:

@@interval:*:clock:1000
{
   if ( numIO == 0 )
      numIO = 1;    // Prevent division by 0 when calculating the average
   if ( min > max )
      min = max;
   printf( "%5d  %5d  %5d\n" , min , avg/numIO , max );
   min = 100000;   // Reset variables for the next interval
   avg = 0;
   max = 0;
   numIO = 0;
}

The full script is available for download on our website: ioqueue.e.

Here is a sample run of the script for the disk hdisk13:

# ./ioqueue.e hdisk13
  min    avg    max
    1      1      2
    1      1      9
    1      1      2
    1      1      8
    1      1      2
    1      1      2
    1      1      8
    1      1     10
    1      1      2
    1      1      1
    1      1     10
    1      1      2
    1      1     11
...

The script expects an hdisk as an argument, and then outputs once per second the values determined for the specified hdisk.

In the example output you can see that the maximum number of entries in the disk queue is 11. An increase of the attribute queue_depth therefore makes no sense from a performance perspective.

Here’s another example:

# ./ioqueue.e hdisk21
  min    avg    max
    9     15     20
   11     17     20
   15     19     20
   13     19     20
   14     19     20
   17     18     20
   18     18     19
   16     19     20
   13     18     20
   18     19     19
   17     19     20
   18     19     20
   17     19     19
...

In this case, the maximum value of 20 (the hdisk21 has a queue_depth of 20) is reached on a regular basis. Increasing the queue_depth can improve throughput in this case.

Of course, the sample script can be expanded in various ways; to determine the throughput in addition, or the waiting time of I/Os in the wait queue, or even the position and size of each I/O on the disk. This example just shows how easy it is to get information about I/Os using ProbeVue.

Numbers: FC World Wide Names (WWNs)

Most of us know WWNs as 64-bit WWNs, written as 16 hexadecimal digits. The knowledge that there are different formats of WWNs and that there are also 128-bit WWNs is not quite as well known. In this article, therefore, the different formats of WWNs are briefly presented.

The basic structure of 64-bit WWNs looks like this:

+---+----------------+
|NAA| NAME           |
+---+----------------+
4-bit 60-bit

The 4-bit NAA (Network Address Authority) field specifies the type of address and the format of the address.

There are a number of different possibilities for the 60-bit NAME field.

 

1. Format 1 Address (NAA = 0001)

+---+--------+------------------------+
|NAA|Reserved| 48-bit IEEE MAC Address|
+---+--------+------------------------+
4-bit 12-bit   48-bit

In the Reserved (12-bit) field, all bits must be set to 0!

Example:

1 000 00507605326d (To clarify the format, the fields are separated by spaces)

 

2. Format 2 Address (NAA = 0010)

+---+---------------+-----------------------+
|NAA|Vendor Assigned|48-bit IEEE MAC Address|
+---+---------------+-----------------------+
4-bit  12-bit         48-bit

The 12-bit “Vendor Assigned” field can be used arbitrarily by the manufacturer.

Example:

2 001 00507605326d (To clarify the format, the fields are separated by spaces)

 

3. Format 3 Address (NAA = 0011)

+---+-----------------+
|NAA|Vendor Assigned  |
+---+-----------------+
4-bit 60-bit

The field “Vendor Assigned” (60-bit) is assigned arbitrarily by the manufacturer. Thus, this type of address is not unique worldwide and therefore usually not found in practice.

Example:

3 0123456789abcde (To clarify the format, the fields are separated by spaces)

 

4. Format 4 Address (NAA = 0100)

+---+---------+--------------+
|NAA|Reserved | IPv4 Address |
+---+---------+--------------+
4-bit 28-bit     32-bit

The “IPv4 Address” (32-bit) field contains a 32-bit IPv4 address.

Example for IP 10.0.0.1:

4 0000000 0a000001 (To clarify the format, the fields are separated by spaces)

 

5. Format 5 Address (NAA = 0101)

+---+-------+-----------------+
|NAA| OUI   | Vendor Assigned |
+---+-------+-----------------+
4.bit 24-bit 36-bit

The OUI (24-bit) field contains the 24-bit IEEE-assigned ID (Organizational Unique ID).

The field “Vendor Assigned” (36-bit) can be assigned arbitrarily by the manufacturer.

Example:

5 005076 012345678 (To clarify the format, the fields are separated by spaces)

 

6. Format 6 Address (NAA = 0110)

Format 6 addresses are 128-bit addresses and are often used for LUNs on the SAN.

+---+-------+---------------+-------------------------+
|NAA|  OUI  |Vendor Assigned|Vendor Assigned Extension|
+---+-------+---------------+-------------------------+
4.bit 24-bit  36-bit          64-bit

The OUI (24-bit) field contains the 24-bit ID assigned by the IEEE.

The field “Vendor Assigned” (36-bit) can be arbitrarily assigned by the manufacturer.

The field “Vendor Assigned Extension” (64-bit) can also be assigned arbitrarily by the manufacturer.

Example:

6 005076 012345678 0123456789abcdef (To clarify the format, the fields are separated by spaces)

 

7. IEEE EUI-64 Address (NAA=11)

In the case of this address format, the NAA field is shortened to only 2 bits, where NAA is 11.

+---+-------------+---------------+
|NAA|OUI shortened|Vendor Assigned|
+---+-------------+---------------+
2-bit 22-bit       40-bit

The “OUI shortened” field (22-bit) is a 22-bit shortened version of the IEEE-assigned 24-bit ID.

(The two least significant bits of the first byte are omitted and the remaining 6 bits are shifted 2 bits to the right to make room for the two NAA bits.)

The field “Vendor Assigned” (40-bit) can be arbitrarily assigned by the manufacturer.

These types of addresses are often used in the area of virtualization, e.g. when it comes to NPIV (N_Port ID Virtualization).

Example:

c05076 0123456789 (To clarify the format, the fields are separated by spaces)

 

 

Full file system: df and du show different space usage

Full file systems occur in practice again and again, everyone knows this. Usually you search for large files or directories and check if older data can be deleted to make space again (but sometimes the file system will be simply enlarged without further investigation). In some cases, however, you can not find any larger files that could be deleted or you discover that file system space is seems to be gone, but you can not identify where this space is used. The command du then displays a smaller value for the file system space used than df. In the following, such an example is shown, as well as the hint how to identify where the filesystem-space is and how it can finally be recovered. AIX has a nice feature to offer that is not found in most other UNIX derivatives.

The file system /var/adm/log is 91% filled, currently 3.6 GB of the file system are in use:

# df -g  /var/adm/log
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/varadmloglv      4.00      0.39   91%      456     1% /var/adm/log
#

A quick check with the command du shows that apparently much less space is occupied:

# du –sm /var/adm/log
950.21   /var/adm/log
#

The command “disk usage” shows only 950 MB occupied space! This is 2.7 GB less than the value from the df command. But where is the missing space?

The difference comes from files that have been deleted but are still open by at least one process. The entry for such files is removed from the associated directory, which makes the file inaccessible. Therefore the command du does not take thes files into account and comes up with a smaller value. As long as a process still has the deleted file in use, however, the associated blocks are not released in the file system, so df correctly displays these as occupied.

So there is at least one file in the file system /var/adm/log which has been deleted but is still open by a process. The question is how to identify the process and the file.

AIX provides an easy way to identify processes that have opened deleted files, the fuser command supports the -d option to list processes that have deleted files open:

# fuser -d /var/adm/log
/var/adm/log:  9110638
#

If you also use the -V option, you will also see information about the deleted files, such as the inode number and file size:

# fuser -dV /var/adm/log
/var/adm/log:
inode=119    size=2882647606   fd=12     9110638
#

The output shows that here the file with the inode number 119 with a size of approximately 2.8 GB was deleted, but is still opened by the process with the PID 9110638 over the file descriptor 12.

Using ps you can quickly find out which process it is:

# ps -ef|grep 9110638
    root  9110638  1770180   0   Nov 20      - 28:28 /usr/sbin/syslogd
    root  8193550  8849130   0 09:13:35  pts/2  0:00 grep 9110638
#

In our case it is the syslogd process. Presumably a log file was rotated via mv without informing the syslogd (refresh -s syslogd). We fix this shortly and check the file system again:

# refresh -s syslogd
0513-095 The request for subsystem refresh was completed successfully.
#
# df -g /var/adm/log
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/varadmloglv      4.00      3.07   24%      455     1% /var/adm/log
#

The output shows that the file system blocks have now been released.

ProbeVue in Action: Identifying a crashing Process

Recently, our monitoring reported a full /var file system on one of our systems. We detected that core files in the directory /var/adm/core had filled the file system. It quickly turned out that all core files came from perl. However, based on the core files we could not determine which perl script had caused the crash of perl. A look at the timestamps of the core files unfortunately showed no pattern:

-bash-4.4$ ls -ltr /var/adm/core
total 2130240
drwxr-xr-x    2 root     system          256 Jan 29 10:20 lost+found/
-rw-------    1 root     system    100137039 Jun 26 04:51 core.22610328.26025105.Z
-rw-------    1 root     system     99054991 Jun 26 06:21 core.21102892.26042104.Z
-rw-------    1 root     system     99068916 Jun 26 08:06 core.18153840.26060607.Z
-rw-------    1 root     system    100132866 Jun 26 08:21 core.19005848.26062105.Z
-rw-------    1 root     system     97986020 Jun 26 16:36 core.15270246.26143608.Z
-rw-------    1 root     system     99208958 Jun 26 22:21 core.22675838.26202106.Z
-rw-------    1 root     system     97557063 Jun 27 01:06 core.5505292.26230604.Z
-rw-------    1 root     system     98962499 Jun 27 10:06 core.8257960.27080603.Z
-rw-------    1 root     system     99804173 Jun 27 14:51 core.18940202.27125107.Z
-rw-------    1 root     system     99633676 Jun 28 03:21 core.17563960.28012107.Z
-rw-------    1 root     system     99116032 Jun 28 19:06 core.8651210.28170608.Z
-bash-4.4$

Also, the entries in the error report provided no information about the crashing perl script and how it was started.

-bash-4.4$ sudo errpt -j A924A5FC –a
...
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     A924A5FC

Date/Time:       Wed May 29 15:21:25 CEST 2019
Sequence Number: 17548
Machine Id:      XXXXXXXXXXXX
Node Id:         XXXXXXXX
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSPROC        

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
         11
USER'S PROCESS ID:
              13369662
FILE SYSTEM SERIAL NUMBER
           1
INODE NUMBER
                 69639
CORE FILE NAME
/var/adm/core/core.13369662.29132106
PROGRAM NAME
perl
STACK EXECUTION DISABLED
           0
COME FROM ADDRESS REGISTER

PROCESSOR ID
  hw_fru_id: 1
  hw_cpu_id: 19

ADDITIONAL INFORMATION

Unable to generate symptom string.
Too many stack elements.
-bash-4.4$

The only information that could be found was that the processes were terminated with signal 11 (SIGSEGV), that is, due to an access to an invalid memory address.

The question arose: how to determine which perl script was crashing and how it is started.

This should be a task for ProbeVue.

The sysproc provider, which generates an event in case of an exit of a process, seemed to be the right probe to use. The special built-in variable __exitinfo provides more detailed information about the exit, such as exit status or the signal number that terminated the process. This can be used to write the following clause:

1: @@sysproc:exit:*
2: when ( __exitinfo->signo == 11 )
3: {
4:         printf( "%llu:  %s\n" , __pid , __pname );
5:         ptree(10);
6: }

The 6 lines are briefly explained here:

  1. The probe point: provider is sysproc, event is exit, * means any process.
  2. By using the above predicate, the subsequent action block is executed only when the process was terminated with signal 11 (SIGSEGV).
  3. Start of the action block.
  4. Output of the PID and the program name of the process.
  5. The function ptree outputs the father, grandfather, etc. (up to 10 levels) of the process.
  6. End of the action block.

Unfortunately, no program arguments can be listed here, which in our case would have given the name of the perl script. But at least we get informations from the function ptree about the way the program was called, which is sufficient in some cases to ultimately identify the program.

For identification we would like to have the information about the arguments with which perl was called. This information is provided by the syscall provider for the system call execve with which the program is started. The probe point is thus syscall: *: execve: entry, since the arguments are known when entering the function. The signature of execve for ProbeVue looks like this:

int execve( char* , struct arg_t* args , char* );

Here, the first argument (provided by ProbeVue as __arg1) is the program name. The second argument is a structure with the arguments in question (provided by __arg2). The third argument gives access to environment variables, which is not important in our case. The structure struct arg_t looks like this for 5 arguments:

struct arg_t
{
        union
        {
                char* arg[5];
                int num[5];
        } u;
};

This structure and the signature of execve must be declared in the ProbeVue script before they can be used.

When accessing the arguments, there is another small problem: if the action block for our sample is addressed, we are in kernel mode, but the arguments themselves are addresses in the user mode of the process. The data (in this case character strings) must be copied out of the user address space. This is done by the function get_userstring.

For every execve we get the PID, the program name, the command and up to 5 arguments. This is implemented in the following program:

#! /usr/bin/probevue

struct arg_t
{
        union
        {
                char* arg[5];
                int num[5];
        } u;
};

int execve( char* , struct arg_t* args , char* );

@@syscall:*:execve:entry
{
        __auto String command[128];
        __auto String argument[128];
        __auto struct arg_t argv;
        copy_userdata( __arg2 , argv );
        command = get_userstring( __arg1 , -1 );
        argument = get_userstring( argv.u.arg[0] , -1 );
        printf( "%llu: %s called execve(%s) with arguments: %s " , __pid , __pname , command , argument )
;
        if ( argv.u.num[1] != 0 )
        {
                argument = get_userstring( argv.u.arg[1] , -1 );
                printf( "%s " , argument );
                if ( argv.u.num[2] != 0 )
                {
                        argument = get_userstring( argv.u.arg[2] , -1 );
                        printf( "%s " , argument );
                        if ( argv.u.num[3] != 0 )
                        {
                                argument = get_userstring( argv.u.arg[3] , -1 );
                                printf( "%s " , argument );
                                if ( argv.u.num[4] != 0 )
                                {
                                        argument = get_userstring( argv.u.arg[4] , -1 );
                                        printf( "%s " , argument );
                                }
                        }
                }
        }
        printf( "\n" );
}

@@sysproc:exit:*
when ( __exitinfo->signo == 11 )
{
        printf( "%llu:  %s\n" , __pid , __pname );
        ptree(10);
}

We called the script capture_segv.e and made it executable.

In theory, after startup the program should output all starting programs with PID, name and up to 5 arguments. In addition, an output occurs when a process is terminated with signal 11 (SIGSEGV). The corresponding PID can then be matched further up in the output and thus the program with arguments can be identified.

Unfortunately, the following small problem arises in practice: if a program is terminated very quickly after the execve, before ProbeVue can copy the arguments with get_userstring, get_userstring accesses a no longer existing address and the ProbeVue script is aborted. We bypassed this by simply starting the ProbeVue script over and over in an infinite loop:

# while true; do ./capture_segv.e >>/tmp/wait_for_segv ; done

We then ran the ProbeVue script for a few hours, until the next crash of the perl script. The file /tmp/wait_for_segv contained about 23,000 lines! We have listed only the relevant lines here:

# cat /tmp/wait_for_segv
…
8651210: ksh called execve(xxxx_hacheck.pl) with arguments: xxxx_hacheck.pl -c
8651210: ksh called execve(/var/opt/OV/bin/instrumentation/xxxx_hacheck.pl) with arguments: xxxx_hacheck
.pl -c
20054518: ksh called execve(/bin/proc2mon.pl) with arguments: proc2mon.pl
…
8651210:  perl

     PID              CMD
       1              init
                        |
                        V
9634196              ovcd
                        |
                        V
9765232              opcacta
                        |
                        V
8651210              perl    <=======
…

One can see that perl was started via the program opcacta, which was started by ovcd. These processes belong to HP OpenView that is in use here. At the top of the output one can see that the perl script /var/opt/OV/bin/instrumentation/xxxx_hacheck.pl has been started. So we found the script that generates the many core files.

The script was written recently and obviously needs to be re-examined and reworked.

With the help of ProbeVue a short script and several hours of waiting was enough to find the cause of the problem! ProbeVue is not only useful for investigating problems; also in performance monitoring, ProbeVue proves to be extremely helpful.

Resources of not activated LPARs and Memory Affinity

When an LPAR is shut down, resources such as processors, memory, and I/O slots are not automatically released by the LPAR. The resources remain assigned to the LPAR and are then reused on the next activation (with the current configuration). In the first part of the article Resources of not activated LPARs we had already looked at this.

(Note: In the example output, we use version 1.4 of the LPAR tool, but in all cases we show the underlying commands on the HMC command line, so you can try everything without using the LPAR tool.)

The example LPAR lpar1 was shut down, but currently still occupies 100 GB of memory:

linux $ lpar status lpar1
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  39       aixlinux  Not Activated  standard  0     inactive  1      0.2         102400  Unknown
linux $

The following commands for the output above were executed on the corresponding HMC hmc01:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1
hmc01: lshwres -r mem -m ms09 --level lpar --filter lpar_names=lpar1
hmc01: lshwres -r proc -m ms09 --level lpar --filter lpar_names=lpar1

As the output shows, the LPAR lpar1 has still allocated its resources (processors, memory, I/O adapters).

In order to understand why deactivating an LPAR does not release the resources, you have to look at the “Memory Affinity Score”:

linux $ lpar lsmemopt lpar1
             LPAR_SCORE  
LPAR_NAME  CURR  PREDICTED
lpar1      100   0
linux $

HMC command line:

hmc01: lsmemopt -m ms09 -r lpar -o currscore –filter lpar_names=lpar1

The Memory Affinity Score describes how close processors and memory are, the closer the memory to the memory is, the better is the throughput to the memory. The command above indicates, with a value between 1 and 100, how big the affinity between processors and LPARs is. Our LPAR lpar1 currently has a value of 100, which means the best possible affinity of memory and processors. If the resources were freed when deactivating an LPAR, then the LPAR would lose this Memory Affinity Score. The next time you enable the LPAR, it then depends on the memory and processors available then how good the memory affinity will be then. We release the resources once:

linux $ lpar -d rmprocs lpar1 1
linux $

HMC command line:

hmc01: chhwres -m ms09 -r proc  -o r -p lpar1 --procs 1

No more score will be given, since the LPAR has no longer allocated any resources:

linux $ lpar lsmemopt lpar1
             LPAR_SCORE  
LPAR_NAME  CURR  PREDICTED
lpar1      none  none
linux $

HMC command line:

hmc01: lsmemopt -m ms09 -r lpar -o currscore –filter lpar_names=lpar1

Now we allocate resources again and look at the effect this has on memory affinity:

linux $ lpar applyprof lpar1 standard
linux $

HMC command line:

hmc01: chsyscfg -r lpar -m ms09 -o apply -p lpar1 -n standard

We again determine the Memory Affinity Score:

linux $ lpar lsmemopt lpar1
             LPAR_SCORE  
LPAR_NAME  CURR  PREDICTED
lpar1      53    0
linux $

HMC command line:

hmc01: lsmemopt -m ms09 -r lpar -o currscore –filter lpar_names=lpar1

The score is now only 53, the performance of the LPAR has become worse. Whether and how much this is noticeable, depends ultimately on the applications on the LPAR.

The fact that the resources are not released when deactivating an LPAR, thus guarantees the next time you activate (with the current configuration) the memory affinity remains the same and thus the performance should be the same.

If you release the resources of an LPAR (manually or automatically), then you have to realize that this has an effect on the LPAR if it is later activated again, because then the resources are reassigned and a worse (but possibly also a better) Memory Affinity Score can result.

Conversely, before activating a new LPAR you can also make sure that there is a good chance for a high memory affinity score for the new LPAR by releasing resources of inactive LPARs.

(Note: resource distribution can be changed and improved at runtime using the Dynamic Platform Optimizer DPO. DPO is supported as of POWER8.)

 

Resources of not activated LPARs

When an LPAR is shutdown, resources such as processors, memory, and I/O slots are not automatically released by the LPAR. The resources remain assigned to the LPAR and are reused on the next activation (with the current configuration).

The article will show how such resources are automatically released and, if desired, how to manually release resources of an inactive LPAR.

(Note: In the example output, we use version 1.4 of the LPAR tool, but in all cases we show the underlying commands on the HMC command line, so you can try everything without using the LPAR tool.)

The example LPAR lpar1 was shut down, but currently still occupies 100 GB of memory:

linux $ lpar status lpar1
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  39       aixlinux  Not Activated  standard  0     inactive  1      0.2         102400  Unknown
linux $

The following commands for the output above were executed on the corresponding HMC hmc01:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1
hmc01: lshwres -r mem -m ms09 --level lpar --filter lpar_names=lpar1
hmc01: lshwres -r proc -m ms09 --level lpar --filter lpar_names=lpar1

The resource_config attribute of an LPAR indicates whether the LPAR has currently allocated resources (resource_config=1) or not (resource_config=0):

linux $ lpar status -F resource_config lpar1
1
linux $

Or on the HMC command line:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1 –F resource_config

The resources allocated by an not activated LPAR can be released in 2 different ways:

  1. Automatic: The resources used are needed by another LPAR, e.g. because memory is expanded dynamically or an LPAR is activated that does not have sufficient resources. In this case, resources are automatically removed from a not activated LPAR. We will show this below with an example.
  2. Manual: The allocated resources are explicitly released by the administrator. This is also shown below in an example.

First we show an example in which resources are automatically taken away from an not activated LPAR.

The managed system ms09 currently has about 36 GB free memory:

linux $ ms lsmem ms09
NAME  INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     33792     786432        36352  256
linux $

HMC command line:

hmc01: lshwres -r mem -m ms09 --level sys

We start an LPAR (lpar2) which was configured with 100 GB of RAM. The managed system has only 36 GB of RAM and is therefore forced to take resources from inactive LPARs in order to provide the required 100 GB. We start lpar2 with the profile standard and look at the memory relations:

linux $ lpar activate -b sms -p standard lpar2
linux $

HMC command line:

hmc01: chsysstate -m ms09 -r lpar -o on -n lpar2 -b sms -f standard

Overview of the storage relations of lpar1 and lpar2:

linux $ lpar status lpar\*
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  4        aixlinux  Not Activated  standard  0     inactive  1      0.2         60160   Unknown
lpar2  8        aixlinux  Open Firmware  standard  0     inactive  1      0.2         102400  Unknown
linux $ ms lsmem ms09
NAME  INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     35584     786432        0      256
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09
hmc01: lshwres -r mem -m ms09 --level lpar
hmc01: lshwres -r proc -m ms09 --level lpar
hmc01: lshwres -r mem -m ms09 --level sys

The LPAR lpar2 has 100 GB RAM, the managed system has no more free memory and the memory allocated by LPAR lpar1 has been reduced to about 60 GB. Allocated resources from non-activated LPARs are automatically released, when needed and assigned to other LPARs.

But you can of course also release the resources manually. This is also shown briefly here. We are reducing the memory of LPAR lpar1 by 20 GB:

linux $ lpar -d rmmem lpar1 20480
linux $

HMC command line:

hmc01: chhwres -m ms09 -r mem  -o r -p lpar1 -q 20480

As stated, the allocated memory has been reduced by 20 GB:

linux $ lpar status lpar\*
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  4        aixlinux  Not Activated  standard  0     inactive  1      0.2         39680   Unknown
lpar2  8        aixlinux  Open Firmware  standard  0     inactive  1      0.2         102400  Unknown
linux $ ms lsmem ms09
NAME  INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     35584     786432        20480  256
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09
hmc01: lshwres -r mem -m ms09 --level lpar
hmc01: lshwres -r proc -m ms09 --level lpar
hmc01: lshwres -r mem -m ms09 --level sys

The 20 GB are immediately available to the managed system as free memory. If you remove the entire memory or all processors (or processor units), then all resources of an inactive LPAR are released:

linux $ lpar -d rmmem lpar1 39680
linux $

HMC command line:

hmc01: chhwres -m ms09 -r mem  -o r -p lpar1 -q 39680

Here are the resulting memory relations:

linux $ lpar status lpar\*
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  4        aixlinux  Not Activated  standard  0     inactive  0      0.0         0       Unknown
lpar2  8        aixlinux  Open Firmware  standard  0     inactive  1      0.2         102400  Unknown
linux $ ms lsmem ms09
NAME        INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     31232     786432        64512  256
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09
hmc01: lshwres -r mem -m ms09 --level lpar
hmc01: lshwres -r proc -m ms09 --level lpar
hmc01: lshwres -r mem -m ms09 --level sys

The LPAR lpar1 now has 0 processors, 0.0 processor units and 0 MB of memory! In addition, the resource_config attribute now has the value 0, which indicates that the LPAR no longer has any resources configured!

linux $ lpar status -F resource_config lpar1
0
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1 –F resource_config

Finally, the question arises as to why you should release resources manually if they are automatically released by the managed system when needed?

We will answer this question in a second article.

 

Accessing the Update Access Key Expiration Date from AIX

As part of the introduction of POWER8 systems, IBM has also introduced the “Update Access Key”, which is necessary to perform firmware updates of the managed system. By default, newly delivered systems have an update access key that usually expires after 3 years. Thereafter, the Update Access Key can has to be extended every 6 month, but only if a maintenance contract exists(https://www.ibm.com/servers/eserver/ess/index.wss).

Of course, it is easy to find out when the current Update Access Key runs through the HMC, GUI or CLI. But you can also display the expiration date via the lscfg command from AIX:

In the case of AIX 7.1, this looks like this:

$ lscfg -vpl sysplanar0 | grep -p "System Firmware"
      System Firmware:
...
        Microcode Image.............SV860_138 SV860_103 SV860_138
        Microcode Level.............FW860.42 FW860.30 FW860.42
        Microcode Build Date........20180101 20170628 20180101
        Microcode Entitlement Date..20190825
        Hardware Location Code......U8284.22A.XXXXXXX-Y1
      Physical Location: U8284.22A.XXXXXXX-Y1

In the case of AIX 7.2, the output is slightly different:

$ lscfg -vpl sysplanar0 |grep -p "System Firmware"
      System Firmware:
...
        Microcode Image.............SV860_138 SV860_103 SV860_138
        Microcode Level.............FW860.42 FW860.30 FW860.42
        Microcode Build Date........20180101 20170628 20180101
        Update Access Key Exp Date..20190825
        Hardware Location Code......U8284.22A.XXXXXXX-Y1
      Physical Location: U8284.22A.XXXXXXX-Y1

Relevant are the lines “Microcode Entitlement Date” respectively “Update Access Key Exp Date“.

Special features of NFSv4 Mounts

Many AIX administrators use NFSv4 in the same way as NFSv3 and NFSv2 before. When exporting and mounting version 4 is specified, otherwise everything else is done as before. While this works in most cases, it also prevents the use of some interesting features of NFSv4.

The first significant difference between NFSv4 and its predecessors already shows up when mounting. NFSv2 and NFSv3 use a separate MOUNT protocol. For example, to run the following NFS mount:

clientv3 # mount aixnim:/export/data /mnt
clientv3 #

an RPC request is sent to the rpc.mountd on the NFS server to get an NFS filehandle for the file system/directory /export/data. All NFS operations must always specify an NFS filehandle that uniquely identifies a file or directory on the NFS server. The very first filehandle is requested via the MOUNT protocol.

In the case of NFSv4 this looks quite different, no separate MOUNT protocol is needed. Instead, a so-called root filehandle is used, which is defined by the NFS standard and need not be learned through a separate protocol. At the following NFSv4 mount

clientv4 # mount –o vers=4 aixnim:/export/data /mnt
clientv4 #

the client starts an NFS LOOKUP specifying the well-known (NFS default) root filehandle and relative path “export/data“, then the NFS server returns the associated filehandle. This can easily be tracked with the help of tcpdump, which we did not do here for reasons of space.

A side effect that is surprising for many administrators (and maybe not always understood) is that you can not see the filesystems exported with NFSv4 with “showmount -e“. This is simply because there is no MOUNT protocol for NFSv4. So you can not easily find out on the NFS client which file systems the NFS server has exported for NFSv4.

clientv4 # showmount -e aixnim
no exported file systems for aixnim
clientv4 #

The command “showmount -e” does not show exported file systems, although we were able to mount successfully using NFSv4. We’ll come back to that later.

The second significant difference is that for NFSv4 the NFS server generates a pseudo-file system for each NFSv4 client. This file system starts with the nfsroot directory (by default this is /) and contains all underlying directories and files exported to the client. The pseudo-file system is also created if there is no exported file system for the client!

For demonstration purposes, we have un-exported all file systems on our NFS server aixnim:

aixnim # lsnfsexp
aixnim #

Although nothing has been exported yet for the NFS client, the per client-generated pseudo-file system can be mounted using NFSv4:

clientv4 # mount -o vers=4 aixnim:/ /mnt
clientv4 # ls -il /mnt
total 0
clientv4 #

The mounted file system is of course empty, since nothing has been exported yet. We unmount the mounted file system again (not shown here) and export the directory /var/adm on the NFS server aixnim:

aixnim # mknfsexp -d /var/adm -v 4 -r clientv4
aixnim # lsnfsexp
/var/adm -vers=4,root=clientv4
aixnim #

We now mount the pseudo-file system / again:

clientv4 # mount -o vers=4 aixnim:/ /mnt
clientv4 #

In order to illustrate the differences to NFSv2 and NFSv3 more easily, we briefly show the useful command nfs4cl for the NFSv4 client:

clientv4 # nfs4cl showfs

Server      Remote Path          fsid                 Local Path        
--------    ---------------      ---------------      ---------------   
aixnim    /                    0:42949672964        /mnt              
clientv4 #

The command shows the pseudo-file system / from aixnim, which is mounted under /mnt. We now look briefly with the command ls in the directory /mnt

clientv4 # ls -il /mnt
total 1
    2 dr-xr-xr-x    2 root     system            3 May 21 07:34 var
clientv4 #

In the pseudo-file system generated by the NFS server, only the path component /var is visible. This path component is a prefix of the exported directory /var/adm. Other directories such as /opt or /usr are not visible in the pseudo-file system, because they are not prefixes of an exported path. We take a look at /mnt/var:

clientv4 # ls -il /mnt/var   
total 8
   32 drwxrwxr-x   15 root     adm            4096 May  2 11:30 adm
clientv4 #

Also under directory var, only the directory adm is visible, since only /var/adm is a prefix of an exported path. The pseudo-file system is of course unchangeable at the places that were not exported, as the attempt to create a file under /mnt/var shows:

clientv4 # touch /mnt/var/file
touch: /mnt/var/file cannot create
clientv4 #

Starting with /mnt/var/adm everything looks like we know from NFSv2 and NFSv3, one has access to the exported data:

clientv4 # ls -il /mnt/var/adm
total 704
  110 drw-r-----    2 root     system          256 May 20 14:33 SRC
4165 drwxrwxr-x    2 adm      adm             256 Apr 17 08:07 acct
   70 drwx------    4 root     system          256 Apr 17 07:50 config
4133 drwx------    2 root     system          256 Apr 17 08:03 corrals
...
4   33 -rw-rw-r--    1 adm      adm          337608 May 20 09:30 wtmp
clientv4 #

Now let’s look again at the output of the command “nfs4cl showfs“:

Clientv4 # nfs4cl showfs

Server      Remote Path          fsid                 Local Path        
--------    ---------------      ---------------      ---------------   
aixnim   /var                 0:42949672966        /mnt/var          
aixnim   /                    0:42949672964        /mnt        
clientv4 #

For each physical file system on the server, a separate pseudo-file system is created. The respective pseudo-file system grants access to exported directories of the underlying physical file system and generates read-only directories for path prefixes of exported directories.

On the NFS server aixnim we export the directory /usr/share for the client:

aixnim # mknfsexp -d /usr/share -v 4 -r clientv4
aixnim # lsnfsexp
/var/adm -vers=4,root=clientv4
/usr/share -vers=4,root=clientv4
aixnim #

On the client this time we do not umount and re-mount, but simply use ls to access the mount point /mnt again:

clientv4 # ls -il /mnt
total 2
    2 dr-xr-xr-x    2 root     system            3 May 21 08:13 usr
    2 dr-xr-xr-x    2 root     system            3 May 21 07:34 var
clientv4 #

The path prefix usr of the currently exported directory /usr/share appears on the client without having explicitly performed a mount. A look at /mnt/usr shows that the directory share appears as expected:

clientv4 # ls -il /mnt/usr

total 0

16390 drwxr-xr-x    8 bin      bin             256 Apr 17 08:31 share

clientv4 #

And under /mnt/usr/share are, as expected, the exported data:

clientv4 # ls -il /mnt/usr/share
total 24
74212 drwxr-xr-x    2 bin      bin             256 Apr 17 08:24 X11
49162 drwxr-xr-x    2 bin      bin             256 Nov  3 2015  dict
16391 drwxr-xr-x   12 bin      bin            4096 Apr 17 08:22 lib
17762 lrwxrwxrwx    1 root     system           26 Apr 17 08:31 locale -> /opt/freeware/share/locale
20653 drwxr-xr-x    5 root     system          256 Apr 24 15:46 lpp
16911 drwxr-xr-x   11 bin      bin            4096 May 20 14:25 man
45096 drwxr-xr-x    2 bin      bin            4096 Apr 17 08:03 modems
clientv4 #

The command “nfs4cl showfs” now shows 3 file systems:

Clientv4 # nfs4cl showfs

Server      Remote Path          fsid                 Local Path        
--------    ---------------      ---------------      ---------------   
aixnim /usr                 0:42949672965        /mnt/usr          
aixnim /var                 0:42949672966        /mnt/var          
aixnim /                    0:42949672964        /mnt              
clientv4 #

The last example shows that in the case of NFSv4 new file systems do not necessarily have to be manually mounted on the client. If further file systems are exported on the NFS server for an NFSv4 client, the client can simply access the new file system. Prerequisite, however, is that the path for the new file system is part of the client-mounted pseudo-file system. Since we had the entire pseudo-file system starting from the nfsroot mounted, this is trivially the case. But if we had only mounted /var from the NFS server, then /usr/share would not be part of the pseudo-file system of /var and a separate mount would have been necessary.

That additional file systems can be accessed without explicit mount, as just shown, comes from a third difference of NFSv4 to its predecessors. For NFSv2 and NFSv3, all filehandles are persistent, that is, immutable. NFSv4 introduced volatile filehandles in addition to persistent filehandles. The file handles used by the pseudo-file system are volatile. That means the client must expect that such a filehandle can change. This is the case if another path is exported on the NFS server, the filehandles for the path prefixes in the pseudo-file system change then, the client recognizes this after a short time and responds accordingly.

Finally, a small problem should be noted: we had 1 mount done, but the command “nfs4cl showfs” showed that 3 filesystems were involved. But there is only one file system mounted, as df shows:

clientv4 # df -g
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd4           1.00      0.86   15%     8367     5% /
/dev/hd2           6.12      2.90   53%    65563     9% /usr
/dev/hd9var        2.00      1.81   10%     1770     1% /var
/dev/hd3           2.00      1.89    6%      452     1% /tmp
/dev/hd1           1.00      0.88   12%      454     1% /home
/dev/hd11admin      0.50      0.50    1%       15     1% /admin
/proc                 -         -    -        -      - /proc
/dev/hd10opt       2.00      0.94   54%    22460    10% /opt
/dev/livedump      1.00      1.00    1%        4     1% /var/adm/ras/livedump
/dev/varadmloglv      2.00      1.85    8%      275     1% /var/adm/log
aixnim:/       0.38      0.20   47%    12644    22% /mnt
clientv4 #

The file system mounted under /mnt is 0.38 GB in size. On the NFS server, /usr/share and /var/adm were exported, a df shows the following sizes:

aixnim # df –g / /usr/share /var/adm
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd4           0.38      0.20   47%    12644    22% /
/dev/hd2           2.06      0.23   89%    39236    41% /usr
/dev/hd9var        0.50      0.45   10%      614     1% /var
aixnim #

Obviously, the values of the file system / of the NFS server are used on the client! Under /usr and thus also /usr/share would be available free space of close to 2 GB, but this is not reflected in the df output on the client. Of course it is difficult to specify values on the client, because multiple file systems are involved on the NFS server. The df command simply displays the data of the physical file system underlying the mounted pseudo-file system. In our case, this is the root file system of the NFS server. Here again you can use the command nfs4cl, which has a subcommand for displaying file system information similar to df:

clientv4 # nfs4cl showstat

Filesystem       512-blocks        Free  %Used       Iused %Iused  Mounted on
aixnim:/usr     4325376      482752    89%       39236    41% /mnt/usr  
aixnim:/var     1048576      947064    10%         614     1% /mnt/var  
aixnim:/      786432      417944    47%       12644    22% /mnt     
clientv4 #

This is identical to the values displayed by df on the NFS server.

But even with the standard df of AIX you can get this information, as the following output shows:

clientv4 # df -g /mnt /mnt/usr /mnt/usr/share /mnt/var /mnt/var/adm
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
aixnim:/           0.38      0.20   47%    12644    22% /mnt
[NFSv4]            2.06      0.23   89%    39236    41% /mnt/usr
[NFSv4]            2.06      0.23   89%    39236    41% /mnt/usr/share
[NFSv4]            0.50      0.45   10%      614     1% /mnt/var
[NFSv4]            0.50      0.45   10%      614     1% /mnt/var/adm
clientv4 #

Of course there are a number of other differences of NFSv4 and previous versions, but they should not be discussed here. Maybe there will be another article on the subject.

The features presented above are especially useful for hierarchical mounts. With NFSv4, you only need to mount one filesystem and you need not worry about the order of mounts.

Understanding how the pseudo-file system works for the NFSv4 client helps determine the exported file systems on the client. Instead of using “showmount -e” as before with NFSv2 and NFSv3 (which does not get results with NFSv4), you can simply mount everything starting with / and then use cd and ls to find out what the NFS server has exported.