LPAR-Tool in Action: Examples

The LPAR tool can administer HMCs, managed systems, LPARs and virtual-I/O-servers via the command line. The current version of the LPAR tool (currently 1.4.0.2) can be downloaded from our download page https://powercampus.de/download. A trial license, valid until October 31, is included. This article will show you some simple but useful applications of the LPAR tool.

A common question in larger environments (multiple HMCs, many managed systems) is: where is a particular LPAR? This question can easily be answered with the LPAR tool, by using the command “lpar show“:

$ lpar show lpar02
NAME    ID  SERIAL     LPAR_ENV  MS    HMCS
lpar02  39  123456789  aixlinux  ms21  hmc01,hmc02
$

In addition to the name, the LPAR-ID and the serial number, the managed system, here ms21, and the associated HMCs, here hmc01 and hmc02, are also shown. You can also specify multiple LPARs and/or wildcards:

$ lpar show lpar02 lpar01
...
$ lpar show lpar*
...
$

If no argument is given, all LPARs are listed.

 

Another question that frequently arises is the status of an LPAR or multiple LPARs. Again, this can be easily answered, this time with the command “lpar status“:

$ lpar status lpar02
NAME    LPAR_ID  LPAR_ENV  STATE    PROFILE   SYNC  RMC     PROCS  PROC_UNITS  MEM   OS_VERSION
lpar02  39       aixlinux  Running  standard  0     active  1      0.7         7168  AIX 7.2 7200-03-02-1846
$

The LPAR lpar02 is Running, the profile used is standard, the RMC connection is active and the LPAR is running AIX 7.2 (TL3 SP2). The LPAR has 1 processor core, with 0.7 processing units and 7 GB RAM. The column SYNC indicates whether the current configuration is synchronized with the profile (attribute sync_curr_profile).

Of course, several LPARs or even all LPARs can be specified here.

If you want to see what the LPAR tool does in the background: for most commands you can specify the option “-v” for verbose-only. The HMC commands will then be listed, but no changes will be made to the HMC. Here are the HMC commands that are issued for the status output:

$ lpar status -v lpar02
hmc01: lssyscfg -r lpar -m ms21
hmc01: lshwres -r mem -m ms21 --level lpar
hmc01: lshwres -r proc -m ms21 --level lpar
$

 

Next, the addition of additional RAM will be shown. We start with the status of the LPAR:

$ lpar status lpar02
NAME    LPAR_ID  LPAR_ENV  STATE    PROFILE   SYNC  RMC     PROCS  PROC_UNITS  MEM   OS_VERSION
lpar02  39       aixlinux  Running  standard  0     active  1      0.7         7168  AIX 7.2 7200-03-02-1846
$

The LPAR is running and RMC is active, so a DLPAR operation should be possible. We will first check if the maximum memory size is already in use:

$ lpar lsmem lpar02
            MEMORY         MEMORY         HUGE_PAGES 
LPAR_NAME  MODE  AME  MIN   CURR  MAX   MIN  CURR  MAX
lpar02     ded   0.0  2048  7168  8192  0    0     0
$

Currently the LPAR uses 7 GB and a maximum of 8 GB are possible. Extending the memory by 1 GB (1024 MB) should be possible. We add the memory by using the command “lpar addmem“:

$ lpar addmem lpar02 1024
$

We check the success by starting the command “lpar lsmem” again:

$ lpar lsmem lpar02
           MEMORY         MEMORY         HUGE_PAGES 
LPAR_NAME  MODE  AME  MIN   CURR  MAX   MIN  CURR  MAX
lpar02     ded   0.0  2048  8192  8192  0    0     0
$

(By the way: if the current configuration is not synchronized with the current profile, attribute sync_curr_profile, then the LPAR tool also updates the profile!)

 

Virtual adapters can be listed using “lpar lsvslot“:

$ lpar lsvslot lpar02
SLOT  REQ  ADAPTER_TYPE   STATE  DATA
0     Yes  serial/server  1      remote: (any)/any connect_status=unavailable hmc=1
1     Yes  serial/server  1      remote: (any)/any connect_status=unavailable hmc=1
2     No   eth            1      PVID=123 VLANS= ETHERNET0 XXXXXXXXXXXX
6     No   vnic           -      PVID=1234 VLANS=none XXXXXXXXXXXX failover sriov/ms21-vio1/1/3/0/2700c003/2.0/2.0/20/100.0/100.0,sriov/ms21-vio2/2/1/0/27004004/2.0/2.0/10/100.0/100.0
10    No   fc/client      1      remote: ms21-vio1(1)/47 c050760XXXXX0016,c050760XXXXX0017
20    No   fc/client      1      remote: ms21-vio2(2)/25 c050760XXXXX0018,c050760XXXXX0019
21    No   scsi/client    1      remote: ms21-vio2(2)/20
$

The example shows virtual FC and SCSI adapters as well as a vNIC adapter in slot 6.

 

Finally, we’ll show how to start a console for an LPAR:

$ lpar console lpar02

Open in progress 

 Open Completed.

…

AIX Version 7

Copyright IBM Corporation, 1982, 2018.

Console login:

…

The console can be terminated with “~.“.

 

Of course, the LPAR tool can do much more.

To be continued.

 

LPAR tool 1.4.0.1 available (including a valid test license)!

In our download area, version 1.4.0.1 of our LPAR tool, including a valid test license (valid until 31th october 2019) is available for download. The license is contained directly in the binaries, so no license key must be entered. The included trial license allows use of the LPAR tool for up to 10 HMCs, 100 managed systems and 1000 LPARs.

TCP connection aborts due to “max assembly queue depth”

Recently we had frequent program crashes from some Java client programs. The following was found in the Java stack trace:

[STACK] Caused by: java.io.IOException: Premature EOF
[STACK]           at sun.net.www.http.ChunkedInputStream.readAheadBlocking(Unknown Source)
[STACK]           at sun.net.www.http.ChunkedInputStream.readAhead(Unknown Source)
[STACK]           at sun.net.www.http.ChunkedInputStream.read(Unknown Source)
[STACK]           at java.io.FilterInputStream.read(Unknown Source)

The problem occurs in the ChunkedInputStream class in the readAheadBlocking method. In the source code of the method you will find:

558 /**
559 * If we hit EOF it means there's a problem as we should never
560 * attempt to read once the last chunk and trailers have been
561 * received.
562 */
563 if (nread < 0) {
564 error = true;
565 throw new IOException("Premature EOF");
566 }

The value nread becomes less than 0 when the end of the data stream is reached. This can happen if the other party closes the connection unexpectedly.

The server side in this case was an AIX system (AIX 7.1 TL5 SP3). A review of the TCP connections for dropdowns using netstat resulted in:

$ netstat -p tcp | grep drop
        361936 connections closed (including 41720 drops)
        74718 embryonic connections dropped
                0 connections dropped by rexmit timeout
                0 connections dropped due to persist timeout
                0 connections dropped by keepalive
        0 packets dropped due to memory allocation failure
        0 Connections dropped due to bad ACKs
        0 Connections dropped due to duplicate SYN packets
        1438 connections dropped due to max assembly queue depth
$

Thus, there were 1438 connection drops because of the maximum TCP assembly queue depth. The queue depth is configured via the new kernel parameter tcp_maxqueuelen, which was introduced as a fix for CVE-2018-6922 (see: The CVE-2018-6922 fix (FreeBSD vulnerability) and scp). The default value is 1000. With larger packet runtimes, the queue may overflow.

After increasing the kernel parameter tcp_maxqueuelen, no connection drops because of max assembly queue have occurred anymore.

 

LPAR tool with test license until 15th september 2019

In our download area, version 1.3.0.2 of our LPAR tool, including a valid test license (valid until 15th september 2019) is available for download. The license is contained directly in the binaries, so no license key must be entered. The included trial license allows use of the LPAR tool for up to 10 HMCs, 100 managed systems and 1000 LPARs.

ProbeVue in Action: Monitoring the Queue Depth of Disks

Disk and storage systems support Tagged Command Queuing, i.e. connected servers can send multiple I/O jobs to the disk or storage system without waiting for older I/O jobs to finish. The number of I/O requests you can send to a disk before you have to wait for older I/O requests to complete can be configured using the hdisk queue_depth attribute on AIX. For many hdisk types, the value 20 for the queue_depth is the default value. In general, most storage systems allow even greater values for the queue depth.

With the help of ProbeVue, the utilization of the disk queue can be monitored very easily.

Starting with AIX 7.1 TL4 or AIX 7.2 TL0, AIX supports the I/O Probe Manager. This makes it easy to trace events in AIX’s I/O stack. If an I/O is started by the disk driver, this is done via the iostart function in the kernel, the request is forwarded to the adapter driver and then passed to the storage system via the host bus adapter. Handling the response is done by the iodone function in the kernel. The I/O Probe Manager supports (among others) probe events at these locations:

@@io:disk:iostart:read:<filter>
@@io::disk:iostart:write:<filter>
@@io:disk:iodone:read:<filter>
@@io::disk:iodone:write:<filter>

As a filter, e.g. a hdisk name like hdisk2 can be specified. The probe points then only trigger events for the disk hdisk2. This allows to perform an action whenever an I/O for a hdisk begins or ends. This would allow to measure how long an I/O operation takes or just to count how many I/Os are executed. In our example, we were interested in the utilization of the disk queue, i.e. the number of I/Os sent to the disk which are not yet completed. The I/O Probe Manager has a built-in variable __diskinfo for the iostart and iodone I/O probe events with the following fields (https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix .genprogc / probevue_man_io.htm):

name          char*     Name of the disk.
…
queue_depth   int       The queue depth of the disk (value from ODM)
cmds_out      int       Number of outstanding I/Os
…

The cmds_out field indicates how many I/Os have already been sent to the disk for which the I/O has not yet been completed (response has not yet arrived at the server).

The following section of code determines the minimum, maximum, and average number of entries in the disk queue:

@@io:disk:iostart:*:hdisk0     // Only I/Os for hdisk0 are considered
{
   queue = __iopath->cmds_out; // Store number of outstanding I/Os in variable queue
   ++numIO;                    // Number of I/Os (used for calculating the average)
   avg += queue;               // Add number of outstanding I/Os to variable avg
   if ( queue < min )
      min = queue;             // Check if minimum
   if ( queue > max )
      max = queue;             // Check if maximum
}

The calculated values are then printed once per second using the interval probe manager:

@@interval:*:clock:1000
{
   if ( numIO == 0 )
      numIO = 1;    // Prevent division by 0 when calculating the average
   if ( min > max )
      min = max;
   printf( "%5d  %5d  %5d\n" , min , avg/numIO , max );
   min = 100000;   // Reset variables for the next interval
   avg = 0;
   max = 0;
   numIO = 0;
}

The full script is available for download on our website: ioqueue.e.

Here is a sample run of the script for the disk hdisk13:

# ./ioqueue.e hdisk13
  min    avg    max
    1      1      2
    1      1      9
    1      1      2
    1      1      8
    1      1      2
    1      1      2
    1      1      8
    1      1     10
    1      1      2
    1      1      1
    1      1     10
    1      1      2
    1      1     11
...

The script expects an hdisk as an argument, and then outputs once per second the values determined for the specified hdisk.

In the example output you can see that the maximum number of entries in the disk queue is 11. An increase of the attribute queue_depth therefore makes no sense from a performance perspective.

Here’s another example:

# ./ioqueue.e hdisk21
  min    avg    max
    9     15     20
   11     17     20
   15     19     20
   13     19     20
   14     19     20
   17     18     20
   18     18     19
   16     19     20
   13     18     20
   18     19     19
   17     19     20
   18     19     20
   17     19     19
...

In this case, the maximum value of 20 (the hdisk21 has a queue_depth of 20) is reached on a regular basis. Increasing the queue_depth can improve throughput in this case.

Of course, the sample script can be expanded in various ways; to determine the throughput in addition, or the waiting time of I/Os in the wait queue, or even the position and size of each I/O on the disk. This example just shows how easy it is to get information about I/Os using ProbeVue.

Numbers: FC World Wide Names (WWNs)

Most of us know WWNs as 64-bit WWNs, written as 16 hexadecimal digits. The knowledge that there are different formats of WWNs and that there are also 128-bit WWNs is not quite as well known. In this article, therefore, the different formats of WWNs are briefly presented.

The basic structure of 64-bit WWNs looks like this:

+---+----------------+
|NAA| NAME           |
+---+----------------+
4-bit 60-bit

The 4-bit NAA (Network Address Authority) field specifies the type of address and the format of the address.

There are a number of different possibilities for the 60-bit NAME field.

 

1. Format 1 Address (NAA = 0001)

+---+--------+------------------------+
|NAA|Reserved| 48-bit IEEE MAC Address|
+---+--------+------------------------+
4-bit 12-bit   48-bit

In the Reserved (12-bit) field, all bits must be set to 0!

Example:

1 000 00507605326d (To clarify the format, the fields are separated by spaces)

 

2. Format 2 Address (NAA = 0010)

+---+---------------+-----------------------+
|NAA|Vendor Assigned|48-bit IEEE MAC Address|
+---+---------------+-----------------------+
4-bit  12-bit         48-bit

The 12-bit “Vendor Assigned” field can be used arbitrarily by the manufacturer.

Example:

2 001 00507605326d (To clarify the format, the fields are separated by spaces)

 

3. Format 3 Address (NAA = 0011)

+---+-----------------+
|NAA|Vendor Assigned  |
+---+-----------------+
4-bit 60-bit

The field “Vendor Assigned” (60-bit) is assigned arbitrarily by the manufacturer. Thus, this type of address is not unique worldwide and therefore usually not found in practice.

Example:

3 0123456789abcde (To clarify the format, the fields are separated by spaces)

 

4. Format 4 Address (NAA = 0100)

+---+---------+--------------+
|NAA|Reserved | IPv4 Address |
+---+---------+--------------+
4-bit 28-bit     32-bit

The “IPv4 Address” (32-bit) field contains a 32-bit IPv4 address.

Example for IP 10.0.0.1:

4 0000000 0a000001 (To clarify the format, the fields are separated by spaces)

 

5. Format 5 Address (NAA = 0101)

+---+-------+-----------------+
|NAA| OUI   | Vendor Assigned |
+---+-------+-----------------+
4.bit 24-bit 36-bit

The OUI (24-bit) field contains the 24-bit IEEE-assigned ID (Organizational Unique ID).

The field “Vendor Assigned” (36-bit) can be assigned arbitrarily by the manufacturer.

Example:

5 005076 012345678 (To clarify the format, the fields are separated by spaces)

 

6. Format 6 Address (NAA = 0110)

Format 6 addresses are 128-bit addresses and are often used for LUNs on the SAN.

+---+-------+---------------+-------------------------+
|NAA|  OUI  |Vendor Assigned|Vendor Assigned Extension|
+---+-------+---------------+-------------------------+
4.bit 24-bit  36-bit          64-bit

The OUI (24-bit) field contains the 24-bit ID assigned by the IEEE.

The field “Vendor Assigned” (36-bit) can be arbitrarily assigned by the manufacturer.

The field “Vendor Assigned Extension” (64-bit) can also be assigned arbitrarily by the manufacturer.

Example:

6 005076 012345678 0123456789abcdef (To clarify the format, the fields are separated by spaces)

 

7. IEEE EUI-64 Address (NAA=11)

In the case of this address format, the NAA field is shortened to only 2 bits, where NAA is 11.

+---+-------------+---------------+
|NAA|OUI shortened|Vendor Assigned|
+---+-------------+---------------+
2-bit 22-bit       40-bit

The “OUI shortened” field (22-bit) is a 22-bit shortened version of the IEEE-assigned 24-bit ID.

(The two least significant bits of the first byte are omitted and the remaining 6 bits are shifted 2 bits to the right to make room for the two NAA bits.)

The field “Vendor Assigned” (40-bit) can be arbitrarily assigned by the manufacturer.

These types of addresses are often used in the area of virtualization, e.g. when it comes to NPIV (N_Port ID Virtualization).

Example:

c05076 0123456789 (To clarify the format, the fields are separated by spaces)

 

 

Full file system: df and du show different space usage

Full file systems occur in practice again and again, everyone knows this. Usually you search for large files or directories and check if older data can be deleted to make space again (but sometimes the file system will be simply enlarged without further investigation). In some cases, however, you can not find any larger files that could be deleted or you discover that file system space is seems to be gone, but you can not identify where this space is used. The command du then displays a smaller value for the file system space used than df. In the following, such an example is shown, as well as the hint how to identify where the filesystem-space is and how it can finally be recovered. AIX has a nice feature to offer that is not found in most other UNIX derivatives.

The file system /var/adm/log is 91% filled, currently 3.6 GB of the file system are in use:

# df -g  /var/adm/log
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/varadmloglv      4.00      0.39   91%      456     1% /var/adm/log
#

A quick check with the command du shows that apparently much less space is occupied:

# du –sm /var/adm/log
950.21   /var/adm/log
#

The command “disk usage” shows only 950 MB occupied space! This is 2.7 GB less than the value from the df command. But where is the missing space?

The difference comes from files that have been deleted but are still open by at least one process. The entry for such files is removed from the associated directory, which makes the file inaccessible. Therefore the command du does not take thes files into account and comes up with a smaller value. As long as a process still has the deleted file in use, however, the associated blocks are not released in the file system, so df correctly displays these as occupied.

So there is at least one file in the file system /var/adm/log which has been deleted but is still open by a process. The question is how to identify the process and the file.

AIX provides an easy way to identify processes that have opened deleted files, the fuser command supports the -d option to list processes that have deleted files open:

# fuser -d /var/adm/log
/var/adm/log:  9110638
#

If you also use the -V option, you will also see information about the deleted files, such as the inode number and file size:

# fuser -dV /var/adm/log
/var/adm/log:
inode=119    size=2882647606   fd=12     9110638
#

The output shows that here the file with the inode number 119 with a size of approximately 2.8 GB was deleted, but is still opened by the process with the PID 9110638 over the file descriptor 12.

Using ps you can quickly find out which process it is:

# ps -ef|grep 9110638
    root  9110638  1770180   0   Nov 20      - 28:28 /usr/sbin/syslogd
    root  8193550  8849130   0 09:13:35  pts/2  0:00 grep 9110638
#

In our case it is the syslogd process. Presumably a log file was rotated via mv without informing the syslogd (refresh -s syslogd). We fix this shortly and check the file system again:

# refresh -s syslogd
0513-095 The request for subsystem refresh was completed successfully.
#
# df -g /var/adm/log
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/varadmloglv      4.00      3.07   24%      455     1% /var/adm/log
#

The output shows that the file system blocks have now been released.

ProbeVue in Action: Identifying a crashing Process

Recently, our monitoring reported a full /var file system on one of our systems. We detected that core files in the directory /var/adm/core had filled the file system. It quickly turned out that all core files came from perl. However, based on the core files we could not determine which perl script had caused the crash of perl. A look at the timestamps of the core files unfortunately showed no pattern:

-bash-4.4$ ls -ltr /var/adm/core
total 2130240
drwxr-xr-x    2 root     system          256 Jan 29 10:20 lost+found/
-rw-------    1 root     system    100137039 Jun 26 04:51 core.22610328.26025105.Z
-rw-------    1 root     system     99054991 Jun 26 06:21 core.21102892.26042104.Z
-rw-------    1 root     system     99068916 Jun 26 08:06 core.18153840.26060607.Z
-rw-------    1 root     system    100132866 Jun 26 08:21 core.19005848.26062105.Z
-rw-------    1 root     system     97986020 Jun 26 16:36 core.15270246.26143608.Z
-rw-------    1 root     system     99208958 Jun 26 22:21 core.22675838.26202106.Z
-rw-------    1 root     system     97557063 Jun 27 01:06 core.5505292.26230604.Z
-rw-------    1 root     system     98962499 Jun 27 10:06 core.8257960.27080603.Z
-rw-------    1 root     system     99804173 Jun 27 14:51 core.18940202.27125107.Z
-rw-------    1 root     system     99633676 Jun 28 03:21 core.17563960.28012107.Z
-rw-------    1 root     system     99116032 Jun 28 19:06 core.8651210.28170608.Z
-bash-4.4$

Also, the entries in the error report provided no information about the crashing perl script and how it was started.

-bash-4.4$ sudo errpt -j A924A5FC –a
...
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     A924A5FC

Date/Time:       Wed May 29 15:21:25 CEST 2019
Sequence Number: 17548
Machine Id:      XXXXXXXXXXXX
Node Id:         XXXXXXXX
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSPROC        

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
         11
USER'S PROCESS ID:
              13369662
FILE SYSTEM SERIAL NUMBER
           1
INODE NUMBER
                 69639
CORE FILE NAME
/var/adm/core/core.13369662.29132106
PROGRAM NAME
perl
STACK EXECUTION DISABLED
           0
COME FROM ADDRESS REGISTER

PROCESSOR ID
  hw_fru_id: 1
  hw_cpu_id: 19

ADDITIONAL INFORMATION

Unable to generate symptom string.
Too many stack elements.
-bash-4.4$

The only information that could be found was that the processes were terminated with signal 11 (SIGSEGV), that is, due to an access to an invalid memory address.

The question arose: how to determine which perl script was crashing and how it is started.

This should be a task for ProbeVue.

The sysproc provider, which generates an event in case of an exit of a process, seemed to be the right probe to use. The special built-in variable __exitinfo provides more detailed information about the exit, such as exit status or the signal number that terminated the process. This can be used to write the following clause:

1: @@sysproc:exit:*
2: when ( __exitinfo->signo == 11 )
3: {
4:         printf( "%llu:  %s\n" , __pid , __pname );
5:         ptree(10);
6: }

The 6 lines are briefly explained here:

  1. The probe point: provider is sysproc, event is exit, * means any process.
  2. By using the above predicate, the subsequent action block is executed only when the process was terminated with signal 11 (SIGSEGV).
  3. Start of the action block.
  4. Output of the PID and the program name of the process.
  5. The function ptree outputs the father, grandfather, etc. (up to 10 levels) of the process.
  6. End of the action block.

Unfortunately, no program arguments can be listed here, which in our case would have given the name of the perl script. But at least we get informations from the function ptree about the way the program was called, which is sufficient in some cases to ultimately identify the program.

For identification we would like to have the information about the arguments with which perl was called. This information is provided by the syscall provider for the system call execve with which the program is started. The probe point is thus syscall: *: execve: entry, since the arguments are known when entering the function. The signature of execve for ProbeVue looks like this:

int execve( char* , struct arg_t* args , char* );

Here, the first argument (provided by ProbeVue as __arg1) is the program name. The second argument is a structure with the arguments in question (provided by __arg2). The third argument gives access to environment variables, which is not important in our case. The structure struct arg_t looks like this for 5 arguments:

struct arg_t
{
        union
        {
                char* arg[5];
                int num[5];
        } u;
};

This structure and the signature of execve must be declared in the ProbeVue script before they can be used.

When accessing the arguments, there is another small problem: if the action block for our sample is addressed, we are in kernel mode, but the arguments themselves are addresses in the user mode of the process. The data (in this case character strings) must be copied out of the user address space. This is done by the function get_userstring.

For every execve we get the PID, the program name, the command and up to 5 arguments. This is implemented in the following program:

#! /usr/bin/probevue

struct arg_t
{
        union
        {
                char* arg[5];
                int num[5];
        } u;
};

int execve( char* , struct arg_t* args , char* );

@@syscall:*:execve:entry
{
        __auto String command[128];
        __auto String argument[128];
        __auto struct arg_t argv;
        copy_userdata( __arg2 , argv );
        command = get_userstring( __arg1 , -1 );
        argument = get_userstring( argv.u.arg[0] , -1 );
        printf( "%llu: %s called execve(%s) with arguments: %s " , __pid , __pname , command , argument )
;
        if ( argv.u.num[1] != 0 )
        {
                argument = get_userstring( argv.u.arg[1] , -1 );
                printf( "%s " , argument );
                if ( argv.u.num[2] != 0 )
                {
                        argument = get_userstring( argv.u.arg[2] , -1 );
                        printf( "%s " , argument );
                        if ( argv.u.num[3] != 0 )
                        {
                                argument = get_userstring( argv.u.arg[3] , -1 );
                                printf( "%s " , argument );
                                if ( argv.u.num[4] != 0 )
                                {
                                        argument = get_userstring( argv.u.arg[4] , -1 );
                                        printf( "%s " , argument );
                                }
                        }
                }
        }
        printf( "\n" );
}

@@sysproc:exit:*
when ( __exitinfo->signo == 11 )
{
        printf( "%llu:  %s\n" , __pid , __pname );
        ptree(10);
}

We called the script capture_segv.e and made it executable.

In theory, after startup the program should output all starting programs with PID, name and up to 5 arguments. In addition, an output occurs when a process is terminated with signal 11 (SIGSEGV). The corresponding PID can then be matched further up in the output and thus the program with arguments can be identified.

Unfortunately, the following small problem arises in practice: if a program is terminated very quickly after the execve, before ProbeVue can copy the arguments with get_userstring, get_userstring accesses a no longer existing address and the ProbeVue script is aborted. We bypassed this by simply starting the ProbeVue script over and over in an infinite loop:

# while true; do ./capture_segv.e >>/tmp/wait_for_segv ; done

We then ran the ProbeVue script for a few hours, until the next crash of the perl script. The file /tmp/wait_for_segv contained about 23,000 lines! We have listed only the relevant lines here:

# cat /tmp/wait_for_segv
…
8651210: ksh called execve(xxxx_hacheck.pl) with arguments: xxxx_hacheck.pl -c
8651210: ksh called execve(/var/opt/OV/bin/instrumentation/xxxx_hacheck.pl) with arguments: xxxx_hacheck
.pl -c
20054518: ksh called execve(/bin/proc2mon.pl) with arguments: proc2mon.pl
…
8651210:  perl

     PID              CMD
       1              init
                        |
                        V
9634196              ovcd
                        |
                        V
9765232              opcacta
                        |
                        V
8651210              perl    <=======
…

One can see that perl was started via the program opcacta, which was started by ovcd. These processes belong to HP OpenView that is in use here. At the top of the output one can see that the perl script /var/opt/OV/bin/instrumentation/xxxx_hacheck.pl has been started. So we found the script that generates the many core files.

The script was written recently and obviously needs to be re-examined and reworked.

With the help of ProbeVue a short script and several hours of waiting was enough to find the cause of the problem! ProbeVue is not only useful for investigating problems; also in performance monitoring, ProbeVue proves to be extremely helpful.

Resources of not activated LPARs and Memory Affinity

When an LPAR is shut down, resources such as processors, memory, and I/O slots are not automatically released by the LPAR. The resources remain assigned to the LPAR and are then reused on the next activation (with the current configuration). In the first part of the article Resources of not activated LPARs we had already looked at this.

(Note: In the example output, we use version 1.4 of the LPAR tool, but in all cases we show the underlying commands on the HMC command line, so you can try everything without using the LPAR tool.)

The example LPAR lpar1 was shut down, but currently still occupies 100 GB of memory:

linux $ lpar status lpar1
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  39       aixlinux  Not Activated  standard  0     inactive  1      0.2         102400  Unknown
linux $

The following commands for the output above were executed on the corresponding HMC hmc01:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1
hmc01: lshwres -r mem -m ms09 --level lpar --filter lpar_names=lpar1
hmc01: lshwres -r proc -m ms09 --level lpar --filter lpar_names=lpar1

As the output shows, the LPAR lpar1 has still allocated its resources (processors, memory, I/O adapters).

In order to understand why deactivating an LPAR does not release the resources, you have to look at the “Memory Affinity Score”:

linux $ lpar lsmemopt lpar1
             LPAR_SCORE  
LPAR_NAME  CURR  PREDICTED
lpar1      100   0
linux $

HMC command line:

hmc01: lsmemopt -m ms09 -r lpar -o currscore –filter lpar_names=lpar1

The Memory Affinity Score describes how close processors and memory are, the closer the memory to the memory is, the better is the throughput to the memory. The command above indicates, with a value between 1 and 100, how big the affinity between processors and LPARs is. Our LPAR lpar1 currently has a value of 100, which means the best possible affinity of memory and processors. If the resources were freed when deactivating an LPAR, then the LPAR would lose this Memory Affinity Score. The next time you enable the LPAR, it then depends on the memory and processors available then how good the memory affinity will be then. We release the resources once:

linux $ lpar -d rmprocs lpar1 1
linux $

HMC command line:

hmc01: chhwres -m ms09 -r proc  -o r -p lpar1 --procs 1

No more score will be given, since the LPAR has no longer allocated any resources:

linux $ lpar lsmemopt lpar1
             LPAR_SCORE  
LPAR_NAME  CURR  PREDICTED
lpar1      none  none
linux $

HMC command line:

hmc01: lsmemopt -m ms09 -r lpar -o currscore –filter lpar_names=lpar1

Now we allocate resources again and look at the effect this has on memory affinity:

linux $ lpar applyprof lpar1 standard
linux $

HMC command line:

hmc01: chsyscfg -r lpar -m ms09 -o apply -p lpar1 -n standard

We again determine the Memory Affinity Score:

linux $ lpar lsmemopt lpar1
             LPAR_SCORE  
LPAR_NAME  CURR  PREDICTED
lpar1      53    0
linux $

HMC command line:

hmc01: lsmemopt -m ms09 -r lpar -o currscore –filter lpar_names=lpar1

The score is now only 53, the performance of the LPAR has become worse. Whether and how much this is noticeable, depends ultimately on the applications on the LPAR.

The fact that the resources are not released when deactivating an LPAR, thus guarantees the next time you activate (with the current configuration) the memory affinity remains the same and thus the performance should be the same.

If you release the resources of an LPAR (manually or automatically), then you have to realize that this has an effect on the LPAR if it is later activated again, because then the resources are reassigned and a worse (but possibly also a better) Memory Affinity Score can result.

Conversely, before activating a new LPAR you can also make sure that there is a good chance for a high memory affinity score for the new LPAR by releasing resources of inactive LPARs.

(Note: resource distribution can be changed and improved at runtime using the Dynamic Platform Optimizer DPO. DPO is supported as of POWER8.)

 

Resources of not activated LPARs

When an LPAR is shutdown, resources such as processors, memory, and I/O slots are not automatically released by the LPAR. The resources remain assigned to the LPAR and are reused on the next activation (with the current configuration).

The article will show how such resources are automatically released and, if desired, how to manually release resources of an inactive LPAR.

(Note: In the example output, we use version 1.4 of the LPAR tool, but in all cases we show the underlying commands on the HMC command line, so you can try everything without using the LPAR tool.)

The example LPAR lpar1 was shut down, but currently still occupies 100 GB of memory:

linux $ lpar status lpar1
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  39       aixlinux  Not Activated  standard  0     inactive  1      0.2         102400  Unknown
linux $

The following commands for the output above were executed on the corresponding HMC hmc01:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1
hmc01: lshwres -r mem -m ms09 --level lpar --filter lpar_names=lpar1
hmc01: lshwres -r proc -m ms09 --level lpar --filter lpar_names=lpar1

The resource_config attribute of an LPAR indicates whether the LPAR has currently allocated resources (resource_config=1) or not (resource_config=0):

linux $ lpar status -F resource_config lpar1
1
linux $

Or on the HMC command line:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1 –F resource_config

The resources allocated by an not activated LPAR can be released in 2 different ways:

  1. Automatic: The resources used are needed by another LPAR, e.g. because memory is expanded dynamically or an LPAR is activated that does not have sufficient resources. In this case, resources are automatically removed from a not activated LPAR. We will show this below with an example.
  2. Manual: The allocated resources are explicitly released by the administrator. This is also shown below in an example.

First we show an example in which resources are automatically taken away from an not activated LPAR.

The managed system ms09 currently has about 36 GB free memory:

linux $ ms lsmem ms09
NAME  INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     33792     786432        36352  256
linux $

HMC command line:

hmc01: lshwres -r mem -m ms09 --level sys

We start an LPAR (lpar2) which was configured with 100 GB of RAM. The managed system has only 36 GB of RAM and is therefore forced to take resources from inactive LPARs in order to provide the required 100 GB. We start lpar2 with the profile standard and look at the memory relations:

linux $ lpar activate -b sms -p standard lpar2
linux $

HMC command line:

hmc01: chsysstate -m ms09 -r lpar -o on -n lpar2 -b sms -f standard

Overview of the storage relations of lpar1 and lpar2:

linux $ lpar status lpar\*
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  4        aixlinux  Not Activated  standard  0     inactive  1      0.2         60160   Unknown
lpar2  8        aixlinux  Open Firmware  standard  0     inactive  1      0.2         102400  Unknown
linux $ ms lsmem ms09
NAME  INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     35584     786432        0      256
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09
hmc01: lshwres -r mem -m ms09 --level lpar
hmc01: lshwres -r proc -m ms09 --level lpar
hmc01: lshwres -r mem -m ms09 --level sys

The LPAR lpar2 has 100 GB RAM, the managed system has no more free memory and the memory allocated by LPAR lpar1 has been reduced to about 60 GB. Allocated resources from non-activated LPARs are automatically released, when needed and assigned to other LPARs.

But you can of course also release the resources manually. This is also shown briefly here. We are reducing the memory of LPAR lpar1 by 20 GB:

linux $ lpar -d rmmem lpar1 20480
linux $

HMC command line:

hmc01: chhwres -m ms09 -r mem  -o r -p lpar1 -q 20480

As stated, the allocated memory has been reduced by 20 GB:

linux $ lpar status lpar\*
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  4        aixlinux  Not Activated  standard  0     inactive  1      0.2         39680   Unknown
lpar2  8        aixlinux  Open Firmware  standard  0     inactive  1      0.2         102400  Unknown
linux $ ms lsmem ms09
NAME  INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     35584     786432        20480  256
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09
hmc01: lshwres -r mem -m ms09 --level lpar
hmc01: lshwres -r proc -m ms09 --level lpar
hmc01: lshwres -r mem -m ms09 --level sys

The 20 GB are immediately available to the managed system as free memory. If you remove the entire memory or all processors (or processor units), then all resources of an inactive LPAR are released:

linux $ lpar -d rmmem lpar1 39680
linux $

HMC command line:

hmc01: chhwres -m ms09 -r mem  -o r -p lpar1 -q 39680

Here are the resulting memory relations:

linux $ lpar status lpar\*
NAME   LPAR_ID  LPAR_ENV  STATE          PROFILE   SYNC  RMC       PROCS  PROC_UNITS  MEM     OS_VERSION
lpar1  4        aixlinux  Not Activated  standard  0     inactive  0      0.0         0       Unknown
lpar2  8        aixlinux  Open Firmware  standard  0     inactive  1      0.2         102400  Unknown
linux $ ms lsmem ms09
NAME        INSTALLED  FIRMWARE  CONFIGURABLE  AVAIL  MEM_REGION_SIZE
ms09  786432     31232     786432        64512  256
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09
hmc01: lshwres -r mem -m ms09 --level lpar
hmc01: lshwres -r proc -m ms09 --level lpar
hmc01: lshwres -r mem -m ms09 --level sys

The LPAR lpar1 now has 0 processors, 0.0 processor units and 0 MB of memory! In addition, the resource_config attribute now has the value 0, which indicates that the LPAR no longer has any resources configured!

linux $ lpar status -F resource_config lpar1
0
linux $

HMC command line:

hmc01: lssyscfg -r lpar -m ms09 --filter lpar_names=lpar1 –F resource_config

Finally, the question arises as to why you should release resources manually if they are automatically released by the managed system when needed?

We will answer this question in a second article.