ProbeVue in Action: Identifying a crashing Process

Recently, our monitoring reported a full /var file system on one of our systems. We detected that core files in the directory /var/adm/core had filled the file system. It quickly turned out that all core files came from perl. However, based on the core files we could not determine which perl script had caused the crash of perl. A look at the timestamps of the core files unfortunately showed no pattern:

-bash-4.4$ ls -ltr /var/adm/core
total 2130240
drwxr-xr-x    2 root     system          256 Jan 29 10:20 lost+found/
-rw-------    1 root     system    100137039 Jun 26 04:51 core.22610328.26025105.Z
-rw-------    1 root     system     99054991 Jun 26 06:21 core.21102892.26042104.Z
-rw-------    1 root     system     99068916 Jun 26 08:06 core.18153840.26060607.Z
-rw-------    1 root     system    100132866 Jun 26 08:21 core.19005848.26062105.Z
-rw-------    1 root     system     97986020 Jun 26 16:36 core.15270246.26143608.Z
-rw-------    1 root     system     99208958 Jun 26 22:21 core.22675838.26202106.Z
-rw-------    1 root     system     97557063 Jun 27 01:06 core.5505292.26230604.Z
-rw-------    1 root     system     98962499 Jun 27 10:06 core.8257960.27080603.Z
-rw-------    1 root     system     99804173 Jun 27 14:51 core.18940202.27125107.Z
-rw-------    1 root     system     99633676 Jun 28 03:21 core.17563960.28012107.Z
-rw-------    1 root     system     99116032 Jun 28 19:06 core.8651210.28170608.Z
-bash-4.4$

Also, the entries in the error report provided no information about the crashing perl script and how it was started.

-bash-4.4$ sudo errpt -j A924A5FC –a
...
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     A924A5FC

Date/Time:       Wed May 29 15:21:25 CEST 2019
Sequence Number: 17548
Machine Id:      XXXXXXXXXXXX
Node Id:         XXXXXXXX
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSPROC        

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
         11
USER'S PROCESS ID:
              13369662
FILE SYSTEM SERIAL NUMBER
           1
INODE NUMBER
                 69639
CORE FILE NAME
/var/adm/core/core.13369662.29132106
PROGRAM NAME
perl
STACK EXECUTION DISABLED
           0
COME FROM ADDRESS REGISTER

PROCESSOR ID
  hw_fru_id: 1
  hw_cpu_id: 19

ADDITIONAL INFORMATION

Unable to generate symptom string.
Too many stack elements.
-bash-4.4$

The only information that could be found was that the processes were terminated with signal 11 (SIGSEGV), that is, due to an access to an invalid memory address.

The question arose: how to determine which perl script was crashing and how it is started.

This should be a task for ProbeVue.

The sysproc provider, which generates an event in case of an exit of a process, seemed to be the right probe to use. The special built-in variable __exitinfo provides more detailed information about the exit, such as exit status or the signal number that terminated the process. This can be used to write the following clause:

1: @@sysproc:exit:*
2: when ( __exitinfo->signo == 11 )
3: {
4:         printf( "%llu:  %s\n" , __pid , __pname );
5:         ptree(10);
6: }

The 6 lines are briefly explained here:

  1. The probe point: provider is sysproc, event is exit, * means any process.
  2. By using the above predicate, the subsequent action block is executed only when the process was terminated with signal 11 (SIGSEGV).
  3. Start of the action block.
  4. Output of the PID and the program name of the process.
  5. The function ptree outputs the father, grandfather, etc. (up to 10 levels) of the process.
  6. End of the action block.

Unfortunately, no program arguments can be listed here, which in our case would have given the name of the perl script. But at least we get informations from the function ptree about the way the program was called, which is sufficient in some cases to ultimately identify the program.

For identification we would like to have the information about the arguments with which perl was called. This information is provided by the syscall provider for the system call execve with which the program is started. The probe point is thus syscall: *: execve: entry, since the arguments are known when entering the function. The signature of execve for ProbeVue looks like this:

int execve( char* , struct arg_t* args , char* );

Here, the first argument (provided by ProbeVue as __arg1) is the program name. The second argument is a structure with the arguments in question (provided by __arg2). The third argument gives access to environment variables, which is not important in our case. The structure struct arg_t looks like this for 5 arguments:

struct arg_t
{
        union
        {
                char* arg[5];
                int num[5];
        } u;
};

This structure and the signature of execve must be declared in the ProbeVue script before they can be used.

When accessing the arguments, there is another small problem: if the action block for our sample is addressed, we are in kernel mode, but the arguments themselves are addresses in the user mode of the process. The data (in this case character strings) must be copied out of the user address space. This is done by the function get_userstring.

For every execve we get the PID, the program name, the command and up to 5 arguments. This is implemented in the following program:

#! /usr/bin/probevue

struct arg_t
{
        union
        {
                char* arg[5];
                int num[5];
        } u;
};

int execve( char* , struct arg_t* args , char* );

@@syscall:*:execve:entry
{
        __auto String command[128];
        __auto String argument[128];
        __auto struct arg_t argv;
        copy_userdata( __arg2 , argv );
        command = get_userstring( __arg1 , -1 );
        argument = get_userstring( argv.u.arg[0] , -1 );
        printf( "%llu: %s called execve(%s) with arguments: %s " , __pid , __pname , command , argument )
;
        if ( argv.u.num[1] != 0 )
        {
                argument = get_userstring( argv.u.arg[1] , -1 );
                printf( "%s " , argument );
                if ( argv.u.num[2] != 0 )
                {
                        argument = get_userstring( argv.u.arg[2] , -1 );
                        printf( "%s " , argument );
                        if ( argv.u.num[3] != 0 )
                        {
                                argument = get_userstring( argv.u.arg[3] , -1 );
                                printf( "%s " , argument );
                                if ( argv.u.num[4] != 0 )
                                {
                                        argument = get_userstring( argv.u.arg[4] , -1 );
                                        printf( "%s " , argument );
                                }
                        }
                }
        }
        printf( "\n" );
}

@@sysproc:exit:*
when ( __exitinfo->signo == 11 )
{
        printf( "%llu:  %s\n" , __pid , __pname );
        ptree(10);
}

We called the script capture_segv.e and made it executable.

In theory, after startup the program should output all starting programs with PID, name and up to 5 arguments. In addition, an output occurs when a process is terminated with signal 11 (SIGSEGV). The corresponding PID can then be matched further up in the output and thus the program with arguments can be identified.

Unfortunately, the following small problem arises in practice: if a program is terminated very quickly after the execve, before ProbeVue can copy the arguments with get_userstring, get_userstring accesses a no longer existing address and the ProbeVue script is aborted. We bypassed this by simply starting the ProbeVue script over and over in an infinite loop:

# while true; do ./capture_segv.e >>/tmp/wait_for_segv ; done

We then ran the ProbeVue script for a few hours, until the next crash of the perl script. The file /tmp/wait_for_segv contained about 23,000 lines! We have listed only the relevant lines here:

# cat /tmp/wait_for_segv
…
8651210: ksh called execve(xxxx_hacheck.pl) with arguments: xxxx_hacheck.pl -c
8651210: ksh called execve(/var/opt/OV/bin/instrumentation/xxxx_hacheck.pl) with arguments: xxxx_hacheck
.pl -c
20054518: ksh called execve(/bin/proc2mon.pl) with arguments: proc2mon.pl
…
8651210:  perl

     PID              CMD
       1              init
                        |
                        V
9634196              ovcd
                        |
                        V
9765232              opcacta
                        |
                        V
8651210              perl    <=======
…

One can see that perl was started via the program opcacta, which was started by ovcd. These processes belong to HP OpenView that is in use here. At the top of the output one can see that the perl script /var/opt/OV/bin/instrumentation/xxxx_hacheck.pl has been started. So we found the script that generates the many core files.

The script was written recently and obviously needs to be re-examined and reworked.

With the help of ProbeVue a short script and several hours of waiting was enough to find the cause of the problem! ProbeVue is not only useful for investigating problems; also in performance monitoring, ProbeVue proves to be extremely helpful.