Recently, our monitoring reported a full /var file system on one of our systems. We detected that core files in the directory /var/adm/core had filled the file system. It quickly turned out that all core files came from perl. However, based on the core files we could not determine which perl script had caused the crash of perl. A look at the timestamps of the core files unfortunately showed no pattern:
-bash-4.4$ ls -ltr /var/adm/core total 2130240 drwxr-xr-x 2 root system 256 Jan 29 10:20 lost+found/ -rw------- 1 root system 100137039 Jun 26 04:51 core.22610328.26025105.Z -rw------- 1 root system 99054991 Jun 26 06:21 core.21102892.26042104.Z -rw------- 1 root system 99068916 Jun 26 08:06 core.18153840.26060607.Z -rw------- 1 root system 100132866 Jun 26 08:21 core.19005848.26062105.Z -rw------- 1 root system 97986020 Jun 26 16:36 core.15270246.26143608.Z -rw------- 1 root system 99208958 Jun 26 22:21 core.22675838.26202106.Z -rw------- 1 root system 97557063 Jun 27 01:06 core.5505292.26230604.Z -rw------- 1 root system 98962499 Jun 27 10:06 core.8257960.27080603.Z -rw------- 1 root system 99804173 Jun 27 14:51 core.18940202.27125107.Z -rw------- 1 root system 99633676 Jun 28 03:21 core.17563960.28012107.Z -rw------- 1 root system 99116032 Jun 28 19:06 core.8651210.28170608.Z -bash-4.4$
Also, the entries in the error report provided no information about the crashing perl script and how it was started.
-bash-4.4$ sudo errpt -j A924A5FC –a ... --------------------------------------------------------------------------- LABEL: CORE_DUMP IDENTIFIER: A924A5FC Date/Time: Wed May 29 15:21:25 CEST 2019 Sequence Number: 17548 Machine Id: XXXXXXXXXXXX Node Id: XXXXXXXX Class: S Type: PERM WPAR: Global Resource Name: SYSPROC Description SOFTWARE PROGRAM ABNORMALLY TERMINATED Probable Causes SOFTWARE PROGRAM User Causes USER GENERATED SIGNAL Recommended Actions CORRECT THEN RETRY Failure Causes SOFTWARE PROGRAM Recommended Actions RERUN THE APPLICATION PROGRAM IF PROBLEM PERSISTS THEN DO THE FOLLOWING CONTACT APPROPRIATE SERVICE REPRESENTATIVE Detail Data SIGNAL NUMBER 11 USER'S PROCESS ID: 13369662 FILE SYSTEM SERIAL NUMBER 1 INODE NUMBER 69639 CORE FILE NAME /var/adm/core/core.13369662.29132106 PROGRAM NAME perl STACK EXECUTION DISABLED 0 COME FROM ADDRESS REGISTER PROCESSOR ID hw_fru_id: 1 hw_cpu_id: 19 ADDITIONAL INFORMATION Unable to generate symptom string. Too many stack elements. -bash-4.4$
The only information that could be found was that the processes were terminated with signal 11 (SIGSEGV), that is, due to an access to an invalid memory address.
The question arose: how to determine which perl script was crashing and how it is started.
This should be a task for ProbeVue.
The sysproc provider, which generates an event in case of an exit of a process, seemed to be the right probe to use. The special built-in variable __exitinfo provides more detailed information about the exit, such as exit status or the signal number that terminated the process. This can be used to write the following clause:
1: @@sysproc:exit:* 2: when ( __exitinfo->signo == 11 ) 3: { 4: printf( "%llu: %s\n" , __pid , __pname ); 5: ptree(10); 6: }
The 6 lines are briefly explained here:
- The probe point: provider is sysproc, event is exit, * means any process.
- By using the above predicate, the subsequent action block is executed only when the process was terminated with signal 11 (SIGSEGV).
- Start of the action block.
- Output of the PID and the program name of the process.
- The function ptree outputs the father, grandfather, etc. (up to 10 levels) of the process.
- End of the action block.
Unfortunately, no program arguments can be listed here, which in our case would have given the name of the perl script. But at least we get informations from the function ptree about the way the program was called, which is sufficient in some cases to ultimately identify the program.
For identification we would like to have the information about the arguments with which perl was called. This information is provided by the syscall provider for the system call execve with which the program is started. The probe point is thus syscall: *: execve: entry, since the arguments are known when entering the function. The signature of execve for ProbeVue looks like this:
int execve( char* , struct arg_t* args , char* );
Here, the first argument (provided by ProbeVue as __arg1) is the program name. The second argument is a structure with the arguments in question (provided by __arg2). The third argument gives access to environment variables, which is not important in our case. The structure struct arg_t looks like this for 5 arguments:
struct arg_t { union { char* arg[5]; int num[5]; } u; };
This structure and the signature of execve must be declared in the ProbeVue script before they can be used.
When accessing the arguments, there is another small problem: if the action block for our sample is addressed, we are in kernel mode, but the arguments themselves are addresses in the user mode of the process. The data (in this case character strings) must be copied out of the user address space. This is done by the function get_userstring.
For every execve we get the PID, the program name, the command and up to 5 arguments. This is implemented in the following program:
#! /usr/bin/probevue struct arg_t { union { char* arg[5]; int num[5]; } u; }; int execve( char* , struct arg_t* args , char* ); @@syscall:*:execve:entry { __auto String command[128]; __auto String argument[128]; __auto struct arg_t argv; copy_userdata( __arg2 , argv ); command = get_userstring( __arg1 , -1 ); argument = get_userstring( argv.u.arg[0] , -1 ); printf( "%llu: %s called execve(%s) with arguments: %s " , __pid , __pname , command , argument ) ; if ( argv.u.num[1] != 0 ) { argument = get_userstring( argv.u.arg[1] , -1 ); printf( "%s " , argument ); if ( argv.u.num[2] != 0 ) { argument = get_userstring( argv.u.arg[2] , -1 ); printf( "%s " , argument ); if ( argv.u.num[3] != 0 ) { argument = get_userstring( argv.u.arg[3] , -1 ); printf( "%s " , argument ); if ( argv.u.num[4] != 0 ) { argument = get_userstring( argv.u.arg[4] , -1 ); printf( "%s " , argument ); } } } } printf( "\n" ); } @@sysproc:exit:* when ( __exitinfo->signo == 11 ) { printf( "%llu: %s\n" , __pid , __pname ); ptree(10); }
We called the script capture_segv.e and made it executable.
In theory, after startup the program should output all starting programs with PID, name and up to 5 arguments. In addition, an output occurs when a process is terminated with signal 11 (SIGSEGV). The corresponding PID can then be matched further up in the output and thus the program with arguments can be identified.
Unfortunately, the following small problem arises in practice: if a program is terminated very quickly after the execve, before ProbeVue can copy the arguments with get_userstring, get_userstring accesses a no longer existing address and the ProbeVue script is aborted. We bypassed this by simply starting the ProbeVue script over and over in an infinite loop:
# while true; do ./capture_segv.e >>/tmp/wait_for_segv ; done
We then ran the ProbeVue script for a few hours, until the next crash of the perl script. The file /tmp/wait_for_segv contained about 23,000 lines! We have listed only the relevant lines here:
# cat /tmp/wait_for_segv … 8651210: ksh called execve(xxxx_hacheck.pl) with arguments: xxxx_hacheck.pl -c 8651210: ksh called execve(/var/opt/OV/bin/instrumentation/xxxx_hacheck.pl) with arguments: xxxx_hacheck .pl -c 20054518: ksh called execve(/bin/proc2mon.pl) with arguments: proc2mon.pl … 8651210: perl PID CMD 1 init | V 9634196 ovcd | V 9765232 opcacta | V 8651210 perl <======= …
One can see that perl was started via the program opcacta, which was started by ovcd. These processes belong to HP OpenView that is in use here. At the top of the output one can see that the perl script /var/opt/OV/bin/instrumentation/xxxx_hacheck.pl has been started. So we found the script that generates the many core files.
The script was written recently and obviously needs to be re-examined and reworked.
With the help of ProbeVue a short script and several hours of waiting was enough to find the cause of the problem! ProbeVue is not only useful for investigating problems; also in performance monitoring, ProbeVue proves to be extremely helpful.
More Articles on ProbeVue
ProbeVue: Practical Introduction
ProbeVue: Practical Introduction II