Cron-Jobs are not startet anymore

Recently, no cron job was started anymore on one of our AIX systems. There was no entry in the error report and no indication of the problem could be found by syslog. In the log of the cron daemon, however, there were a lot of messages:

# cat /var/adm/cron/log
...
! c queue max run limit reached Sat Feb 23 08:49:00 2019
! rescheduling a cron job Sat Feb 23 08:49:00 2019
...

On AIX, the number of active cron jobs is set to 100 by default. Obviously this number had been achieved on our system. New entries are then executed by default 60 seconds later. Both can be configured via the file /var/adm/cron/queuedefs. The value 100 is already quite high and reaching the value indicates a problem.

The PID of the cron daemon was quick to find out:

$ ps -ef|grep cron
    root  6684924        1   0   Sep 26      -  8:03 /usr/sbin/cron
$

The currently active cron jobs run as cron‘s child processes. With the option “-T” of the ps command, we can quickly list all children:

$ ps -T 6684924
      PID    TTY  TIME CMD
  6684924      -  8:03 cron
  3276876      -  0:00    |\--perl
  9961588      -  0:00    |    \--mount
 12714002      -  0:07    |        \--nfsmnthelp
  3604516      -  0:00    |\--perl
 20185130      -  0:00    |    \--mount
 10158264      -  0:35    |        \--nfsmnthelp
  4587542      -  0:00    |\--perl
...

It is immediately noticeable that the lines are repeated again and again, i.e. a perl program was started over and over again by cron, which tried to mount a file system via NFS, which did not work (no answer from the NFS server) and the perl script hangs. Since the script was restarted over and over again, at some point in time there were 100 active cron jobs and from that moment on no more cron jobs were started. We briefly count the active perl processes:

$ ps -T 6684924 |grep perl |wc -l
     100
$

There are exactly 100 perl processes started by cron. We terminate some of the hanging perl processes:

# kill 3276876 3604516  4587542
#

A look at the end of the cron log file shows, that jobs have been terminated, and after a short while the first newly started cron job appears:

# tail –f /var/adm/cron/log
…
Cron Job with pid: 3276876 Failed
Cron Job with pid: 3604516  Failed
Cron Job with pid: 4587542Failed
mqm       : CMD ( /appdata/mqm/admin/bin/checks/checkXmitMonitoring.sh >>/appdata/mqm/tracks/logs/scheduler/checkXmitMonitoring.fatal 2>&1 ) : PID ( 28442840 ) : Mon Feb 25 10:34:00 2019
…

We also terminate the other hanging processes and restart the cron daemon for safety’s sake by simply terminating it:

# kill 6684924
#

The cron daemon is automatically restarted thanks to an /etc/inittab entry:

# lsitab cron
cron:23456789:respawn:/usr/sbin/cron
#

After cron works again, the perl script should be examined, which ultimately led to the hanging of cron. For scripts started per cron, it is advisable to check whether the job is still running or already running.

Automatic creation of home directories

There are several possibilities under AIX to automatically create missing home directories when logging in. This is especially useful if the user accounts are managed through LDAP or another naming service and are not created locally. If a user is newly created in LDAP, he initially has no home directory on the AIX LDAP client:

$ ssh new_user@aix01
...
Could not chdir to home directory /home/new_user: No such file or directory
$ pwd
/
$ exit
$

Probably the easiest way to automatically create the home directory when logging in, is the attribute mkhomeatlogin in the file /etc/security/login.cfg. The default for this attribute is “false” if it is not set:

# lssec -f /etc/security/login.cfg -s usw -a mkhomeatlogin
usw mkhomeatlogin=
# 

The attribute can be set to true with the chsec command:

# chsec -f /etc/security/login.cfg -s usw -a mkhomeatlogin=true
# lssec -f /etc/security/login.cfg -s usw -a mkhomeatlogin
usw mkhomeatlogin=true
#

We try the login again:

$ ssh new_user@aix01
...
$ pwd
/home/new_user
$

A new home directory has been created for the user.

“who -r” does not return run level

On one of our systems, the command “who -r” did not return run level information. No error message was shown:

$ who -r
$ echo $?
0
$

As a consequence, an install script terminated with an error, since it was not able to determine the run level.

The information about the run level comes from the binary log file /etc/utmp. The run level is stored as the second record in this file. We assumed that /etc/utmp contained corrupt records.

The command /usr/sbin/acct/fwtmp (bos.acct) can be used to convert binary utmp-records to ASCII (and vice versa). The command expects the records to convert on standard input. In our case we got:

$ cat /etc/utmp | /usr/sbin/acct/fwtmp
                        system boot   2     0 0000 0000 1484666008                                  Tue Jan 17 16:13:28 CET 2017
root                                  0 804397248 0000 0000          0 \ufffd{\ufffd\ufffd                             Thu Jan  1 01:00:00 CET 1970
         naudio                       8 3473526 0000 0000 1484666008                                  Tue Jan 17 16:13:28 CET 2017
         naudio2                      8 3539068 0000 0000 1484666008                                  Tue Jan 17 16:13:28 CET 2017
...

The output above confirmed that the second record was corrupt, since it obviously did not contained the run level. Comparing with the entries from a working system showed how the correct records should look like:

                        system boot   2     0 0000 0000 1545044734                                  Mon Dec 17 12:05:34 2018
                        run-level 2   1     0 0062 0123 1545044734                                  Mon Dec 17 12:05:34 2018

First of all we made a copy of the corrupt /etc/utmp. Then we created an ASCII version using the above fwtmp command:

# cp /etc/utmp /etc/utmp.orig
# cat /etc/utmp | /usr/sbin/acct/fwtmp -X -L >/etc/utmp.ascii
#

The options -X and -L ensure that user and host names are not shortened.

Using an editor, we corrected the second entry by using the corresponding entry from the working system above. Then we corrected the timestamps by taking the values from the first entry. All in all the corrected version was:

                        system boot   2     0 0000 0000 1484666008                                  Tue Jan 17 16:13:28 CET 2017
                        run-level 2   1     0 0062 0123 1484666008                                  Tue Jan 17 16:13:28 CET 2017
         naudio                       8 3473526 0000 0000 1484666008                                  Tue Jan 17 16:13:28 CET 2017
...

Now we converted the corrected ASCII version back to the binary format and stored that version under /etc/utmp:

# cat /etc/utmp.ascii | /usr/sbin/acct/fwtmp -ic > /etc/utmp
#

Finally the command “who -r” worked again:

$ who -r
   .        run-level 2 Jan 17 16:13       2    0    S
$

The problem was resolved.

/usr/sbin/rpm_share[440]: 36044986 Illegal instruction

The above error message showed up during the installation of an RPM package:

# rpm -U db4-4.7.25-2.aix5.1.ppc.rpm 
/usr/sbin/rpm_share[440]: 36044986 Illegal instruction
rpm_share: 0645-007 ATTENTION: get_rpm_inst_root_list() returned an unexpected result.
rpm_share: 0645-007 ATTENTION: update_inst_root() returned an unexpected result.

The rpm-command no longer works, a rebuild of the RPM database is therefore not possible anymore:

# rpm --rebuilddb
/usr/sbin/rpm_share[470]: 22478966 Illegal instruction

Reinstalling the fileset rpm.rte fixes the problem:

# installp -acFXYd . rpm.rte
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+

...

Installation Summary
--------------------
Name                        Level           Part        Event       Result
-------------------------------------------------------------------------------
rpm.rte                     4.13.0.3        USR         APPLY       SUCCESS    
rpm.rte                     4.13.0.3        ROOT        APPLY       SUCCESS

Afterwards the rpm-command works again:

# rpm -qa
...
db4-4.7.25-2.ppc
...
AIX-rpm-7.1.5.15-7.ppc