Recently, no cron job was started anymore on one of our AIX systems. There was no entry in the error report and no indication of the problem could be found by syslog. In the log of the cron daemon, however, there were a lot of messages:
# cat /var/adm/cron/log ... ! c queue max run limit reached Sat Feb 23 08:49:00 2019 ! rescheduling a cron job Sat Feb 23 08:49:00 2019 ...
On AIX, the number of active cron jobs is set to 100 by default. Obviously this number had been achieved on our system. New entries are then executed by default 60 seconds later. Both can be configured via the file /var/adm/cron/queuedefs. The value 100 is already quite high and reaching the value indicates a problem.
The PID of the cron daemon was quick to find out:
$ ps -ef|grep cron root 6684924 1 0 Sep 26 - 8:03 /usr/sbin/cron $
The currently active cron jobs run as cron‘s child processes. With the option “-T” of the ps command, we can quickly list all children:
$ ps -T 6684924 PID TTY TIME CMD 6684924 - 8:03 cron 3276876 - 0:00 |\--perl 9961588 - 0:00 | \--mount 12714002 - 0:07 | \--nfsmnthelp 3604516 - 0:00 |\--perl 20185130 - 0:00 | \--mount 10158264 - 0:35 | \--nfsmnthelp 4587542 - 0:00 |\--perl ...
It is immediately noticeable that the lines are repeated again and again, i.e. a perl program was started over and over again by cron, which tried to mount a file system via NFS, which did not work (no answer from the NFS server) and the perl script hangs. Since the script was restarted over and over again, at some point in time there were 100 active cron jobs and from that moment on no more cron jobs were started. We briefly count the active perl processes:
$ ps -T 6684924 |grep perl |wc -l 100 $
There are exactly 100 perl processes started by cron. We terminate some of the hanging perl processes:
# kill 3276876 3604516 4587542 #
A look at the end of the cron log file shows, that jobs have been terminated, and after a short while the first newly started cron job appears:
# tail –f /var/adm/cron/log … Cron Job with pid: 3276876 Failed Cron Job with pid: 3604516 Failed Cron Job with pid: 4587542Failed mqm : CMD ( /appdata/mqm/admin/bin/checks/checkXmitMonitoring.sh >>/appdata/mqm/tracks/logs/scheduler/checkXmitMonitoring.fatal 2>&1 ) : PID ( 28442840 ) : Mon Feb 25 10:34:00 2019 …
We also terminate the other hanging processes and restart the cron daemon for safety’s sake by simply terminating it:
# kill 6684924 #
The cron daemon is automatically restarted thanks to an /etc/inittab entry:
# lsitab cron cron:23456789:respawn:/usr/sbin/cron #
After cron works again, the perl script should be examined, which ultimately led to the hanging of cron. For scripts started per cron, it is advisable to check whether the job is still running or already running.