Under Construction

Connection Monitoring

In order to detect failures and inaccessibility of managed systems, connection monitoring can be configured on the HMCs. If a monitored managed system is not accessible, a service event is generated. Connection monitoring can be enabled or disabled individually for each managed system. Timers can be configured on the HMC, which determine what is considered unavailability.

Which managed systems are currently being monitored by one (or both) HMCs can be displayed using the “ms lsconnmon” command:

$ ms lsconnmon
                     CONNMON
NAME  SERIAL_NUM  HMC1      HMC2
ms11  17G6G7S   Disabled  Disabled
ms12  07G6G8R   Disabled  Disabled
ms13  323B31V   Enabled   Enabled
ms14  323B31V   Enabled   Enabled

$

If a managed system has a connection to two HMCs, then the managed system can be monitored by both HMCs.

Monitoring of a managed system can be enabled with the “ms enableconnmon” command:

$ ms enableconnmon ms11
$

Monitoring is enabled on both HMCs (if present).

In order not to generate events during planned maintenance work, monitoring can be disabled for the duration of maintenance work. The command “ms disableconnmon” can be used in this case:

$ ms disableconnmon ms11
$

After the maintenance work has been completed, the monitoring should be turned on again so that a failure of the managed system can be detected again.

Three timer values can be configured on the HMC, which determine what is considered a failure or unavailability. The current values of the three timers can be displayed using the “hmc lsconnmon” command:

$ hmc lsconnmon
      OUTAGE        RECOVER    NEW
HMC    DISCONNECTED  CONNECTED  INCIDENT
hmc01  15            10         60
hmc02  30            2          20
$

The first timer can be set by the outage_disconnected_minutes attribute. It indicates after how many minutes of unavailability of a managed system this is considered a failure and a service event is generated accordingly. The second timer (recover_connected_minutes attribute) indicates how long a previously unavailable managed system must be accessible again in order to be considered recovered. This is intended to prevent only short-term accessibility from being considered a restore. The last timer (attribute new_incident_minutes) after which time a new inaccessibility should be considered a new failure. If a managed system is only accessible again for a short period of time and then unavailable again, this is not considered a new failure.

The timer values can be set using the attributes mentioned using the “hmc chconnmon” command:

$ hmc chconnmon hmc02 outage_disconnected_minutes=15 \
  recover_connected_minutes=10 new_incident_minutes=60
$

Note: With some older HMC versions, all 3 timer values must always be specified, even if not all values are changed!