Exadata: 7 Useful Commands to check Port/Sensor Alarms

Hello all!

This days I had an alarm with message below:

Message=The aggregate sensor /SYS/CABLE_CONN_STAT has a fault.

There is some useful commands I used to verify all ports/sensors in my exadata cluster.

In summary, these commands:
1) Use Intelligent Platform Management Interface (IPMI) to read the Sensor Data Record (SDR) repository
2) Use Intelligent Platform Management Interface (IPMI) to view the ILOM SP System Event Log (SEL)
3) Display all host nodes with ibhosts
4) Use ibcheckstate to scan InfiniBand fabric and validate the port logical and physical state
5) Use ibcheckerrors to scan InfiniBand fabric and validate the connectivity as described in the topology file
6) Checking for sensor healthy from switch
7) Check the overall health of the InfiniBand switch, on the Exadata switch itself

The Commands are:


1) ipmitool sdr
2) ipmitool sel list | tail -60
3) ibhosts
4) ibcheckstate -v
5) ibcheckerrors
6) showunhealthy
7)env_test

Below the complete list with commands executions and example outputs.

1) Use Intelligent Platform Management Interface (IPMI) to read the Sensor Data Record (SDR) repository:

[root@grepora03db02 ~]# ipmitool sdr
ACPI             | 0x01              | ok
HDD0/PRSNT       | 0x02              | ok
HDD0/STATE       | 0x04              | ok
HDD1/PRSNT       | 0x02              | ok
HDD1/STATE       | 0x04              | ok
HDD2/PRSNT       | 0x02              | ok
HDD2/STATE       | 0x04              | ok
HDD3/PRSNT       | 0x02              | ok
HDD3/STATE       | 0x04              | ok
HDD4/PRSNT       | 0x01              | ok
HDD4/STATE       | 0x00              | ok
HDD5/PRSNT       | 0x01              | ok
HDD5/STATE       | 0x00              | ok
HDD6/PRSNT       | 0x01              | ok
HDD6/STATE       | 0x00              | ok
HDD7/PRSNT       | 0x01              | ok
HDD7/STATE       | 0x00              | ok
NVME0/PRSNT      | 0x01              | ok
NVME1/PRSNT      | 0x01              | ok
NVME2/PRSNT      | 0x01              | ok
NVME3/PRSNT      | 0x01              | ok
HOST_ERR         | 0x01              | ok
INTSW            | 0x01              | ok
FM0/F0/TACH      | 4800 RPM          | ok
FM0/F1/TACH      | 4300 RPM          | ok
FM0/F2/TACH      | 8300 RPM          | ok
FM0/F3/TACH      | 6900 RPM          | ok
FM0/PRSNT        | 0x02              | ok
FM1/F0/TACH      | 8100 RPM          | ok
FM1/F1/TACH      | 7300 RPM          | ok
FM1/F2/TACH      | 3900 RPM          | ok
FM1/F3/TACH      | 3600 RPM          | ok
FM1/PRSNT        | 0x02              | ok
FM2/F0/TACH      | 4200 RPM          | ok
FM2/F1/TACH      | 3700 RPM          | ok
FM2/F2/TACH      | 6900 RPM          | ok
FM2/F3/TACH      | 6100 RPM          | ok
FM2/PRSNT        | 0x02              | ok
FM3/F0/TACH      | 6900 RPM          | ok
FM3/F1/TACH      | 5700 RPM          | ok
FM3/F2/TACH      | 4600 RPM          | ok
FM3/F3/TACH      | 4000 RPM          | ok
FM3/PRSNT        | 0x02              | ok
P0/D0/PRSNT      | 0x02              | ok
P0/D1/PRSNT      | 0x02              | ok
P0/D2/PRSNT      | 0x01              | ok
P0/D3/PRSNT      | 0x02              | ok
P0/D4/PRSNT      | 0x02              | ok
P0/D5/PRSNT      | 0x01              | ok
P0/D6/PRSNT      | 0x01              | ok
P0/D7/PRSNT      | 0x02              | ok
P0/D8/PRSNT      | 0x02              | ok
P0/D9/PRSNT      | 0x01              | ok
P0/D10/PRSNT     | 0x02              | ok
P0/D11/PRSNT     | 0x02              | ok
P0/PRSNT         | 0x01              | ok
P0/V_DIMM        | 1.23 Volts        | ok
P1/D0/PRSNT      | 0x02              | ok
P1/D1/PRSNT      | 0x02              | ok
P1/D2/PRSNT      | 0x01              | ok
P1/D3/PRSNT      | 0x02              | ok
P1/D4/PRSNT      | 0x02              | ok
P1/D5/PRSNT      | 0x01              | ok
P1/D6/PRSNT      | 0x01              | ok
P1/D7/PRSNT      | 0x02              | ok
P1/D8/PRSNT      | 0x02              | ok
P1/D9/PRSNT      | 0x01              | ok
P1/D10/PRSNT     | 0x02              | ok
P1/D11/PRSNT     | 0x02              | ok
P1/PRSNT         | 0x01              | ok
P1/V_DIMM        | 1.24 Volts        | ok
R1/PCIE1/PRSNT   | 0x01              | ok
MB/RISER1/PRSNT  | 0x02              | ok
R2/PCIE2/PRSNT   | 0x02              | ok
MB/RISER2/PRSNT  | 0x02              | ok
R3/PCIE3/PRSNT   | 0x02              | ok
R3/PCIE4/PRSNT   | 0x02              | ok
MB/RISER3/PRSNT  | 0x02              | ok
T_CORE_NET01     | 70 degrees C      | ok
T_CORE_NET23     | 71 degrees C      | ok
T_IN_PS          | 31 degrees C      | ok
T_IN_SLOT1       | 43 degrees C      | ok
T_IN_SLOT2       | 50 degrees C      | ok
T_IN_SLOT3       | 38 degrees C      | ok
T_OUT_SLOT1      | 49 degrees C      | ok
T_OUT_SLOT2      | 50 degrees C      | ok
T_OUT_SLOT3      | 48 degrees C      | ok
PS0/PRSNT        | 0x02              | ok
PS0/P_IN         | 180 Watts         | ok
PS0/P_OUT        | 170 Watts         | ok
PS0/STATE        | 0x01              | ok
PS0/T_OUT        | 36 degrees C      | ok
PS0/V_12V        | 12 Volts          | ok
PS0/V_12V_STBY   | 12 Volts          | ok
PS0/V_IN         | 206 Volts         | ok
PS1/PRSNT        | 0x02              | ok
PS1/P_IN         | 180 Watts         | ok
PS1/P_OUT        | 160 Watts         | ok
PS1/STATE        | 0x01              | ok
PS1/T_OUT        | 38 degrees C      | ok
PS1/V_12V        | 12 Volts          | ok
PS1/V_12V_STBY   | 12 Volts          | ok
PS1/V_IN         | 204 Volts         | ok
PWRBS            | no reading        | ns
T_AMB            | 16 degrees C      | ok
/SYS/VPS         | 360 Watts         | ok
VPS_CPUS         | 180 Watts         | ok
VPS_FANS         | 10 Watts          | ok
VPS_MEMORY       | 45 Watts          | ok

 

2) Use Intelligent Platform Management Interface (IPMI) to view the ILOM SP System Event Log (SEL):

[root@grepora03db02 ~]# ipmitool sel list | tail -60
 20d | 09/14/2017 | 20:51:01 | System ACPI Power State #0x26 | S0/G0: working | Deasserted
 20e | 09/14/2017 | 20:51:01 | System ACPI Power State #0x26 | S5/G2: soft-off | Asserted
 20f | 09/14/2017 | 20:53:57 | System Boot Initiated | System Restart | Asserted
 210 | 09/14/2017 | 20:53:57 | System Firmware Progress | Management controller initialization | Asserted
 211 | 09/14/2017 | 20:53:58 | System Firmware Progress | SMBus initialization | Asserted
 212 | 09/14/2017 | 20:53:59 | System Firmware Progress | Primary CPU initialization | Asserted
 213 | 09/14/2017 | 20:54:00 | System Firmware Progress | Memory initialization | Asserted
 214 | 09/14/2017 | 20:54:00 | System Boot Initiated | Initiated by warm reset | Asserted
 215 | 09/14/2017 | 20:54:00 | System Firmware Progress | Management controller initialization | Asserted
 216 | 09/14/2017 | 20:54:01 | System Firmware Progress | SMBus initialization | Asserted
 217 | 09/14/2017 | 20:54:02 | System Firmware Progress | Primary CPU initialization | Asserted
 218 | 09/14/2017 | 20:54:02 | System Firmware Progress | Memory initialization | Asserted
 219 | 09/14/2017 | 20:54:04 | System ACPI Power State #0x26 | S0/G0: working | Asserted
 21a | 09/14/2017 | 20:54:04 | System ACPI Power State #0x26 | S5/G2: soft-off | Deasserted
 21b | 09/14/2017 | 20:54:43 | System Firmware Progress | Cache initialization | Asserted
 21c | 09/14/2017 | 20:54:46 | System Firmware Progress | Secondary CPU Initialization | Asserted
 21d | 09/14/2017 | 20:55:01 | System Firmware Progress | PCI resource configuration | Asserted
 21e | 09/14/2017 | 20:55:06 | System Firmware Progress | PCI resource configuration | Asserted
 21f | 09/14/2017 | 20:55:15 | System ACPI Power State #0x26 | S0/G0: working | Deasserted
 220 | 09/14/2017 | 20:55:15 | System ACPI Power State #0x26 | S5/G2: soft-off | Asserted
 221 | 09/14/2017 | 20:55:19 | System Boot Initiated | System Restart | Asserted
 222 | 09/14/2017 | 20:55:19 | System Firmware Progress | Management controller initialization | Asserted
 223 | 09/14/2017 | 20:55:20 | System Firmware Progress | SMBus initialization | Asserted
 224 | 09/14/2017 | 20:55:21 | System Firmware Progress | Primary CPU initialization | Asserted
 225 | 09/14/2017 | 20:55:21 | System Firmware Progress | Memory initialization | Asserted
 226 | 09/14/2017 | 20:55:22 | System Boot Initiated | Initiated by warm reset | Asserted
 227 | 09/14/2017 | 20:55:22 | System Firmware Progress | Management controller initialization | Asserted
 228 | 09/14/2017 | 20:55:22 | System Firmware Progress | SMBus initialization | Asserted
 229 | 09/14/2017 | 20:55:24 | System Firmware Progress | Primary CPU initialization | Asserted
 22a | 09/14/2017 | 20:55:24 | System Firmware Progress | Memory initialization | Asserted
 22b | 09/14/2017 | 20:55:30 | System ACPI Power State #0x26 | S0/G0: working | Asserted
 22c | 09/14/2017 | 20:55:30 | System ACPI Power State #0x26 | S5/G2: soft-off | Deasserted
 22d | 09/14/2017 | 20:56:04 | System Firmware Progress | Cache initialization | Asserted
 22e | 09/14/2017 | 20:56:08 | System Firmware Progress | Secondary CPU Initialization | Asserted
 22f | 09/14/2017 | 20:56:23 | System Firmware Progress | PCI resource configuration | Asserted
 230 | 09/14/2017 | 20:56:28 | System Firmware Progress | PCI resource configuration | Asserted
 231 | 09/14/2017 | 20:56:35 | System Firmware Progress | Video initialization | Asserted
 232 | 09/14/2017 | 20:56:35 | System Firmware Progress | Option ROM initialization | Asserted
 233 | 09/14/2017 | 20:56:38 | System Firmware Progress | Keyboard controller initialization | Asserted
 234 | 09/14/2017 | 20:56:41 | System Firmware Progress | Option ROM initialization | Asserted
 235 | 09/14/2017 | 20:57:20 | System Firmware Progress | Hard-disk initialization | Asserted
 236 | 09/14/2017 | 20:57:20 | System Firmware Progress | Option ROM initialization | Asserted
 237 | 09/14/2017 | 20:57:29 | System Firmware Progress | System boot initiated | Asserted
 238 | 09/14/2017 | 20:57:29 | System Firmware Progress | System boot initiated | Asserted
 239 | 09/14/2017 | 20:59:13 | System Firmware Progress | Management controller initialization | Asserted
 23a | 09/14/2017 | 20:59:13 | System Firmware Progress | SMBus initialization | Asserted
 23b | 09/14/2017 | 20:59:14 | System Firmware Progress | Primary CPU initialization | Asserted
 23c | 09/14/2017 | 20:59:14 | System Firmware Progress | Memory initialization | Asserted
 23d | 09/14/2017 | 20:59:55 | System Firmware Progress | Cache initialization | Asserted
 23e | 09/14/2017 | 20:59:58 | System Firmware Progress | Secondary CPU Initialization | Asserted
 23f | 09/14/2017 | 21:00:13 | System Firmware Progress | PCI resource configuration | Asserted
 240 | 09/14/2017 | 21:00:18 | System Firmware Progress | PCI resource configuration | Asserted
 241 | 09/14/2017 | 21:00:25 | System Firmware Progress | Video initialization | Asserted
 242 | 09/14/2017 | 21:00:25 | System Firmware Progress | Option ROM initialization | Asserted
 243 | 09/14/2017 | 21:00:28 | System Firmware Progress | Keyboard controller initialization | Asserted
 244 | 09/14/2017 | 21:00:32 | System Firmware Progress | Option ROM initialization | Asserted
 245 | 09/14/2017 | 21:01:10 | System Firmware Progress | Hard-disk initialization | Asserted
 246 | 09/14/2017 | 21:01:10 | System Firmware Progress | Option ROM initialization | Asserted
 247 | 09/14/2017 | 21:01:18 | System Firmware Progress | System boot initiated | Asserted
 248 | 09/14/2017 | 21:01:18 | System Firmware Progress | System boot initiated | Asserted

 

3) Display all host nodes with ibhosts:

[root@grepora03db02 ~]# ibhosts
Ca	: 0x0021280001cf2dea ports 2 "grepora01nas01 PCIe 1"
Ca	: 0x0021280001cf7d6a ports 2 "grepora01db04 S 10.10.10.7 HCA-1"
Ca	: 0x0021280001a13cbc ports 2 "grepora01db02 S 10.10.10.2 HCA-1"
Ca	: 0x0021280001cf798e ports 2 "grepora01db03 S 10.10.10.6 HCA-1"
Ca	: 0x0021280001a0b038 ports 2 "grepora01cel07 C 10.10.10.11 HCA-1"
Ca	: 0x0021280001a135d0 ports 2 "grepora01db01 S 10.10.10.1 HCA-1"
Ca	: 0x0021280001cedbca ports 2 "grepora01cel05 C 10.10.10.9 HCA-1"
Ca	: 0x0021280001cf6006 ports 2 "grepora01cel06 C 10.10.10.10 HCA-1"
Ca	: 0x0021280001a151c0 ports 2 "grepora01cel03 C 10.10.10.5 HCA-1"
Ca	: 0x0021280001a16a12 ports 2 "grepora01cel04 C 10.10.10.8 HCA-1"
Ca	: 0x0021280001a15364 ports 2 "grepora01cel01 C 10.10.10.3 HCA-1"
Ca	: 0x0021280001a1590a ports 2 "grepora01cel02 C 10.10.10.4 HCA-1"
Ca	: 0x0010e00001333318 ports 2 "grepora01nas03 PCIe 6"
Ca	: 0x0010e00001333704 ports 2 "grepora01nas03 PCIe 5"
Ca	: 0x0010e00001330f3c ports 2 "grepora01nas04 PCIe 6"
Ca	: 0x0010e00001332f30 ports 2 "grepora01nas04 PCIe 5"
Ca	: 0x0010e0000128e3e4 ports 2 "grepora02db02 S 10.10.10.24 HCA-1"
Ca	: 0x0010e0000128e18c ports 2 "grepora02db01 S 10.10.10.28 HCA-1"
Ca	: 0x0010e00001289c6c ports 2 "grepora02cel03 C 10.10.10.27 HCA-1"
Ca	: 0x0010e0000128bea4 ports 2 "grepora02cel01 C 10.10.10.25 HCA-1"
Ca	: 0x0010e00001289d90 ports 2 "grepora02cel02 C 10.10.10.26 HCA-1"
Ca	: 0x0010e00001868f40 ports 2 "grepora03db04 S 10.10.10.35,10.10.10.36 HCA-1"
Ca	: 0x0010e00001859cd0 ports 2 "grepora03db03 S 10.10.10.33,10.10.10.34 HCA-1"
Ca	: 0x0010e000018640a0 ports 2 "grepora03db01 S 10.10.10.29,10.10.10.30 HCA-1"
Ca	: 0x0010e00001887928 ports 2 "grepora03cel07 C 10.10.10.49,10.10.10.50 HCA-1"
Ca	: 0x0010e0000185cd00 ports 2 "grepora03cel05 C 10.10.10.45,10.10.10.46 HCA-1"
Ca	: 0x0010e00001868e80 ports 2 "grepora03cel06 C 10.10.10.47,10.10.10.48 HCA-1"
Ca	: 0x0010e0000185e5a0 ports 2 "grepora03cel04 C 10.10.10.43,10.10.10.44 HCA-1"
Ca	: 0x0010e0000185e5f0 ports 2 "grepora03cel03 C 10.10.10.41,10.10.10.42 HCA-1"
Ca	: 0x0010e00001638be4 ports 2 "grepora03cel01 C 10.10.10.37,10.10.10.38 HCA-1"
Ca	: 0x0010e0000187c658 ports 2 "grepora03cel02 C 10.10.10.39,10.10.10.40 HCA-1"
Ca	: 0x0010e0000185c130 ports 2 "grepora03db02 S 10.10.10.31,10.10.10.32 HCA-1"

 

4) Use ibcheckstate to scan InfiniBand fabric and validate the port logical and physical state:
It reports any logical port with state other than active or physical with state other than LinkUp.

[root@grepora03db02 ~]# ibcheckstate -v

# Checking Switch: nodeguid 0x0021284692dea0a0
Node check lid 3:  OK 
Port check lid 3 port 1:  OK 
Port check lid 3 port 2:  OK 
Port check lid 3 port 3:  OK 
Port check lid 3 port 4:  OK 
Port check lid 3 port 5:  OK 
Port check lid 3 port 6:  OK 

Port check lid 3 port 7:  OK 
Port check lid 3 port 8:  OK 
Port check lid 3 port 9:  OK 
Port check lid 3 port 10:  OK 
Port check lid 3 port 12:  OK 
Port check lid 3 port 13:  OK 
Port check lid 3 port 14:  OK 
Port check lid 3 port 15:  OK 
Port check lid 3 port 16:  OK 
Port check lid 3 port 17:  OK 
Port check lid 3 port 18:  OK 
Port check lid 3 port 29:  OK 
Port check lid 3 port 31:  OK 
Port check lid 3 port 32:  OK 

# Checking Switch: nodeguid 0x0021284692d1a0a0
Node check lid 1:  OK 
Port check lid 1 port 1:  OK 
Port check lid 1 port 2:  OK 
Port check lid 1 port 3:  OK 
Port check lid 1 port 4:  OK 
Port check lid 1 port 5:  OK 
Port check lid 1 port 6:  OK 
Port check lid 1 port 7:  OK 
Port check lid 1 port 8:  OK 
Port check lid 1 port 9:  OK 
Port check lid 1 port 10:  OK 
Port check lid 1 port 12:  OK 
Port check lid 1 port 13:  OK 
Port check lid 1 port 14:  OK 
Port check lid 1 port 15:  OK 
Port check lid 1 port 16:  OK 
Port check lid 1 port 17:  OK 
Port check lid 1 port 18:  OK 
Port check lid 1 port 29:  OK 
Port check lid 1 port 31:  OK 
Port check lid 1 port 32:  OK 

# Checking Switch: nodeguid 0x0010e035c814a0a0
Node check lid 7:  OK 
Port check lid 7 port 1:  OK 
Port check lid 7 port 2:  OK 
Port check lid 7 port 4:  OK 
Port check lid 7 port 7:  OK 
Port check lid 7 port 10:  OK 
Port check lid 7 port 11:  OK 
Port check lid 7 port 13:  OK 
Port check lid 7 port 14:  OK 
Port check lid 7 port 15:  OK 
Port check lid 7 port 16:  OK 
Port check lid 7 port 17:  OK 
Port check lid 7 port 18:  OK 
Port check lid 7 port 27:  OK 
Port check lid 7 port 29:  OK 
Port check lid 7 port 30:  OK 
Port check lid 7 port 31:  OK 
Port check lid 7 port 33:  OK 
Port check lid 7 port 34:  OK 
Port check lid 7 port 35:  OK 
Port check lid 7 port 36:  OK 

# Checking Switch: nodeguid 0x0010e035bed4a0a0
Node check lid 27:  OK 
Port check lid 27 port 1:  OK 
Port check lid 27 port 2:  OK 
Port check lid 27 port 4:  OK 
Port check lid 27 port 7:  OK 
Port check lid 27 port 10:  OK 
Port check lid 27 port 11:  OK 
Port check lid 27 port 13:  OK 
Port check lid 27 port 14:  OK 
Port check lid 27 port 15:  OK 
Port check lid 27 port 16:  OK 
Port check lid 27 port 17:  OK 
Port check lid 27 port 18:  OK 
Port check lid 27 port 27:  OK 
Port check lid 27 port 29:  OK 
Port check lid 27 port 30:  OK 
Port check lid 27 port 31:  OK 
Port check lid 27 port 33:  OK 
Port check lid 27 port 34:  OK 
Port check lid 27 port 35:  OK 
Port check lid 27 port 36:  OK 

# Checking Switch: nodeguid 0x0010e0801a0aa0a0
Node check lid 62:  OK 
Port check lid 62 port 1:  OK 
Port check lid 62 port 2:  OK 
Port check lid 62 port 3:  OK 
Port check lid 62 port 4:  OK 
Port check lid 62 port 5:  OK 
Port check lid 62 port 6:  OK 
Port check lid 62 port 7:  OK 
Port check lid 62 port 8:  OK 
Port check lid 62 port 9:  OK 
Port check lid 62 port 10:  OK 
Port check lid 62 port 12:  OK 
Port check lid 62 port 13:  OK 
Port check lid 62 port 14:  OK 
Port check lid 62 port 15:  OK 
Port check lid 62 port 16:  OK 
Port check lid 62 port 17:  OK 
Port check lid 62 port 18:  OK 
Port check lid 62 port 31:  OK 
Port check lid 62 port 32:  OK 

# Checking Switch: nodeguid 0x00212846965fa0a0
Node check lid 80:  OK 
Port check lid 80 port 13:  OK 
Port check lid 80 port 14:  OK 
Port check lid 80 port 15:  OK 
Port check lid 80 port 16:  OK 
Port check lid 80 port 19:  OK 
Port check lid 80 port 20:  OK 
Port check lid 80 port 21:  OK 
Port check lid 80 port 22:  OK 
Port check lid 80 port 23:  OK 
Port check lid 80 port 24:  OK 
Port check lid 80 port 25:  OK 
Port check lid 80 port 26:  OK 
Port check lid 80 port 27:  OK 
Port check lid 80 port 28:  OK 
Port check lid 80 port 29:  OK 
Port check lid 80 port 30:  OK 
Port check lid 80 port 31:  OK 
Port check lid 80 port 32:  OK 
Port check lid 80 port 33:  OK 
Port check lid 80 port 34:  OK 

# Checking Switch: nodeguid 0x0010e0801ff2a0a0
Node check lid 54:  OK 
Port check lid 54 port 7:  OK 
Port check lid 54 port 8:  OK 
Port check lid 54 port 9:  OK 
Port check lid 54 port 10:  OK 
Port check lid 54 port 11:  OK 
Port check lid 54 port 12:  OK 
Port check lid 54 port 17:  OK 
Port check lid 54 port 18:  OK 
Port check lid 54 port 19:  OK 
Port check lid 54 port 20:  OK 
Port check lid 54 port 21:  OK 
Port check lid 54 port 22:  OK 
Port check lid 54 port 25:  OK 
Port check lid 54 port 26:  OK 
Port check lid 54 port 27:  OK 
Port check lid 54 port 28:  OK 
Port check lid 54 port 29:  OK 
Port check lid 54 port 30:  OK 
Port check lid 54 port 35:  OK 
Port check lid 54 port 36:  OK 

# Checking Switch: nodeguid 0x0010e0801a9da0a0
Node check lid 53:  OK 
Port check lid 53 port 1:  OK 
Port check lid 53 port 2:  OK 
Port check lid 53 port 3:  OK 
Port check lid 53 port 4:  OK 
Port check lid 53 port 5:  OK 
Port check lid 53 port 6:  OK 
Port check lid 53 port 7:  OK 
Port check lid 53 port 8:  OK 
Port check lid 53 port 9:  OK 
Port check lid 53 port 10:  OK 
Port check lid 53 port 12:  OK 
Port check lid 53 port 13:  OK 
Port check lid 53 port 14:  OK 
Port check lid 53 port 15:  OK 
Port check lid 53 port 16:  OK 
Port check lid 53 port 17:  OK 
Port check lid 53 port 18:  OK 
Port check lid 53 port 31:  OK 
Port check lid 53 port 32:  OK 

# Checking Ca: nodeguid 0x0021280001cf2dea
Node check lid 13:  OK 
Port check lid 13 port 1:  OK 
Port check lid 13 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001cf7d6a
Node check lid 25:  OK 
Port check lid 25 port 1:  OK 
Port check lid 25 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001a13cbc
Node check lid 5:  OK 
Port check lid 5 port 1:  OK 
Port check lid 5 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001cf798e
Node check lid 23:  OK 
Port check lid 23 port 1:  OK 
Port check lid 23 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001a0b038
Node check lid 21:  OK 
Port check lid 21 port 1:  OK 
Port check lid 21 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001a135d0
Node check lid 2:  OK 
Port check lid 2 port 1:  OK 
Port check lid 2 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001cedbca
Node check lid 17:  OK 
Port check lid 17 port 1:  OK 
Port check lid 17 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001cf6006
Node check lid 19:  OK 
Port check lid 19 port 1:  OK 
Port check lid 19 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001a151c0
Node check lid 11:  OK 
Port check lid 11 port 1:  OK 
Port check lid 11 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001a16a12
Node check lid 15:  OK 
Port check lid 15 port 1:  OK 
Port check lid 15 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001a15364
Node check lid 38:  OK 
Port check lid 38 port 1:  OK 
Port check lid 38 port 2:  OK 

# Checking Ca: nodeguid 0x0021280001a1590a
Node check lid 9:  OK 
Port check lid 9 port 1:  OK 
Port check lid 9 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001333704
Node check lid 42:  OK 
Port check lid 42 port 1:  OK 
Port check lid 42 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001333318
Node check lid 41:  OK 
Port check lid 41 port 1:  OK 
Port check lid 41 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001330f3c
Node check lid 40:  OK 
Port check lid 40 port 1:  OK 
Port check lid 40 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001332f30
Node check lid 39:  OK 
Port check lid 39 port 1:  OK 
Port check lid 39 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000128e3e4
Node check lid 36:  OK 
Port check lid 36 port 1:  OK 
Port check lid 36 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000128e18c
Node check lid 34:  OK 
Port check lid 34 port 1:  OK 
Port check lid 34 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001289c6c
Node check lid 32:  OK 
Port check lid 32 port 1:  OK 
Port check lid 32 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000128bea4
Node check lid 28:  OK 
Port check lid 28 port 1:  OK 
Port check lid 28 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001289d90
Node check lid 30:  OK 
Port check lid 30 port 1:  OK 
Port check lid 30 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001868f40
Node check lid 50:  OK 
Port check lid 50 port 1:  OK 
Port check lid 50 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001859cd0
Node check lid 57:  OK 
Port check lid 57 port 1:  OK 
Port check lid 57 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001887928
Node check lid 48:  OK 
Port check lid 48 port 1:  OK 
Port check lid 48 port 2:  OK 

# Checking Ca: nodeguid 0x0010e000018640a0
Node check lid 55:  OK 
Port check lid 55 port 1:  OK 
Port check lid 55 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000185cd00
Node check lid 47:  OK 
Port check lid 47 port 1:  OK 
Port check lid 47 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001868e80
Node check lid 52:  OK 
Port check lid 52 port 1:  OK 
Port check lid 52 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000185e5a0
Node check lid 56:  OK 
Port check lid 56 port 1:  OK 
Port check lid 56 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000185e5f0
Node check lid 59:  OK 
Port check lid 59 port 1:  OK 
Port check lid 59 port 2:  OK 

# Checking Ca: nodeguid 0x0010e00001638be4
Node check lid 58:  OK 
Port check lid 58 port 1:  OK 
Port check lid 58 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000187c658
Node check lid 51:  OK 
Port check lid 51 port 1:  OK 
Port check lid 51 port 2:  OK 

# Checking Ca: nodeguid 0x0010e0000185c130
Node check lid 49:  OK 
Port check lid 49 port 1:  OK 
Port check lid 49 port 2:  OK 

## Summary: 40 nodes checked, 0 bad nodes found
##          222 ports checked, 0 ports with bad state found

 

5) Use ibcheckerrors to scan InfiniBand fabric and validate the connectivity as described in the topology file:
The ibcheckerrors command uses the topology file to scan the InfiniBand fabric and validate the connectivity as described in the topology file, and to report errors as indicated by the port counters.

[root@grepora03db02 ~]# ibcheckerrors
#warn: counter PortRcvRemotePhysicalErrors = 100 	(threshold 100) lid 1 port 255
Error check on lid 1 (SUN DCS 36P QDR grepora01sw-ib2 10.20.10.24) port all:  FAILED 
#warn: counter PortRcvRemotePhysicalErrors = 100 	(threshold 100) lid 1 port 15
Error check on lid 1 (SUN DCS 36P QDR grepora01sw-ib2 10.20.10.24) port 15:  FAILED 
#warn: counter LinkErrorRecoveryCounter = 12 	(threshold 10) lid 7 port 255
Error check on lid 7 (SUN DCS 36P QDR grepora02sw-ibb0 10.20.10.50) port all:  FAILED 
#warn: counter LinkErrorRecoveryCounter = 12 	(threshold 10) lid 7 port 34
Error check on lid 7 (SUN DCS 36P QDR grepora02sw-ibb0 10.20.10.50) port 34:  FAILED 
#warn: counter SymbolErrorCounter = 156 	(threshold 10) lid 27 port 255
Error check on lid 27 (SUN DCS 36P QDR grepora02sw-iba0 10.20.10.49) port all:  FAILED 
#warn: counter SymbolErrorCounter = 156 	(threshold 10) lid 27 port 35
Error check on lid 27 (SUN DCS 36P QDR grepora02sw-iba0 10.20.10.49) port 35:  FAILED 
#warn: counter SymbolErrorCounter = 13896 	(threshold 10) lid 80 port 255
#warn: counter LinkErrorRecoveryCounter = 12 	(threshold 10) lid 80 port 255
#warn: counter PortRcvErrors = 145 	(threshold 10) lid 80 port 255
Error check on lid 80 (SUN DCS 36P QDR grepora01sw-ib1 10.20.10.23) port all:  FAILED 
#warn: counter SymbolErrorCounter = 118 	(threshold 10) lid 80 port 19
#warn: counter PortRcvErrors = 115 	(threshold 10) lid 80 port 19
Error check on lid 80 (SUN DCS 36P QDR grepora01sw-ib1 10.20.10.23) port 19:  FAILED 
#warn: counter SymbolErrorCounter = 13777 	(threshold 10) lid 80 port 22
#warn: counter LinkErrorRecoveryCounter = 12 	(threshold 10) lid 80 port 22
#warn: counter PortRcvErrors = 29 	(threshold 10) lid 80 port 22
Error check on lid 80 (SUN DCS 36P QDR grepora01sw-ib1 10.20.10.23) port 22:  FAILED 
#warn: counter PortRcvErrors = 40 	(threshold 10) lid 53 port 255
Error check on lid 53 (SUN DCS 36P QDR grepora03sw-iba01 10.20.10.76) port all:  FAILED 
#warn: counter PortRcvErrors = 38 	(threshold 10) lid 53 port 8
Error check on lid 53 (SUN DCS 36P QDR grepora03sw-iba01 10.20.10.76) port 8:  FAILED 

## Summary: 40 nodes checked, 0 bad nodes found
##          222 ports checked, 6 ports have errors beyond threshold

 

6) Checking for sensor healthy from switch:
* Running from a leaf switch.

[root@grepora03sw-ibb01 ~]# showunhealthy
OK - No unhealthy sensors

 

7) Check the overall health of the InfiniBand switch, on the Exadata switch itself:

[root@grepora03sw-ibb01 ~]# env_test
Environment test started:
Starting Environment Daemon test:
Environment daemon running
Environment Daemon test returned OK
Starting Voltage test:
Voltage ECB OK
Measured 3.3V Main = 3.25 V
Measured 3.3V Standby = 3.35 V
Measured 12V = 12.03 V
Measured 5V = 4.99 V
Measured VBAT = 3.09 V
Measured 2.5V = 2.50 V
Measured 1.8V = 1.78 V
Measured I4 1.2V = 1.21 V
Voltage test returned OK
Starting PSU test:
PSU 0 present OK
PSU 1 present OK
PSU test returned OK
Starting Temperature test:
Back temperature 29
Front temperature 31
SP temperature 48
Switch temperature 49, maxtemperature 57
Temperature test returned OK
Starting FAN test:
Fan 0 not present
Fan 1 running at rpm 11445
Fan 2 running at rpm 11445
Fan 3 running at rpm 11445
Fan 4 not present
FAN test returned OK
Starting Connector test:
Connector test returned OK
Starting Onboard ibdevice test:
Switch OK
All Internal ibdevices OK
Onboard ibdevice test returned OK
Starting SSD test:
SSD test returned OK
Environment test PASSED
[root@grepora03sw-ibb01 ~]#

Some references:
https://docs.oracle.com/cd/E19464-01/820-6850-11/IPMItool.html
https://docs.oracle.com/cd/E24707_01/html/E24528/z400000c1016683.html
http://docs.oracle.com/cd/E19654-01/820-7752-12/z400014c1567639.html
http://docs.oracle.com/cd/E19654-01/820-7751-12/z400014e1393674.html
Infiniband Switch in Exadata incorrectly reporting WARNING/FAILURE when Performing ‘showunhealthy’ (Doc ID 1578284.1)

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.