Hi all!
So, I had this message from a memory component in my Exadata:
Message=A memory component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0/D3 Fault class : fault.memory.intel.dimm_ce
But this was right after a maintenance on server. Checking on ILOM:
-> show /SYS/MB/P0/D3 /SYS/MB/P0/D3 Targets: PRSNT SERVICE Properties: type = DIMM ipmi_name = MB/P0/D3 fru_name = 16384MB DDR4 SDRAM DIMM fru_manufacturer = Samsung fru_part_number = % fru_rev_level = 01 fru_serial_number = % fault_state = OK clear_fault_action = (none)
Checking on CellCLI alert history:
CellCLI> list alerthistory detail name: 13_1 alertDescription: "A memory component suspected of causing a fault" alertMessage: "A memory component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0/D3 Fault class : fault.memory.intel.dimm_ce Fault message : http://support.oracle.com/msg/SPX86A-8002-XM" alertSequenceID: 13 alertShortName: Hardware alertType: Stateful beginTime: % endTime: % examinedBy: metricObjectName: /SYS/MB/P0/D3_FAULT notificationState: 1 sequenceBeginTime: % severity: critical alertAction: "For additional information, please refer to http://support.oracle.com/msg/SPX86A-8002-XM Automatic Service Request has been notified with Unique Identifier: %. Diagnostic package is attached. It is also accessible at % It will be retained on the storage server for 7 days. If the diagnostic package has expired, then it can be re-created at %"
Hm… Let’s read the MOS: SPX86A-8002-XM – Memory Correctable ECC (Doc ID 1615285.1)
“Suggested Action for System Administrator
Replace the faulty memory DIMM at the earliest possible convenience.”
Hmm… But as I said, this was right after a maintenance on server, what if this is related?
Ok, some additional piece of information:
-> version SP firmware 3.2.9.23 SP firmware build number: 116695 SP firmware date: Thu Mar 30 11:38:01 CST 2017 SP filesystem version: 0.2.10
At the current firmware level of SP firmware 3.2.9.23 the memory correctable error threshold limit for DIMM replacement is 240 CEs in a 72 hrs period.
So, the suggestion is:
– Clear all the error messages after complete the maintenance and lets check if the threshold is reached again. If so, we may need to really replace it.
How to do it? Easy:
ssh root@grepora01-ilom -> show /SYS/MB/P0/D3 Expected: [...] fault_state = Faulted [..] -> set /SYS/MB/P0/D3 clear_fault_action=true Are you sure you want to clear /SYS/MB/P0/D3 (y/n)? y -> show /SYS/MB/P0/D3 [Expected] /SYS/MB/P0/D3 Targets: PRSNT SERVICE Properties: type = DIMM ipmi_name = MB/P0/D3 fru_name = 16384MB DDR4 SDRAM DIMM fru_manufacturer = Samsung fru_part_number = % fru_rev_level = 01 fru_serial_number = % fault_state = OK clear_fault_action = (none)
Hope it helps!
Cheers!