OEM after a Maintenance: A memory component is suspected of causing a fault with a 100% certainty. Component Name : % Fault class : fault.memory.intel.dimm_ce

Hi all!
So, I had this message from a memory component in my Exadata:

Message=A memory component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0/D3 Fault class : fault.memory.intel.dimm_ce

But this was right after a maintenance on server. Checking on ILOM:

-> show /SYS/MB/P0/D3

 /SYS/MB/P0/D3
    Targets:
        PRSNT
        SERVICE

    Properties:
        type = DIMM
        ipmi_name = MB/P0/D3
        fru_name = 16384MB DDR4 SDRAM DIMM
        fru_manufacturer = Samsung
        fru_part_number = %
        fru_rev_level = 01
        fru_serial_number = %
        fault_state = OK
        clear_fault_action = (none)

Checking on CellCLI alert history:

CellCLI> list alerthistory detail

	 name:                   13_1
	 alertDescription:       "A memory component suspected of causing a fault"
	 alertMessage:           "A memory component is suspected of causing a fault with a 100% certainty.  Component Name : /SYS/MB/P0/D3  Fault class    : fault.memory.intel.dimm_ce  Fault message  : http://support.oracle.com/msg/SPX86A-8002-XM"
	 alertSequenceID:        13
	 alertShortName:         Hardware
	 alertType:              Stateful
	 beginTime:              %
	 endTime:                %
	 examinedBy:             
	 metricObjectName:       /SYS/MB/P0/D3_FAULT
	 notificationState:      1
	 sequenceBeginTime:      %
	 severity:               critical
	 alertAction:            "For additional information, please refer to http://support.oracle.com/msg/SPX86A-8002-XM Automatic Service Request has been notified with Unique Identifier: %.  Diagnostic package is attached. It is also accessible at % It will be retained on the storage server for 7 days. If the diagnostic package has expired, then it can be re-created at %"

Hm… Let’s read the MOS: SPX86A-8002-XM – Memory Correctable ECC (Doc ID 1615285.1)

Suggested Action for System Administrator

Replace the faulty memory DIMM at the earliest possible convenience.”

Hmm… But as I said, this was right after a maintenance on server, what if this is related?
Ok, some additional piece of information:

-> version 
SP firmware 3.2.9.23 
SP firmware build number: 116695 
SP firmware date: Thu Mar 30 11:38:01 CST 2017 
SP filesystem version: 0.2.10

At the current firmware level of SP firmware 3.2.9.23 the memory correctable error threshold limit for DIMM replacement is 240 CEs in a 72 hrs period.

So, the suggestion is:
– Clear all the error messages after complete the maintenance and lets check if the threshold is reached again. If so, we may need to really replace it.

How to do it? Easy:

ssh root@grepora01-ilom
-> show /SYS/MB/P0/D3
Expected:
[...]
fault_state = Faulted
[..]
-> set /SYS/MB/P0/D3 clear_fault_action=true
Are you sure you want to clear /SYS/MB/P0/D3 (y/n)? y
-> show /SYS/MB/P0/D3
[Expected]
 /SYS/MB/P0/D3
    Targets:
        PRSNT
        SERVICE
Properties:
type = DIMM
ipmi_name = MB/P0/D3
fru_name = 16384MB DDR4 SDRAM DIMM
fru_manufacturer = Samsung
fru_part_number = %
fru_rev_level = 01
fru_serial_number = %
 fault_state = OK
clear_fault_action = (none)

Hope it helps!
Cheers!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.