Reduce Exadata Core Count

Ok, so I was preparing for a DC services migration with a client and this would involve resizing the CPU count of Exadatas for better attending those services. This way, one of the steps will require reduce CPU counts in one of the sites to be aligned with the license terms.

Checking for the steps to accomplish that, I found references to change CPU and core count, but always described in the case of increasing allocation. As per 2.7 Increasing the Number of Active Cores on Database Servers. But not so much about reducing, as this seems to be unusual…

Also considering that the planned change would be within the minimum number requirement: 2.1 Restrictions for Capacity-On-Demand on Oracle Exadata Database Machine.

Reviewing on MOS, we found the When Attempting to Change the Number of Cores, Errors With: DBM-10004 – Decreasing the Number of Active Cores is not Supported ( Doc ID 2177634.1 ), pointing to use the clause “FORCE” on “ALTER DBSERVER pendingCoreCount =x” command.

And this worked. I just disabled the iaasMode to play safe. Have a look:

[root@grepora01~]# dbmcli
DBMCLI: Release  - Production on Mon Jan 05 01:10:12 EEST 2019
Copyright (c) 2007, 2014, Oracle.  All rights reserved.
DBMCLI> LIST DBSERVER attributes coreCount
	 36/44
DBMCLI> ALTER DBSERVER pendingCoreCount = 24 force
DBM-10022: At least 26 physical cores need to be active in order to support IaaS.
DBMCLI> ALTER DBSERVER iaasMode = "off"
DBServer exadb01 successfully altered
DBMCLI> ALTER DBSERVER pendingCoreCount = 24 force
DBServer grepora01 successfully altered. Please reboot the system to make the new pendingCoreCount effective.
DBMCLI> LIST DBSERVER attributes pendingCoreCount
24/44

–> Restart the server
After restarting, it should look like:

DBMCLI> LIST DBSERVER attributes coreCount
	 24/44
DBMCLI> LIST DBSERVER attributes pendingCoreCount

Hope this helps!

Exadata Compute Node Not Starting after a long Period…

Well,
After a long time on a graceful reboot, the compute node was simply not starting… What do to?
The best is:

1. Connect to ILOM Console:

Go to: Host Management –> Power control –> select Power Cycle in drop down list.

2. Connect to ILOM Server start SP console:
You may do it from another node, of course.

[root@grepora02 ~]# ssh root@grepora01-ilom Password:

Oracle(R) Integrated Lights Out Manager

Version 3.2.9.23 r116695

Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.

Warning: HTTPS certificate is set to factory default.

Hostname: grepora01-ilom

-> start /SP/console Are you sure you want to start /SP/console (y/n)? y

And, if not, as always, create a SR and follow with Oracle is the best way to go…

Hope it helps!

OEM after a Maintenance: A memory component is suspected of causing a fault with a 100% certainty. Component Name : % Fault class : fault.memory.intel.dimm_ce

Hi all!
So, I had this message from a memory component in my Exadata:

Message=A memory component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0/D3 Fault class : fault.memory.intel.dimm_ce

But this was right after a maintenance on server. Checking on ILOM:

-> show /SYS/MB/P0/D3

 /SYS/MB/P0/D3
    Targets:
        PRSNT
        SERVICE

    Properties:
        type = DIMM
        ipmi_name = MB/P0/D3
        fru_name = 16384MB DDR4 SDRAM DIMM
        fru_manufacturer = Samsung
        fru_part_number = %
        fru_rev_level = 01
        fru_serial_number = %
        fault_state = OK
        clear_fault_action = (none)

Checking on CellCLI alert history:

CellCLI> list alerthistory detail

	 name:                   13_1
	 alertDescription:       "A memory component suspected of causing a fault"
	 alertMessage:           "A memory component is suspected of causing a fault with a 100% certainty.  Component Name : /SYS/MB/P0/D3  Fault class    : fault.memory.intel.dimm_ce  Fault message  : http://support.oracle.com/msg/SPX86A-8002-XM"
	 alertSequenceID:        13
	 alertShortName:         Hardware
	 alertType:              Stateful
	 beginTime:              %
	 endTime:                %
	 examinedBy:             
	 metricObjectName:       /SYS/MB/P0/D3_FAULT
	 notificationState:      1
	 sequenceBeginTime:      %
	 severity:               critical
	 alertAction:            "For additional information, please refer to http://support.oracle.com/msg/SPX86A-8002-XM Automatic Service Request has been notified with Unique Identifier: %.  Diagnostic package is attached. It is also accessible at % It will be retained on the storage server for 7 days. If the diagnostic package has expired, then it can be re-created at %"

Hm… Let’s read the MOS: SPX86A-8002-XM – Memory Correctable ECC (Doc ID 1615285.1)

Suggested Action for System Administrator

Replace the faulty memory DIMM at the earliest possible convenience.”

Hmm… But as I said, this was right after a maintenance on server, what if this is related?
Ok, some additional piece of information:

-> version 
SP firmware 3.2.9.23 
SP firmware build number: 116695 
SP firmware date: Thu Mar 30 11:38:01 CST 2017 
SP filesystem version: 0.2.10

At the current firmware level of SP firmware 3.2.9.23 the memory correctable error threshold limit for DIMM replacement is 240 CEs in a 72 hrs period.

So, the suggestion is:
– Clear all the error messages after complete the maintenance and lets check if the threshold is reached again. If so, we may need to really replace it.

How to do it? Easy:

ssh root@grepora01-ilom
-> show /SYS/MB/P0/D3
Expected:
[...]
fault_state = Faulted
[..]
-> set /SYS/MB/P0/D3 clear_fault_action=true
Are you sure you want to clear /SYS/MB/P0/D3 (y/n)? y
-> show /SYS/MB/P0/D3
[Expected]
 /SYS/MB/P0/D3
    Targets:
        PRSNT
        SERVICE
Properties:
type = DIMM
ipmi_name = MB/P0/D3
fru_name = 16384MB DDR4 SDRAM DIMM
fru_manufacturer = Samsung
fru_part_number = %
fru_rev_level = 01
fru_serial_number = %
 fault_state = OK
clear_fault_action = (none)

Hope it helps!
Cheers!

OEM: The ILOM server is currently offline or unreachable on the network.

Hi all!
Just got an alarm from OEM with this message. How to check it?
– First thing is to be able to connect on ILOM from DBNode.
– From there we can test the IPv4 and/or IPv6 interfaces through ping, as pe shown below.

This is also documented as per this Doc: Oracle Integrated Lights Out Manager (ILOM) 3.0 HTML Documentation Collection – Test IPv4 or IPv6 Network Configuration (CLI)

In my case, it was only a false alarm, as I was able to connect to other DBNodes from this ILOM:

[root@greporasrv01db01 ~]# ssh greporasrv01-ilom.jcrew.com
The authenticity of host 'greporasrv01-ilom.grepora.com (10.48.18.64)' can't be established.
RSA key fingerprint is 59:c5:9f:b1:60:59:15:16:94:c8:94:88:7b:4e:52:57.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'greporasrv01-ilom.grepora.com' (RSA) to the list of known hosts.
Password: 

Oracle(R) Integrated Lights Out Manager

Version 3.2.9.23 r116695

Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.

Warning: HTTPS certificate is set to factory default.

Hostname: greporasrv01-ilom

-> show /SP/network

 /SP/network
    Targets:
        interconnect
        ipv6
        test

    Properties:
        commitpending = (Cannot show property)
        dhcp_clientid = none
        dhcp_server_ip = none
        ipaddress = 10.50.12.64
        ipdiscovery = static
        ipgateway = 10.50.12.1
        ipnetmask = 255.255.255.0
        macaddress = 00:10:E0:95:73:E6
        managementport = MGMT
        outofbandmacaddress = 00:10:E0:95:73:E6
        pendingipaddress = 10.50.12.64
        pendingipdiscovery = static
        pendingipgateway = 10.50.12.1
        pendingipnetmask = 255.255.255.0
        pendingmanagementport = MGMT
        pendingvlan_id = (none)
        sidebandmacaddress = 00:10:E0:95:73:E7
        state = ipv4-only
        vlan_id = (none)

    Commands:
        cd
        set
        show

-> cd /SP/network/test
/SP/network/test

-> show

 /SP/network/test
    Targets:

    Properties:
        ping = (Cannot show property)
        ping6 = (Cannot show property)

    Commands:
        cd
        set
        show

-> set ping=10.50.12.51       -- DBNode1
Ping of 10.50.12.51 succeeded

-> set ping=10.50.12.52       -- DBNode2
Ping of 10.50.12.52 succeeded

 

Verifying topology of Exadata

Hey all!
Some time ago I needed to check topology of a client’s Exadata due a network issue and made a very useful note. Sharing with you now. 😀

This and other cool commands can be found here: Network Diagnostics information for Oracle Database Machine Environments (Doc ID 1053498.1)

# /opt/oracle.SupportTools/ibdiagtools/verify-topology -t quarterrack

Newer versions don’t require -t option.
In case of halfrack, -t halfrack should be used in my case.

Ok, but how to know it? You can have it from here:

[root@greporaexa onecommand]# grep -i MACHINETYPES databasemachine.xml
[MACHINETYPES]X4-2 Eighth Rack HC 4TB[/MACHINETYPES]

Hope it helps! 🙂

Exadata: 7 Useful Commands to check Port/Sensor Alarms

Hello all!

This days I had an alarm with message below:

Message=The aggregate sensor /SYS/CABLE_CONN_STAT has a fault.

There is some useful commands I used to verify all ports/sensors in my exadata cluster.

In summary, these commands:
1) Use Intelligent Platform Management Interface (IPMI) to read the Sensor Data Record (SDR) repository
2) Use Intelligent Platform Management Interface (IPMI) to view the ILOM SP System Event Log (SEL)
3) Display all host nodes with ibhosts
4) Use ibcheckstate to scan InfiniBand fabric and validate the port logical and physical state
5) Use ibcheckerrors to scan InfiniBand fabric and validate the connectivity as described in the topology file
6) Checking for sensor healthy from switch
7) Check the overall health of the InfiniBand switch, on the Exadata switch itself

The Commands are:

Continue reading

Exadata: ORA-07445: exception encountered: core dump [ocl_lock_get_waitobj_owner()+26] [11] [0x000000000] [] [] []

Hello all,

This is because the error is generated by an unpublished bug 17891564, as per described in MOS ORA-7445 [ocl_lock_get_waitobj_owner] on an Exadata storage cell (Doc ID 1906366.1).

It affects Exadata storage cell with image version between 11.2.1.2.0 and 11.2.3.3.0. The CELLSRV process crash with this error as per:

Cellsrv encountered a fatal signal 11
Errors in file /opt/oracle/cell11.2.3.3.0_LINUX.X64_131014.1/log/diag/asm/cell//trace/svtrc_11711_27.trc  (incident=257):
ORA-07445: exception encountered: core dump [ocl_lock_get_waitobj_owner()+26] [11] [0x000000000] [] [] []
Incident details in: /opt/oracle/cell11.2.3.3.0_LINUX.X64_131014.1/log/diag/asm/cell//incident/incdir_257/svtrc_11711_27_i257.trc

The CELLSRV process should auto restart after this error.

Continue reading

Infiniband Error: Cable is present on Port “X” but it is polling for peer port

Facing this error? Let me guess: Ports 03, 05, 06, 08, 09 and 12 are alerting? You have a Quarter Rack? Have recently installed Exadata plugin to version 12.1.0.3 or higher?
Don’t panic!

This is probably related to Bug 15937297 : EM 12C HAS ERRORS CABLE IS PRESENT ON PORT ‘N’ BUT IT IS POLLING FOR PEER PORT. The full message might be like “Cable is present on Port 6 but it is polling for peer port. This could happen when the peer port is unplugged/disabled“.

In fact, the bug was closed as not a bug. 🙂
As part of the 12.1.0.3 Exadata plugin, the IB switch ports are now checked for non-terminated cables. So these errors ‘polling for peer port’ are the expected behavior.  Once ‘polling for peer port’ is an enhanced feature of the 12.1.0.3 plugin, this explains why you most likely did not see these errors until you upgraded the OMS to 12.1.0.2 and then updated the plugins.

In Quarter Racks, the following ports 3, 5, 6, 8, 9 and 12 are usually cabled ahead of time, but not terminated. In some racks port 32 may also be unterminated. Checking for incident in OEM you might see something like this image:

newscreenshot-2016-12-26-as-20-03-50

Continue reading

RS-7445 [Serv MS leaking memory] [It will be restarted] [] [] [] [] [] [] [] [] [] []

Hello!
Having this error from cell alerthistory.log? Don’t panic!
Take a look in MOS: Exadata Storage Cell reports error RS-7445 [Serv MS Leaking Memory] (Doc ID 1954357.1). It’s related to Bug  – RS-7445 [SERV MS LEAKING MEMORY].

The issue is a memory leak in the Java executable and affects systems running with JDK 7u51 or later versions. This is relevant for all versions in Release 11.2 to 12.1.

What happens is that MS process is consuming high memory (up to 2GB).  Normally MS use around 1GB but because of the bug the memory allocated can grow upt to 2GB.  You can check it as per example below:

[root@exaserver ~]# ps -feal|grep java
0 S root     16493 14737  0  80   0 - 15317 pipe_w 18:34 pts/0    00:00:00 grep java
0 S root     22310 27043  2  80   0 - 267080 futex_ 18:15 ?       00:00:27 /usr/java/default/bin/java -Xms256m -Xmx512m -XX:-UseLargePages -Djava.library.path=/opt/oracle/cell/cellsrv/lib -Ddisable.checkForUpdate=true -jar /opt/oracle/cell/oc4j/ms/j2ee/home/oc4j.jar -out /opt/oracle/cell/cellsrv/deploy/log/ms.lst -err /opt/oracle/cell/cellsrv/deploy/log/ms.err

Note that: 267080 * 4096 = 1143MB (1GB). If your number is higher than this, it indicates the presence of the bug.

Continue reading