I want to thank you all for the attendance on yesterday’s Workshop promoted by GUORS.
It was a pleasure to have you there and we barely can wait for next one!
Today’s post is a very simple one, once seems sometimes a simple and success case post is appreciated. So, how to generate a Sundiag Report for Oracle?
First, what is a Sundiag?
It consists in an Oracle hardware healthcheck list to be run on your environment. It doesn’t give us any report, but is usually requested by Oracle on SRs related to network, disk I/O or any other possible hardware/firmware related issue.
So, how to do it? Simple like this (the path is always the same):
[root@greporacel01 ~]# cd /opt/oracle.SupportTools/ [root@greporacel01 oracle.SupportTools]# /opt/oracle.SupportTools/sundiag.sh Oracle Exadata Database Machine - Diagnostics Collection Tool Gathering Linux information Skipping ILOM collection. Use the ilom or snapshot options, or login to ILOM over the network and run Snapshot separately if necessary. /tmp/sundiag_greporacel01_1108FMM0FF_2017_01_01_02_17 Gathering Cell information Generating diagnostics tarball and removing temp directory ==================================================================== Done. The report files are bzip2 compressed in /tmp/sundiag_greporacel01_1108FMM0FF_2017_01_01_02_17.tar.bz2 ==================================================================== [root@greporacel01 oracle.SupportTools]#
Now you just need to pick this generated file and add to your SR. Simple right?
Hope it helps!
Ok, so I was preparing for a DC services migration with a client and this would involve resizing the CPU count of Exadatas for better attending those services. This way, one of the steps will require reduce CPU counts in one of the sites to be aligned with the license terms.
Checking for the steps to accomplish that, I found references to change CPU and core count, but always described in the case of increasing allocation. As per 2.7 Increasing the Number of Active Cores on Database Servers. But not so much about reducing, as this seems to be unusual…
Also considering that the planned change would be within the minimum number requirement: 2.1 Restrictions for Capacity-On-Demand on Oracle Exadata Database Machine.
Reviewing on MOS, we found the When Attempting to Change the Number of Cores, Errors With: DBM-10004 – Decreasing the Number of Active Cores is not Supported ( Doc ID 2177634.1 ), pointing to use the clause “FORCE” on “ALTER DBSERVER pendingCoreCount =x” command.
And this worked. I just disabled the iaasMode to play safe. Have a look:
[root@grepora01~]# dbmcli DBMCLI: Release - Production on Mon Jan 05 01:10:12 EEST 2019 Copyright (c) 2007, 2014, Oracle. All rights reserved. DBMCLI> LIST DBSERVER attributes coreCount 36/44 DBMCLI> ALTER DBSERVER pendingCoreCount = 24 force DBM-10022: At least 26 physical cores need to be active in order to support IaaS. DBMCLI> ALTER DBSERVER iaasMode = "off" DBServer exadb01 successfully altered DBMCLI> ALTER DBSERVER pendingCoreCount = 24 force DBServer grepora01 successfully altered. Please reboot the system to make the new pendingCoreCount effective. DBMCLI> LIST DBSERVER attributes pendingCoreCount 24/44
–> Restart the server
After restarting, it should look like:
DBMCLI> LIST DBSERVER attributes coreCount 24/44 DBMCLI> LIST DBSERVER attributes pendingCoreCount
Hope this helps!
After a long time on a graceful reboot, the compute node was simply not starting… What do to?
The best is:
1. Connect to ILOM Console:
Go to: Host Management –> Power control –> select Power Cycle in drop down list.
2. Connect to ILOM Server start SP console:
You may do it from another node, of course.
[root@grepora02 ~]# ssh root@grepora01-ilom Password: Oracle(R) Integrated Lights Out Manager Version 184.108.40.206 r116695 Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved. Warning: HTTPS certificate is set to factory default. Hostname: grepora01-ilom -> start /SP/console Are you sure you want to start /SP/console (y/n)? y
And, if not, as always, create a SR and follow with Oracle is the best way to go…
Hope it helps!
So, I had this message from a memory component in my Exadata:
Message=A memory component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0/D3 Fault class : fault.memory.intel.dimm_ce
But this was right after a maintenance on server. Checking on ILOM:
-> show /SYS/MB/P0/D3 /SYS/MB/P0/D3 Targets: PRSNT SERVICE Properties: type = DIMM ipmi_name = MB/P0/D3 fru_name = 16384MB DDR4 SDRAM DIMM fru_manufacturer = Samsung fru_part_number = % fru_rev_level = 01 fru_serial_number = % fault_state = OK clear_fault_action = (none)
Checking on CellCLI alert history:
CellCLI> list alerthistory detail name: 13_1 alertDescription: "A memory component suspected of causing a fault" alertMessage: "A memory component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0/D3 Fault class : fault.memory.intel.dimm_ce Fault message : http://support.oracle.com/msg/SPX86A-8002-XM" alertSequenceID: 13 alertShortName: Hardware alertType: Stateful beginTime: % endTime: % examinedBy: metricObjectName: /SYS/MB/P0/D3_FAULT notificationState: 1 sequenceBeginTime: % severity: critical alertAction: "For additional information, please refer to http://support.oracle.com/msg/SPX86A-8002-XM Automatic Service Request has been notified with Unique Identifier: %. Diagnostic package is attached. It is also accessible at % It will be retained on the storage server for 7 days. If the diagnostic package has expired, then it can be re-created at %"
Hm… Let’s read the MOS: SPX86A-8002-XM – Memory Correctable ECC (Doc ID 1615285.1)
“Suggested Action for System Administrator
Replace the faulty memory DIMM at the earliest possible convenience.”
Hmm… But as I said, this was right after a maintenance on server, what if this is related?
Ok, some additional piece of information:
-> version SP firmware 220.127.116.11 SP firmware build number: 116695 SP firmware date: Thu Mar 30 11:38:01 CST 2017 SP filesystem version: 0.2.10
At the current firmware level of SP firmware 18.104.22.168 the memory correctable error threshold limit for DIMM replacement is 240 CEs in a 72 hrs period.
So, the suggestion is:
– Clear all the error messages after complete the maintenance and lets check if the threshold is reached again. If so, we may need to really replace it.
How to do it? Easy:
ssh root@grepora01-ilom -> show /SYS/MB/P0/D3 Expected: [...] fault_state = Faulted [..] -> set /SYS/MB/P0/D3 clear_fault_action=true Are you sure you want to clear /SYS/MB/P0/D3 (y/n)? y -> show /SYS/MB/P0/D3 [Expected] /SYS/MB/P0/D3 Targets: PRSNT SERVICE Properties: type = DIMM ipmi_name = MB/P0/D3 fru_name = 16384MB DDR4 SDRAM DIMM fru_manufacturer = Samsung fru_part_number = % fru_rev_level = 01 fru_serial_number = % fault_state = OK clear_fault_action = (none)
Hope it helps!
Just got an alarm from OEM with this message. How to check it?
– First thing is to be able to connect on ILOM from DBNode.
– From there we can test the IPv4 and/or IPv6 interfaces through ping, as pe shown below.
This is also documented as per this Doc: Oracle Integrated Lights Out Manager (ILOM) 3.0 HTML Documentation Collection – Test IPv4 or IPv6 Network Configuration (CLI)
In my case, it was only a false alarm, as I was able to connect to other DBNodes from this ILOM:
[root@greporasrv01db01 ~]# ssh greporasrv01-ilom.jcrew.com The authenticity of host 'greporasrv01-ilom.grepora.com (10.48.18.64)' can't be established. RSA key fingerprint is 59:c5:9f:b1:60:59:15:16:94:c8:94:88:7b:4e:52:57. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'greporasrv01-ilom.grepora.com' (RSA) to the list of known hosts. Password: Oracle(R) Integrated Lights Out Manager Version 22.214.171.124 r116695 Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved. Warning: HTTPS certificate is set to factory default. Hostname: greporasrv01-ilom -> show /SP/network /SP/network Targets: interconnect ipv6 test Properties: commitpending = (Cannot show property) dhcp_clientid = none dhcp_server_ip = none ipaddress = 10.50.12.64 ipdiscovery = static ipgateway = 10.50.12.1 ipnetmask = 255.255.255.0 macaddress = 00:10:E0:95:73:E6 managementport = MGMT outofbandmacaddress = 00:10:E0:95:73:E6 pendingipaddress = 10.50.12.64 pendingipdiscovery = static pendingipgateway = 10.50.12.1 pendingipnetmask = 255.255.255.0 pendingmanagementport = MGMT pendingvlan_id = (none) sidebandmacaddress = 00:10:E0:95:73:E7 state = ipv4-only vlan_id = (none) Commands: cd set show -> cd /SP/network/test /SP/network/test -> show /SP/network/test Targets: Properties: ping = (Cannot show property) ping6 = (Cannot show property) Commands: cd set show -> set ping=10.50.12.51 -- DBNode1 Ping of 10.50.12.51 succeeded -> set ping=10.50.12.52 -- DBNode2 Ping of 10.50.12.52 succeeded
Some time ago I needed to check topology of a client’s Exadata due a network issue and made a very useful note. Sharing with you now. 😀
This and other cool commands can be found here: Network Diagnostics information for Oracle Database Machine Environments (Doc ID 1053498.1)
# /opt/oracle.SupportTools/ibdiagtools/verify-topology -t quarterrack
Newer versions don’t require -t option.
In case of halfrack, -t halfrack should be used in my case.
Ok, but how to know it? You can have it from here:
[root@greporaexa onecommand]# grep -i MACHINETYPES databasemachine.xml [MACHINETYPES]X4-2 Eighth Rack HC 4TB[/MACHINETYPES]
Hope it helps! 🙂
This days I had an alarm with message below:
Message=The aggregate sensor /SYS/CABLE_CONN_STAT has a fault.
There is some useful commands I used to verify all ports/sensors in my exadata cluster.
In summary, these commands:
1) Use Intelligent Platform Management Interface (IPMI) to read the Sensor Data Record (SDR) repository
2) Use Intelligent Platform Management Interface (IPMI) to view the ILOM SP System Event Log (SEL)
3) Display all host nodes with ibhosts
4) Use ibcheckstate to scan InfiniBand fabric and validate the port logical and physical state
5) Use ibcheckerrors to scan InfiniBand fabric and validate the connectivity as described in the topology file
6) Checking for sensor healthy from switch
7) Check the overall health of the InfiniBand switch, on the Exadata switch itself
The Commands are:
This is because the error is generated by an unpublished bug 17891564, as per described in MOS ORA-7445 [ocl_lock_get_waitobj_owner] on an Exadata storage cell (Doc ID 1906366.1).
It affects Exadata storage cell with image version between 126.96.36.199.0 and 188.8.131.52.0. The CELLSRV process crash with this error as per:
Cellsrv encountered a fatal signal 11 Errors in file /opt/oracle/cell184.108.40.206.0_LINUX.X64_131014.1/log/diag/asm/cell//trace/svtrc_11711_27.trc (incident=257): ORA-07445: exception encountered: core dump [ocl_lock_get_waitobj_owner()+26]  [0x000000000]    Incident details in: /opt/oracle/cell220.127.116.11.0_LINUX.X64_131014.1/log/diag/asm/cell//incident/incdir_257/svtrc_11711_27_i257.trc
The CELLSRV process should auto restart after this error.
Facing this error? Let me guess: Ports 03, 05, 06, 08, 09 and 12 are alerting? You have a Quarter Rack? Have recently installed Exadata plugin to version 18.104.22.168 or higher?
This is probably related to Bug 15937297 : EM 12C HAS ERRORS CABLE IS PRESENT ON PORT ‘N’ BUT IT IS POLLING FOR PEER PORT. The full message might be like “Cable is present on Port 6 but it is polling for peer port. This could happen when the peer port is unplugged/disabled“.
In fact, the bug was closed as not a bug. 🙂
As part of the 22.214.171.124 Exadata plugin, the IB switch ports are now checked for non-terminated cables. So these errors ‘polling for peer port’ are the expected behavior. Once ‘polling for peer port’ is an enhanced feature of the 126.96.36.199 plugin, this explains why you most likely did not see these errors until you upgraded the OMS to 188.8.131.52 and then updated the plugins.
In Quarter Racks, the following ports 3, 5, 6, 8, 9 and 12 are usually cabled ahead of time, but not terminated. In some racks port 32 may also be unterminated. Checking for incident in OEM you might see something like this image: