Exadata DNS Change – Pitfalls to be avoided

Hi all, it’s been a while but here I am!

There were some changes in the infrastructure at the place I work and I was asked to do a DNS change on a bit old Exadata X5. I had never done one before this, so the idea of this post is to help others who might face the issues I had.

The first thing I did was to look up the documentation about it and see the steps, yes there are blogs about it but the doc can help to get at least the first glance of the situation.

Long story short: Exadata has lots of components and the new DNS should be changed on all of them.

Here is a summary of the steps.

Infiniband switches

Connect to the switches and sudo to ilom-admin and change the DNS

su - ilom-admin
show /SP/clients/dns
set /SP/clients/dns nameserver=192.168.16.1,192.168.16.2,192.168.16.3
show /SP/clients/dns

 

Database nodes

For my image I only needed to change the /etc/resolv.conf, if you have a newer one you will need to user ipconf – That´s why you need to go to the documentation, at least there we hope that they will put some mentions on the pitfalls (well keep reading and you will see that was not my case)

Also changed the DNS on wach database node ilom, runing the ipmtool from the each node

ipmitool sunoem cli 'show /SP/clients/dns'
ipmitool sunoem cli 'set /SP/clients/dns nameserver=192.168.16.1,192.168.16.2,192.168.16.3'
ipmitool sunoem cli 'show /SP/clients/dns'


Cell nodes – Here things start to get interesting

For the storage cell there are some points that need to be taken under consideration:

Increase the ASM disk_repair_time – the goal here is to avoid a full rebalance if you do this within its timeframe, if you don’t know this parameter,  ASM will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online. If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online. If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance.

On each cell node we need to make sure all disks are OK, stop all cell disks, stop all cell services and user ipconfig to change the DNS configuration

#Check that putting the grid disks offline will not cause a problem for Oracle ASM - it should all say YES on the 3rd column 
cellcli -e LIST GRIDDISK ATTRIBUTES name,asmmodestatus,asmdeactivationoutcome

#Inactivate all grid disks on the cell - may take a while to complete
cellcli -e ALTER GRIDDISK ALL INACTIVE


#Confirm the grid disks are offline, it should show asmmodestatus=OFFLINE or asmmodestatus=UNUSED, and asmdeactivationoutcome=Yes for all grid disks
cellcli -e LIST GRIDDISK ATTRIBUTES name, asmmodestatus,asmdeactivationoutcome

#Confirm that the disks are offline
cellcli -e LIST GRIDDISK

#Shut down the cell services and ocrvottargetd service
cellcli -e ALTER CELL SHUTDOWN SERVICES ALL
service ocrvottargetd stop #on some images this services does not exists

To execute the ipconf on the old way we only need to call it can follow the prompts, but if you have a newer image you will need to provide its parameters as is shown in the documentation.

The documentation says that after it we could start the cell services back up but I would recommend validating the DNS prior to doing that, why is that you might say because mine did not work and I could have a bigger issue with a cell node without DNS trying to start the services.

So, how to test, use nslookup, dig and curl

nslookup dns_domain.com
curl -v 192.168.16.1:53
dig another_server_in_the_network

 

My tests did not work, I was able to ping the DNS servers but not to resolve any name, I had an SR on MOS but did not help much either, looking up as this is a production system I tried to see if the firewall was up on the Linux site, and to my surprise it was.

I tried to manually add rules to iptables but it did not work and then I came across this note Exadata: New DNS server is not accessible after changing using IPCONF (Doc ID 1581417.1)

And there it was, I needed to restart the cellwall service to recreate the iptables rules.

# Restart cellwall service
service cellwall restart
service cellwall status

One final point, check if ASM started the rebalance or not, if it did, do not start to bring down another cell node until the rebalance is finish, otherwise you may run into deeper issues.

 

I hope it helps!

Elisson Almeida

General ILOM Faults Management

Hi all,
So just a quick reference: Some useful general commands for ILOM:

1. Check for Faults:

/home/boesing> ssh root@[ilom ip]    

Oracle(R) Integrated Lights Out Manager

Version 3.1.2.10 r74387

Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.

-> cd /SP/faultmgmt
/SP/faultmgmt

-> start shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> 
faultmgmtsp> fmadm 
Usage: fmadm 
  where  is one of the following:
    faulty [-asv] [-u ]   : display list of faulty resources
    faulty -f                   : display faulty on FRUs
    acquit                 : acquit faults on a FRU
    acquit                : acquit faults associated with UUID
    acquit           : acquit faults specified by
                                  (FRU, UUID) combination
    replaced               : replaced faults on a FRU
    repaired               : repaired faults on a FRU
    repair                 : repair faults on a FRU
    rotate errlog               : rotate error log
    rotate fltlog               : rotate fault log

faultmgmtsp> 
faultmgmtsp> fmadm faulty
No faults found

– Another way to see current issues:

show /SP/logs/event/list show faulty

2. Clearing Faults:
In case there is a failure that can be ignored (for example, lost of AC power), it may be cleared:

set /SYS/PSU1 clear_fault_action=true

3. Checking Additional Logs:

start /SYS 
->y 
ls /SYS 
start /SP/console 
-> y
show /SP/logs/event/list 

Some MOS notes for reference:

– Diagnostic information for ILOM, ILO , LO100 issues (Doc ID 1062544.1)
– How to run an ILOM Snapshot on a Sun/Oracle X86 System (Doc ID 1448069.1)
– PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1)

Clear Exadata Component Messages After Maintenance

Hi all,

Quick one today: So you completed a maintenance in a component (a memory component, as per example below) but keep receiving messages of failure?

Well, try clearing all the error messages after complete the maintenance and lets check if the threshold is reached again. If so, we may need to really replace it.

How to do it? Easy:

ssh root@grepora01-ilom
-> show /SYS/MB/P0/D3
Expected:
[...]
fault_state = Faulted
[..]
-> set /SYS/MB/P0/D3 clear_fault_action=true
Are you sure you want to clear /SYS/MB/P0/D3 (y/n)? y
-> show /SYS/MB/P0/D3
[Expected]
 /SYS/MB/P0/D3
    Targets:
        PRSNT
        SERVICE
Properties:
type = DIMM
ipmi_name = MB/P0/D3
fru_name = 16384MB DDR4 SDRAM DIMM
fru_manufacturer = Samsung
fru_part_number = %
fru_rev_level = 01
fru_serial_number = %
 fault_state = OK
clear_fault_action = (none)

Exadata Compute Node Not Starting after a long Period…

Well,
After a long time on a graceful reboot, the compute node was simply not starting… What do to?
The best is:

1. Connect to ILOM Console:

Go to: Host Management –> Power control –> select Power Cycle in drop down list.

2. Connect to ILOM Server start SP console:
You may do it from another node, of course.

[root@grepora02 ~]# ssh root@grepora01-ilom Password:

Oracle(R) Integrated Lights Out Manager

Version 3.2.9.23 r116695

Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.

Warning: HTTPS certificate is set to factory default.

Hostname: grepora01-ilom

-> start /SP/console Are you sure you want to start /SP/console (y/n)? y

And, if not, as always, create a SR and follow with Oracle is the best way to go…

Hope it helps!

OEM: The ILOM server is currently offline or unreachable on the network.

Hi all!
Just got an alarm from OEM with this message. How to check it?
– First thing is to be able to connect on ILOM from DBNode.
– From there we can test the IPv4 and/or IPv6 interfaces through ping, as pe shown below.

This is also documented as per this Doc: Oracle Integrated Lights Out Manager (ILOM) 3.0 HTML Documentation Collection – Test IPv4 or IPv6 Network Configuration (CLI)

In my case, it was only a false alarm, as I was able to connect to other DBNodes from this ILOM:

[root@greporasrv01db01 ~]# ssh greporasrv01-ilom.jcrew.com
The authenticity of host 'greporasrv01-ilom.grepora.com (10.48.18.64)' can't be established.
RSA key fingerprint is 59:c5:9f:b1:60:59:15:16:94:c8:94:88:7b:4e:52:57.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'greporasrv01-ilom.grepora.com' (RSA) to the list of known hosts.
Password: 

Oracle(R) Integrated Lights Out Manager

Version 3.2.9.23 r116695

Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.

Warning: HTTPS certificate is set to factory default.

Hostname: greporasrv01-ilom

-> show /SP/network

 /SP/network
    Targets:
        interconnect
        ipv6
        test

    Properties:
        commitpending = (Cannot show property)
        dhcp_clientid = none
        dhcp_server_ip = none
        ipaddress = 10.50.12.64
        ipdiscovery = static
        ipgateway = 10.50.12.1
        ipnetmask = 255.255.255.0
        macaddress = 00:10:E0:95:73:E6
        managementport = MGMT
        outofbandmacaddress = 00:10:E0:95:73:E6
        pendingipaddress = 10.50.12.64
        pendingipdiscovery = static
        pendingipgateway = 10.50.12.1
        pendingipnetmask = 255.255.255.0
        pendingmanagementport = MGMT
        pendingvlan_id = (none)
        sidebandmacaddress = 00:10:E0:95:73:E7
        state = ipv4-only
        vlan_id = (none)

    Commands:
        cd
        set
        show

-> cd /SP/network/test
/SP/network/test

-> show

 /SP/network/test
    Targets:

    Properties:
        ping = (Cannot show property)
        ping6 = (Cannot show property)

    Commands:
        cd
        set
        show

-> set ping=10.50.12.51       -- DBNode1
Ping of 10.50.12.51 succeeded

-> set ping=10.50.12.52       -- DBNode2
Ping of 10.50.12.52 succeeded

 

Infiniband Error: Cable is present on Port “X” but it is polling for peer port

Facing this error? Let me guess: Ports 03, 05, 06, 08, 09 and 12 are alerting? You have a Quarter Rack? Have recently installed Exadata plugin to version 12.1.0.3 or higher?
Don’t panic!

This is probably related to Bug 15937297 : EM 12C HAS ERRORS CABLE IS PRESENT ON PORT ‘N’ BUT IT IS POLLING FOR PEER PORT. The full message might be like “Cable is present on Port 6 but it is polling for peer port. This could happen when the peer port is unplugged/disabled“.

In fact, the bug was closed as not a bug. 🙂
As part of the 12.1.0.3 Exadata plugin, the IB switch ports are now checked for non-terminated cables. So these errors ‘polling for peer port’ are the expected behavior.  Once ‘polling for peer port’ is an enhanced feature of the 12.1.0.3 plugin, this explains why you most likely did not see these errors until you upgraded the OMS to 12.1.0.2 and then updated the plugins.

In Quarter Racks, the following ports 3, 5, 6, 8, 9 and 12 are usually cabled ahead of time, but not terminated. In some racks port 32 may also be unterminated. Checking for incident in OEM you might see something like this image:

newscreenshot-2016-12-26-as-20-03-50

More“Infiniband Error: Cable is present on Port “X” but it is polling for peer port”