Exadata DNS Change – Pitfalls to be avoided

Hi all, it’s been a while but here I am!

There were some changes in the infrastructure at the place I work and I was asked to do a DNS change on a bit old Exadata X5. I had never done one before this, so the idea of this post is to help others who might face the issues I had.

The first thing I did was to look up the documentation about it and see the steps, yes there are blogs about it but the doc can help to get at least the first glance of the situation.

Long story short: Exadata has lots of components and the new DNS should be changed on all of them.

Here is a summary of the steps.

Infiniband switches

Connect to the switches and sudo to ilom-admin and change the DNS

su - ilom-admin
show /SP/clients/dns
set /SP/clients/dns nameserver=192.168.16.1,192.168.16.2,192.168.16.3
show /SP/clients/dns

 

Database nodes

For my image I only needed to change the /etc/resolv.conf, if you have a newer one you will need to user ipconf – That´s why you need to go to the documentation, at least there we hope that they will put some mentions on the pitfalls (well keep reading and you will see that was not my case)

Also changed the DNS on wach database node ilom, runing the ipmtool from the each node

ipmitool sunoem cli 'show /SP/clients/dns'
ipmitool sunoem cli 'set /SP/clients/dns nameserver=192.168.16.1,192.168.16.2,192.168.16.3'
ipmitool sunoem cli 'show /SP/clients/dns'


Cell nodes – Here things start to get interesting

For the storage cell there are some points that need to be taken under consideration:

Increase the ASM disk_repair_time – the goal here is to avoid a full rebalance if you do this within its timeframe, if you don’t know this parameter,  ASM will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online. If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online. If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance.

On each cell node we need to make sure all disks are OK, stop all cell disks, stop all cell services and user ipconfig to change the DNS configuration

#Check that putting the grid disks offline will not cause a problem for Oracle ASM - it should all say YES on the 3rd column 
cellcli -e LIST GRIDDISK ATTRIBUTES name,asmmodestatus,asmdeactivationoutcome

#Inactivate all grid disks on the cell - may take a while to complete
cellcli -e ALTER GRIDDISK ALL INACTIVE


#Confirm the grid disks are offline, it should show asmmodestatus=OFFLINE or asmmodestatus=UNUSED, and asmdeactivationoutcome=Yes for all grid disks
cellcli -e LIST GRIDDISK ATTRIBUTES name, asmmodestatus,asmdeactivationoutcome

#Confirm that the disks are offline
cellcli -e LIST GRIDDISK

#Shut down the cell services and ocrvottargetd service
cellcli -e ALTER CELL SHUTDOWN SERVICES ALL
service ocrvottargetd stop #on some images this services does not exists

To execute the ipconf on the old way we only need to call it can follow the prompts, but if you have a newer image you will need to provide its parameters as is shown in the documentation.

The documentation says that after it we could start the cell services back up but I would recommend validating the DNS prior to doing that, why is that you might say because mine did not work and I could have a bigger issue with a cell node without DNS trying to start the services.

So, how to test, use nslookup, dig and curl

nslookup dns_domain.com
curl -v 192.168.16.1:53
dig another_server_in_the_network

 

My tests did not work, I was able to ping the DNS servers but not to resolve any name, I had an SR on MOS but did not help much either, looking up as this is a production system I tried to see if the firewall was up on the Linux site, and to my surprise it was.

I tried to manually add rules to iptables but it did not work and then I came across this note Exadata: New DNS server is not accessible after changing using IPCONF (Doc ID 1581417.1)

And there it was, I needed to restart the cellwall service to recreate the iptables rules.

# Restart cellwall service
service cellwall restart
service cellwall status

One final point, check if ASM started the rebalance or not, if it did, do not start to bring down another cell node until the rebalance is finish, otherwise you may run into deeper issues.

 

I hope it helps!

Elisson Almeida

ORA-07445 [kxsPurgeCursor()]

Hi all,
We started getting lots of ORA-07445 errors that are being reported in the Exadata production database from a client.
Checking on the traces we can see that someone is running the below SQL and that is triggering these errors:
declare
i number := 0;
begin
for c1 in (
select address, hash_value, sql_id from v$sql where
last_active_time < sysdate - 1/24
and executions < 2 ) loop
begin
dbms_shared_pool.purge(c1.address || ' ' || to_char(c1.hash_value), 'C');
i := i + 1;
exception when others then
dbms_output.put_line(c1.sql_id);
end;
end loop;
dbms_output.put_line('Cursors purged = ' || to_char(i));
end;
/
I’m not even going to discuss the PL above, it’s purging all new SQLs executed less than 2 times in the last 24hours. I’m not really agreeing with this…
What matters is: in the end, it was being caused because hash value of non-cursor is being passed to dbms_shared_pool.purge().  This is a match to Bug 29281112 – ORA-7445: [kxspurgecursor()+517] [sigsegv] (Doc ID 29281112.8)
There is no actual workaround besides stop passing a non-cursor hash to dbms_shared_pool.purge, however we have a fix for a better addressing of those cases on:

Quick Reference: Oracle Exadata default passwords

It’s not needed to say how important this is to have all the password reset, even though on new Exa hardware the SSH is usually disabled. How ever, oftenly we need to have access in some of the consoles knowing the password was never changed but unaware on what it should be.

For those cases (until you change it, ASAP), here is a quick reference:

Database Server:

  • root/welcome1
  • oracle/welcome1
  • grid/welcome1
  • grub/sos1Exadata

Exadata Storage Servers:

  • root/welcome1
  • celladmin/welcome1
  • cellmonitor/welcome1

InfiniBand switches:

  • root/welcome1
  • nm2user/changeme

Ethernet switches:

  • admin/welcome1

Power distribution units (PDUs):

  • admin/welcome1
  • root/welcome1

Database server ILOMs:

  • root/welcome1

Exadata Storage Server ILOMs:

  • root/welcome1

InfiniBand ILOMs:

  • ilom-admin/ilom-admin
  • ilom-operator/ilom-operator

Keyboard, video, mouse (KVM):

  • admin/welcome1

Change them!

I hope this was useful for you.

ODA Useful Commands to Manage VMs

1. To restart the vm:

oakcli stop vm [vm-name]
oakcli stop vm [vm-name] -force
oakcli start vm [vm-name]

2. To show All vm status

oakcli show vm

Example:

root@server oak]# oakcli show vm

NAME NODENUM MEMORY VCPU STATE REPOSITORY
vm1 1 8192 4 ONLINE fileshare

3. Listing Configured Options for a VM Template

oakcli show vmtemplate ol6linux_64

4. Adding a Network to the VM Template

oakcli modify vmtemplate ol6linux_64 -addnetwork net1

5. Configure CPU, Memory on the Template

oakcli configure vmtemplate ol6linux_64 -vcpu 4 -maxvcpu 8 -cpucap 10 -memory 3000M -maxmemory 6G -os OTHER_LINUX

6. Create a VM by Cloning from Template

oakcli clone vm ol6test -vmtemplate ol6linux_64 -repo repo1 -node oda2

7. Override VM Template Values

oakcli configure vm ol6test -vcpu 6 -memory 4G

8. Configure High Availability and Failover Values

oakcli configure vm ol6test -prefnode oda2 -failover oda1

9. Start a VM

oakcli start vm ol6test

10. Access VM Console for a VM:

oakcli show vmconsole [vm-name]

Hope it helps you!

General ILOM Faults Management

Hi all,
So just a quick reference: Some useful general commands for ILOM:

1. Check for Faults:

/home/boesing> ssh root@[ilom ip]    

Oracle(R) Integrated Lights Out Manager

Version 3.1.2.10 r74387

Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.

-> cd /SP/faultmgmt
/SP/faultmgmt

-> start shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> 
faultmgmtsp> fmadm 
Usage: fmadm 
  where  is one of the following:
    faulty [-asv] [-u ]   : display list of faulty resources
    faulty -f                   : display faulty on FRUs
    acquit                 : acquit faults on a FRU
    acquit                : acquit faults associated with UUID
    acquit           : acquit faults specified by
                                  (FRU, UUID) combination
    replaced               : replaced faults on a FRU
    repaired               : repaired faults on a FRU
    repair                 : repair faults on a FRU
    rotate errlog               : rotate error log
    rotate fltlog               : rotate fault log

faultmgmtsp> 
faultmgmtsp> fmadm faulty
No faults found

– Another way to see current issues:

show /SP/logs/event/list show faulty

2. Clearing Faults:
In case there is a failure that can be ignored (for example, lost of AC power), it may be cleared:

set /SYS/PSU1 clear_fault_action=true

3. Checking Additional Logs:

start /SYS 
->y 
ls /SYS 
start /SP/console 
-> y
show /SP/logs/event/list 

Some MOS notes for reference:

– Diagnostic information for ILOM, ILO , LO100 issues (Doc ID 1062544.1)
– How to run an ILOM Snapshot on a Sun/Oracle X86 System (Doc ID 1448069.1)
– PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1)

Clear Exadata Component Messages After Maintenance

Hi all,

Quick one today: So you completed a maintenance in a component (a memory component, as per example below) but keep receiving messages of failure?

Well, try clearing all the error messages after complete the maintenance and lets check if the threshold is reached again. If so, we may need to really replace it.

How to do it? Easy:

ssh root@grepora01-ilom
-> show /SYS/MB/P0/D3
Expected:
[...]
fault_state = Faulted
[..]
-> set /SYS/MB/P0/D3 clear_fault_action=true
Are you sure you want to clear /SYS/MB/P0/D3 (y/n)? y
-> show /SYS/MB/P0/D3
[Expected]
 /SYS/MB/P0/D3
    Targets:
        PRSNT
        SERVICE
Properties:
type = DIMM
ipmi_name = MB/P0/D3
fru_name = 16384MB DDR4 SDRAM DIMM
fru_manufacturer = Samsung
fru_part_number = %
fru_rev_level = 01
fru_serial_number = %
 fault_state = OK
clear_fault_action = (none)

Workshop: Best Practices for Oracle DB on Exadata: RUN, It’s Free!

Hey, you in Porto Alegre/Brazil!

Don’t miss this opportunity to have an official Oracle Workshop about Exadata Best Practices for Oracle Database!

This is being being promoted by GUORS in partnership with Oracle Brasil.

Draft - Convite (1)

When? June 26th from 2PM to 6PM BRT.
Where?  TecnoPUC-RS, room 206 – building 99A
Instructor: Valter Rodrigues da Oracle Brasil.

LIMITED POSITIONS: 30!


Requirements
:

1. Like GUORS Page on Facebook AND LinkedIn
2. Like Oracle University on LinkedIn
3. Like GUOB on LinkedIn
4. Bring your notebook.
5. Basic knowledge on RAC

Subscribe ASAP!

Exadata: Generate a Sundiag

Hello all!

Today’s post is a very simple one, once seems sometimes a simple and success case post is appreciated. So, how to generate a Sundiag Report for Oracle?

First, what is a Sundiag?
It consists in an Oracle hardware healthcheck list to be run on your environment. It doesn’t give us any report, but is usually requested by Oracle on SRs related to network, disk I/O or any other possible hardware/firmware related issue.

So, how to do it? Simple like this (the path is always the same):

[root@greporacel01 ~]# cd /opt/oracle.SupportTools/
[root@greporacel01 oracle.SupportTools]# /opt/oracle.SupportTools/sundiag.sh

Oracle Exadata Database Machine - Diagnostics Collection Tool

Gathering Linux information

Skipping ILOM collection. Use the ilom or snapshot options, or login to ILOM
over the network and run Snapshot separately if necessary.

/tmp/sundiag_greporacel01_1108FMM0FF_2017_01_01_02_17
Gathering Cell information

Generating diagnostics tarball and removing temp directory

====================================================================
Done. The report files are bzip2 compressed in /tmp/sundiag_greporacel01_1108FMM0FF_2017_01_01_02_17.tar.bz2
====================================================================
[root@greporacel01 oracle.SupportTools]#

Now you just need to pick this generated file and add to your SR. Simple right?

Hope it helps!

Reduce Exadata Core Count

Ok, so I was preparing for a DC services migration with a client and this would involve resizing the CPU count of Exadatas for better attending those services. This way, one of the steps will require reduce CPU counts in one of the sites to be aligned with the license terms.

Checking for the steps to accomplish that, I found references to change CPU and core count, but always described in the case of increasing allocation. As per 2.7 Increasing the Number of Active Cores on Database Servers. But not so much about reducing, as this seems to be unusual…

Also considering that the planned change would be within the minimum number requirement: 2.1 Restrictions for Capacity-On-Demand on Oracle Exadata Database Machine.

Reviewing on MOS, we found the When Attempting to Change the Number of Cores, Errors With: DBM-10004 – Decreasing the Number of Active Cores is not Supported ( Doc ID 2177634.1 ), pointing to use the clause “FORCE” on “ALTER DBSERVER pendingCoreCount =x” command.

And this worked. I just disabled the iaasMode to play safe. Have a look:

[root@grepora01~]# dbmcli
DBMCLI: Release  - Production on Mon Jan 05 01:10:12 EEST 2019
Copyright (c) 2007, 2014, Oracle.  All rights reserved.
DBMCLI> LIST DBSERVER attributes coreCount
	 36/44
DBMCLI> ALTER DBSERVER pendingCoreCount = 24 force
DBM-10022: At least 26 physical cores need to be active in order to support IaaS.
DBMCLI> ALTER DBSERVER iaasMode = "off"
DBServer exadb01 successfully altered
DBMCLI> ALTER DBSERVER pendingCoreCount = 24 force
DBServer grepora01 successfully altered. Please reboot the system to make the new pendingCoreCount effective.
DBMCLI> LIST DBSERVER attributes pendingCoreCount
24/44

–> Restart the server
After restarting, it should look like:

DBMCLI> LIST DBSERVER attributes coreCount
	 24/44
DBMCLI> LIST DBSERVER attributes pendingCoreCount

Hope this helps!