Errors and Bugs – grepOra

HAIP Subnet Fliping Upon Link Down/Up Event

By matheusdba 21 de July de 2025 30 de March de 2026

HAIP (High Availability IP) is Oracle’s private network redundancy feature in Grid Infrastructure. When a cluster has multiple private network interfaces, HAIP automatically bonds them, providing failover if one link goes down. Bug 29379299 reveals a critical flaw in how HAIP handles link recovery.

What happens:

Both private network links are UP. HAIP is healthy.
Link A goes DOWN. HAIP correctly fails over to Link B. All traffic continues on Link B.
Link A comes back UP.
Bug: Instead of staying on Link B (the current active link), HAIP “flips” — it attempts to reassign traffic to Link A, triggering a brief HAIP reconfiguration. During this reconfiguration, cluster interconnect traffic is interrupted, which can cause:
- False node evictions (CSS thinks a node is dead due to missed heartbeats)
- ORA-29740 (evicted by member) in database alert logs
- Brief application outages even though both links are UP

Affected versions:

Oracle Grid Infrastructure 12.1.0.2 through 18c. Fixed in 19c starting with 19.4 RU.

Workarounds for older releases:

Option 1 — Disable HAIP and use OS-level bonding (bonding/teaming at the OS layer is more stable in most environments anyway):

# In GI home, set in crsconfig_params or run during install
-haip_no_config

Option 2 — Apply the one-off patch for Bug 29379299 (available for 12.1.0.2 and 12.2.0.1 through MOS).

Recommendation: If you’re on an affected version and your cluster has had mysterious node evictions following network maintenance or link flaps, check the CRS diagnostic logs ($GRID_HOME/log/<hostname>/cssd/ocssd.log) for HAIP reconfiguration events correlated with the eviction timestamps. This bug is the likely culprit.

ASM, Database, Errors and Bugs

+ASM1 Fails to Join +ASM2 During Rolling Upgrade from 12c to 19c

By matheusdba 14 de July de 2025 30 de March de 2026

Bug 30452852 is a particularly nasty corner case in rolling Grid Infrastructure upgrades: after upgrading node 1 to 19c, the upgraded +ASM1 instance (running 19c) cannot join the cluster because +ASM2 is still running 12c — and the two versions cannot communicate properly in certain network configurations.

Scenario:

Node 1: Grid Infrastructure 19.3.0, +ASM1 — upgraded
Node 2: Grid Infrastructure 12.1.0.2, +ASM2 — still on 12c

During the rolling upgrade window, both ASM instances must briefly coexist. In a properly configured cluster, 19c ASM is backward-compatible with 12c ASM for this transition. Bug 30452852 causes that compatibility to break.

Symptoms:

+ASM1 crashes immediately after starting, citing I/O errors on voting disks
Alert log on node 1: GMON: detected ASM instance eviction — cannot reach quorum
crsctl stat res ora.asm -t shows +ASM1 OFFLINE, +ASM2 ONLINE

Technical cause:

The bug is in the inter-instance GMON (Global Membership Oracle Network) communication protocol. A version mismatch in the GMON handshake message format causes +ASM1 to interpret a valid 12c heartbeat as a corrupted packet and self-evict as a safety measure.

Workaround:

Before starting the upgrade:

# Set this parameter in ASM spfile or init.ora
_asm_compatibility_override = '12.1.0.2'

This tells 19c ASM to use the 12c-compatible message format during the transition. Remove it after both nodes are on 19c.

The fix is available in the 19.8+ patch set. If you’re running 19.3 or 19.5, this parameter workaround is essential for rolling upgrades in multi-node configurations.

ASM, Database, Errors and Bugs

ASM Fails to Start When Upgrading from 12.1.0.2 to 19.3.0

By matheusdba 7 de July de 2025 30 de March de 2026

Continuing the ASM upgrade bug series from last week: Bug 30265357 is one of the most impactful RAC upgrade bugs you can encounter during a 12.1.0.2 to 19.3.0 Grid Infrastructure upgrade. It’s distinct from Doc ID 2606735.1 and deserves its own analysis.

What happens:

During the upgrade, after the 19c CRS stack starts on node 1, ora.asm (the ASM resource managed by CRS) fails to start. The ASM instance itself may be running as an OS process, but CRS cannot register or control it properly.

Error signatures in CRS logs:

CRS-2674: Start of 'ora.asm' on 'node1' failed
CRS-2632: There are no more servers to try to place resource 'ora.asm' on that is not already attempting to start
ORA-15077: could not locate ASM instance serving a required diskgroup

Root cause:

The 19c version of ocssd.bin (Cluster Synchronization Services) changes how ASM instance discovery works. During the crossgrade window (19c on node 1, 12c on remaining nodes), the CSS heartbeat mechanism and voting disk access patterns change in a way that causes the ASM resource to time out before completing its startup sequence.

Resolution:

Apply the patch bundle that includes the fix for Bug 30265357 to all nodes before beginning the upgrade. The fix is included in 19.7+ RU bungets.
If you’re already mid-upgrade: manually start ASM using srvctl start asm -n node1 and then attempt crsctl start res ora.asm — if successful, proceed with the rootupgrade.sh on remaining nodes.
As a last resort: bounce the entire cluster and restart the upgrade from a clean state.

Always confirm the patch status with opatch lspatches | grep 30265357 before your maintenance window.

ASM, Database, Errors and Bugs

ASM Fails to Start After rootupgrade.sh on the First Node

By matheusdba 30 de June de 2025 30 de March de 2026

If you’ve ever performed a rolling upgrade from Oracle Grid Infrastructure 12c to 19c in a RAC environment, you may have hit this one: after running rootupgrade.sh on the first node, the ora.asm resource fails to come online, and the upgrade stalls.

Symptoms:

rootupgrade.sh completes on node 1 but ora.asm fails to start
CRS alert log shows errors like: CRS-2674: Start of 'ora.asm' on '<node1>' failed
crsctl stat res ora.asm -t shows OFFLINE on node 1
ASM alert log may show mount errors or listener registration failures

Root cause

This issue is documented under Doc ID 2606735.1 and is related to the ASM SPFILE location and CRS resource configuration misalignment during the upgrade transition window. The 19c CRS stack is running on node 1 while the 12c stack is still active on remaining nodes — creating a brief but critical incompatibility in how ora.asm is registered and started.

Workaround during upgrade:

Before running rootupgrade.sh on node 1, ensure ASM is using a PFILE instead of an SPFILE stored on ASM diskgroups that require the instance to already be mounted.
Run the upgrade with the ASM SPFILE temporarily backed up to a local filesystem location.
After the upgrade on all nodes, restore the SPFILE to ASM.

Related bugs:

This issue is closely related to Bug 30265357 and Bug 30452852, which I’ll cover in the next two weeks. If you’re planning a 12c-to-19c Grid Infrastructure upgrade, read all three before you start the maintenance window.

Always test your upgrade procedure in a non-production environment first, and keep My Oracle Support (MOS) Doc ID 2606735.1 bookmarked.

Database, Errors and Bugs

ORA-19665: size % in file header does not match actual file size of %

By matheusdba 15 de December de 2021 20 de September de 2022

That’s an unexpected message to get, right?
I got it related to an ORA-7445: ORA-07445: exception encountered: core dump [kcflfi()+1016] [SIGFPE] [Integer divide by zero] [0x10047EF18] [] []

What’s next?
After some checks, found the following (got from related trace the file name, matching to file_id 106):

SQL> select file_name, bytes from dba_data_files where file_id=106;

FILE_NAME                                         BYTES 
------------------------------------------------ --------------
+DATA/MYDB/DATAFILE/DATAFILE_XX.558.1015447173   14529069056

SQL> select name, bytes from v$datafile where file#= 106;

NAME                                             BYTES 
------------------------------------------------ --------------
+DATA/MYDB/DATAFILE/DATAFILE_XX.558.1015447173   14529067281

This means the Database Dictionary has different sizes for the datafile.
Looking at MOS, it seems to be a match to ORA-07445: Exception Encountered (Doc ID 1958870.1).

How to resolve it?

SQL> alter database datafile 106 offline to drop;
RMAN> restore datafile 106;
RMAN> recover datafile 106;
SQL> alter database datafile 106 online;

This resolved my case, fixing the views.

BE CAREFUL:

Make sure you have a backup before dropping the datafile.
Make sure you can put the datafile offline or proceed in a non-business hour.
Follow change procedures for Production, of course. Things may get wild.

And what if I don’t have a backup?
1. You may want to take it. It may not work, though, considering the original mismatch.
2. Export/Import logically:
– Export the data from the related tablespace (Datapump or Legacy Export, check for limitations and datatypes).
– Drop the tablespace and recreate it.
– Import the data back.

As usual, test it in a non-production environment to validate your plan and commands.

I hope it helps!

Backup and Restore, Database, Errors and Bugs, Migration | Upgrades | Patching

Relying in Guaranteed Restore Points? Be careful!

By matheusdba 7 de April de 2021 8 de April de 2021

Hi all,

Are you relying on Guaranteed Restore Points (GRP) as a fallback plan for your migration or upgrade strategy? Be careful!

When performing some non-Prod upgrade with the Autoupgrade tool, after completing the upgrade, I wanted to roll it back and go through the process again, This is what happened:

SQL> startup ORA-29702: error occurred in Cluster Group Service operation

When looking for it found this blog post from Mike I missed the last year: https://mikedietrichde.com/2020/11/13/ora-29702-and-your-instance-does-not-startup-in-the-cluster-anymore/

This means my database is not starting anymore! Oh man, glad that I’m in the testing phase!

This caused by of Bug 31561819 – Incompatible maxmembers at CRSD Level Causing Database Instance Not Able to Start.

As per Mike’s post, “you don’t need to even restore or flashback a database to hit this error. A simple instance in NOMOUNT state leads to the same error. Without even any datafile.”

The bug is fixed on:

19.9.0.0.201020 (Oct 2020) OCW RU
18.12.0.0.201020 (Oct 2020) OCW RU
12.2.0.1.201020 (Oct 2020) OCW RU

As being, you should include this patch BEFORE starting any move! Do it right away if you are on these versions!

Also, be aware of the latest change regarding Restore Points propagation on 19c, as per MOS Automatic Propagate Restore Points from Primary to Standby site in 19C (Doc ID 2463082.1)

In my case, the usage is exactly for a 12.1->19c upgrade. So, the fix is not even available (no Extended Support in place). As being, we had to think on alternate fallback plans, like a physical standby. But this is a topic for another post.

So for YOU:

Apply this patch if you can!
If not, be very careful on the fallback plans and as usual: Test, Test and Test!

See you next post!

Database, Errors and Bugs, Linux, Oracle Engineered Systems | Exadata

Exadata DNS Change – Pitfalls to be avoided

By Elisson Almeida 31 de March de 2021 17 de March de 2021

Hi all, it’s been a while but here I am!

There were some changes in the infrastructure at the place I work and I was asked to do a DNS change on a bit old Exadata X5. I had never done one before this, so the idea of this post is to help others who might face the issues I had.

The first thing I did was to look up the documentation about it and see the steps, yes there are blogs about it but the doc can help to get at least the first glance of the situation.

Long story short: Exadata has lots of components and the new DNS should be changed on all of them.

Here is a summary of the steps.

Infiniband switches

Connect to the switches and sudo to ilom-admin and change the DNS

su - ilom-admin
show /SP/clients/dns
set /SP/clients/dns nameserver=192.168.16.1,192.168.16.2,192.168.16.3
show /SP/clients/dns

Database nodes

For my image I only needed to change the /etc/resolv.conf, if you have a newer one you will need to user ipconf – That´s why you need to go to the documentation, at least there we hope that they will put some mentions on the pitfalls (well keep reading and you will see that was not my case)

Also changed the DNS on wach database node ilom, runing the ipmtool from the each node

ipmitool sunoem cli 'show /SP/clients/dns'
ipmitool sunoem cli 'set /SP/clients/dns nameserver=192.168.16.1,192.168.16.2,192.168.16.3'
ipmitool sunoem cli 'show /SP/clients/dns'

Cell nodes – Here things start to get interesting

For the storage cell there are some points that need to be taken under consideration:

Increase the ASM disk_repair_time – the goal here is to avoid a full rebalance if you do this within its timeframe, if you don’t know this parameter, ASM will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online. If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online. If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance.

On each cell node we need to make sure all disks are OK, stop all cell disks, stop all cell services and user ipconfig to change the DNS configuration

#Check that putting the grid disks offline will not cause a problem for Oracle ASM - it should all say YES on the 3rd column 
cellcli -e LIST GRIDDISK ATTRIBUTES name,asmmodestatus,asmdeactivationoutcome

#Inactivate all grid disks on the cell - may take a while to complete
cellcli -e ALTER GRIDDISK ALL INACTIVE


#Confirm the grid disks are offline, it should show asmmodestatus=OFFLINE or asmmodestatus=UNUSED, and asmdeactivationoutcome=Yes for all grid disks
cellcli -e LIST GRIDDISK ATTRIBUTES name, asmmodestatus,asmdeactivationoutcome

#Confirm that the disks are offline
cellcli -e LIST GRIDDISK

#Shut down the cell services and ocrvottargetd service
cellcli -e ALTER CELL SHUTDOWN SERVICES ALL
service ocrvottargetd stop #on some images this services does not exists

To execute the ipconf on the old way we only need to call it can follow the prompts, but if you have a newer image you will need to provide its parameters as is shown in the documentation.

The documentation says that after it we could start the cell services back up but I would recommend validating the DNS prior to doing that, why is that you might say because mine did not work and I could have a bigger issue with a cell node without DNS trying to start the services.

So, how to test, use nslookup, dig and curl

nslookup dns_domain.com
curl -v 192.168.16.1:53
dig another_server_in_the_network

My tests did not work, I was able to ping the DNS servers but not to resolve any name, I had an SR on MOS but did not help much either, looking up as this is a production system I tried to see if the firewall was up on the Linux site, and to my surprise it was.

I tried to manually add rules to iptables but it did not work and then I came across this note Exadata: New DNS server is not accessible after changing using IPCONF (Doc ID 1581417.1)

And there it was, I needed to restart the cellwall service to recreate the iptables rules.

# Restart cellwall service
service cellwall restart
service cellwall status

One final point, check if ASM started the rebalance or not, if it did, do not start to bring down another cell node until the rebalance is finish, otherwise you may run into deeper issues.

I hope it helps!

Elisson Almeida

Database, Errors and Bugs

Failed Logon Delay Causing Performance Issues

By Elisson Almeida 25 de November de 2020 18 de November de 2020

On the other day when I got to the office I was called to check a database that was running slow. They had implemented a new process there and wanted to make sure it was not impacted.

When checked I saw this issue using OEM

User SYS causing a strange wait event Failed Logon Delay

Someone had created a process running with the user SYS but they did not fully configured and a part of the process was trying to connect with the wrong password.

While they were looking in the configuration files and servers to see from where the issue was coming from, I started my own investigation to speed up the process.

1st I had to enable audit as it was disabled for unsuccessful loging attemps

SQL> audit session whenever not successful;

Audit succeeded.

Than I was able to see from where the failed connection came from, I just needed to look for the return code 1017 as ORA-1017 is invalid username/password; logon denied on sys.aud$

col ntimestamp# for a30 heading "Timestamp"
col userid for a6 heading "Username"
col userhost for a15 heading "Machine"
col spare1 for a10 heading "OS User"
col comment$text for a80 heading "Details"

select ntimestamp#, userid, userhost, spare1, comment$text,returncode from sys.aud$ where returncode=1017 or returncode=28000;

Oldie but goldie =)

Hope it helps,

Elisson Almeida

Errors and Bugs, MySQL

MySQL won’t start [ERROR] Found option without preceding group in config file

By sipmann 18 de November de 2020 18 de November de 2020

Hey folks,

have you ever received a call for a MySQL on windows that stopped working after someone did something at their my.cnf? Then you try to start the service by cmd and get the following error.

mysqld: [ERROR] Found option without preceding group in config file C:\ProgramData\MySQL\MySQL Server 8.0\my.ini at line 1.
mysqld: [ERROR] Fatal error in defaults handling. Program aborted!

Well, for some reason, the editor that was used (no idea which one was), threw some random byte at the beginning of the file. To solve that (on windows at least), open the file on Notepad++, go to Format > Convert to ANSI. Save the file and start again the service.

What was the weirdest thing that happened to you on a Windows Server?

Backup and Restore, Errors and Bugs, MySQL

MySQL Error1075 – Incorrect table definition; What’s happening?

By sipmann 28 de October de 2020 18 de November de 2020

Hey Folks,

A few months ago, I found an issue, where, for some reason, someone ignored the warnings and tried to restore a backup from a different version of MySQL (or even MariaDB, IDK). And as a result, half the database was running without Primary Keys. So when a system was trying to update their schema, we were getting errors (like the error bellow) trowed at our face.

1075 - Incorrect table definition; there can be only one auto column and it must be defined as a key

Ok, first things first, you would like to run an ALTER TABLE TABLE_NAME_HERE
ADD PRIMARY KEY (ID); and see if it works. The error was being thrown because the table key doesn’t have a single index on it… if you have problems with duplicated records on it, you can try the following script to solve the issue.

First, get the max id from the table, and then run the following:

UPDATE TABLE_NAME_HERE JOIN (SELECT @sequence := MAX_ID_HERE ) r SET id=@sequence:=@sequence+1 where id= DUPLICATED_ID_HERE;

WARNING

Be aware that, if the rows that were duplicated, where referenced as FK on another table, you will get some headache (well, you already have problems…) !!