Exadata: Cell Server Crashing on ORA-00600: [LinuxBlockIO::reap]

Hi all,

So I started facing this in a client environment. Here is the alert message:

Target name=db01cel08.xxx.com
Message=ORA-00600: internal error code, arguments: [LinuxBlockIO::reap], [0x60000D502388], [], [], [], [], [], [], [], [], [], []
Event reported time=Dec 19, 2019 2:14:16 AM EDT

When checking on the cellserver I see this message:

[root@db01 ~]# ssh db01cel08
Last login: Thu Dec 19 04:45:13 2019 from db01.xxx.com
[root@db01cel08 ~]# cellcli
CellCLI: Release 12.1.2.3.5 - Production on Fri Dec 19 17:13:31 EDT 2019

Copyright (c) 2007, 2016, Oracle. All rights reserved.

CellCLI> LIST ALERTHISTORY detail

[...]

name: 10
alertDescription: "ORA-07445: exception encountered: core dump [__intel_new_memset()+62] [11] [0x000000000] [] [] []"
alertMessage: "ORA-07445: exception encountered: core dump [__intel_new_memset()+62] [11] [0x000000000] [] [] []"
alertSequenceID: 10
alertShortName: ADR
alertType: Stateless
beginTime: 2019-12-19T02:00:04-04:00
endTime:
examinedBy:
notificationState: 1
sequenceBeginTime: 2019-12-19T02:00:04-04:00
severity: critical
alertAction: "Errors in file /opt/oracle/cell/log/diag/asm/cell/SYS_112331_170406/trace/cellofltrc_19796_53.trc (incident=25). Diagnostic package is attached. It is also accessible at https://db01cel08.xxx.com/diagpack/download?name=db01cel08_2019_12_19T02_00_04_10.tar.bz2 It will be retained on the storage server for 7 days. If the diagnostic package has expired, then it can be re-created at https://db01cel08.xxx.com/diagpack"

name: 11
alertDescription: "ORA-00600: internal error code, arguments: [LinuxBlockIO::reap], [0x60000D502388], [], [], [], [], [], [], [], [], [], []"
alertMessage: "ORA-00600: internal error code, arguments: [LinuxBlockIO::reap], [0x60000D502388], [], [], [], [], [], [], [], [], [], []"
alertSequenceID: 11
alertShortName: ADR
alertType: Stateless
beginTime: 2019-12-19T02:00:04-04:00
endTime:
examinedBy:
notificationState: 1
sequenceBeginTime: 2019-12-19T02:00:04-04:00
severity: critical
alertAction: "Errors in file /opt/oracle/cell/log/diag/asm/cell/db01cel08/trace/svtrc_9174_12.trc (incident=25). Diagnostic package is attached. It is also accessible at https://db01cel08.xxxx.com/diagpack/download?name=jdb01cel08_2019_12_19T02_00_04_11.tar.bz2 It will be retained on the storage server for 7 days. If the diagnostic package has expired, then it can be re-created at https://db01cel08.xxx.com/diagpack"

name: 12_1
alertDescription: "A SQL PLAN quarantine has been added"
alertMessage: "A SQL PLAN quarantine has been added. As a result, Smart Scan is disabled for SQL statements with the quarantined SQL plan. Quarantine id : 21 Quarantine type : SQL PLAN Quarantine reason : Crash Quarantine Plan : SYSTEM Quarantine Mode : FULL_Quarantine DB Unique Name : XPTODB Incident id : 25 SQLID : 8j0az9sgxs5yh SQL Plan details : {SQL_PLAN_HASH_VALUE=281152830, PLAN_LINE_ID=9} In addition, the following disk region has been quarantined, and Smart Scan will be disabled for this region: Disk Region : {Grid Disk Name=Unknown, offset=186750337024, size=1M} "
alertSequenceID: 12
alertShortName: Software
alertType: Stateful
beginTime: 2019-12-19T02:00:12-04:00
examinedBy:
metricObjectName: QUARANTINE/21
notificationState: 1
sequenceBeginTime: 2019-12-19T02:00:12-04:00
severity: critical
alertAction: "A SQL statement caused the Cell Server (CELLSRV) service on the cell to crash. A SQL PLAN quarantine has been created to prevent the same SQL statement from causing the same cell to crash. When possible, disable offload for the SQL statement or apply the RDBMS patch that fixes the crash, then remove the quarantine with the following CellCLI command: CellCLI> drop quarantine 21 All quarantines are automatically removed when a cell is patched or upgraded. For information about how to disable offload for the SQL statement, refer to the section about 'SQL Processing Offload' in Oracle Exadata Storage Server User's Guide. Diagnostic package is attached. It is also accessible at https://db01cel08.xxx.com/diagpack/download?name=db01cel08_2019_12_19T02_00_12_12_1.tar.bz2 It will be retained on the storage server for 7 days. If the diagnostic package has expired, then it can be re-created at https://db01cel08.xxx.com/diagpack"

CellCLI>

After some research, we could match the situation to Bug 13245134 – Query may fail with errors ORA-27618, ORA-27603, ORA-27626 or ORA-00600[linuxblockio::reap_1] or ora-600 [cacheput::process_1]

It’s also described as per: Exadata/SuperCluster: 11.2 databases missing fix for the bug 13245134 may lead to cell service crash with ora-600 [linuxblockio::reap_1]/ora-600 [cacheput::process_1] or ORA-27626: Exadata error: 242/Smart scan issues on the RDBMS side

In order to resolve the crashes quickly, I applied the patch online with:

After applying, all got solved:

[oracle@db01 ~]$ /oracle/xptodb/product/11.2.0.4/OPatch/opatch lsinventory -OH $ORACLE_HOME | grep 13245134
Patch (online) 13245134: applied on Thu Dec 19 23:34:50 EST 2019
13245134
[oracle@db01 ~]$

Hope it helps!

Automatic Killing Inactive Sessions with Resource Manager

Hello All!

So, your are manually (or via script) killing idle sessions on your database?

As consequence, your users are getting error “ORA-00028: your session has been killed.” and getting angry on you?

What about doing it automatically in a much more graceful way and be seen as a nicer DBA? You can do it using Resource Manager.

Are you already using Resource Manager on your database? Yes – Great!

No – Shame on you. Read this and put RM in place ASAP. This is a great tool for the database to manage the database resources, plus no additional licensing is needed. So go for it!

Ok, but what is the trick?
It’s the limit max_idle_time. You can use it either on existing groups or subgroups of your plan (or subplan) or switch to “KILL” groups with this you can even use the same criteria you’d use for any script to perform this action. For this you might user the parameter new_switch_group and have a created a different group only for those kills.

Have this option it’s nicer, if you want to avoid killing sessions on database, by the way, as you can always switch a session for the killing group manually, not demanding it to fill the requirements to automatic switching.

How? Quick example:

Creating new Plan Groups with MAX_IDLE_TIME:

begin
dbms_resource_manager.create_plan_directive( plan => '&RM_PLAN', group_or_subplan =>
'LONG_RUN', comment => 'Limit idle time to 5 minutes', max_idle_time => 300);
dbms_resource_manager.create_plan_directive( plan => '&RM_PLAN', group_or_subplan =>
'SHORT_RUN', comment => 'Limit idle time to 1 minute', max_idle_time => 60);
end;
/

Cool!

And what would be the error for the user that get’s the session killed?

ORA-3113 End of file on communication channel

Much nicer: Now you are a nice DBA and don’t kill sessions anymore.
You automated it!

Some additional recommendations:

  • Use this solution for Databases above 11.2.0.4 or 12.1.0.2, due some known bugs:
    • Bug 9523768 – IDLE SESSIONS AREN’T ACTUALLY KILLED IMMEDIATELY (affecting 11.2.0.1 to 11.2.0.3)
    • Bug 13837378 – ALTER SYSTEM KILL SESSION IMMEDIATE DOES NOT KILL QC SESSIONS (11.2.0.1 to 12.1.0.1)
    • Bug 8891495 – NON-IDLE SESSIONS ARE KILLED (11.2.0.1 only)
  • Some MOS references on it:
    • How To Automatic Kill Inactive Sessions using Resource Manager (Doc ID 1935739.1)
    • Using Resource Manager to Detect and Kill Idle Sessions (Doc ID 1557657.1)

Hope it helps!

Advise: Setting underScore/Hidden Parameters

Please be aware no hidden parameters (parameters starting with “_”) should not be set unless recommended by Oracle explicitly or in case of a perfect match to the case.

This is explained on MOS Best Practices and Guidelines for Setting underScore/Hidden Parameter to Workaround and/or Resolve Customers Problems (Doc ID 2524119.1)

So, in case of any diagnostics you have to lead to setting it, SUGGESTED:

  • Get the explicit approval/agreement from Oracle.
  • Include evidence on the root cause for this setting. A comment should work, something like:
alter system set "_replace_virtual_columns"=false Comment= 'Workaround Doc ID 1512347.1' SID= '*' scope= both;

Cheers!

ORA-20001: Statistics Advisor: Invalid task name for the current user

Hi all,
So, I start seeing on my alert log:

2019-12-26T00:42:03.543268-05:00
Errors in file /u01/app/oracle/diag/rdbms/cdb1/CDB1/trace/CDB1_j000_89793.trc:
ORA-12012: error on auto execute of job "SYS"."ORA$AT_OS_OPT_SY_36807"
ORA-20001: Statistics Advisor: Invalid task name for the current user
ORA-06512: at "SYS.DBMS_STATS", line 47207
ORA-06512: at "SYS.DBMS_STATS_ADVISOR", line 882
ORA-06512: at "SYS.DBMS_STATS_INTERNAL", line 20059
ORA-06512: at "SYS.DBMS_STATS_INTERNAL", line 22201
ORA-06512: at "SYS.DBMS_STATS", line 47197

After some research I found Bug 22879263 – BETA 12.2 – ORA-12012 , ORA-20001 AND ORA-06512 FROM ORA$AT_OS_OPT_SY_88 JOB.

This was suppose to be fixed on 12.2.0.1, however I’m still getting. Anyway, I executed the below and it got fixed:

EXEC dbms_stats.init_package();

Ref: Recurring ORA-12012, ORA-20001, ORA-06512 In Container Database (Doc ID 2420581.1)

Hope it helps!

ODA Useful Commands to Manage VMs

1. To restart the vm:

oakcli stop vm [vm-name]
oakcli stop vm [vm-name] -force
oakcli start vm [vm-name]

2. To show All vm status

oakcli show vm

Example:

root@server oak]# oakcli show vm

NAME NODENUM MEMORY VCPU STATE REPOSITORY
vm1 1 8192 4 ONLINE fileshare

3. Listing Configured Options for a VM Template

oakcli show vmtemplate ol6linux_64

4. Adding a Network to the VM Template

oakcli modify vmtemplate ol6linux_64 -addnetwork net1

5. Configure CPU, Memory on the Template

oakcli configure vmtemplate ol6linux_64 -vcpu 4 -maxvcpu 8 -cpucap 10 -memory 3000M -maxmemory 6G -os OTHER_LINUX

6. Create a VM by Cloning from Template

oakcli clone vm ol6test -vmtemplate ol6linux_64 -repo repo1 -node oda2

7. Override VM Template Values

oakcli configure vm ol6test -vcpu 6 -memory 4G

8. Configure High Availability and Failover Values

oakcli configure vm ol6test -prefnode oda2 -failover oda1

9. Start a VM

oakcli start vm ol6test

10. Access VM Console for a VM:

oakcli show vmconsole [vm-name]

Hope it helps you!

Webinar: How to Get Access to Exadata for Testing (Portuguese)

É amanhã! Não vai perder esse Webinar Free com o Oracle ACE Franky Weber Faust!

Descubra como ter acesso ao seu próprio banco de dados disponível a qualquer momento num Exadata para poder testar smart scan, hybrid columnar compression, storage indexes, entre outros recursos do Exadata. Vamos ver alguns casos práticos de smart scan, storage indexes e HCC. Testes ao vivo, tudo pode acontecer…

Quando: Terça-feira – 03/03/2020 19:30 BRT (AMANHÃ!)
Onde: https://gobrunch.com/events/39067
Realização: GUOB
Palestrante: Franky Weber Faust

Evento ao vivo – Vagas limitadas!

WhatsApp Image 2020-02-28 at 06.54.38

RMOUG Training Days 2020 Review!

Hello all!
I’m here to say my thank you all for attending and participating with me on the RMOUG. It was great to meet old friends and make some new.

There were all sorts of great topics in place, here is my top ten list (in no particular order):

  • OCI: Everything You Need To Know To Get Started – Simon Pane
  • An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs – Jim Czuprynski
  • Bringing DevOps To Your Database Environment – Michael Haynes
  • Straight Talk On Oracle Licensing & Licensing Trends – Michael Corey and Donald Sullivan
  • SQL Tuning: Undocumented Secrets For Getting Accurate Cardinalities – Kaley Crum
  • Actions For An Impactful Presentation – Dan Norris (not a technical one, but very useful)
  • What’s Your Super-Power? Mine Is ML With Oracle Autonomous DB – Jim Czuprynski
  • Oracle Database Single Sign-On Made Easy-Oracle18c/19c Centrally Managed Users (CMU) – Simon Pane
  • SQL Data Caching For Performance & DB Offload – Douglas Hood
  • Tune Your SQL Without Hints Or Indexes – Kaley Crum
  • Using Kubernetes For DB Scalability & Availability – Douglas Hood
  • Working Effectively With JSON In The Oracle Database – Sean Stacey

Ok, I had to put 12!

By the way, for all my conference latest slides, please check: https://grepora.com/material/

I hope you have all enjoyed it as much as I did!

And if you are an RMOUG board member: Thank you very much for the great welcoming and congratulations for such a great organization and quality event!

Cheers!

General ILOM Faults Management

Hi all,
So just a quick reference: Some useful general commands for ILOM:

1. Check for Faults:

/home/boesing> ssh root@[ilom ip]    

Oracle(R) Integrated Lights Out Manager

Version 3.1.2.10 r74387

Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.

-> cd /SP/faultmgmt
/SP/faultmgmt

-> start shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> 
faultmgmtsp> fmadm 
Usage: fmadm 
  where  is one of the following:
    faulty [-asv] [-u ]   : display list of faulty resources
    faulty -f                   : display faulty on FRUs
    acquit                 : acquit faults on a FRU
    acquit                : acquit faults associated with UUID
    acquit           : acquit faults specified by
                                  (FRU, UUID) combination
    replaced               : replaced faults on a FRU
    repaired               : repaired faults on a FRU
    repair                 : repair faults on a FRU
    rotate errlog               : rotate error log
    rotate fltlog               : rotate fault log

faultmgmtsp> 
faultmgmtsp> fmadm faulty
No faults found

– Another way to see current issues:

show /SP/logs/event/list show faulty

2. Clearing Faults:
In case there is a failure that can be ignored (for example, lost of AC power), it may be cleared:

set /SYS/PSU1 clear_fault_action=true

3. Checking Additional Logs:

start /SYS 
->y 
ls /SYS 
start /SP/console 
-> y
show /SP/logs/event/list 

Some MOS notes for reference:

– Diagnostic information for ILOM, ILO , LO100 issues (Doc ID 1062544.1)
– How to run an ILOM Snapshot on a Sun/Oracle X86 System (Doc ID 1448069.1)
– PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1)

Diagsnap Causing Node Eviction on 12c

Hi all,

So, you know how it is when we are having node evictions over a RAC. Lot’s of sessions getting killed, some effects for the clients and applications, also boss bossing in our shoulders to have a quick resolution over this. Plus it’s never an easy thing to drill down and understand.

Troubleshooting node reboots/evictions within Grid Infrastructure (GI) often is difficult due to the lack of Network and OS level resource information. To help circumvent this situation the diagsnap feature has been developed and integrated with Grid Infrastructure. Diagsnap is triggered to collect Network and OS level resource information when a given node is about to get evicted or when Grid Infrastructure is about to crash.

The diagsnap feature is enabled automatically starting from 12.1.0.2 Oct2017 PSU and 12.2.0.1 Oct2017 RU.
For more information about the diagsnap feature, refer to the Document 2345654.1 “What is diagsnap resource in 12c GI and above?”

However, after a lot of research and SR logs sending and interaction with Oracle to investigate a node eviction, we ended up finding a match to Bug 25785073 – OCSSD hangs while DIAGSNAP takes pstack causing a node reboot(Doc ID 25785073.8).

So Diagsnap is not helping, it is the cause of the issues.

After some research, seems this is not the only bug related to it. See some more:

  • Bug 27182006 – Auxiliary commands generated by DIAGSNAP spin CPU(Doc ID 27182006.8)
  • Bug 24692439 – Auxiliary commands generated by DIAGSNAP consumes high CPU(Doc ID 24692439.8)
  • Bug 23101338 – Disable diagsnap after 12.1.0.2.160419 GI PSU patch was installed(Doc ID 23101338.8)
  • Bug 28462215 – The Process diagsnap.pl is Restarted Every 2 Minutes(Doc ID 28462215.8)

Well,

At least for the 12Cs, I’m disabling it in all my environments:

– Check if diagsnap is disabled or not (DIAGSNAP=Disable if disabled, nothing if enabled)

$ egrep '^DIAGSNAP|^PSTACK' /u01/app/12.2.0.1/grid/crf/admin/crf$(hostname -s).ora

– Disable diagsnap (this will disable diagsnap on all nodes)

$ /u01/app/12.2.0.1/grid/bin/oclumon manage -disable diagsnap

– Check that diagsnap is disabled (DIAGSNAP=Disable if disabled, nothing if enabled) — to be done on each node

$ egrep '^DIAGSNAP|^PSTACK' /u01/app/12.2.0.1/grid/crf/admin/crf$(hostname -s).ora

Hope it helps!
Cheers!

RMOUG Training Days 2020!

Hi all,

I’m happy to let you know I’ll be this year at the RMOUG Training Days 2020!

The event is happening this year from Feb 18th to 20th. Here is the event home, for additional information.

word-image-1 (1)

I see there agenda is full of great names, exponents of Oracle Community. Check here for the complete agenda.

Some other additional important links:

I can barely wait to be among some long-distance friends in this great place:

uk1mhqcwwphb1aocwjp9

See you there!