SYSTEMSTATE DUMP

Hi guys!
I already posted about Hang Analyze (part1, part2) and ASHDUMP. Now, in the same ‘package’, let me show you about SYSTEMSTATE DUMP.

Systemstate is basically made by the process state for all process in instance (or instances) at the time the systemstate is called.
Through a systemstate it’s possible to identify enqeues, rowcache locks, mutexes, library cache pins and locks, latch free situations, and other kind of chains.

It’s a good tool to add in a SR, but it’s quite hard to habituate on reading/interpreting the file. To undertand exactly how to read a systemstate I’d recommend you the best: Read the manual!
The doc Reading and Understanding Systemstate Dumps (Doc ID 423153.1) has a very good explanation with examples, I’m not able to to it better. 🙂

What I can do is share about the SYSTEMSTATE levels. I had some difficult to find it…
But before I show how you can do it, it’s important to mention that Oracle does not recommend you to use ‘numeric events’ without a SR (MOS), according to Note: 75713.1.

More“SYSTEMSTATE DUMP”

ASHDUMP for Instance Crash/Hang ‘Post Mortem’ Analysis

Hi guys!
In the last weeks I talked about ASHDUMP in the post HANGANALYZE Part 1. Let’s think about it now…

Imagine the situation: The database is hanging, you cannot find what is going on and decided to restart the database OR your leader/boss yelled to you do it so, OR you know the database is going do get down, anyway…
Everyone has passed by this kind of situation at least once. After restart everything become OK and the ‘problem’ was solved. But now you are being asked about RCA (what caused this situation?). The database was hanging, so no snap was closed and you lost the ASH info…

For this cases I think is very useful to take 1 minute before database get down to generate an ASHDUMP. It’s very simple:

More“ASHDUMP for Instance Crash/Hang ‘Post Mortem’ Analysis”

HANGANALYZE Part 2

Hi!
See the first part of this post here: HANGANALIZE Part 1.

This post is just complement with a little kludge I liked…
First, let’s remmember that the hanganalyze is used when you are if some hanging in your environment, of course.

But what if you are having difficult to access the database, even with ‘/ as sysdba’?

You can create a ‘preliminary connection’ without create a session, like this:

sqlplus -prelim / as sysdba

This ‘feature’ is available since Oracle 10g, and it basically skips a session creation part (which could block) when logging on as SYSDBA.

More“HANGANALYZE Part 2”

HANGANALYZE Part 1

Hi all!
I realized I have some posts about database hangs but have no posts about hanganalyze, system state or ashdump usage. So let’s fix it. 🙂
To organize the ideas I’m going to split the subject on three posts. This first will be about hanganalyse.

See the second part of this post here: HANGANALIZE Part 2.

Ok, so let me refer the most clear Oracle words I could found:
“Hanganalyze tries to work out who is waiting for who by building wait chains, and then depending on the level will request various processes to dump their errorstack.”

This is very similar to what we can do manually through v$wait_chains. But is quicker and ‘oficial’, so let’s use! 😀

But before I show how you can do it, it’s important to mention that Oracle does not recommend you to use ‘numeric events’ without a SR (MOS), according to Note: 75713.1.

So, how to do it? Basically 2 ways:

1) ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level LL'; OR EVENT="60 trace name HANGANALYZE level 5"
2) ORADEBUG hanganalyze LL

More“HANGANALYZE Part 1”

Adding datafile hang on “enq: TT – contention”

Yesterday a colegue asked me about “enq: TT – contention” event on his session that is adding a a datafile in a tablespace wich run out of space in a 11.1.0.7 Database.
I’ve faced this situation another time and decided to document it.

Oracle refer Bug 8332021 : CANNOT ADD A DF WHEN SESSIONS ARE REPORTING ORA-1653 ON 11.1.0.7 for this situation.

The pointing solutions are:
– “Apply Patch 8332021”
– “Alternatively, you can upgrade to 11.2.0.2 or higher as the patch is included in the 11.2.0.2 patch set.”

The not documented workaround (just for you, by Matheus :D) is:
– Cancel session adding datafile.
– Extend any datafile to resume sessions waiting (in resumable state).
– Readd datafile.
This extend action relieves the blocks and will allow you to add the datafile.

Hugs!

Matheus.

Lock by DBLink – How to locate the remote session?

And if you identify a lock or other unwanted operation by a DBLink session, how to identify the original session in remote database (origin dabatase)?
The one million answer is simple: by process of v$session. By the way, looks like is easier than find the local process (spid)… Take a look in my example (scripts in the end of post):

dest> @sid
Sid:10035
Inst:1
SEQ# EVENT MODULE STATUS SID SERIAL# INST_ID
----- --------- ---------- ---------- ---------- ---------- ----------
29912 SQL*Net message from client oracle@origin2(TNS V1-V3) INACTIVE 10035 35 1
dest> @spid
SPID SID PID PROCESS_FOR_DB_LINK MACHINE LOGON_TIME
------ ---------- ---------- ----------- ----------- -----------
16188960 10035 882 17302472 origin2 24/08/2015 07:43:40

Now I know the sid 10035 refers to local process 16188960 and the process on origin database is 17302472. What I do what I want if this process:

root@origin2:/oracle/diag/rdbms/origin/origin2/trace>ps -ef |grep 17302472
grid 17302472 1 97 07:42:42 - 5:58 oracleorigin2 (LOCAL=NO)
root 24445782 36700580 0 08:05:45 pts/3 0:00 grep 17302472

What include to locae the session in the database by spid, see the sql, and etecetera:

origin> @spid2
Enter value for process: 17302472
SID SERIAL# USERNAME OSUSER PROGRAM STATUS
------- ---------- ----------- ----------- --------------- ----------
7951 41323 USER_XPTO scheduler_user sqlplus@scheduler_app.domain.net (TNS V1-V3) ACTIVE
database2> @sid
Sid:7951
Inst: 2
SQL_ID SEQ# EVENT MODULE STATUS SID SERIAL# INST_ID
---------- ----- --------- ------- --------- ----- ------ ----------
1w1wz2mdunya1 56778 db file sequential read REMOTE_LOAD ACTIVE 7951 41323 2

That’s OK?
Simple isn’t?

The used Scripts (except the “sid”, that is a simple SQL on gv$session):

Get SPID and PROCESS FOR DBLINK from a SID:

# spid:
col machine format a30
col process format 999999
select p.spid,b.sid, p.pid, b.process as process_for_db_link, machine, logon_time
from v$session b, v$process p
where b.paddr=p.addr
and sid=&sid
/

Get SID from SPID:

#spid2:
SELECT s.sid, s.serial#, s.username,
s.osuser, s.program, s.status,
FROM v$session s, v$process p
WHERE s.paddr = p.addr
AND p.spid IN (&process);
/

See ya!
Matheus.

Application Hangs: resmgr:become active

Application APP hangs with resmgr:become active. There is a resource plan defined who has a specific group to this Application. What is wrong and how to fix?

Here I presume you what is a resource manager and a resource plan. And, of course, for what purpose they exists. You must to know that this event is related to high active sessions in the group of resource plan too.

Before everything else, please understand if this is an acceptable behavior of the application. Then, in which resource group the sessions in this event are. The are other application in this same group with an unacceptable behavior? Yes? So, fix it.
No? Consider tho adjust the resource plan, switch the application to a new group, or, like in my case, remap the Application APP to the right group… ¬¬

To make it clear: In my case, the mapping is missing, so the schema MYAPP (Application APP) fit to OTHER_GROUP, where we use to set minimal limits:

SID SERIAL# INST_ID USERNAME RESOURCE_CONSUMER_GROUP EVENT
----- ---------- ---------- ------------------------------ -----------
492 29459 2 MYAPP OTHER_GROUPS resmgr:become active
1102 19145 2 MYAPP OTHER_GROUPS resmgr:become active
955 33161 2 MYAPP OTHER_GROUPS resmgr:become active
1084 33839 2 MYAPP OTHER_GROUPS db file sequential read
MYDB> show parameters resource_manager_plan
NAME TYPE VALUE
--------------------- ------ --------------
resource_manager_plan string MYDB_PLAN
MYDB> select group_or_subplan, active_sess_pool_p1, cpu_p1, cpu_p2, cpu_p3, cpu_p4 from DBA_RSRC_PLAN_DIRECTIVES where plan = 'MYDB_PLAN'
Enter value for plano: MYDB_PLAN
GROUP_OR_SUBPLAN ACTIVE_SESS_POOL_P1 CPU_P1 CPU_P2 CPU_P3 CPU_P4
------------------------------ ------------------- ---------- ---------- ---------- ----------
BATCH_GROUP 60 0 10 0 0
SYS_GROUP 80 0 0 0
APP_PLAN 20 0 30 0 0
OTHER_GROUPS 20 0 20 0 0
GGATE_GROUP 0 10 0 0
PAYTRUE_GROUP 40 0 30 0 0
DBA_GROUP 20 0 0 0

You can configure the mapping by user like that:

BEGIN
DBMS_RESOURCE_MANAGER.clear_pending_area;
DBMS_RESOURCE_MANAGER.create_pending_area;
DBMS_RESOURCE_MANAGER.set_consumer_group_mapping (
attribute => DBMS_RESOURCE_MANAGER.oracle_user,
-- DBMS_RESOURCE_MANAGER.service_name (or a lot of possibilities. Google it!)
value => 'MYAPP',
consumer_group => 'APP_PLAN');
DBMS_RESOURCE_MANAGER.validate_pending_area;
DBMS_RESOURCE_MANAGER.submit_pending_area;
END;
/

To switch the connected sessions, it can be done like:

SELECT 'EXEC DBMS_RESOURCE_MANAGER.SWITCH_CONSUMER_GROUP_FOR_SESS ('''||SID||''','''||SERIAL#||''',''APP_PLAN'');' FROM V$SESSION where username='MYAPP'
and RESOURCE_CONSUMER_GROUP='OTHER_GROUPS';

Remember that creating a resource plan without making the mappings is a bit pointless… 😛

Matheus.

Application Looping Until Lock a Row with NOWAIT Clause

Yesterday I treated an interesting situation:
A BATCH stayed on “SQL*Net message from client” event but the last_call_et was always on 1 or 0. Seems OK, with some client contention to send the commands to the DBMS, right? Nope.

It was caused by a loop in the application code “waiting” for a row lock but without “DBMS waiting events” (something like “select * from table for update nowait”). Take a look in how it was identified below.

First the session with no SQL_ID, no wait events and last_Call_et=0 of a “BATH_PROCESS” user:

proddb2> @sid
Sid:9796
Inst:
LAST_CALL_ET SQL_ID   EVENT STATUS SID SERIAL# INST_ID USERNAME
------------ ------- ------------- ---------- ------------------------
0 SQL*Net message from client INACTIVE 9796 45117 2 BATCH_PROCESS
proddb2> @trace
Enter value for sid: 9796
Enter value for serial: 45117
PL/SQL procedure successfully completed.

As you see, with no idea about what is happening, I started a trace. The trace was stuck with this:

*** 2015-06-15 14:03:25.755
WAIT #4574470448: nam='SQL*Net message from client' ela=993072 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833326636999
CLOSE #4574470448:c=10,e=15,dep=0,type=3,tim=12833326637228
PARSE #4574470448:c=25,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1139820409,tim=12833326637286
BINDS #4574470448:
Bind#0
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=24 off=0
kxsbbbfp=110a8d0d8 bln=22 avl=05 flg=05
value=5022011
WAIT #4574470448: nam='gc cr block 2-way' ela= 709 p1=442 p2=5944 p3=8483 obj#=0 tim=12833326638533
WAIT #4574470448: nam='gc cr block 2-way' ela= 541 p1=3 p2=2088264 p3=4367 obj#=0 tim=12833326639352
WAIT #4574470448: nam='gc cr block 2-way' ela= 651 p1=442 p2=5944 p3=8483 obj#=0 tim=12833326641673
WAIT #4574470448: nam='enq: TX - row lock contention' ela= 1093 name|mode=1415053318 usn<obj#=23141074 tim=12833326643029
EXEC #4574470448:c=1776,e=5836,p=0,cr=117,cu=1,mis=0,r=0,dep=0,og=1,plh=1139820409,tim=12833326643150
ERROR #4574470448:err=54 tim=12833326643172
WAIT #4574470448: nam='SQL*Net break/reset to client' ela= 9 driver id=1413697536 break?=1 p3=0 obj#=23141074 tim=12833326643373
WAIT #4574470448: nam='SQL*Net break/reset to client' ela= 503 driver id=1413697536 break?=0 p3=0 obj#=23141074 tim=12833326643891
WAIT #4574470448: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833326643915

AHÁ!
Did you see the “err=54” there? Yes. You know this error:

ORA-00054: Resource busy and acquire with NOWAIT specified

It’s caused by a SELECT FOR UPDATE NOWAIT in the code.
But, this select is in a loop, so the session don’t go ahead until have it.
(Obviously it could be coded with some treatment/better logic for this loop and errors, buuuut…)

What can we do now?
The easy way is to discover the holding session and kill it.
And sometimes the easy way is the best way. 😉

For that, we use the “obj#” and “value”, also bolded in the trace.
As I know the application, I know that the used field in all “where clauses” is the “RECNO” column. But if you don’t, it’s needed to discover. With this information in mind:

proddb2>select * from dba_objects where object_id='23141074';
OWNER OBJECT_NAME
------------------------------ ----------------
OWNER_EXAMPLE TABLE_XPTO
proddb2> select * from OWNER_EXAMPLE.TABLE_XPTO WHERE recno=5022011;
COL_KEY FSAMED0 FSAMED1 FSMNEG1 FSMNEG2 FSMNEG3 COL_DATE RECNO
------- ---------- ---------- ---------- ---------- ---------- -----
1002974 0 0 -516.8 0 0 15/06/2015 00:00:00 5022011

Ok, I know the row that is holded by the other session.
Let’s discover which session is causing a lock by myself (but in my case, without “NOWAIT” clause, to have time to find the holder):

proddb5>select * from OWNER_EXAMPLE.TABLE_XPTO WHERE recno=5022011 for update;

In another sqlplus session:

proddb2> @me
INST_ID SID SERIAL# USERNAME EVENT BLOCKING_SE BLOCKING_SESSION BLOCKING_INSTANCE
------- ---------- ---------- --------------- ----------------------
5 14174 479 MATHEUS_BOESING enq:TX - row lock contention VALID 11006 1
2 4233 12879 MATHEUS_BOESING PX Deq: Execution Msg NOT IN WAIT
1 15410 7697 MATHEUS_BOESING PX Deq: Execution Msg NOT IN WAIT

AHÁ again!
The SID 11006. Let’s see who is there:

proddb2> @sid
Sid:11006
Inst:
SQL_ID SEQ# EVENT STATUS SID SERIAL# INST_ID USERNAME
-------------------- ---------- --------------------------------------
9jzm6vn5j06js 24919 enq: TX - row lock contention ACTIVE 11006 44627 1 DBLINK_OTHER_BATCH_SCHEMA

Ok, it’s another session of a different batch process in a remote database holding this row. As it’s less relevant, lets kill! Muahaha!
Then, you’ll see, my session get the lock and is in the middle of a transaction:

proddb1> @kill
***
sid : 11006
serial : 44627
***
System altered.
***
proddb1> @me
INST_ID SID SERIAL# USERNAME EVENT BLOCKING_SE BLOCKING_SESSION BLOCKING_INSTANCE
------- ---------- ---------- --------------- --------------------
5 14174 479 MATHEUS_BOESING transaction UNKNOWN
2 4332 56037 MATHEUS_BOESING PX Deq: Execution Msg NOT IN WAIT
1 12058 9 MATHEUS_BOESING class slave wait NO HOLDER

To release the “row locked” to my principal process, lets suicide (kill my own session, this case, that is holding the row lock right now).

proddb5> @kill
***
sid : 14174
serial : 479
***
System altered.
***

After kill all the holding sessions, my BATCH_PROCESS just gone! 😀
Take a look on the trace (running ok):

WAIT #4576933904: nam='SQL*Net message to client' ela= 3 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981531019
FETCH #4576933904:c=45,e=71,p=0,cr=3,cu=0,mis=0,r=5,dep=0,og=1,plh=419358542,tim=12833981531062
WAIT #4576933904: nam='SQL*Net message from client' ela= 562 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981531654
WAIT #4576933904: nam='SQL*Net message to client' ela= 3 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981531788
FETCH #4576933904:c=55,e=86,p=0,cr=2,cu=0,mis=0,r=5,dep=0,og=1,plh=419358542,tim=12833981531826
WAIT #4576933904: nam='SQL*Net message from client' ela= 715 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981532576
WAIT #4576933904: nam='SQL*Net message to client' ela= 4 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981532721
FETCH #4576933904:c=61,e=96,p=0,cr=2,cu=0,mis=0,r=5,dep=0,og=1,plh=419358542,tim=12833981532758
WAIT #4576933904: nam='SQL*Net message from client' ela= 600 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981533617
WAIT #4576933904: nam='SQL*Net message to client' ela= 3 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981534163
FETCH #4576933904:c=52,e=82,p=0,cr=2,cu=0,mis=0,r=5,dep=0,og=1,plh=419358542,tim=12833981534203
WAIT #4576933904: nam='SQL*Net message from client' ela= 517 driver id=1413697536 #bytes=1 p3=0 obj#=23141074 tim=12833981534752

Now, with the problem solved, lets disable the trace and continue the other daily tasks… 🙂

proddb2> @untrace
Enter value for sid: 9796
Enter value for serial: 45117
PL/SQL procedure successfully completed.

I hope it was useful!
If helped you, make a comment! 😀

See ya!
Matheus.

High CPU usage by LMS and Node Evictions: Solved by Setting “_high_priority_processes”

Another thing that may help you in environments with highly interdependent applications:

Our env has high interconnect network block changing, and, as a consequence, high CPU usage by Global Cache Services (GCS)/Lock Manager Server Process (LMS).

This way, for each little latency in the interconnect interface, we were having a node eviction and all the impacts to the legacy application you can imagine (without gridlink or any solution to make the relocation ‘transparent’, as is usual to legacy application) and, of course, the business impact.

Oracle obviously suggested that we reduce the block concurrency over the cluster nodes grouping the application by affinity. But, it’s just no applicable to our env… 🙁

When nothing seemed to help, the workaround came from here: Top 5 Database and/or Instance Performance Issues in RAC Environment (Doc ID 1373500.1).

Here is our change:

boesing@proddb> alter system set "_high_priority_processes"='LMS*|LGWR|VKTM' scope=spfile sid='*';
System altered.

No magic, but the problem stopped to happen. After that, we’re having some warnings about clock synchronization over the cluster nodes on CRS alerts. Like this:

CRS-2409:The clock on host proddb1 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization. Service is running in observer mode.

I believe it happens because VKTM lost priority. But it’s OK: The node evictions has stopped! 😀

Matheus.

VPD: “row cache objects” latch contention

The other day, we found high occurrence of latch events in our principal/core environment (11.2.0.3.0). The origins are all “different businesses channels” that access objects through the use of VPD. The latch events was bit by bit dominating the environment during the last months and turn on an “attention alarm” to us.

Then we found the the note: Bug 12772404 – Significant “row cache objects” latch contention when using VPD – superseded (Doc ID 12772404.8).

The situation is right the same:

“When VPD is used, intense row cache objects latch contention (dc_users) may caused by an internal Exempt Access Policy privilege check. Rediscovery Information: 
VPD is in use 
Significant “latch: row cache objects” waits occur
The waits are for the latch covering dc_users”

Take a look on the DC_USERS latches:
dc_users

And about the workaround:
“There is no direct workaround available.
The following guidelines may help to alleviate the problem :
– Dropping the database roles from our user:
The Number of Roles granted to user can increase the row cache
look-ups proportionally. When database is required to check whether
a system privilege is granted to User, it checks if that privilege
is granted to any of the User’s roles. Hence, it’s not helpful
to do something like “set role A, B, C, D, F …” to recreate its
environment for every execution.
– Changing the policy function might be helpful in some cases
eg: To use CONTEXT dependent policies instead of DYNAMIC policies”

Take a look in one of the examples of:

boesing@mydb4> /
P1RAW EVENT USERNAME SQL_ID SQL_CHILD_NUMBER LAST_CALL_ET SID SEQ# WAIT_TIME SECOND
--------- ---------------- ------ ------ ---------- ---------- ------
0700011807B50D08 latch: row cache objects CHANNELAPP 4nwvpx8xt3h3m 22 0 1276 59113 0
0700011807B50D08 latch: row cache objects CHANNELAPP fp3mft3usb74w 0 21719 16636 0
0700011807B50D08 latch: row cache objects CHANNELAPP 58pund2p09hgg 0 6774 11061 0
0700011807B50D08 latch: row cache objects OTHER_CHANNELAPP 54a2wfa60rgu1 1 0 8046 12386 0
0700011807B50D08 latch: row cache objects CHANNELAPP 1gwr69wduk9v4 42 0 9454 53927 0
0700011807B50D08 latch: row cache objects OTHER_CHANNELAPP 9pqrqqfzukrq4 68 0 9732 19311 0
0700011807B50D08 latch: row cache objects CHANNELAPP d1bnq8wb0nhrf 0 1 11425 56830 -1
0700011807B50D08 latch: row cache objects CHANNELAPP 32aqdd8cbmc4b 0 11711 39182 0
0700011807B50D08 latch: row cache objects IB_RUN adgnrpwazbfmz 0 12133 3372 0
0700011807B50D08 latch: row cache objects IB_RUN cqmgxvb78q9hy 0 17913 6345 0
0700011807B50D08 latch: row cache objects CHANNELAPP byzm159jbjxaa 0 6 19606 52624 0
0700011807B50D08 latch: row cache objects OTHER_CHANNELAPP 2kbjztd9yzqfm 61 0 20732 28687 0
0700011807B50D08 latch: row cache objects CHANNELAPP 6dvagdabts9nx 19 7 21011 504 0
0700011807B50D08 latch: row cache objects CHANNELAPP 9pqrqqfzukrq4 78 0 21439 19030 0
0700011807B50D08 latch: row cache objects CHANNELAPP gq1avu79h2np3 85 0 3815 33831 -1
boesing@mydb4>SELECT child# FROM v$latch_children WHERE addr= '0700011807B50D08';
CHILD#
----------
8
boesing@mydb4> select s.kqrstcln latch#, s.kqrstcid cache#, kqrsttxt name from x$kqrst s where s.kqrstcln=8;
LATCH# CACHE# NAME
---------- ---------- --------------------------------
8 10 dc_users
8 7 dc_users
8 7 dc_users
8 7 dc_users

The problem was definitively solved by applying the 11.2.0.4.2 PSU. No problems after that.
Good luck, if it’s your situation.

Hugs!
Matheus.