Sonntag, 25. Juni 2017

Hints, up and down

Last week was AOUG conference 2017. There I attended Lothar Flatz' Any Hint, anywhere. There Lothar mentioned it's not required to have hints in the first / topmost SELECT. Even the sentence itself was clear for me, the consequences and possibilities were not at that time.
During the presentation and some discussion with Lothar afterwards, some possibilities were shown.



Preparation:


drop table T1;
create table T1 as 
select rownum as rn, 'A' as const from dual connect by rownum < 1000000;

create unique index T1I1 on T1 (rn);
create index T1I2 on T1 (const);

update T1 set const='B' where rn=42;
create view V1 as select * from T1;


SQLs with direct hints:


select /*+ qb_name(test1) */ const from V1
where rn = 42;
with plan

Plan hash value: 4265153043
 
-----------------------------------------------------
| Id  | Operation                   | Name | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT            |      |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1   |      1 |
|*  2 |   INDEX UNIQUE SCAN         | T1I1 |      1 |
-----------------------------------------------------

and with another query
select /*+ qb_name(test2) */ avg(rn) average
from V1 
where const='A';
Plan hash value: 3724264953
 
--------------------------------------------
| Id  | Operation          | Name | E-Rows |
--------------------------------------------
|   0 | SELECT STATEMENT   |      |        |
|   1 |  SORT AGGREGATE    |      |      1 |
|*  2 |   TABLE ACCESS FULL| T1   |    999K|
--------------------------------------------

the hints work like a charm:
select /*+ qb_name(test1) FULL(@"SEL$63695B56" "T1") */ const from V1
where rn = 42;

Plan hash value: 3617692013
 
-------------------------------------------
| Id  | Operation         | Name | E-Rows |
-------------------------------------------
|   0 | SELECT STATEMENT  |      |        |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |
-------------------------------------------

and
select /*+ qb_name(test2) INDEX(@"SEL$0E4D0AA2" "T1"@"SEL$1" ("T1"."CONST")) */ avg(rn) average
from V1 
where const='A';
Plan hash value: 2287954029
 
--------------------------------------------------------------
| Id  | Operation                            | Name | E-Rows |
--------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |        |
|   1 |  SORT AGGREGATE                      |      |      1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1   |    999K|
|*  3 |    INDEX RANGE SCAN                  | T1I2 |    999K|
--------------------------------------------------------------


Hints hidden in the view:


When I "hide" the hints inside the view definition:
create or replace view V1 as 
select /*+ FULL(@"SEL$63695B56" "T1") INDEX(@"SEL$0E4D0AA2" "T1"@"SEL$1" ("T1"."CONST")) */ * from T1;

they still work:

select /*+ qb_name(test1) */ const from V1
where rn = 42;

Plan hash value: 3617692013
 
-------------------------------------------
| Id  | Operation         | Name | E-Rows |
-------------------------------------------
|   0 | SELECT STATEMENT  |      |        |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |
-------------------------------------------

and

select /*+ qb_name(test2) */ avg(rn) average
from V1 
where const='A';

Plan hash value: 2287954029
 
--------------------------------------------------------------
| Id  | Operation                            | Name | E-Rows |
--------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |        |
|   1 |  SORT AGGREGATE                      |      |      1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1   |    999K|
|*  3 |    INDEX RANGE SCAN                  | T1I2 |    999K|
--------------------------------------------------------------


When you can edit only parts of the query:

Sometimes an application allows only changes on parts of the query, like additional filter in the WHERE clause:
(let's reset the view:)
create or replace view V1 as select * from T1;

As the hint can only be in the SELECT part, we need to create an "artificial" select:
select /*+ qb_name(test1) */ const from V1
where rn = 42
  and 1 = (select /*+ FULL(@"SEL$63695B56" "T1") */1 from dual);

so it works as well:
Plan hash value: 317588836
 
--------------------------------------------
| Id  | Operation          | Name | E-Rows |
--------------------------------------------
|   0 | SELECT STATEMENT   |      |        |
|*  1 |  FILTER            |      |        |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |
|   3 |   FAST DUAL        |      |      1 |
--------------------------------------------


Conclusion:

There are many ways to place a hint, so as long as you can manipulate any part of the query, ther eis a chance to inject a hint.

Samstag, 24. Juni 2017

Method R Profiler 7.0 available

I'm seldom doing product announcements on my blog. So there must be a reason why I do not follow this rule.

Method-R recently released a new version of it's powerful SQL-Trace profiler.
Of course there are many improvements on the profiler engine itself, but as the prodict was very stable already they will only help in rare edge cases, most people would not benefit dramatically.

But there is a new featuere which brings a lot of added value to the tool and improves time to repair/fix/enlightenment for many readers of the trace file:
They added an explanatory text to many sections.
Here an example (from their sample file - slightly edited to better fit the size of this blog):

Your experience duration was dominated by “cell single block physical read” and “CPU: FETCH dbcalls” calls. These are all high-productivity calls; their dominance means that the Oracle Database is working hard for your application. Now you should determine whether all of that work is actually necessary. Consult the Profile by Cursor and Profile by Statement sections for the next step of your diagnosis.
SubroutineDurationCallsDuration per call (seconds)
secondsRmeanminskewmax
1cell single block physical read4.64469.9%6,8990.00070.0005▁▁▁█▁▁▁▁▁▁▁0.028
2CPU: FETCH dbcalls2.38135.8%7,9470.00030.0000▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁█▁▁2.071

Unfortunately I did not manage to copy all the nice formatting into this blog, so i created a screenshot to show the original picture.

As you can see there is a new text with more in detail explanation about the Subroutine and where to go next for deeper analysis. I see this as a big advantage for everyone who does not work with SQL-Trace files every day. (well, even for those it's helpful).

There is a very nice demo version available, so if you have a trace file to analyze and want to check, if you see added value by this suggestion, give it a try.

Beside the new Profiler, Method-R has redone their web page as well (and I expect more news to follow). It's worth to invest some minutes and have a glimpse.

Dienstag, 11. April 2017

search for the missing ORA-04068

Oracle Active Dataguard is a great software for offloading read only activities from the primary nodes to nodes which is not utilized anyhow (at least during non-disaster times).
A very good example for doing so is regular extraction of data for processing in a data warehouse. That's also the case which led to this post.

The setting is just slightly more complex: both, primary and standby site are RACs - of course when you plan to switchover/failover, they should be somehow similar. (And in case of a disaster, it's planned to failover an disable all not-so-important activities; warehouse extract would be one of those).

Also the offloaded activities are slightly more complex than a simple query. In this case PL/SQL code is included. According to the documentation, that's not an issue at all - it's still read only. But in this DB (11.2.0.4), it was an issue: Sometimes the result was not as expected.

Now I first need to define the expectation: in this special case, the ERROR is the expectation, and no error the wrong result. Whenever a package (with global variables) is changed in a session, all other sessions which initialized the package before the change, but called it afterwards, must get

04068, 00000, "existing state of packages%s%s%s has been discarded"
// *Cause:  One of errors 4060 - 4067 when attempt to execute a stored
//          procedure.
// *Action: Try again after proper re-initialization of any application's
//          state.

Then the application can catch this error, re-initiate the package and continue.
that's how it should be.

But we sometimes had strange results in the test environment. After some investigation, we found it and I simplified it to this Testcase:

Testcase: 
(results are only shown when of any interest) 
INST1_ ... Instance1
INST2_ ... Instance2
PRI ... Primary DB 
ADG ... Active DataGuard 
it's important the sessions are not closed within the test!

prepare user:
SYS@INST1_PRI:
=============
-- create the user 
create user SR_TEST1 identified by "xxx" default tablespace users temporary tablespace temp; 
grant connect, create session, create procedure, create table to SR_TEST1; 
alter user SR_TEST1 quota unlimited on users; 

SESSION1@INST1_PRI:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 

create table tab1 (id number, ver varchar2(30)); 
insert into tab1 values (1, '_'); 
commit; 

create or replace PACKAGE PACK AS 

first_load date := to_date('3333-01-01 01:01:01', 'YYYY-MM-DD HH24:MI:SS'); 

PROCEDURE proc; 

END PACK; 
/ 

CREATE OR REPLACE 
PACKAGE BODY PACK AS 

PROCEDURE proc AS 
BEGIN 
if first_load > sysdate then 
first_load := sysdate; 
end if; 
DBMS_OUTPUT.PUT_LINE('1: loaded first on ' || to_char(first_load,'YYYY-MM-DD HH24:MI:SS') ); 
NULL; 
END proc; 

END PACK; 
/ 
update tab1 set ver='1' where id=1; 
commit; 

create or replace procedure proc1 
is 
begin 
SR_TEST1.pack.proc; 
end; 
/ 

-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 
select * from tab1; 

> 1: loaded first on 2017-03-15 09:52:00 
> 
> PL/SQL procedure successfully completed. 
> 
select * from tab1; 
> 
> ID VER 
>---------- ------------------------------ 
> 1 1 

SESSION2@INST2_PRI:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 

-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 
select * from tab1; 

> 1: loaded first on 2017-03-15 09:53:00 
> 
> PL/SQL procedure successfully completed. 
> 
select * from tab1; 
> 
> ID VER 
>---------- ------------------------------ 
> 1 1 

SESSION3@INST1_ADG:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 
-- here the redo apply is running
select sid from v$session where program like '%MRP0%';

>       SID
>----------
>         2


-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

> 1: loaded first on 2017-03-15 09:53:21 
> 
> PL/SQL procedure successfully completed. 
> 
> select * from tab1; 
> 
> ID VER 
> ---------- ------------------------------ 
> 1 1 

SESSION4@INST1_ADG:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 
-- NO redo apply is running hereselect inst_id, sid from gv$session where program like '%MRP0%';

> no rows selected

-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

> 1: loaded first on 2017-03-15 09:54:00 
> 
> PL/SQL procedure successfully completed. 
> 
> select * from tab1; 
> 
> ID VER 
> ---------- ------------------------------ 
> 1 1 

SESSION1@INST1_PRI:
==================
-- change package body 

CREATE OR REPLACE 
PACKAGE BODY PACK AS 

PROCEDURE proc AS 
BEGIN 
if first_load > sysdate then 
first_load := sysdate; 
end if; 
DBMS_OUTPUT.PUT_LINE('2: loaded first on ' || to_char(first_load,'YYYY-MM-DD HH24:MI:SS') ); 
NULL; 
END proc; 

END PACK; 
/ 
update tab1 set ver='2' where id=1; 
commit; 


-- test 
set serveroutput on 
exec SR_TEST1.proc1 
select * from tab1; 

> 2: loaded first on 2017-03-15 09:55:51 
> 
> PL/SQL procedure successfully completed. 
> 
> 
> ID VER 
> ---------- ------------------------------ 
> 1 2 

SESSION2@INST2_PRI:
==================
-- test 
set serveroutput on 
exec SR_TEST1.proc1 
SR_TEST1@EBSSID051 > BEGIN SR_TEST1.proc1; END; 

* 
ERROR at line 1: 
ORA-04068: existing state of packages has been discarded 
ORA-04061: existing state of package body "SR_TEST1.PACK" has been invalidated 
ORA-04065: not executed, altered or dropped package body "SR_TEST1.PACK" 
ORA-06508: PL/SQL: could not find program unit being called: "SR_TEST1.PACK" 
ORA-06512: at "SR_TEST1.PROC1", line 4 
ORA-06512: at line 1 


> select * from tab1; 

ID VER 
---------- ------------------------------ 
1 2 

-- THIS Is the expected result 

SESSION3@INST1_ADG:
==================
-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

* 
ERROR at line 1: 
ORA-04068: existing state of packages has been discarded 
ORA-04061: existing state of package body "SR_TEST1.PACK" has been invalidated 
ORA-04065: not executed, altered or dropped package body "SR_TEST1.PACK" 
ORA-06508: PL/SQL: could not find program unit being called: "SR_TEST1.PACK" 
ORA-06512: at "SR_TEST1.PROC1", line 4 
ORA-06512: at line 1 

select * from tab1; 

ID VER 
---------- ------------------------------ 
1 2 

SESSION4@INST1_ADG:
==================
-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

1: loaded first on 2017-03-15 09:54:00 

PL/SQL procedure successfully completed. 

select * from tab1; 

ID VER 
---------- ------------------------------ 
1 2 
-- HERE you see the update on the table is applied on ADG already, but the procedure output is still prefixed with 1: - not 2: as it should be.

SESSION4@INST2_ADG:
==================
-- to check: the package is there! 
set pages 99 
select text from all_source where name = 'PACK' 
and type = 'PACKAGE BODY' 
3 order by line; 

TEXT 
------------------------
PACKAGE BODY PACK AS 

PROCEDURE proc AS 
BEGIN 
if first_load > sysdate then 
first_load := sysdate; 
end if; 
DBMS_OUTPUT.PUT_LINE('2: loaded first on ' || to_char(first_load,'YYYY-MM-DD HH24:MI:SS') ); 

NULL; 
END proc; 

END PACK; 

12 rows selected. 

Even the testcase is clear and simple, it was not that easy to identify the root cause in a more complex life environment. Special thanks to Andy Sayer who helped me to sort and refine my ideas over twitter.

With this testcase I was able to open a SR at Oracle, and after some some Support-ping-pong I got a useful information:
It is known as Bug: 18357555: ADD SUPPORT FOR KGLIV_BRNULPALL TO KGLCLUSTERMESSAGE() / KQLMBIVG()

I'm only aware of one Patch:25808607 (for PSU:11.2.0.4.160419 - together with some other patches) which is available right now. But you can ask for your own, if you have an Active Dataguard and RAC, and Packages with global variables.

Mittwoch, 5. April 2017

Can I flush the shared pool of all RAC Instances in an Active DataGuard?

That's a good question.

And the answer is simple: Yes, but NO!

Well, ok; that's not very helpful. But whenever is anything simple in modern IT?

So I'll show why Yes, it's possible:

first as SYS create a dummy user without special permissions and a function to drop the shared pool:

create user berx identified by xxx;
grant connect to berx;
grant select on v_$instance to berx;
    
create or replace function sys.flush_shared_pool_bx return varchar2
as
begin
        execute immediate 'alter system flush shared_pool';
  return 'Flush_Shared_Pool executed on ' || sysdate || ' by ' || sys_context('USERENV', 'SESSION_USER');

end;
/
grant execute on sys.flush_shared_pool_bx  to public;

SELECT * FROM TABLE(gv$(CURSOR(
  SELECT INSTANCE_NUMBER, sys.flush_shared_pool_bx flush_result
  FROM v$instance)));  

Unfortunately I can not use the gv$ function as user berx directly.

So a view does the trick:
create view v_flush_shared_pool_bx
as 
SELECT * FROM TABLE(gv$(CURSOR(
  SELECT INSTANCE_NUMBER, sys.flush_shared_pool_bx flush_result
  FROM v$instance))); 

grant select on v_flush_shared_pool_bx to public;

create or replace
function f_gv_flush
return DBMS_DEBUG_VC2COLL
PIPELINED 
IS
  res varchar2(120);
   cursor c1 is
     SELECT INSTANCE_NUMBER || ' - ' || FLUSH_RESULT as in_result
 FROM TABLE(gv$(CURSOR(
  SELECT INSTANCE_NUMBER, sys.flush_shared_pool_bx flush_result
  FROM v$instance)));
begin
  FOR res_row in c1
  LOOP
      pipe row (to_char(res_row.in_result)) ;
  END LOOP;
  return;
end;
/

and now it's so simple to flush on all instances:

connect berx/

select * from sys.v_flush_shared_pool_bx;
  
INSTANCE_NUMBER FLUSH_RESULT
--------------- --------------------------------------------------
              1 Flush_Shared_Pool executed on 10-APR-17 by BERX
              3 Flush_Shared_Pool executed on 10-APR-17 by BERX
              2 Flush_Shared_Pool executed on 10-APR-17 by BERX


or

select * from table(sys.f_gv_flush);

COLUMN_VALUE
-----------------------------------------------------
1 - Flush_Shared_Pool executed on 10-APR-17 by BERX
2 - Flush_Shared_Pool executed on 10-APR-17 by BERX
3 - Flush_Shared_Pool executed on 10-APR-17 by BERX

So obviously the answer if I an flush shared pool in all RAC instances - even in Active DataGuard is: Yes.

But at the same time the answer is NO, as I'm using the gv$ function - and this is not documented, so it's not wise to use it.

If it was not about an Active DataGuard, I'd use a scheduler job to run the flush shared_pool on all instances, but as the ADG is read only, I can't use it.

If you have another idea, how to do this (only with oracle means, no cronjob or similar), please let me know!

Mittwoch, 8. März 2017

force connections to use SERVICE_NAME

During the setup of a project I had a small observation: The Oracle Database is running on a 4 node RAC, but we (DBAs) wanted to provide only 2 instances to the application team.

So we created a dedicated service with Preferred instances: inst1,inst2 and Available instances: inst3.inst4

But after some minutes I saw sessions running on all 4 instances. Those sessions used the Service SYS$USERS - so they managed to "guess" the SID and created a connetionstring with SID, not SERVICE_NAME.

We contacted them and asked to use the connection string we provided (with SERVICE_NAME) which was no problem at all.

But the question remains - how to avoid connections with SID?

One idea was to create a after logon on database trigger and there check for special allowed users, hosts, or whatever. This seems fine, but there are situations, where SYS$USERS are used by default even from users, which should not use it for connections. One example are scheduler jobs. this can be solved by changing the SERVICE of DEFAULT_JOB_CLASS to the desired service:
DBMS_SCHEDULER.SET_ATTRIBUTE ('DEFAULT_JOB_CLASS','SERVICE','PREFERRED_SERVICE');

This requires additional actions and care - and a good documentation to avoid future issues.

As my initial question was to stop connections using SID through the listener, not interfere with generic bequeath connections. So I focused on the listener:

Even the SID is registered automatically when an instance registers to the listener, it can be specified manually as well.
So I tried to abuse this feature a little bit.

My config:
SID_LIST_LISTENER=
  (SID_LIST=
    (SID_DESC=
      (ORACLE_HOME=/tmp/berx1)
      (SID_NAME=ORCLSID))
)

but the listener searches for an oracle binary in this fake ORACLE_HOME
TNS-01201: Listener cannot find executable /tmp/berx1/bin/oracle for SID EDWP123
so I created a binary there:
ln -s /bin/false  /tmp/berx1/bin/oracle
and now the listener started fine.

A connection attempt using the SID fails now with
ERROR:
ORA-12537: TNS:connection closed

and the listener.log shows the errors
08-MAR-2017 11:14:20 * (CONNECT_DATA=(SID=ORCLSID)(CID=(PROGRAM=sqlplus)(HOST=myhost)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=1.2.3.4)(PORT=50714)) * establish * ORCLSID * 12518
TNS-12518: TNS:listener could not hand off client connection
 TNS-12547: TNS:lost contact
  TNS-12560: TNS:protocol adapter error
   TNS-00517: Lost contact
    Linux Error: 32: Broken pipe

That's more or less a solution to my question.
Of course I do not recommend to implement it, at least not without proper testing.
Especially in a DataGuard setup a SID connection is required for some activities, so there  a dedicated listener for DataGuard activities is required in such a configuration.

Samstag, 24. Dezember 2016

interconnect fragmentation kills the cluster

On a particular Oracle 2 node cluster (12.1) we faced random instances failing. Servicerequests at Oracle were open with limited result, as it was quite random and we could not link it to any trigger.
As it looked somehow like a communication problem between the 2 nodes, network team has checked the switches involved - without any outcome.
Even crashing instances were a problem already, it get worse one day when one node rebooted (according to the clusters alert.log and cssd.log due to network heartbeat issues) and then the clusterstack did not start anymore.

2016-12-12 03:35:34.203 [CLSECHO(54825)]CRS-10001: 12-Dec-16 03:35 AFD-9204: AFD device driver installed or loaded status: 'false' 
2016-12-12 09:17:25.698 [OSYSMOND(1247)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 1247 
2016-12-12 09:17:25.699 [CSSDAGENT(1248)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 1248 
2016-12-12 09:17:25.854 [OCSSD(1264)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 1264 
2016-12-12 09:17:26.899 [OCSSD(1264)]CRS-1713: CSSD daemon is started in hub mode 
2016-12-12 09:17:32.220 [OCSSD(1264)]CRS-1707: Lease acquisition for node yyy2 number 2 completed 
2016-12-12 09:17:33.280 [OCSSD(1264)]CRS-1605: CSSD voting file is online: ORCL:ASM_OCR_VOTE_1; details in /xxx/app/grid/diag/crs/yyy1/crs/trace/ocssd.trc. 
2016-12-12 09:17:33.289 [OCSSD(1264)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1. 
2016-12-12 09:27:25.925 [CSSDAGENT(1248)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:22951} in /xxx/app/grid/diag/crs/yyy2/crs/trace/ohasd_cssdagent_root.trc. 
2016-12-12 09:27:25.925 [OCSSD(1264)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc 
2016-12-12 09:27:25.926 [OCSSD(1264)]CRS-1603: CSSD on node yyy2 shutdown by user. 
Mon Dec 12 09:27:30 2016 
Errors in file /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc (incident=857): 
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] 
Incident details in: /xxx/app/grid/diag/crs/yyy/crs/incident/incdir_857/ocssd_i857.trc 

CSS trace is filled with messages reporting no connectivity with node1:
2016-12-12 09:27:20.375584 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:20.585624 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:20.633907 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:20.633912 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:20.656587 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120596, LATS 232008644, lastSeqNo 11120595, uniqueness 1476197219, timestamp 1481534839/2789302712 
2016-12-12 09:27:20.868210 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:21.375702 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:21.585813 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:21.634038 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:21.634046 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:21.657538 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120597, LATS 232009644, lastSeqNo 11120596, uniqueness 1476197219, timestamp 1481534840/2789303712 
2016-12-12 09:27:21.868336 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:22.375830 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:22.586063 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:22.634195 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:22.634203 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 

After even more investigation on the Network another SR was filed.

Due to previous SRs oswatcher was installed already, and there we found the important information in the netstats segment:

zzz ***Fri Dec 9 14:54:54 GMT 2016 
Ip: 
13943376329 total packets received 
129843 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11934989273 incoming packets delivered 
11631767391 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498052793 reassemblies required 
494739589 packets reassembled ok 
353229 packet reassembles failed 
411073325 fragments received ok 
2109526776 fragments created 

and after 2 minutes:
zzz ***Fri Dec 9 14:56:55 GMT 2016 
Ip: 
13943469180 total packets received 
129849 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11935067348 incoming packets delivered 
11631828206 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498069258 reassemblies required 
494741345 packets reassembled ok 
359542 packet reassembles failed 
411073565 fragments received ok 
2109528513 fragments created 

The important part are the 6313 packet reassembles failed. In comparison to 16465 reassemblies required.

This led to some notes which describe both our symptoms (instance and cluster stack failure)

RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (Doc ID 2008933.1)

and

The CRSD is Intermediate State and Not Joining to the Cluster (Doc ID 2168576.1)



Reassembly happens when the sender wants so send more data than fits into a single packet. In this cluster the MTU size is 1500 - and in our example we had 16465 datagrams which needed to be reassembled, but 6131 failed. There are some variables in the Linux kernel, they can affect the buffer used in kernel to reassembly fragmented datagrams.

The solution for our system was to increase 2 parameters:

net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640 

These can be changed in the running system in
/proc/sys/net/ipv4/ipfrag_low_thresh
/proc/sys/net/ipv4/ipfrag_high_thresh
and for persistent changes in sysctl.conf

Unfortunately these parameters were not mentioned in any of the prerequisit scripts I found.

With all these knowledge, we identified an important difference to other clusters: This one is the only with MTU 1500 - so much more fragmented packages needed carehere.

After the issue itself was solved, I wondered if it can be found on a vanilla 12.1 crs installation.
(vanilla in comparison to our setup where oswatcher was installed due to the first SRs).
Yes, our beloved -MGMTDB holds the information already! It's in the documentation as well (Troubleshooting Oracle Clusterware) and in the output of oclumon dumpnodeview I can see
IPReasFail - Number of failures detected by the IPv4 reassembly algorithm
Node: yyy1 Clock: '16-11-26 06.55.27 Etc/GMT' SerialNo:443 
NICS: 
bond0 netrr: 159.021 netwr: 181.688 neteff: 340.709 nicerrors: 0 pktsin: 412 pktsout: 358 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 412 innonunicast: 0 type: PUBLIC 
lo netrr: 37.722 netwr: 37.722 neteff: 75.443 nicerrors: 0 pktsin: 95 pktsout: 95 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 95 innonunicast: 0 type: PUBLIC 
bond1 netrr: 2350.313 netwr: 42989.510 neteff: 45339.823 nicerrors: 0 pktsin: 1927 pktsout: 31345 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 1927 innonunicast: 0 type: PRIVATE 
PROTOCOL ERRORS: 
IPHdrErr: 0 IPAddrErr: 102203 IPUnkProto: 0 IPReasFail: 59886 IPFragFail: 0 TCPFailedConn: 12598 TCPEstRst: 335559 TCPRetraSeg: 67276584 UDPUnkPort: 40134 UDPRcvErr: 0 

Unfortunately the format is kind of clumsy - I will need to dig into it's tables for a better output - especially for quick but powerful reports during problems.


During my research, I discovered it's not an oracle-only problem, others are affected as well (and provide a great description).





Sonntag, 20. November 2016

cost vs. gets

Last week I hit an interesting performance issue:
A Table (T) has 2 index (of interest). One (IX_1) is only on column S, the other (IX_2) on (C, S).
The optimizer calculates the plan with IX_1 more expensive than the plan with IX_2 - as this should be more selective.
But the gets for the plan with IX_1 were less than those the optimizer preferred.

Here the information about the statement, index and plans.
As it's a real life example (just the object names are obfuscated) please do not wonder about the Partition operators. This statement only worked in the same subset of partitions - based on record_timestamp.


SELECT <columns>
FROM T WHERE S = :BIND0
         AND C = :BIND1
   AND record_timestamp > :BIND2
--       AND some more filters 

INDEX:

IX_1 C
IX_2 S, C


Plan1:
Plan hash value: 1749230273
------------------------------------------------------------------------------------------------------
| Id  | Operation                                     | Name            | Rows  | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                              |                 |   218 |   352K|   517   (1)|
|   1 |  SORT ORDER BY                                |                 |   218 |   352K|   517   (1)|
|*  2 |   FILTER                                      |                 |       |       |            |
|   3 |    PARTITION RANGE ITERATOR                   |                 |   218 |   352K|   516   (0)|
|   4 |     PARTITION LIST SINGLE                     |                 |   218 |   352K|   516   (0)|
|*  5 |      TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T               |   218 |   352K|   516   (0)|
|*  6 |       INDEX RANGE SCAN                        | IX_1            |   473 |       |    63   (0)|
------------------------------------------------------------------------------------------------------
consistent gets 27647

Plan2:
Plan hash value: 4278735161
------------------------------------------------------------------------------------------------
| Id  | Operation                                     | Name      | Rows  | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                              |           |   218 |   352K|   505   (1)|
|   1 |  SORT ORDER BY                                |           |   218 |   352K|   505   (1)|
|*  2 |   FILTER                                      |           |       |       |            |
|   3 |    PARTITION RANGE ITERATOR                   |           |   218 |   352K|   504   (0)|
|   4 |     PARTITION LIST SINGLE                     |           |   218 |   352K|   504   (0)|
|*  5 |      TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T         |   218 |   352K|   504   (0)|
|*  6 |       INDEX RANGE SCAN                        | IX_2      |   461 |       |    63   (0)|
------------------------------------------------------------------------------------------------
consistent gets 33807


How could the optimizer be wrong, even all table and column statistics are correct?

It's just simple the relation between columns C and S: There can be many S for a given C, but a S always have a dedicated C. Both Index will provide the same ROWIDs for any given (C, S) - but as IX_1 is smaller than IX_2 - less gets are required.

But before we simply suggest just to hint (or outline, profile, patch, ...) the statement to use IX_1 some more information about the ongoing discussion:
As the application loops through all Cs, and for every C through all S, it might be more efficient to have all S per C (as in IX_2) in buffercache for the time C is of any interest - and afterwards do not care about those blocks at all. IX_1 would hit the same blocks quite often as the S are by no way grouped/ordered for any C. I'm not sure if we ever can make a reasonable testcase, as it's hard to bring the environment into a state without disturbing effects to measure such effects.

This is shows how easily the optimizer can be misguided, and even with reasonable knowledge about the data and application a decision for the "best" solution can be hard to impossible.