代码改变世界

Oracle :Insert ways.

2011-07-19 11:44  Tracy.  阅读(272)  评论(0编辑  收藏  举报

You Asked

I'am working on enhancing the performance of some long running 
queries in our application. The existing code uses the NOT EXISTS 
clause extensively and I wondered if it would be a good idea to 
use OUTER JOINS instead, wherever possible. For example, all else 
being equal, which of the following would execute faster ?

Insert into Title(......)
select .....
from rollup R
where not exists ( select 'Found' from title T 
                   where R.source_id = T.Title_ID);

Insert into Title(.......)
select ......
from title T, rollup R
where R.source_id = T.Title_id(+)
and T.Title_id is null;

Note that Rollup has about 5 million rows and Title about 
1 million rows and both are indexed on the join fields.
In trials, I did notice that the latter was about 30% faster 
but does that have a theoritical basis ?

Would appreciate an early reply.

Thanks,

Sunder 


 

and we said...

There are at least 3 ways to do this above (minus might be a 4'th depending on your 
circumstances).

The way with NOT exists works well for small OUTER tables (or outer queries that have 
other predicates the result in "smallish" result sets before we have to do the correlated 
subquery).  A NOT exists works something like this:

  for x in ( select * from rollup ) loop
     if ( not exists ( that query ) ) then
            OUTPUT
     end if;
  end

it in effect runs the subquery over and over and over....

The outer join, method number 2, allows us to do more large scale, bulk type operations.  
Anything we do in bulk, we can probably do faster.  So, by not running that subquery over 
and over, we gained some efficiency.  It'll depend on your indexes and such as well how 
well this goes.

The bulk operations we do -- thats your "theroretical basis".

A 3'rd alternative is a NOT IN query.  As long as you understand that NOT IN and NOT 
EXISTS are semantically different, this might run even faster yet.  NOT IN will not work 
correctly if the subquery returns a NULL (the result set will be empty if it does).  So, 
we have to add a filter "where title_id is NOT NULL" to the subquery to make this be the 
same.

Here is a very very small test showing the three methods.  Existence (or lack of) indexes 
may effect this test and so on.  You might consider the NOT IN with the HASH_AJ hint 
(although using the CBO should have this happen automatically -- if the NOT IN runs 
really really slow, its not using the HASH_AJ -- you'd have to hint it in that case).

this test is not a benchmark, it is just illustrative of the three methods and contains 
some comments on their plans.


DOC>create table t as select * from all_objects;
DOC>create index t_idx on t(object_id);

DOC>create table t2 as select * from all_objects where mod(object_id,50) = 0;
DOC>create index t2_idx on t2(object_id);

DOC>*/

ops$tkyte@DEV8I.WORLD> set autotrace on explain
ops$tkyte@DEV8I.WORLD> select count(object_name)
  2    from t
  3   where NOT EXISTS( select null
  4                       from t2
  5                      where t2.object_id = t.object_id )
  6  /

COUNT(OBJECT_NAME)
------------------
             22251

Elapsed: 00:00:00.85

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   SORT (AGGREGATE)
   2    1     FILTER
   3    2       TABLE ACCESS (FULL) OF 'T'
   4    2       INDEX (RANGE SCAN) OF 'T2_IDX' (NON-UNIQUE)

Here, it full scans T and does that subquery for each row -- thats the index probe we 
see in there, the subquery running

ops$tkyte@DEV8I.WORLD> 
ops$tkyte@DEV8I.WORLD> select /*+ HASH_AJ */ count(object_name)
  2    from t
  3   where object_id NOT IN ( select object_id
  4                              from t2 
  5                             where object_id IS NOT NULL )
  6  /

COUNT(OBJECT_NAME)
------------------
             22251

Elapsed: 00:00:00.67

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 
   1    0   SORT (AGGREGATE)
   2    1     FILTER
   3    2       INDEX (FAST FULL SCAN) OF 'T_IDX' (NON-UNIQUE) 
   4    2       INDEX (RANGE SCAN) OF 'T2_IDX' (NON-UNIQUE) 

There is the "anti join" method with may or may not run a little better in your 
environment.  Its worth trying though

ops$tkyte@DEV8I.WORLD> select count(t.object_name)
  2    from t, t2
  3   where t.object_id = t2.object_id(+)
  4     and t2.object_id is NULL
  5  /

COUNT(T.OBJECT_NAME)
--------------------
               22251

Elapsed: 00:00:00.75

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   SORT (AGGREGATE)
   2    1     FILTER
   3    2       NESTED LOOPS (OUTER)
   4    3         TABLE ACCESS (FULL) OF 'T'
   5    3         INDEX (RANGE SCAN) OF 'T2_IDX' (NON-UNIQUE)

And then the outer join method -- in this PARTICULAR example it fell between the not 
exists and the NOT IN.  Your results may vary ;)

ops$tkyte@DEV8I.WORLD> set autotrace off