EXISTS/NOT EXISTS一定比IN/NOT IN的效率高吗?
经常别人说EXISTS比IN快!NOT EXISTS比NOT IN快!然而事实真的如此么?
??? 我们先讨论IN和EXISTS。
??? select * from t1 where x in ( select y from t2 )
??? 事实上可以理解为:
??? select *
????? from t1, ( select distinct y from t2 ) t2
???? where t1.x = t2.y;
??? ——如果你有一定的SQL优化经验,从这句很自然的可以想到t2绝对不能是个大表,因为需要对t2进行全表的“唯一排序”,如果t2很大这个排序的性能是不可忍受的。但是t1可以很大,为什么呢?最通俗的理解就是因为t1.x=t2.y可以走索引。但这并不是一个很好的解释。试想,如果t1.x和t2.y都有索引,我们知道索引是种有序的结构,因此t1和t2之间最佳的方案是走merge join。另外,如果t2.y上有索引,对t2的排序性能也有很大提高。
??? select * from t1 where exists ( select null from t2 where y = x )
??? 可以理解为:
??? for x in ( select * from t1 )
??? loop
?????? if ( exists ( select null from t2 where y = x.x )
?????? then
????????? OUTPUT THE RECORD!
?????? end if
??? end loop
??? ——这个更容易理解,t1永远是个表扫描!因此t1绝对不能是个大表,而t2可以很大,因为y=x.x可以走t2.y的索引。
??? 综合以上对IN/EXISTS的讨论,我们可以得出一个基本通用的结论:IN适合于外表大而内表小的情况;EXISTS适合于外表小而内表大的情况。
??? 如果你对上述说法表示怀疑,请看以下测试:
********************************************************************************
SQL> create table big as select * from all_objects;
表已创建。
SQL> insert /*+ append */ into big select * from big;
已创建26872行。
SQL> commit;
提交完成。
SQL> insert /*+ append */ into big select * from big;
已创建53744行。
SQL> commit;
提交完成。
SQL> insert /*+ append */ into big select * from big;
已创建107488行。
SQL> commit;
提交完成。
SQL> create index big_idx on big(object_id);
索引已创建。
SQL> create table small as select * from all_objects where rownum < 100;
表已创建。
SQL> create index small_idx on small(object_id);
索引已创建。
********************************************************************************
运行SQL并设置EVENT=10046,用TKPROF格式化TRACE文件,结果如下。
大表在外,小表在内的测试:
********************************************************************************
select count(subobject_name)
from big
where object_id in ( select object_id from small )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.00?????? 0.01????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.00?????? 0.14???????? 29??????? 900????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.00?????? 0.15???????? 29??????? 900????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)??? 792??? TABLE ACCESS (BY INDEX ROWID) OF 'BIG'??? 892???? NESTED LOOPS???? 99????? VIEW OF 'VW_NSO_1'???? 99?????? SORT (UNIQUE)???? 99??????? TABLE ACCESS (FULL) OF 'SMALL'??? 792????? INDEX (RANGE SCAN) OF 'BIG_IDX' (NON-UNIQUE)
select count(subobject_name)
from big
where exists ( select null from small where small.object_id = big.object_id )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 1.90?????? 2.72?????? 2917???? 216125????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 1.90?????? 2.72?????? 2917???? 216125????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)??? 792??? FILTER 214976???? TABLE ACCESS (FULL) OF 'BIG'??? 225???? INDEX (RANGE SCAN) OF 'SMALL_IDX' (NON-UNIQUE)********************************************************************************用IN的性能数据:cpu=0.00??? elapsed=0.15??? query=900??? current=0??? disk=29用EXISTS的性能数据:cpu=1.90??? elapsed=2.72??? query=216125??? current=0??? disk=2917——在CPU的消耗和LIO、PIO上的对比十分明显,IN的效率高得多!
大表在内,小表在外的测试:
********************************************************************************
select count(subobject_name)
from small
where object_id in ( select object_id from big )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.41?????? 1.71?????? 2917?????? 2982????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.41?????? 1.72?????? 2917?????? 2982????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)???? 99??? TABLE ACCESS (BY INDEX ROWID) OF 'SMALL' 26972???? NESTED LOOPS 26872????? VIEW OF 'VW_NSO_1' 26872?????? SORT (UNIQUE) 214976??????? TABLE ACCESS (FULL) OF 'BIG'???? 99????? INDEX (RANGE SCAN) OF 'SMALL_IDX' (NON-UNIQUE)
select count(subobject_name)
from small
where exists ( select null from big where small.object_id = big.object_id )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.00?????? 0.00????????? 0??????? 202????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.00?????? 0.00????????? 0??????? 202????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)???? 99??? FILTER???? 99???? TABLE ACCESS (FULL) OF 'SMALL'???? 99???? INDEX (RANGE SCAN) OF 'BIG_IDX' (NON-UNIQUE)********************************************************************************用IN的性能数据:cpu=0.41??? elapsed=1.72??? query=2982??? current=26??? disk=2917用EXISTS的性能数据:cpu=0.00??? elapsed=0.00??? query=202??? current=0??? disk=0——在CPU的消耗和PIO、LIO上的对比十分明显,EXISTS效率高得多!
??? 有些遗憾的是我这个测试是在RBO下进行的,RBO是个死板的只根据优先级来确定执行计划的优化器,RBO不会评估实际的执行计划对系统造成的影响。在RBO中NESTED LOOP的优先级要远远大于MERGE JOIN,只要能走NESTED LOOP RBO就绝不会走MERGE JOIN。如果你用的是CBO,并且对表、索引做过统计分析,上面IN的测试一定会选择走MERGE JOIN。我们用HINTS在RBO下强制走MERGE JOIN对比一下这个SQL分别走MJ和NL的性能:
********************************************************************************
select count(subobject_name)
from small
where object_id in ( select/*+ use_merge(small big) */ object_id from big )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.01?????? 0.17????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.09?????? 0.27??????? 187??????? 473????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.10?????? 0.44??????? 187??????? 473????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)???? 99??? MERGE JOIN 26872???? SORT (UNIQUE) 214976????? INDEX (FULL SCAN) OF 'BIG_IDX' (NON-UNIQUE)???? 99???? SORT (JOIN)???? 99????? TABLE ACCESS (FULL) OF 'SMALL'********************************************************************************可以看到:NESTED LOOP:cpu=0.41??? elapsed=1.72??? query=2982??? current=26??? disk=2917MERGE JOIN:cpu=0.10??? elapsed=0.44??? query=437??? current=2??? disk=187——这也证实了我上面的说法。很多人不敢让自己的SQL走merge join,其实对于两个已经具有排序结构的表merge join是最佳选择。
??? 下面我们讨论NOT IN和NOT EXISTS,我把它们放在一起讨论实属被逼无奈,因为很多人喜欢拿它们比较。其实NOT IN/NOT EXISTS与IN/EXISTS不一样,IN/EXISTS是完全可以作为等价替换结构的,而NOT IN/NOT EXISTS则不同,它们并不是等价替换结构!只有当子查询中不可能返回空值时,NOT IN/NOT EXISTS才可以等价替换。
??? 为什么?请看:
********************************************************************************
SQL> conn scott/tiger@tdb;
已连接。
SQL> select count(*)
2?? from emp
3 where mgr is null;
COUNT(*)
----------------
????????? 1
SQL> select count(*)
2?? from emp
3 where empno not in ( select mgr from emp );
COUNT(*)
----------------
????????? 0
SQL> select count(*)
2?? from emp t1
3 where not exists ( select null
4???????????????? from emp t2
5??????????????? where t2.mgr = t1.empno );
COUNT(*)
----------------
????????? 7
********************************************************************************
??? 如果子查询中返回的结果集含有空值NOT IN永远是0,因为NULL代表“未知”,任何值和NULL比较永远是false。
??? 现在我们基于假设——子查询中不返回空值,来比较NOT IN和NOT EXISTS。
??? 在RBO中如果不使用HINTS来改变NOT IN的执行计划,几乎任何时候NOT IN都比NOT EXISTS慢得多,在CBO中如果具有准确的统计信息NOT IN的效率和NOT EXISTS的一样,甚至会比NOT EXISTS快得多。
??? 调整NOT IN性能的基本原则是:如果想让NOT IN跑得快就必须走合适的连接。
??? select * from t1 where x not in ( select y from t2 )
??? 以这个句子为例(y无空值)
??? 这个句子可以等价替换为:
??? a) select * from t1 where not exists ( select null from t2 where t2.y=t1.x)
??? 或
??? b) select t1.* from t1, t2 where t1.x=t2.y(+) and t2.y is null
测试如下:
********************************************************************************
SQL> create table t1 as select * from all_objects where rownum <= 5000;
表已创建。
SQL> create table t2 as select * from all_objects where rownum <= 4950;
表已创建。
SQL> create index t2_idx on t2(object_id);
索引已创建。
********************************************************************************
RBO下的测试:
********************************************************************************
select count(*)
from t1 rbo
where object_id not in ( select object_id from t2 )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 6.13????? 19.12???? 127487???? 197502????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 6.13????? 19.13???? 127487???? 197502????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)???? 50??? FILTER?? 5000???? TABLE ACCESS (FULL) OF 'T1'?? 4950???? TABLE ACCESS (FULL) OF 'T2'
select count(*)
from t1 rbo
where not exists ( select null from t2 where t2.object_id = rbo.object_id)
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.01?????? 0.00????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.01?????? 0.12???????? 83????? 10075????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.02?????? 0.12???????? 83????? 10075????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)???? 50??? FILTER?? 5000???? TABLE ACCESS (FULL) OF 'T1'?? 4950???? INDEX (RANGE SCAN) OF 'T2_IDX' (NON-UNIQUE)
select count(*)
from t1, t2 rbo
where t1.object_id = rbo.object_id(+) and rbo.object_id is null
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.00?????? 0.05???????? 72?????? 5087????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.00?????? 0.06???????? 72?????? 5087????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)???? 50??? FILTER?? 5000???? NESTED LOOPS (OUTER)?? 5000????? TABLE ACCESS (FULL) OF 'T1'?? 4950????? INDEX (RANGE SCAN) OF 'T2_IDX' (NON-UNIQUE)********************************************************************************RBO自己选择的执行计划,性能数据:NOT IN:cpu=6.13?? elapsed=19.13?? query=197502?? current=0?? disk=127487NOT EXISTS:cpu=0.02?? elapsed=0.12?? query=10075?? current=0?? disk=83OUTER JOIN:cpu=0.00?? elapsed=0.06?? query=5087??? current=0?? disk=72——NOT EXISTS的效率比NOT IN好很多,但与OUTER JOIN相比NOT EXISTS的效率略低。
RBO,用HINTS改变NOT IN的执行计划:
********************************************************************************
select count(*)
from t1 rbo
where object_id not in ( select/*+ hash_aj(rbo t2) */ object_id from t2 )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.04?????? 0.45????????? 0????????? 3????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.01?????? 0.09???????? 48??????? 191????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.05?????? 0.55???????? 48??????? 194????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 0?? SORT (AGGREGATE)????? 0??? HASH JOIN (ANTI)????? 0???? TABLE ACCESS (FULL) OF 'T1'????? 0???? INDEX (FAST FULL SCAN) OF 'T2_IDX' (NON-UNIQUE)********************************************************************************在只有t2.object_id有索引的情况下,hash join-anti性能数据如下:HJ-ANTI:cpu=0.05?? elapsed=0.55?? query=194?? current=0?? disk=48——性能好了很多!
在t1.object_id上建立索引,使用merge join-anti:
********************************************************************************
select count(*)
from t1 rbo
where object_id not in ( select /*+ merge_aj(rbo t2) */ object_id from t2 )
call???? count?????? cpu??? elapsed?????? disk????? query??? current??????? rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse??????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Execute????? 1????? 0.00?????? 0.00????????? 0????????? 0????????? 0?????????? 0Fetch??????? 2????? 0.00?????? 0.00????????? 0???????? 28????????? 0?????????? 1------- ------ -------- ---------- ---------- ---------- ---------- ----------total??????? 4????? 0.00?????? 0.00????????? 0???????? 28????????? 0?????????? 1
Rows???? Execution Plan------- ---------------------------------------------------????? 0 SELECT STATEMENT?? GOAL: CHOOSE????? 1?? SORT (AGGREGATE)???? 50??? MERGE JOIN (ANTI)?? 5000???? INDEX (FULL SCAN) OF 'T1_IDX' (NON-UNIQUE)?? 4950???? SORT (UNIQUE)?? 4950????? INDEX (FAST FULL SCAN) OF 'T2_IDX' (NON-UNIQUE)********************************************************************************在t1.object_id上建立索引,merge join-anti的性能数据如下:MJ-ANTI:cpu=0.00?? elapsed=0.00?? query=28?? current=0?? disk=0——这个NOT IN语句在t1.object_id、t2.object_id都有索引的情况下,merge join-anti的效率高于上面的任何SQL。
??? 综上,只要NOT IN走合适的连接,其效率很高甚至高于NOT EXISTS和OUTER JOIN。