NOT IN、JOIN、IS NULL、NOT EXISTS效率对比
生活随笔
收集整理的這篇文章主要介紹了
NOT IN、JOIN、IS NULL、NOT EXISTS效率对比
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
語句一:select count(*) from A where A.a not in (select a from B)?
語句二:select count(*) from A left join B on A.a = B.a where B.a is null?
語句三:select count(*) from A where not exists (select a from B where A.a = B.a)?
知道以上三條語句的實際效果是相同的已經很久了,但是一直沒有深究其間的效率對比。一直感覺上語句二是最快的。?
今天工作上因為要對一個數千萬行數據的庫進行數據清除,需要刪掉兩千多萬行數據。大量的用到了以上三條語句所要實現的功能。本來用的是語句一,但是結果是執行速度1個小時32分,日志文件占用21GB。時間上雖然可以接受,但是對硬盤空間的占用確是個問題。因此將所有的語句一都換成語句二。本以為會更快。沒想到執行40多分鐘后,第一批50000行都沒有刪掉,反而讓SQL SERVER崩潰掉了,結果令人詫異。試了試單獨執行這條語句,查詢近一千萬行的表,語句一用了4秒,語句二卻用了18秒,差距很大。語句三的效率與語句一接近。?
第二種寫法是大忌,應該盡量避免。第一種和第三種寫法本質上幾乎一樣。?
假設buffer pool足夠大,寫法二相對于寫法一來說存在以下幾點不足:?
(1)left join本身更耗資源(需要更多資源來處理產生的中間結果集)?
(2)left join的中間結果集的規模不會比表A小?
(3)寫法二還需要對left join產生的中間結果做is null的條件篩選,而寫法一則在兩個集合join的同時完成了篩選,這部分開銷是額外的?
這三點綜合起來,在處理海量數據時就會產生比較明顯的區別(主要是內存和CPU上的開銷)。我懷疑樓主在測試時buffer pool可能已經處于飽和狀態,這樣的話,寫法二的那些額外開銷不得不借助磁盤上的虛擬內存,在SQL Server做換頁時,由于涉及到較慢的I/O操作因此這種差距會更加明顯。?
關于日志文件過大,這也是正常的,因為刪除的記錄多嘛。可以根據數據庫的用途考慮將恢復模型設為simple,或者在刪除結束后將日志truncate掉并把文件shrink下來。?
因為以前曾經作過一個對這個庫進行無條件刪除的腳本,就是要刪除數據量較大的表中的所有數據,但是因為客戶要求,不能使用truncate table,怕破壞已有的庫結構。所以只能用delete刪,當時也遇到了日志文件過大的問題,當時采用的方法是分批刪除,在SQL2K中用set rowcount @chunk,在SQL2K5中用delete top @chunk。這樣的操作不僅使刪除時間大大減少,而且讓日志量大大減少,只增長了1G左右。?
但是這次清除數據的工作需要加上條件,就是delete A from A where ....后面有條件的。再次使用分批刪除的方法,卻已經沒效果了。?
不知您知不知道這是為什么。?
mysql not in 和 left join 效率問題記錄?
首先說明該條sql的功能是查詢集合a不在集合b的數據。?
not in的寫法?
復制代碼代碼如下:
select add_tb.RUID?
from (select distinct RUID?
from UserMsg?
where SubjectID =12?
and CreateTime>'2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00'?
) add_tb?
where add_tb.RUID?
not in (select distinct RUID?
from UserMsg?
where SubjectID =12?
and CreateTime<'2009-8-14 15:30:00'?
)?
返回444行記錄用時 0.07sec?
explain 結果?
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--?
----------------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows |?
Extra |?
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--?
----------------------------+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 452 |?
Using where |?
| 3 | DEPENDENT SUBQUERY | UserMsg | index_subquery | RUID,SubjectID,CreateTime | RUID | 96 | func | 2 |?
Using index; Using where |?
| 2 | DERIVED | UserMsg | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1857 |?
Using where; Using temporary |?
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--?
----------------------------+?
分析:該條查詢速度快原因為id=2的sql查詢出來的結果比較少,所以id=1sql所以運行速度比較快,id=2的使用了臨時表,不知道這個時候是否使用索引??
其中一種left join?
復制代碼代碼如下:
select a.ruid,b.ruid?
from(select distinct RUID?
from UserMsg?
where SubjectID =12?
and CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00'?
) a left join (?
select distinct RUID?
from UserMsg?
where SubjectID =12 and CreateTime< '2009-8-14 15:30:00'?
) b on a.ruid = b.ruid?
where b.ruid is null?
返回444行記錄用時 0.39sec?
explain 結果?
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------?
-------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------?
-------+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 452 |?
|?
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 1112 | Using where; Not exists?
|?
| 3 | DERIVED | UserMsg | ref | SubjectID,CreateTime | SubjectID | 5 | | 6667 | Using where; Using?
temporary |?
| 2 | DERIVED | UserMsg | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1838 | Using where; Using?
temporary |?
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------?
-------+?
分析:使用了兩個臨時表,并且兩個臨時表做了笛卡爾積,導致不能使用索引并且數據量很大?
另外一種left join?
復制代碼代碼如下:
select distinct a.RUID?
from UserMsg a?
left join UserMsg b?
on a.ruid = b.ruid?
and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00'?
where a.subjectID =12?
and a.createTime >= '2009-8-14 15:30:00'?
and a.createtime <='2009-8-17 16:00:00'?
and b.ruid is null;?
返回444行記錄用時 0.07sec?
explain 結果?
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------?
--------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------?
--------------------+?
| 1 | SIMPLE | a | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1839 | Using where;?
Using temporary |?
| 1 | SIMPLE | b | ref | RUID,SubjectID,CreateTime | RUID | 96 | dream.a.RUID | 2 | Using where;?
Not exists; Distinct |?
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------?
--------------------+?
分析:兩次查詢都是用上了索引,并且查詢時同時進行的,所以查詢效率應該很高?
使用not exists的sql?
復制代碼代碼如下:
select distinct a.ruid?
from UserMsg a?
where a.subjectID =12?
and a.createTime >= '2009-8-14 15:30:00'?
and a.createTime <='2009-8-17 16:00:00'?
and not exists (?
select distinct RUID?
from UserMsg?
where subjectID =12 and createTime < '2009-8-14 15:30:00'?
and ruid=a.ruid?
)?
返回444行記錄用時 0.08sec?
explain 結果?
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------?
------------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------?
------------------------+?
| 1 | PRIMARY | a | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1839 | Using?
where; Using temporary |?
| 2 | DEPENDENT SUBQUERY | UserMsg | ref | RUID,SubjectID,CreateTime | RUID | 96 | dream.a.RUID | 2 | Using?
where |?
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------?
------------------------+?
分析:同上基本上是一樣的,只是分解了2個查詢順序執行,查詢效率低于第3個?
為了驗證數據查詢效率,將上述查詢中的subjectID =12的限制條件去掉,結果統計查詢時間如下?
0.20s?
21.31s?
0.25s?
0.43s?
laserhe幫忙分析問題總結?
復制代碼代碼如下:
select a.ruid,b.ruid?
from( select distinct RUID?
from UserMsg?
where CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00'?
) a left join UserMsg b?
on a.ruid = b.ruid?
and b.createTime < '2009-8-14 15:30:00'?
where b.ruid is null;?
執行時間0.13s?
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------?
----+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------?
----+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 |?
|?
| 1 | PRIMARY | b | ref | RUID,CreateTime | RUID | 96 | a.RUID | 2 | Using where; Not exists?
|?
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3553 | Using where; Using?
temporary |?
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------?
----+?
執行效率類似與not in的效率?
數據庫優化的基本原則:讓笛卡爾積發生在盡可能小的集合之間,mysql在join的時候可以直接通過索引來掃描,而嵌入到子查詢里頭,查詢規?
劃器就不曉得用合適的索引了。?
一個SQL在數據庫里是這么優化的:首先SQL會分析成一堆分析樹,一個樹狀數據結構,然后在這個數據結構里,查詢規劃器會查找有沒有合適?
的索引,然后根據具體情況做一個排列組合,然后計算這個排列組合中的每一種的開銷(類似explain的輸出的計算機可讀版本),然后比較里?
面開銷最小的,選取并執行之。那么:?
explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00' ) a left join UserMsg b on a.ruid = b.ruid and b.createTime < '2009-8-14 15:30:00'?
where b.ruid is null;?
和?
explain select add_tb.RUID?
-> from (select distinct RUID?
-> from UserMsg?
-> where CreateTime>'2009-8-14 15:30:00'?
-> and CreateTime<='2009-8-17 16:00:00'?
-> ) add_tb?
-> where add_tb.RUID?
-> not in (select distinct RUID?
-> from UserMsg?
-> where CreateTime<'2009-8-14 15:30:00'?
-> );?
explain?
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------?
------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------?
------------------+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 | Using where?
|?
| 3 | DEPENDENT SUBQUERY | UserMsg | index_subquery | RUID,CreateTime | RUID | 96 | func | 2 | Using index;?
Using where |?
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3509 | Using where;?
Using temporary |?
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------?
------------------+?
開銷是完全一樣的,開銷可以從 rows 那個字段得出(基本上是rows那個字段各個行的數值的乘積,也就是笛卡爾積)?
但是呢:下面這個:?
explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00' ) a left join ( select distinct RUID from UserMsg where createTime < '2009-8-14?
15:30:00' ) b on a.ruid = b.ruid where b.ruid is null;?
執行時間21.31s?
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------?
-+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------?
-+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 |?
|?
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 30308 | Using where; Not exists?
|?
| 3 | DERIVED | UserMsg | ALL | CreateTime | NULL | NULL | NULL | 69366 | Using where; Using temporary?
|?
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3510 | Using where; Using temporary?
|?
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------?
-+?
我就有些不明白?
為何是四行?
并且中間兩行巨大無比?
按理說?
查詢規劃器應該能把這個查詢優化得跟前面的兩個一樣的?
(至少在我熟悉的pgsql數據庫里我有信心是一樣的)?
但mysql里頭不是?
所以我感覺查詢規劃器里頭可能還是糙了點?
我前面說過優化的基本原則就是,讓笛卡爾積發生在盡可能小的集合之間?
那么上面最后一種寫法至少沒有違反這個原則?
雖然b 表因為符合條件的非常多,基本上不會用索引?
但是并不應該妨礙查詢優化器看到外面的join on條件,從而和前面兩個SQL一樣,選取主鍵進行join?
不過我前面說過查詢規劃器的作用?
理論上來講?
遍歷一遍所有可能,計算一下開銷?
是合理的?
我感覺這里最后一種寫法沒有遍歷完整所有可能?
可能的原因是子查詢的實現還是比較簡單??
子查詢對數據庫的確是個挑戰?
因為基本都是遞歸的東西?
所以在這個環節有點毛病并不奇怪?
其實你仔細想想,最后一種寫法無非是我們第一種寫法的一個變種,關鍵在表b的where 條件放在哪里?
放在里面,就不會用索引去join?
放在外面就會?
這個本身就是排列組合的一個可能
語句二:select count(*) from A left join B on A.a = B.a where B.a is null?
語句三:select count(*) from A where not exists (select a from B where A.a = B.a)?
知道以上三條語句的實際效果是相同的已經很久了,但是一直沒有深究其間的效率對比。一直感覺上語句二是最快的。?
今天工作上因為要對一個數千萬行數據的庫進行數據清除,需要刪掉兩千多萬行數據。大量的用到了以上三條語句所要實現的功能。本來用的是語句一,但是結果是執行速度1個小時32分,日志文件占用21GB。時間上雖然可以接受,但是對硬盤空間的占用確是個問題。因此將所有的語句一都換成語句二。本以為會更快。沒想到執行40多分鐘后,第一批50000行都沒有刪掉,反而讓SQL SERVER崩潰掉了,結果令人詫異。試了試單獨執行這條語句,查詢近一千萬行的表,語句一用了4秒,語句二卻用了18秒,差距很大。語句三的效率與語句一接近。?
第二種寫法是大忌,應該盡量避免。第一種和第三種寫法本質上幾乎一樣。?
假設buffer pool足夠大,寫法二相對于寫法一來說存在以下幾點不足:?
(1)left join本身更耗資源(需要更多資源來處理產生的中間結果集)?
(2)left join的中間結果集的規模不會比表A小?
(3)寫法二還需要對left join產生的中間結果做is null的條件篩選,而寫法一則在兩個集合join的同時完成了篩選,這部分開銷是額外的?
這三點綜合起來,在處理海量數據時就會產生比較明顯的區別(主要是內存和CPU上的開銷)。我懷疑樓主在測試時buffer pool可能已經處于飽和狀態,這樣的話,寫法二的那些額外開銷不得不借助磁盤上的虛擬內存,在SQL Server做換頁時,由于涉及到較慢的I/O操作因此這種差距會更加明顯。?
關于日志文件過大,這也是正常的,因為刪除的記錄多嘛。可以根據數據庫的用途考慮將恢復模型設為simple,或者在刪除結束后將日志truncate掉并把文件shrink下來。?
因為以前曾經作過一個對這個庫進行無條件刪除的腳本,就是要刪除數據量較大的表中的所有數據,但是因為客戶要求,不能使用truncate table,怕破壞已有的庫結構。所以只能用delete刪,當時也遇到了日志文件過大的問題,當時采用的方法是分批刪除,在SQL2K中用set rowcount @chunk,在SQL2K5中用delete top @chunk。這樣的操作不僅使刪除時間大大減少,而且讓日志量大大減少,只增長了1G左右。?
但是這次清除數據的工作需要加上條件,就是delete A from A where ....后面有條件的。再次使用分批刪除的方法,卻已經沒效果了。?
不知您知不知道這是為什么。?
mysql not in 和 left join 效率問題記錄?
首先說明該條sql的功能是查詢集合a不在集合b的數據。?
not in的寫法?
復制代碼代碼如下:
select add_tb.RUID?
from (select distinct RUID?
from UserMsg?
where SubjectID =12?
and CreateTime>'2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00'?
) add_tb?
where add_tb.RUID?
not in (select distinct RUID?
from UserMsg?
where SubjectID =12?
and CreateTime<'2009-8-14 15:30:00'?
)?
返回444行記錄用時 0.07sec?
explain 結果?
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--?
----------------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows |?
Extra |?
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--?
----------------------------+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 452 |?
Using where |?
| 3 | DEPENDENT SUBQUERY | UserMsg | index_subquery | RUID,SubjectID,CreateTime | RUID | 96 | func | 2 |?
Using index; Using where |?
| 2 | DERIVED | UserMsg | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1857 |?
Using where; Using temporary |?
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--?
----------------------------+?
分析:該條查詢速度快原因為id=2的sql查詢出來的結果比較少,所以id=1sql所以運行速度比較快,id=2的使用了臨時表,不知道這個時候是否使用索引??
其中一種left join?
復制代碼代碼如下:
select a.ruid,b.ruid?
from(select distinct RUID?
from UserMsg?
where SubjectID =12?
and CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00'?
) a left join (?
select distinct RUID?
from UserMsg?
where SubjectID =12 and CreateTime< '2009-8-14 15:30:00'?
) b on a.ruid = b.ruid?
where b.ruid is null?
返回444行記錄用時 0.39sec?
explain 結果?
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------?
-------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------?
-------+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 452 |?
|?
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 1112 | Using where; Not exists?
|?
| 3 | DERIVED | UserMsg | ref | SubjectID,CreateTime | SubjectID | 5 | | 6667 | Using where; Using?
temporary |?
| 2 | DERIVED | UserMsg | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1838 | Using where; Using?
temporary |?
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------?
-------+?
分析:使用了兩個臨時表,并且兩個臨時表做了笛卡爾積,導致不能使用索引并且數據量很大?
另外一種left join?
復制代碼代碼如下:
select distinct a.RUID?
from UserMsg a?
left join UserMsg b?
on a.ruid = b.ruid?
and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00'?
where a.subjectID =12?
and a.createTime >= '2009-8-14 15:30:00'?
and a.createtime <='2009-8-17 16:00:00'?
and b.ruid is null;?
返回444行記錄用時 0.07sec?
explain 結果?
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------?
--------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------?
--------------------+?
| 1 | SIMPLE | a | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1839 | Using where;?
Using temporary |?
| 1 | SIMPLE | b | ref | RUID,SubjectID,CreateTime | RUID | 96 | dream.a.RUID | 2 | Using where;?
Not exists; Distinct |?
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------?
--------------------+?
分析:兩次查詢都是用上了索引,并且查詢時同時進行的,所以查詢效率應該很高?
使用not exists的sql?
復制代碼代碼如下:
select distinct a.ruid?
from UserMsg a?
where a.subjectID =12?
and a.createTime >= '2009-8-14 15:30:00'?
and a.createTime <='2009-8-17 16:00:00'?
and not exists (?
select distinct RUID?
from UserMsg?
where subjectID =12 and createTime < '2009-8-14 15:30:00'?
and ruid=a.ruid?
)?
返回444行記錄用時 0.08sec?
explain 結果?
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------?
------------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------?
------------------------+?
| 1 | PRIMARY | a | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1839 | Using?
where; Using temporary |?
| 2 | DEPENDENT SUBQUERY | UserMsg | ref | RUID,SubjectID,CreateTime | RUID | 96 | dream.a.RUID | 2 | Using?
where |?
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------?
------------------------+?
分析:同上基本上是一樣的,只是分解了2個查詢順序執行,查詢效率低于第3個?
為了驗證數據查詢效率,將上述查詢中的subjectID =12的限制條件去掉,結果統計查詢時間如下?
0.20s?
21.31s?
0.25s?
0.43s?
laserhe幫忙分析問題總結?
復制代碼代碼如下:
select a.ruid,b.ruid?
from( select distinct RUID?
from UserMsg?
where CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00'?
) a left join UserMsg b?
on a.ruid = b.ruid?
and b.createTime < '2009-8-14 15:30:00'?
where b.ruid is null;?
執行時間0.13s?
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------?
----+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------?
----+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 |?
|?
| 1 | PRIMARY | b | ref | RUID,CreateTime | RUID | 96 | a.RUID | 2 | Using where; Not exists?
|?
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3553 | Using where; Using?
temporary |?
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------?
----+?
執行效率類似與not in的效率?
數據庫優化的基本原則:讓笛卡爾積發生在盡可能小的集合之間,mysql在join的時候可以直接通過索引來掃描,而嵌入到子查詢里頭,查詢規?
劃器就不曉得用合適的索引了。?
一個SQL在數據庫里是這么優化的:首先SQL會分析成一堆分析樹,一個樹狀數據結構,然后在這個數據結構里,查詢規劃器會查找有沒有合適?
的索引,然后根據具體情況做一個排列組合,然后計算這個排列組合中的每一種的開銷(類似explain的輸出的計算機可讀版本),然后比較里?
面開銷最小的,選取并執行之。那么:?
explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00' ) a left join UserMsg b on a.ruid = b.ruid and b.createTime < '2009-8-14 15:30:00'?
where b.ruid is null;?
和?
explain select add_tb.RUID?
-> from (select distinct RUID?
-> from UserMsg?
-> where CreateTime>'2009-8-14 15:30:00'?
-> and CreateTime<='2009-8-17 16:00:00'?
-> ) add_tb?
-> where add_tb.RUID?
-> not in (select distinct RUID?
-> from UserMsg?
-> where CreateTime<'2009-8-14 15:30:00'?
-> );?
explain?
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------?
------------------+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------?
------------------+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 | Using where?
|?
| 3 | DEPENDENT SUBQUERY | UserMsg | index_subquery | RUID,CreateTime | RUID | 96 | func | 2 | Using index;?
Using where |?
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3509 | Using where;?
Using temporary |?
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------?
------------------+?
開銷是完全一樣的,開銷可以從 rows 那個字段得出(基本上是rows那個字段各個行的數值的乘積,也就是笛卡爾積)?
但是呢:下面這個:?
explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'?
and CreateTime<='2009-8-17 16:00:00' ) a left join ( select distinct RUID from UserMsg where createTime < '2009-8-14?
15:30:00' ) b on a.ruid = b.ruid where b.ruid is null;?
執行時間21.31s?
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------?
-+?
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra?
|?
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------?
-+?
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 |?
|?
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 30308 | Using where; Not exists?
|?
| 3 | DERIVED | UserMsg | ALL | CreateTime | NULL | NULL | NULL | 69366 | Using where; Using temporary?
|?
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3510 | Using where; Using temporary?
|?
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------?
-+?
我就有些不明白?
為何是四行?
并且中間兩行巨大無比?
按理說?
查詢規劃器應該能把這個查詢優化得跟前面的兩個一樣的?
(至少在我熟悉的pgsql數據庫里我有信心是一樣的)?
但mysql里頭不是?
所以我感覺查詢規劃器里頭可能還是糙了點?
我前面說過優化的基本原則就是,讓笛卡爾積發生在盡可能小的集合之間?
那么上面最后一種寫法至少沒有違反這個原則?
雖然b 表因為符合條件的非常多,基本上不會用索引?
但是并不應該妨礙查詢優化器看到外面的join on條件,從而和前面兩個SQL一樣,選取主鍵進行join?
不過我前面說過查詢規劃器的作用?
理論上來講?
遍歷一遍所有可能,計算一下開銷?
是合理的?
我感覺這里最后一種寫法沒有遍歷完整所有可能?
可能的原因是子查詢的實現還是比較簡單??
子查詢對數據庫的確是個挑戰?
因為基本都是遞歸的東西?
所以在這個環節有點毛病并不奇怪?
其實你仔細想想,最后一種寫法無非是我們第一種寫法的一個變種,關鍵在表b的where 條件放在哪里?
放在里面,就不會用索引去join?
放在外面就會?
這個本身就是排列組合的一個可能
總結
以上是生活随笔為你收集整理的NOT IN、JOIN、IS NULL、NOT EXISTS效率对比的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 浅谈 MySQL 的存储引擎(表类型)
- 下一篇: 安装12G内存读出内存条为3.45G的处