悬崖边上的舞者,记7.2生产数据库灾难事件
懸崖邊上的舞者,記7.2生產(chǎn)數(shù)據(jù)庫災(zāi)難事件
作者:張子良
版權(quán)所有,轉(zhuǎn)載請注明出處
引子:出事了
7月2日是一個難得的大晴天,一段時間以來桂林一直在下雨,一直下,害的我減肥的計劃一再的泡湯,因為下雨每天早上的跑步變成了觀雨。和往常一樣,準(zhǔn)時來到單位,卻不準(zhǔn)時的得到一個消息:昨晚前置生產(chǎn)數(shù)據(jù)庫出問題了,連接不上。本能的反應(yīng),日志滿了,或者服務(wù)器交換空間不足。急忙趕到運維間,經(jīng)詢問夜間值班人員,發(fā)現(xiàn)后續(xù)的情況是:logsegment日志空間確實是滿了,發(fā)現(xiàn)問題后,運維人員采用dump日志的方式進行處理,卻發(fā)現(xiàn)dump根本不被執(zhí)行,無奈最后只得采用增加log設(shè)備空間的方式,進行了緊急處理。現(xiàn)在的問題是到底是什么原因引起的log空間的陡然間增加呢?
一、日志分析:發(fā)現(xiàn)了問題,卻無法定位?
取了數(shù)據(jù)庫運行日志,展開分析,發(fā)現(xiàn)了以下問題:
00:06:00000:00912:2013/07/02 03:31:53.98 server? Space available in the log segment has fallen critically low in database 'XXXDB'.? All future modifications to this database will be suspended until the log is successfully dumped and space becomes available.
00:01:00000:00046:2013/07/02 03:31:54.08 server? Error: 1105, Severity: 17, State: 4
00:01:00000:00046:2013/07/02 03:31:54.09 server? Can't allocate space for object 'syslogs' in database 'XXXDB' because 'logsegment' segment is full/has no free extents. If you ran out of space in syslogs, dump the transaction log. Otherwise, use ALTER DATABASE to increase the size of the segment.
00:01:00000:00046:2013/07/02 03:31:54.09 server? DUMP TRANSACTION for database 'XXXDB' could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database 'frtdb' shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server? DUMP TRANSACTION for database 'XXXDB' could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database 'XXXDB' shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server? background task message: LOG DUMP:'logsegment' for 'XXXDB' dumped!
問題很明顯:2013/07/02 03:31:53就已經(jīng)出現(xiàn)了空間不足的情況,2013/07/02 03:31:54.08出現(xiàn)了log空間已滿的情況。2013/07/02 03:31:54.09執(zhí)行了Dump操作,但是失敗了,無奈那么為什么會失敗呢?又是什么操作導(dǎo)致了日志空間的劇烈增長呢?繼續(xù)跟蹤日志。
二、發(fā)現(xiàn)死鎖:死鎖會導(dǎo)致日志空間滿嗎
繼續(xù)向上追溯發(fā)現(xiàn)在此之前曾經(jīng)發(fā)現(xiàn)出現(xiàn)過死鎖的情況,具體日志如下:
00:02:00000:00796:2013/07/01 07:13:24.33 server? Deadlock Id 1 detected
Deadlock Id 1 detected. 1 deadlock chain(s) involved.
Deadlock Id 1: Process (Familyid 0, Spid 1117, Suid 4) was executing a UPDATE command in the procedure 'insprep0' at line 1 at nested level 1.
Deadlock Id 1: Process 1117 was involved in application ''.
Deadlock Id 1: Process 1117 was involved on host name ''.
Deadlock Id 1: Process 1117 was involved in transaction '$user_transaction'.
SQL Text: ?
Executing procedure: *00111700009969_3e59fe
Subordinate SQL Text:XXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 1: Process 796 was involved in application ''.
Deadlock Id 1: Process 796 was involved on host name ''.
SQL Text:? XXXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796) was waiting for a 'shared page' lock on page 60607 of table 't_merch_det' in database 'frtdb' but process (Familyid 0, Spid 1117) already held a 'exclusive page' lock on it.
Deadlock Id 1: Process (Familyid 0, Spid 1117) was waiting for a 'exclusive page' lock on page 5866141 of table 't_merch_det' in database 'frtdb' but process (Familyid 0, Spid 796) already held a 'shared page' lock on it.
Deadlock Id 1: Process (Familyid 0, Spid 796) was chosen as the victim.
Victim process host = `', user = `sysadmin' program name = `' host processes = `13697416' .
End of deadlock information.
00:00:00000:01262:2013/07/01 14:49:42.27 server? Deadlock Id 2 detected
Deadlock Id 2 detected. 1 deadlock chain(s) involved.
Deadlock Id 2: Process (Familyid 0, Spid 813, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 2: Process 813 was involved in application ''.
Deadlock Id 2: Process 813 was involved on host name ''.
Deadlock Id 2: Process 813 was involved in transaction '$user_transaction'.
SQL Text: XXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262, Suid 4) was executing a UPDATE command in the procedure 'insprep0' at line 1 at nested level 1.
Deadlock Id 2: Process 1262 was involved in application ''.
Deadlock Id 2: Process 1262 was involved on host name ''.
Deadlock Id 2: Process 1262 was involved in transaction '$user_transaction'.
SQL Text: ?
Executing procedure: *00126200040493_8f416a
Subordinate SQL Text:? XXXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262) was waiting for a 'exclusive row' lock on row 21 page 10073 of the 'XXXXXX' table in database 'frtdb' but process (Familyid 0, Spid 813) already held a 'exclusive row' lock on it.
Deadlock Id 2: Process (Familyid 0, Spid 813) was waiting for a 'shared row' lock on row 11 page 18705629 of the 'XXXXXX' table in database 'frtdb' but process (Familyid 0, Spid 1262) already held a 'exclusive row' lock on it.
Deadlock Id 2: Process (Familyid 0, Spid 1262) was chosen as the victim.
Victim process host = `', user = `sysadmin' program name = `' host processes = `13500790' .
End of deadlock information.
根據(jù)分析發(fā)現(xiàn),死鎖的時間發(fā)生在2013/07/01 14:49:42.27,而發(fā)生死鎖的進程也被終止,如果是由此原因?qū)е碌膯栴}的話,爆發(fā)時間對不上。根據(jù)日志清理的定時計劃,再次階段出現(xiàn)的問題,日志應(yīng)該已經(jīng)被清理,所以可以排除掉這種情況。
三、數(shù)據(jù)庫變更:重啟過,都變更了什么?
繼續(xù)向上追溯,發(fā)現(xiàn)6.30日數(shù)據(jù)庫曾經(jīng)被重啟過,發(fā)現(xiàn)數(shù)據(jù)庫參數(shù)也曾經(jīng)發(fā)生過變化:
00:00:00000:00013:2013/06/30 01:10:56.60 server? Completed filling free space info for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:56.61 server? Started cleaning up the default data cache for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:57.04 server? Completed cleaning up the default data cache for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:57.04 server? Checking external objects.
00:00:00000:00013:2013/06/30 01:10:57.46 server? The transaction log in the database 'XXXDB' will use I/O size of 8 Kb.
00:00:00000:00013:2013/06/30 01:10:57.46 server? Database 'XXXDB' is now online.
00:00:00000:00001:2013/06/30 01:10:57.47 server? Recovery has restored the value of 'local async prefetch limit' for '64K' pool in 'default data cache' from '80' to 'DEFAULT'.
00:00:00000:00001:2013/06/30 01:10:57.47 server? Recovery has restored the value of 'local async prefetch limit' for '8K' pool in 'default data cache' from '80' to 'DEFAULT'.
00:00:00000:00001:2013/06/30 01:10:57.57 server? Recovery has restored the original size for '64K' pool and '8K' pool in 'default data cache'.
00:00:00000:00001:2013/06/30 01:10:57.58 server? Recovery complete.
00:00:00000:00001:2013/06/30 01:10:57.59 server? ASE's default unicode sort order is 'binary'.
00:00:00000:00001:2013/06/30 01:10:57.60 server? ASE's default sort order is:
00:00:00000:00001:2013/06/30 01:10:57.60 server ??? ?'nocase_cp936' (ID = 52)
00:00:00000:00001:2013/06/30 01:10:57.60 server? on top of default character set:
00:00:00000:00001:2013/06/30 01:10:57.60 server ??? ?'cp936' (ID = 171).
那么是否是數(shù)據(jù)庫參數(shù)變更導(dǎo)致的問題,這個地方是我重點懷疑的地方,但是始終沒有找到數(shù)據(jù)支撐,變更這些參數(shù),是否就是導(dǎo)致問題的根源呢?我需要更多的數(shù)據(jù)的支撐。
四、性能監(jiān)控:陡然增加的日志空間占用,究竟發(fā)生了什么?
? 好在部署了數(shù)據(jù)庫性能監(jiān)控監(jiān)控軟件,通過分析性能監(jiān)控圖表,發(fā)現(xiàn)7.2凌晨2點至3點半,數(shù)據(jù)庫log空間出現(xiàn)了激增,空間占用率呈直線60度上升,直至100%。那么可以肯定的是,在2點的時候,一定是有什么大事務(wù)的數(shù)據(jù)庫操作?那么究竟是什么操作呢?
五、柳暗花明:日志,還是日志
既然數(shù)據(jù)庫本身的日志無法滿足分析的需要,那么前置平臺日志,肯定會有相關(guān)的記錄,跟蹤前置日志,終于發(fā)現(xiàn),7.2日凌晨2點的-3點半期間確實有一個大的數(shù)據(jù)庫事務(wù)發(fā)生,那就是從當(dāng)前交易流水表復(fù)制交易記錄到歷史交易記錄表,記錄數(shù)1100w條,單表需要數(shù)據(jù)庫存儲空間5G,執(zhí)行完成插入操作后,需要進行后續(xù)的刪除操作,而恰在此時時出現(xiàn)了日志空間不足的情況。新的疑問就出現(xiàn)了,為何這個問題之前沒有發(fā)生呢?
六、歷史問題:為何以前沒有發(fā)生問題呢
通過跟蹤6月30日以前的日志發(fā)現(xiàn),執(zhí)行當(dāng)前表到歷史表的數(shù)據(jù)拷貝的操作,都是失敗的。直接返回-1,失敗的原因是由于當(dāng)前數(shù)據(jù)庫可用空間不足或者其他原因,正在跟蹤分析中。。。
?
?
總結(jié)
以上是生活随笔為你收集整理的悬崖边上的舞者,记7.2生产数据库灾难事件的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Java学习笔记50:JSONObjec
- 下一篇: Windows2003 SQL2005解