當(dāng)前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

mysql relay_技术分享 | MySQL：产生大量小 relay log 的故障一例

發(fā)布時間：2025/3/15 数据库 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 mysql relay_技术分享 | MySQL：产生大量小 relay log 的故障一例小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

作者：高鵬

文章末尾有他著作的《深入理解 MySQL 主從原理 32 講》，深入透徹理解 MySQL 主從，GTID 相關(guān)技術(shù)知識。

源碼版本：5.7.22

水平有限，如有誤請諒解

本文建議橫屏觀看，效果更佳

一、案例來源和現(xiàn)象

這個案例是朋友 @peaceful 遇到的線上問題，最終線索也是他自己找到的。現(xiàn)象如下：

1、出現(xiàn)了大量很小的 relay log 如下，堆積量大約 2600 個：

...

-rw-r----- 1 mysql dba 12827 Oct 11 12:28 mysql-relay-bin.036615

-rw-r----- 1 mysql dba 4908 Oct 11 12:28 mysql-relay-bin.036616

-rw-r----- 1 mysql dba 1188 Oct 11 12:28 mysql-relay-bin.036617

-rw-r----- 1 mysql dba 5823 Oct 11 12:29 mysql-relay-bin.036618

-rw-r----- 1 mysql dba 507 Oct 11 12:29 mysql-relay-bin.036619

-rw-r----- 1 mysql dba 1188 Oct 11 12:29 mysql-relay-bin.036620

-rw-r----- 1 mysql dba 3203 Oct 11 12:29 mysql-relay-bin.036621

-rw-r----- 1 mysql dba 37916 Oct 11 12:30 mysql-relay-bin.036622

-rw-r----- 1 mysql dba 507 Oct 11 12:30 mysql-relay-bin.036623

-rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036624

-rw-r----- 1 mysql dba 4909 Oct 11 12:31 mysql-relay-bin.036625

-rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036626

-rw-r----- 1 mysql dba 507 Oct 11 12:31 mysql-relay-bin.036627

-rw-r----- 1 mysql dba 507 Oct 11 12:32 mysql-relay-bin.036628

-rw-r----- 1 mysql dba 1188 Oct 11 12:32 mysql-relay-bin.036629

-rw-r----- 1 mysql dba 454 Oct 11 12:32 mysql-relay-bin.036630

-rw-r----- 1 mysql dba 6223 Oct 11 12:32 mysql-relay-bin.index

2、主庫錯誤日志有如下錯誤

2019-10-11T12:31:26.517309+08:00 61303425 [Note] While initializing dump thread for slave with UUID , found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303421).

2019-10-11T12:31:26.517489+08:00 61303425 [Note] Start binlog_dump to master_thread_id(61303425) slave_server(19304313), pos(, 4)

2019-10-11T12:31:44.203747+08:00 61303449 [Note] While initializing dump thread for slave with UUID , found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303425).

2019-10-11T12:31:44.203896+08:00 61303449 [Note] Start binlog_dump to master_thread_id(61303449) slave_server(19304313), pos(, 4)

二、slave_net_timeout 參數(shù)分析

實(shí)際上第一眼看這個案例我也覺得很奇怪，因?yàn)楹苌儆腥藭ピO(shè)置 slave_net_timeout 參數(shù)，同樣我們也沒有設(shè)置過，因此關(guān)注較少。但是 @peaceful 自己找到了可能出現(xiàn)問題的設(shè)置就是當(dāng)前從庫 slave_net_timeout 參數(shù)設(shè)置為 10。我就順著這個線索往下分析，我們先來看看 slave_net_timeout 參數(shù)的功能。

當(dāng)前看來從庫的 slave_net_timeout 有如下兩個功能：

1、設(shè)置 IO 線程在空閑情況下(沒有 Event 接收的情況下)的連接超時時間。

這個參數(shù) 5.7.7 過后是 60 秒，以前是 3600 秒，修改后需要重啟主從才會生效。

2、如果 change master 沒有指定 MASTER_HEARTBEAT_PERIOD 的情況下會設(shè)置為 slave_net_timeout/2

一般我們配置主從都沒有去指定這個心跳周期，因此就是 slave_net_timeout/2，它控制的是如果在主庫沒有 Event 產(chǎn)生的情況下，多久發(fā)送一個心跳 Event 給從庫的 IO 線程，用于保持連接。但是一旦我們配置了主從(change master)這個值就定下來了，不會隨著 slave_net_timeout 參數(shù)的更改而更改，我們可以在 slave_master_info 表中找到相應(yīng)的設(shè)置如下：

mysql> select Heartbeat from slave_master_info \G

*************************** 1. row ***************************

Heartbeat: 30

1 row in set (0.01 sec)

如果我們要更改這個值只能重新 change master 才行。

三、原因總結(jié)

如果滿足下面三個條件，將會出現(xiàn)案例中的故障：

主從中的 MASTER_HEARTBEAT_PERIOD 的值大于從庫 slave_net_timeout

主庫當(dāng)前壓力很小持續(xù) slave_net_timeout 設(shè)置時間沒有產(chǎn)生新的 Event

之前主從有一定的延遲

那么這種情況下在主庫心跳 Event 發(fā)送給從庫的IO線程之前，IO 線程已經(jīng)斷開了。斷開后 IO 線程會進(jìn)行重連，每次重連將會生成新的 relay log，但是這些 relay log 由于延遲問題不能清理就出現(xiàn)了案例中的情況。

下面是官方文檔中關(guān)于這部分說明：

If you are logging master connection information to tables, MASTER_HEARTBEAT_PERIOD can be seen

as the value of the Heartbeat column of the mysql.slave_master_info table.

Setting interval to 0 disables heartbeats altogether. The default value for interval is equal to the

value of slave_net_timeout divided by 2.

Setting @@global.slave_net_timeout to a value less than that of the current heartbeat interval

results in a warning being issued. The effect of issuing RESET SLAVE on the heartbeat interval is to

reset it to the default value.

四、案例模擬

有了理論基礎(chǔ)就很好了模擬了，延遲這一點(diǎn)我模擬的時候關(guān)閉了從庫的 SQL 線程來模擬 relay log 積壓的情況，因?yàn)檫@個案例和 SQL 線程沒有太多的關(guān)系。

提前配置好主從，查看當(dāng)前的心跳周期和 slave_net_timeout 參數(shù)如下：

mysql> show variables like '%slave_net_timeout%';

+-------------------+-------+

| Variable_name | Value |

+-------------------+-------+

| slave_net_timeout | 60 |

+-------------------+-------+

1 row in set (0.01 sec)

mysql> select Heartbeat from slave_master_info \G

*************************** 1. row ***************************

Heartbeat: 30

1 row in set (0.00 sec)

1、停止從庫的 SQL 線程

stop slave sql_thread;

2、設(shè)置 slave_net_timeout 為 10

mysql> set global slave_net_timeout=10;

Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show warnings;

+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

| Level | Code | Message |

+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

| Warning | 1704 | The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout. |

+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

可以看到這里實(shí)際上已經(jīng)有一個警告了。

3、重啟 IO 線程

這樣才會讓 slave_net_timeout 參數(shù)生效

mysql> stop slave ;

Query OK, 0 rows affected (0.01 sec)

mysql> start slave io_thread;

Query OK, 0 rows affected (0.01 sec)

4、觀察現(xiàn)象

大概每 10 秒會生成一個 relay log 文件如下：

-rw-r----- 1 mysql mysql 500 2019-09-27 23:48:32.655001361 +0800 relay.000142

-rw-r----- 1 mysql mysql 500 2019-09-27 23:48:42.943001355 +0800 relay.000143

-rw-r----- 1 mysql mysql 500 2019-09-27 23:48:53.293001363 +0800 relay.000144

-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:03.502000598 +0800 relay.000145

-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:13.799001357 +0800 relay.000146

-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:24.055001354 +0800 relay.000147

-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:34.280001827 +0800 relay.000148

-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:44.496001365 +0800 relay.000149

-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:54.789001353 +0800 relay.000150

-rw-r----- 1 mysql mysql 500 2019-09-27 23:50:05.485001371 +0800 relay.000151

-rw-r----- 1 mysql mysql 500 2019-09-27 23:50:15.910001430 +0800 relay.000152

大概每 10 秒主庫的日志會輸出如下日志：

2019-10-08T02:27:24.996827+08:00 217 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(216).

2019-10-08T02:27:24.998297+08:00 217 [Note] Start binlog_dump to master_thread_id(217) slave_server(953340), pos(, 4)

2019-10-08T02:27:35.265961+08:00 218 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(217).

2019-10-08T02:27:35.266653+08:00 218 [Note] Start binlog_dump to master_thread_id(218) slave_server(953340), pos(, 4)

2019-10-08T02:27:45.588074+08:00 219 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(218).

2019-10-08T02:27:45.589814+08:00 219 [Note] Start binlog_dump to master_thread_id(219) slave_server(953340), pos(, 4)

2019-10-08T02:27:55.848558+08:00 220 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(219).

2019-10-08T02:27:55.849442+08:00 220 [Note] Start binlog_dump to master_thread_id(220) slave_server(953340), pos(, 4)

這個日志就和案例中的一模一樣了。

解決問題

知道原因后解決也就很簡單了我們只需設(shè)置 slave_net_timeout 參數(shù)為 MASTER_HEARTBEAT_PERIOD 的 2 倍就可以了，設(shè)置后重啟主從即可。

五、實(shí)現(xiàn)方式

這里我們將通過簡單的源碼調(diào)用分析來看看到底 slave_net_timeout 參數(shù)和 MASTER_HEARTBEAT_PERIOD 對主從的影響。

1、從庫使用參數(shù) slave_net_timeout

從庫IO線程啟動時候會通過參數(shù) slave_net_timeout 設(shè)置超時：

->connect_to_master

-> mysql_options

case MYSQL_OPT_CONNECT_TIMEOUT: //MYSQL_OPT_CONNECT_TIMEOUT

mysql->options.connect_timeout= *(uint*) arg;

break;

而在建立和主庫的連接時候會使用這個值

connect_to_master

->mysql_real_connect

-> get_vio_connect_timeout

timeout_sec= mysql->options.connect_timeout;

因此我們也看到了 slave_net_timeout 參數(shù)只有在 IO 線程重啟的時候才會生效

2、從庫設(shè)置 MASTER_HEARTBEAT_PERIOD 值

在每次使用從庫 change master 時候會設(shè)置這個值如下，默認(rèn)為 slave_net_timeout/2：

->change_master

->change_receive_options

mi->heartbeat_period= min(SLAVE_MAX_HEARTBEAT_PERIOD,

(slave_net_timeout/2.0f));

因此我們看到只有 change master 才會重新設(shè)置這個值，重啟主從是不會重新設(shè)置的。

3、使用 MASTER_HEARTBEAT_PERIOD 值

每次 IO 線程啟動時候會將這個值傳遞給主庫的 DUMP 線程，方式應(yīng)該是通過構(gòu)建語句 ‘SET @master_heartbeat_period’ 來完成的。如下：

->handle_slave_io

->get_master_version_and_clock

if (mi->heartbeat_period != 0.0)

{

char llbuf[22];

const char query_format[]= "SET @master_heartbeat_period= %s";

char query[sizeof(query_format) - 2 + sizeof(llbuf)];

主庫啟動 DUMP 線程的時候會通過搜索的方式找到這個值如下

->Binlog_sender::init

->Binlog_sender::init_heartbeat_period

user_var_entry *entry=

(user_var_entry*) my_hash_search(&m_thd->user_vars, (uchar*) name.str,

name.length);

m_heartbeat_period= entry ? entry->val_int(&null_value) : 0;

4、DUMP 線程使用 MASTER_HEARTBEAT_PERIOD 發(fā)送心跳 Event

這里主要是通過一個超時等待來完成，如下：

->Binlog_sender::wait_new_events

->Binlog_sender::wait_with_heartbeat

set_timespec_nsec(&ts, m_heartbeat_period); //心跳超時

ret= mysql_bin_log.wait_for_update_bin_log(m_thd, &ts);//等待

if (ret != ETIMEDOUT && ret != ETIME) //如果是正常收到則收到信號，說明有新的Event到來，否則如果是超時則發(fā)送心跳Event

break; //正常返回0 是超時返回ETIMEDOUT 繼續(xù)循環(huán)

if (send_heartbeat_event(log_pos)) //發(fā)送心跳Event

return 1;

5、重連會殺掉可能的存在的 DUMP 線程

根據(jù) UUID 進(jìn)行比對如下：

->kill_zombie_dump_threads

Find_zombie_dump_thread find_zombie_dump_thread(slave_uuid);

THD *tmp= Global_THD_manager::get_instance()->

find_thd(&find_zombie_dump_thread);

if (tmp)

{

Here we do not call kill_one_thread() as

it will be slow because it will iterate through the list

again. We just to do kill the thread ourselves.

if (log_warnings > 1)

{

if (slave_uuid.length())

{

sql_print_information("While initializing dump thread for slave with "

"UUID , found a zombie dump thread with the "

"same UUID. Master is killing the zombie dump "

"thread(%u).", slave_uuid.c_ptr(),

tmp->thread_id());

}//這里就是本案例中的日志了

.....

這里我們看到了案例中的日志。

6、關(guān)于 DUMP 線程流程圖

最后給出一張來自我《 MySQL 主從原理 32 講》第 17 節(jié)中 DUMP 線程的流程圖如下：

總結(jié)

以上是生活随笔為你收集整理的mysql relay_技术分享 | MySQL：产生大量小 relay log 的故障一例的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： reactor多线程模型_Netty运用
下一篇： linux cmake编译源码,linu