1.背景概述
在一次主從復制架構中,由于主節點binlog損壞,導致從節點無法正常同步數據,只能重做從節點;因此使用MySQL 8.0.17開始提供的clone技術進行恢復,恢復后的2天都發生了主從報錯數據沖突。
通過解析binlog發現,同一時刻主從節點都在執行同一條語句,因此詢問業務是否在主從節點都執行了定時任務,業務回復定時任務只在主節點執行。
最后排查發現,克隆后的從節點的定時任務也會是開啟的狀態,因此同一時刻,主從節點同時執行定時任務,導致主從報錯,最終將從節點的定時任務關閉后解決此問題。
2.問題復現
本次測試基于 GreatSQL 8.0.32-24
?
?
greatsql>?SELECT?VERSION(); +-----------+ |?VERSION()?| +-----------+ |?8.0.32-24?| +-----------+ 1?row?in?set?(0.00?sec)
?
?
1.搭建一套主從架構
略
2.創建event
?
?
greatsql>?create?database?test; greatsql>?use?test; greatsql>?CREATE?TABLE?`test`?( ?`id`?int(11)?unsigned?NOT?NULL?AUTO_INCREMENT?COMMENT?'ID', ?`now`?datetime?DEFAULT?NULL?COMMENT?'時間', ?PRIMARY?KEY?(`id`) ); greatsql>?CREATE?EVENT?event_test? ON?SCHEDULE?EVERY?1?MINUTE ON?COMPLETION?PRESERVE? ENABLE? COMMENT?'每隔1分鐘向test表插入記錄' DO?INSERT?INTO?test?VALUES(NULL,?now());
?
?
3.查看event狀態
主節點,默認情況下event狀態為 ENABLED
?
?
greatsql>?show?events; +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+ |?Db??|?Name????|?Definer?|?Time?zone?|?Type????|?Execute?at?|?Interval?value?|?Interval?field?|?Starts????????|?Ends?|?Status??|?Originator?|?character_set_client?|?collation_connection?|?Database?Collation?| +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+ |?test?|?event_test?|?root@%??|?SYSTEM???|?RECURRING?|?NULL????|?1????????|?MINUTE?????|?2023-10-12?17:11:14?|?NULL?|?ENABLED?|??????1?|?utf8mb4????????|?utf8mb4_unicode_ci??|?utf8mb4_unicode_ci?| +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+ 1?row?in?set?(0.00?sec)
?
?
從節點,默認情況下event狀態為 SLAVESIDE_DISABLED
?
?
greatsql>?show?events; +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+ |?Db??|?Name????|?Definer?|?Time?zone?|?Type????|?Execute?at?|?Interval?value?|?Interval?field?|?Starts????????|?Ends?|?Status???????|?Originator?|?character_set_client?|?collation_connection?|?Database?Collation?| +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+ |?test?|?event_test?|?root@%??|?SYSTEM???|?RECURRING?|?NULL????|?1????????|?MINUTE?????|?2023-10-12?17:11:14?|?NULL?|?SLAVESIDE_DISABLED?|??????1?|?utf8mb4????????|?utf8mb4_unicode_ci??|?utf8mb4_unicode_ci?| +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+ 1?row?in?set?(0.00?sec)
?
?
4.查看數據
?
?
greatsql>?select?*?from?test.test; +----+---------------------+ |?id?|?now?????????| +----+---------------------+ |??1?|?2023-08-08?1639?| |??2?|?2023-08-08?1639?| |??3?|?2023-08-08 1639?| +----+---------------------+ 3?rows?in?set?(0.00?sec)
?
?
5.從節點進行克隆
# 安裝克隆插件,主從節點都需要
?
?
greatsql>?install?plugin?clone?soname?'mysql_clone.so';
?
?
# 從節點進行clone
?
?
greatsql>?set?global?clone_valid_donor_list='172.17.137.162:6001'; greatsql>?clone?instance?from?root@'172.17.137.162':6001?identified?by?'greatsql';
?
?
6.重新建立主從復制
?
?
greatsql>?change?master?to?master_user='root',master_password='greatsql',master_host='172.17.137.162',master_port=6001,master_auto_position=1; Query?OK,?0?rows?affected,?7?warnings?(0.04?sec) greatsql>?start?slave; Query?OK,?0?rows?affected,?1?warning?(0.04?sec)
?
?
7.查看主從狀態
?
?
greatsql>?show?slave?statusG ***************************?1.?row?*************************** ???????????????Slave_IO_State:?Waiting?for?source?to?send?event ??????????????????Master_Host:?172.17.137.162 ??????????????????Master_User:?root ??????????????????Master_Port:?6001 ????????????????Connect_Retry:?60 ??????????????Master_Log_File:?binlog.000001 ??????????Read_Master_Log_Pos:?2959 ???????????????Relay_Log_File:?relaylog.000002 ????????????????Relay_Log_Pos:?395 ????????Relay_Master_Log_File:?binlog.000001 ?????????????Slave_IO_Running:?Yes ????????????Slave_SQL_Running:?No ??????????????Replicate_Do_DB:? ??????????Replicate_Ignore_DB:? ???????????Replicate_Do_Table:? ???????Replicate_Ignore_Table:? ??????Replicate_Wild_Do_Table:? ??Replicate_Wild_Ignore_Table:? ???????????????????Last_Errno:?1062 ???????????????????Last_Error:?Coordinator?stopped?because?there?were?error(s)?in?the?worker(s).?The?most?recent?failure?being:?Worker?1?failed?executing?transaction?'e8bf88f9-2acd-11ee-a98a-00163e605c74:8'?at?master?log?binlog.000001,?end_log_pos?2606.?See?error?log?and/or?performance_schema.replication_applier_status_by_worker?table?for?more?details?about?this?failure?or?others,?if?any. ?????????????????Skip_Counter:?0 ??????????Exec_Master_Log_Pos:?2307 ??????????????Relay_Log_Space:?1242 ??????????????Until_Condition:?None ???????????????Until_Log_File:? greatsql>?select?*?from?performance_schema.replication_applier_status_by_worker?limit?1G ***************************?1.?row?*************************** ???????????????????????????????????????????CHANNEL_NAME:? ??????????????????????????????????????????????WORKER_ID:?1 ??????????????????????????????????????????????THREAD_ID:?NULL ??????????????????????????????????????????SERVICE_STATE:?OFF ??????????????????????????????????????LAST_ERROR_NUMBER:?1062 ?????????????????????????????????????LAST_ERROR_MESSAGE:?Worker?1?failed?executing?transaction?'e8bf88f9-2acd-11ee-a98a-00163e605c74:8'?at?master?log?binlog.000001,?end_log_pos?2606;?Could?not?execute?Write_rows?event?on?table?test.test;?Duplicate?entry?'5'?for?key?'test.PRIMARY',?Error_code:?1062;?handler?error?HA_ERR_FOUND_DUPP_KEY;?the?event's?master?log?FIRST,?end_log_pos?2606 ???????????????????????????????????LAST_ERROR_TIMESTAMP:?2023-08-08?1639.033240 ???????????????????????????????LAST_APPLIED_TRANSACTION:? ?????LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP:?0000-00-00?0000.000000 ????LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP:?0000-00-00?0000.000000 ?????????LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP:?0000-00-00?0000.000000 ???????????LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP:?0000-00-00?0000.000000 ???????????????????????????????????APPLYING_TRANSACTION:?e8bf88f9-2acd-11ee-a98a-00163e605c74:8 ?????????APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP:?2023-08-08?1645.795753 ????????APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP:?2023-08-08?1645.795753 ?????????????APPLYING_TRANSACTION_START_APPLY_TIMESTAMP:?2023-08-08?1639.032510 ?????????????????LAST_APPLIED_TRANSACTION_RETRIES_COUNT:?0 ???LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER:?0 ??LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:? LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP:?0000-00-00?0000.000000 ?????????????????????APPLYING_TRANSACTION_RETRIES_COUNT:?0 ???????APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER:?0 ??????APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:? ????APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP:?0000-00-00?0000.000000 1?row?in?set?(0.00?sec)
?
?
可以看到從節點報錯發生了主鍵沖突。
8.查看從節點定時任務狀態
當前從節點定時任務狀態為 ENABLED
?
?
greatsql>?show?events; +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+ |?Db??|?Name????|?Definer?|?Time?zone?|?Type????|?Execute?at?|?Interval?value?|?Interval?field?|?Starts????????|?Ends?|?Status??|?Originator?|?character_set_client?|?collation_connection?|?Database?Collation?| +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+ |?test?|?event_test?|?root@%??|?SYSTEM???|?RECURRING?|?NULL????|?1????????|?MINUTE?????|?2023-08-08?15:58:45?|?NULL?|?ENABLED?|??????1?|?utf8mb4????????|?utf8mb4_unicode_ci??|?utf8mb4_unicode_ci?| +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+---------------------+----------------------+--------------------+ 1?row?in?set?(0.00?sec)
?
?
可以看到由于從節點的定時任務也執行了,從節點寫入數據,導致主鍵沖突。
9.故障解決
?
?
greatsql>?alter?event?event_test?DISABLE; Query?OK,?0?rows?affected?(0.01?sec)
?
?
關閉從節點的定時任務event,然后跳過主鍵沖突的報錯,最后重新啟動主從復制。
3.總結
1.如果主庫有定時任務,通過clone的方式搭建從庫,在從庫恢復之后需要關閉定時任務,避免主從同時執行定時任務導致主從故障。
2.克隆時,如果捐贈節點有主從復制信息,則克隆后的接收節點也會克隆此復制信息,并在克隆完成自動重啟實例后,自動啟動復制;避免此問題可以在接收節點的配置文件中增加 skip-slave-start,避免節點重啟后自動啟動復制。
編輯:黃飛
?
評論