前言
上一篇介绍了 ctdb eventscript monitor的机制,并以50.samba为例,介绍了monitor时间。本文介绍01.reclock脚本的monitor事件。
ctdb是不是高可用的方案,答案是 Yes and No。CTDB本身并不是HA的方案,但是和分布式文件系统配合,就可以作为HA。分布式文件系统需要为ctdb提供recovery lock,防止brain split的发生。
从这个角度来讲,分布式文件系统的状态非常重要,如果分布式文件系统出现异常,造成recovery lock文件无法访问,那么ctdb就会出现异常,因为recovery的第一步就需要get lock。因此监控recovery lock的状态非常重要。
分布式文件系统无法访问会怎么样
既然recovery lock对ctdb非常重要,那么如果分布式文件系统出现异常,导致无法访问,CTDB会怎么样?
2017/05/12 17:57:09.830888 [recoverd:269627]: server/ctdb_recoverd.c:2837 check_reclock child process hung/timedout CFS slow to grant locks?
2017/05/12 17:57:09.831102 [269550]: High RECLOCK latency 15.015342s for operation recd reclock
2017/05/12 17:57:10.174137 [269550]: 01.reclock: ERROR: more than 3 consecutive failures for 01.reclock, marking cluster unhealthy
2017/05/12 17:57:24.851245 [recoverd:269627]: server/ctdb_recoverd.c:2837 check_reclock child process hung/timedout CFS slow to grant locks?
2017/05/12 17:57:24.851527 [269550]: High RECLOCK latency 15.012397s for operation recd reclock
2017/05/12 17:57:25.372032 [269550]: 01.reclock: ERROR: more than 3 consecutive failures for 01.reclock, marking cluster unhealthy
recovery daemon
在recovery daemon的main loop中,有会调用check_recovery_lock
if (ctdb->tunable.verify_recovery_lock != 0) {
/* we should have the reclock - check its not stale */
ret = check_recovery_lock(ctdb);
if (ret != 0) {
DEBUG(DEBUG_ERR,("Failed check_recovery_lock. Force a recovery\n"));
ctdb_set_culprit(rec, ctdb->pnn);
do_recovery(rec, mem_ctx, pnn, nodemap, vnnmap);
return;
}
}
事实上,这一部分代码在最新的ctdb中已经取消了:
commit 9924218dc6a7f1cb184db9953d839ab5d9124ca2
Author: Martin Schwenke <martin@meltin.net>
Date: Tue Dec 9 14:45:08 2014 +1100
recoverd: Remove check_recovery_lock()
This has not done anything useful since commit
b9d8bb23af8abefb2d967e9b4e9d6e60c4a3b520. Instead, just check that
the lock is held.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
(Imported from commit be19a17faf6da97365c425c5b423e9b74f9c9e0c)
因此,我们并不会花费太多笔墨解释这个函数。
01.reclock monitor
因为每15秒就会执行各个脚本的monitor,下面我们看下01.reclock的monitor事件
monitor)
RECLOCKFILE=$(ctdb -Y getreclock)
ctdb_counter_incr
(ctdb_check_counter_limit 200 >/dev/null 2>&1) || {
echo "Reclock file $RECLOCKFILE\" can not be accessed. Shutting down."
df
sleep 1
ctdb shutdown
}
[ -z "$RECLOCKFILE" ] && {
# we are not using a reclock file
ctdb_counter_init
exit 0
}
# try stat the reclock file as a background process
# so that we dont block in case the cluster filesystem is unavailable
(
stat $RECLOCKFILE && {
# we could stat the file, reset the counter
ctdb_counter_init
}
) >/dev/null 2>/dev/null &
ctdb_check_counter_limit 3 quiet
;;
ctdb_counter_init/ctdb_counter_incr/ctdb_check_counter_limit 是一系列的函数,他们实现在 /etc/ctdb/functions 中:
_ctdb_counter_common () {
_service_name="${1:-${service_name}}"
_counter_file="$ctdb_fail_dir/$_service_name"
mkdir -p "${_counter_file%/*}" # dirname
}
ctdb_counter_init () {
_ctdb_counter_common "$1"
>"$_counter_file"
}
ctdb_counter_incr () {
_ctdb_counter_common "$1"
# unary counting!
echo -n 1 >> "$_counter_file"
}
ctdb_check_counter_limit () {
_ctdb_counter_common
_limit="${1:-${service_fail_limit}}"
_quiet="$2"
# unary counting!
_size=$(stat -c "%s" "$_counter_file" 2>/dev/null || echo 0)
if [ $_size -ge $_limit ] ; then
echo "ERROR: more than $_limit consecutive failures for $service_name, marking cluster unhealthy"
exit 1
elif [ $_size -gt 0 -a -z "$_quiet" ] ; then
echo "WARNING: less than $_limit consecutive failures ($_size) for $service_name, not unhealthy yet"
fi
}
说起来实现非常简单,就是在/var/lib/ctdb/failcount目录下,为任意服务创建一个对应的文件:
root@ctl-1:/var/lib/ctdb/failcount# ll
total 12
drwxr-xr-x 2 root root 4096 May 12 17:09 ./
drwxrwxrwx 6 root root 4096 May 12 17:09 ../
-rw-r--r-- 1 root root 4 May 12 17:52 01.reclock
-rw-r--r-- 1 root root 0 May 12 17:09 samba
-rw-r--r-- 1 root root 0 May 12 17:09 winbind
root@ctl-1:/var/lib/ctdb/failcount# cat 01.reclock
1111root@ctl-1:/var/lib/ctdb/failcount#
如果某项检查持续失败,那么就往文件里面追加写入一个1,只要stat对应的文件,就可以知道该monitor事件已经连续失败了多少次了。当然了如果失败N次之后,突然恢复了,那么需要将文件的内容清空,通过文件长度0来宣示,目前正常。
注意下面的01.reclock,文件长度为4,表示已经连续4次检查,都失败了(实际上为stat无法正常返回)
root@ctl-1:/var/lib/ctdb/failcount# ll
total 12
drwxr-xr-x 2 root root 4096 May 12 17:09 ./
drwxrwxrwx 6 root root 4096 May 12 17:09 ../
-rw-r--r-- 1 root root 4 May 12 17:52 01.reclock
-rw-r--r-- 1 root root 0 May 12 17:09 samba
-rw-r--r-- 1 root root 0 May 12 17:09 winbind
root@ctl-1:/var/lib/ctdb/failcount# cat 01.reclock
如果分布式文件系统异常,那么stat操作一般会卡住,无法及时返回:
root 40060 0.0 0.0 4540 244 ? S 18:28 0:00 /bin/sh /etc/ctdb/events.d/01.reclock monitor
root 40062 0.0 0.0 8596 1572 ? D 18:28 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 42964 0.0 0.0 4540 240 ? S 18:28 0:00 /bin/sh /etc/ctdb/events.d/01.reclock monitor
root 42966 0.0 0.0 8596 1616 ? D 18:28 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 46030 0.0 0.0 4540 244 ? S 18:29 0:00 /bin/sh /etc/ctdb/events.d/01.reclock monitor
root 46032 0.0 0.0 8596 1452 ? D 18:29 0:00 stat /var/share/ezfs/ctdb/recovery.lock
而代码stat部分实际上创建子进程后台执行:
(
stat $RECLOCKFILE && {
# we could stat the file, reset the counter
ctdb_counter_init
}
) >/dev/null 2>/dev/null &
如果成功stat,那么自然将失败的计数器设置为0,如果卡住,那么monitor的开头ctdb_counter_incr就会导致失败计数器不断增长,当连续3次失败之后,就会每次ctdb_check_counter_limit 3 quiet都会打印:
2017/05/12 17:22:58.614358 [269550]: 01.reclock: ERROR: more than 3 consecutive failures for 01.reclock, marking cluster unhealthy
因为monitor事件每15秒调用一次,所以当分布式文件系统不能恢复的时候,上述的打印也每15秒打印一次。
如果分布式文件系统长时间不能恢复,会怎么样?
(ctdb_check_counter_limit 200 >/dev/null 2>&1) || {
echo "Reclock file $RECLOCKFILE\" can not be accessed. Shutting down."
df
sleep 1
ctdb shutdown
}
如果连续200次检查,都是失败或者卡住,那么就会尝试执行df,并且关闭ctdb
2017/05/12 21:19:48.339397 [269550]: 01.reclock: Reclock file /var/share/ezfs/ctdb/recovery.lock" can not be accessed. Shutting down.
2017/05/12 21:19:48.342088 [269550]: 01.reclock: Filesystem 1K-blocks Used Available Use% Mounted on
2017/05/12 21:19:48.342122 [269550]: 01.reclock: /dev/vda2 14867844 1405372 12684184 10% /
2017/05/12 21:19:48.342143 [269550]: 01.reclock: udev 4079172 4 4079168 1% /dev
2017/05/12 21:19:48.342159 [269550]: 01.reclock: tmpfs 1635336 516 1634820 1% /run
2017/05/12 21:19:48.342177 [269550]: 01.reclock: none 5120 0 5120 0% /run/lock
2017/05/12 21:19:48.342195 [269550]: 01.reclock: none 4088332 12 4088320 1% /run/shm
2017/05/12 21:19:48.342214 [269550]: 01.reclock: /dev/sda2 12253344 102084 12134876 1% /data/osd.4
2017/05/12 21:19:48.342231 [269550]: 01.reclock: /dev/sdb2 12253344 73288 12163672 1% /data/osd.5
2017/05/12 21:19:48.342249 [269550]: 01.reclock: 192.168.137.137,192.168.137.138,192.168.137.139:/ 36757504 315392 36392960 1% /var/share/ezfs
2017/05/12 21:19:48.342269 [269550]: 01.reclock: ceph-fuse 73519104 733184 72785920 1% /mnt/ezfs_fuse
2017/05/12 21:19:49.344706 [269550]: Shutting down recovery daemon
2017/05/12 21:19:49.344746 [269550]: Monitoring has been stopped
表面看,就这么回事,实际上还有个问题,即df可能会卡住。
2017/05/12 18:49:01.000787 [121066]: 01.reclock: Reclock file /var/share/ezfs/ctdb/recovery.lock" can not be accessed. Shutting down.
2017/05/12 18:49:30.977273 [121066]: Event script timed out : 01.reclock monitor count : 0 pid : 278735
2017/05/12 18:49:30.977601 [121066]: Ignoring hung script for call 8
2017/05/12 18:49:30.977652 [121066]: server/eventscript.c:584 Sending SIGTERM to child pid:278735
2017/05/12 18:49:45.999869 [121066]: 01.reclock: Reclock file /var/share/ezfs/ctdb/recovery.lock" can not be accessed. Shutting down.
2017/05/12 18:50:15.980508 [121066]: Event script timed out : 01.reclock monitor count : 0 pid : 287311
2017/05/12 18:50:15.980763 [121066]: Ignoring hung script for call 8
2017/05/12 18:50:15.980804 [121066]: server/eventscript.c:584 Sending SIGTERM to child pid:287311
因此,ctdb shutdown 并没有机会执行。注意,eventscript脚本因为df 分布式文件系统会导致hang住。
strace跟踪df,会发现因为分布式文件系统有问题,导致statfs始终无法返回,因此会hang住
1494600228.960570 statfs("/var/share/ezfs",
好在eventscript脚本有timeout机制:
ctdb listvars |grep -i timeout
EventScriptTimeout = 30
static void ctdb_event_script_timeout(struct event_context *ev, struct timed_event *te,
struct timeval t, void *p)
{
struct ctdb_event_script_state *state = talloc_get_type(p, struct ctdb_event_script_state);
struct ctdb_context *ctdb = state->ctdb;
struct ctdb_script_wire *current = get_current_script(state);
DEBUG(DEBUG_ERR,("Event script timed out : %s %s %s count : %u pid : %d\n",
current->name, ctdb_eventscript_call_names[state->call], state->options, ctdb->event_script_timeouts, state->child));
/* ignore timeouts for these events */
switch (state->call) {
case CTDB_EVENT_START_RECOVERY:
case CTDB_EVENT_RECOVERED:
case CTDB_EVENT_TAKE_IP:
case CTDB_EVENT_RELEASE_IP:
case CTDB_EVENT_STOPPED:
case CTDB_EVENT_MONITOR:
case CTDB_EVENT_STATUS:
state->scripts->scripts[state->current].status = 0;
DEBUG(DEBUG_ERR,("Ignoring hung script for %s call %d\n", state->options, state->call));
break;
default:
state->scripts->scripts[state->current].status = -ETIME;
debug_timeout(state);
}
talloc_free(state);
}
在ctdb_event_script_callback_v 中注册有state free要执行的函数,即 event_script_destructor
talloc_set_destructor(state, event_script_destructor);
下面我们看一下 event_script_destructor函数做的事情:
/*
destroy an event script: kill it if ->child != 0.
*/
static int event_script_destructor(struct ctdb_event_script_state *state)
{
int status;
struct event_script_callback *callback;
if (state->child) {
DEBUG(DEBUG_ERR,(__location__ " Sending SIGTERM to child pid:%d\n", state->child));
if (kill(state->child, SIGTERM) != 0) {
DEBUG(DEBUG_ERR,("Failed to kill child process for eventscript, errno %s(%d)\n", strerror(errno), errno));
}
}
会向未完成的脚本发送SIGTERM。
总结
如果分布式文件系统出现异常,ctdb 会怎么样?
首先无论是recmaster 还是普通的node,都会发现因为stat recovery lock 卡住,而导致失败计数器不断递增(即对应文件/var/lib/ctdb/failcount/01.reclock 不断写入1)。
当连续三次失败,log.ctdb中开始出现如下打印,每15秒打印依次
2017/05/12 21:51:20.329111 [98501]: 01.reclock: ERROR: more than 3 consecutive failures for 01.reclock, marking cluster unhealthy
2017/05/12 21:51:35.506703 [98501]: 01.reclock: ERROR: more than 3 consecutive failures for 01.reclock, marking cluster unhealthy
2017/05/12 21:51:50.688467 [98501]: 01.reclock: ERROR: more than 3 consecutive failures for 01.reclock, marking cluster unhealthy
如果在15*200 = 3000秒的时间内,分布式文件系统仍然无法恢复正常,recovery lock 依然无法访问,那么,开始出现如下打印:
2017/05/12 22:41:15.079926 [98501]: 01.reclock: Reclock file /var/share/ezfs/ctdb/recovery.lock" can not be accessed. Shutting down.
2017/05/12 22:41:45.058071 [98501]: Event script timed out : 01.reclock monitor count : 0 pid : 656512
2017/05/12 22:41:45.058172 [98501]: Ignoring hung script for call 8
2017/05/12 22:41:45.058191 [98501]: server/eventscript.c:584 Sending SIGTERM to child pid:656512
2017/05/12 22:42:00.088310 [98501]: 01.reclock: Reclock file /var/share/ezfs/ctdb/recovery.lock" can not be accessed. Shutting down.
2017/05/12 22:42:30.059273 [98501]: Event script timed out : 01.reclock monitor count : 0 pid : 664585
2017/05/12 22:42:30.059381 [98501]: Ignoring hung script for call 8
2017/05/12 22:42:30.059398 [98501]: server/eventscript.c:584 Sending SIGTERM to child pid:664585
01.reclock monitor 打印了Shutting down之后,尝试执行ctdb shutdown,但是好死不死,偏偏在执行ctdb shutdown之前执行了df,要知道 分布式文件系统也是在本地也是有挂载点的,因此,df会hang住,不能返回。那么30秒之后,eventscript超时,被ctdb进程发送SIGTERM信号杀死。
注意在上述的过程中,stat recovery lock的进程在累积,越来越多,每15秒就会增加一个。但是ctdb shutdown又无法执行,因此集群不恢复,stat进程就会无限制增加。
root 118695 0.0 0.0 10168 2028 ? D 21:50 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 119746 0.0 0.0 10168 1840 ? D 21:51 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 120328 0.0 0.0 10168 1912 ? D 21:51 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 121374 0.0 0.0 10168 1864 ? D 21:51 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 122218 0.0 0.0 10168 2008 ? D 21:51 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 648084 0.0 0.0 10168 1868 ? D 22:40 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 650885 0.0 0.0 10168 1972 ? D 22:40 0:00 stat /var/share/ezfs/ctdb/recovery.lock
root 653780 0.0 0.0 10168 1848 ? D 22:40 0:00 stat /var/share/ezfs/ctdb/recovery.lock
随着时间的流逝,管理员可能恢复了分布式文件系统,这时候,会发生什么事情,取决是否与残存的df 和 未遂的shutdown。
- 如果没有hang住的df,那么stat可能就会返回,那么就会将 /var/lib/ctdb/failcount/01.reclock 清空,下一次来执行01.reclock monitor的时候,不会尝试执行ctdb shutdown。
- 如果很不幸,存在hang住的df,即01.reclock monitor脚本尚未超时被杀死,那么分布式文件系统恢复后,df也正常返回,那么ctdb shutdown终于得到机会执行,该节点的ctdb就会关闭,有如下的log:
2017/05/12 22:48:49.038076 [170453]: 01.reclock: Reclock file /var/share/ezfs/ctdb/recovery.lock" can not be accessed. Shutting down.
2017/05/12 22:48:53.814372 [170453]: 01.reclock: Filesystem 1K-blocks Used Available Use% Mounted on
2017/05/12 22:48:53.814559 [170453]: 01.reclock: /dev/vda2 14867844 4439208 9650348 32% /
2017/05/12 22:48:53.814609 [170453]: 01.reclock: udev 4079172 4 4079168 1% /dev
2017/05/12 22:48:53.814667 [170453]: 01.reclock: tmpfs 1635336 600 1634736 1% /run
2017/05/12 22:48:53.814724 [170453]: 01.reclock: none 5120 0 5120 0% /run/lock
2017/05/12 22:48:53.814763 [170453]: 01.reclock: none 4088332 12 4088320 1% /run/shm
2017/05/12 22:48:53.814802 [170453]: 01.reclock: /dev/sda2 12253344 109364 12127596 1% /data/osd.2
2017/05/12 22:48:53.814843 [170453]: 01.reclock: /dev/sdb2 12253344 96492 12140468 1% /data/osd.3
2017/05/12 22:48:53.814883 [170453]: 01.reclock: 192.168.137.137,192.168.137.138,192.168.137.139:/ 36757504 323584 36384768 1% /var/share/ezfs
2017/05/12 22:48:53.814960 [170453]: 01.reclock: ceph-fuse 73519104 749568 72769536 2% /mnt/ezfs_fuse
2017/05/12 22:48:54.818095 [170453]: Shutting down recovery daemon
2017/05/12 22:48:54.818183 [170453]: Monitoring has been stopped
2017/05/12 22:48:54.818243 [170453]: server/eventscript.c:584 Sending SIGTERM to child pid:867808
2017/05/12 22:48:54.919587 [170453]: server/ctdb_daemon.c:803 Unable to setup call send
2017/05/12 22:48:54.919752 [170453]: server/ctdb_daemon.c:803 Unable to setup call send