前言
上一篇文章介绍了CTDB eventscript调用的过程,这一篇我们重点介绍50.samba这个脚本的monitor事件。事实上, 在正常运行过程中,CTDB每过一段时间(默认15秒),就会检查是否正常,具体做法即执行脚本的monitor事件:
2017/05/08 10:45:33.986913 [519718]: server/eventscript.c:762 Starting eventscript monitor
2017/05/08 10:45:33.987442 [eventscript-00.ctdb-monitor:134206]: Executing event script /etc/ctdb/events.d/00.ctdb monitor
2017/05/08 10:45:33.994023 [eventscript-01.reclock-monitor:134220]: Executing event script /etc/ctdb/events.d/01.reclock monitor
2017/05/08 10:45:34.022542 [eventscript-10.interface-monitor:134244]: Executing event script /etc/ctdb/events.d/10.interface monitor
2017/05/08 10:45:34.067580 [eventscript-11.natgw-monitor:134295]: Executing event script /etc/ctdb/events.d/11.natgw monitor
2017/05/08 10:45:34.075430 [eventscript-11.routing-monitor:134306]: Executing event script /etc/ctdb/events.d/11.routing monitor
2017/05/08 10:45:34.087329 [eventscript-13.per_ip_routing-monitor:134312]: Executing event script /etc/ctdb/events.d/13.per_ip_routing monitor
2017/05/08 10:45:34.096299 [eventscript-20.multipathd-monitor:134331]: Executing event script /etc/ctdb/events.d/20.multipathd monitor
2017/05/08 10:45:34.096684 [eventscript-31.clamd-monitor:134332]: Executing event script /etc/ctdb/events.d/31.clamd monitor
2017/05/08 10:45:34.097064 [eventscript-40.vsftpd-monitor:134333]: Executing event script /etc/ctdb/events.d/40.vsftpd monitor
2017/05/08 10:45:34.103765 [eventscript-40.fs_use-monitor:134349]: Executing event script /etc/ctdb/events.d/40.fs_use monitor
2017/05/08 10:45:34.104252 [eventscript-41.httpd-monitor:134350]: Executing event script /etc/ctdb/events.d/41.httpd monitor
2017/05/08 10:45:34.115300 [eventscript-50.samba-monitor:134361]: Executing event script /etc/ctdb/events.d/50.samba monitor
2017/05/08 10:45:34.160322 [eventscript-60.ganesha-monitor:134423]: Executing event script /etc/ctdb/events.d/60.ganesha monitor
2017/05/08 10:45:34.160648 [eventscript-60.nfs-monitor:134424]: Executing event script /etc/ctdb/events.d/60.nfs monitor
2017/05/08 10:45:34.170403 [eventscript-62.cnfs-monitor:134434]: Executing event script /etc/ctdb/events.d/62.cnfs monitor
2017/05/08 10:45:34.176847 [eventscript-70.iscsi-monitor:134440]: Executing event script /etc/ctdb/events.d/70.iscsi monitor
2017/05/08 10:45:34.181991 [eventscript-91.lvs-monitor:134444]: Executing event script /etc/ctdb/events.d/91.lvs monitor
2017/05/08 10:45:34.188665 [519718]: server/eventscript.c:485 Eventscript monitor finished with state 0
最近一轮执行的结果可以通过ctdb scriptstatus查询:
root@controller01:/var/log/ctdb# ctdb scriptstatus
17 scripts were executed last monitor cycle
00.ctdb Status:OK Duration:0.014 Mon May 8 11:51:26 2017
01.reclock Status:OK Duration:0.068 Mon May 8 11:51:26 2017
10.interface Status:OK Duration:0.106 Mon May 8 11:51:26 2017
11.natgw Status:OK Duration:0.024 Mon May 8 11:51:26 2017
11.routing Status:OK Duration:0.007 Mon May 8 11:51:26 2017
13.per_ip_routing Status:OK Duration:0.015 Mon May 8 11:51:26 2017
20.multipathd Status:DISABLED
31.clamd Status:DISABLED
40.vsftpd Status:OK Duration:0.013 Mon May 8 11:51:26 2017
40.fs_use Status:DISABLED
41.httpd Status:OK Duration:0.015 Mon May 8 11:51:26 2017
50.samba Status:OK Duration:0.101 Mon May 8 11:51:26 2017
60.ganesha Status:DISABLED
60.nfs Status:OK Duration:0.011 Mon May 8 11:51:26 2017
62.cnfs Status:OK Duration:0.009 Mon May 8 11:51:26 2017
70.iscsi Status:OK Duration:0.023 Mon May 8 11:51:26 2017
91.lvs Status:OK Duration:0.034 Mon May 8 11:51:26 2017
本文通过50.samba为例,介绍ctdb 如何检查samba 运行状态
50.samba monitor 之准备工作
脚本最开始,是这么写的:
. $CTDB_BASE/functions
detect_init_style
case $CTDB_INIT_STYLE in
suse)
CTDB_SERVICE_SMB=${CTDB_SERVICE_SMB:-smb}
CTDB_SERVICE_NMB=${CTDB_SERVICE_NMB:-nmb}
CTDB_SERVICE_WINBIND=${CTDB_SERVICE_WINBIND:-winbind}
;;
debian)
CTDB_SERVICE_SMB=${CTDB_SERVICE_SMB:-samba}
CTDB_SERVICE_NMB=${CTDB_SERVICE_NMB:-""}
CTDB_SERVICE_WINBIND=${CTDB_SERVICE_WINBIND:-winbind}
;;
*)
# should not happen, but for now use redhat style as default:
CTDB_SERVICE_SMB=${CTDB_SERVICE_SMB:-smb}
CTDB_SERVICE_NMB=${CTDB_SERVICE_NMB:-""}
CTDB_SERVICE_WINBIND=${CTDB_SERVICE_WINBIND:-winbind}
;;
esac
service_name="samba"
loadconfig
ctdb_setup_service_state_dir
首先是引入了functions,很多公用的shell函数都都实现在functions。下面我们看下准备阶段做了哪些事情:
. $CTDB_BASE/functions
要想手动执行 50.samba monitor, 需要首先声明 CTDB_BASE环境变量,对于我们Ubuntu,应该如此执行:
CTDB_BASE=/etc/ctdb bash -x /etc/ctdb/events.d/50.samba moinitor
我们单步执行,看下functions 做了哪些准备
+ . /etc/ctdb/functions
++ PATH=/bin:/usr/bin:/usr/sbin:/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
++ '[' -z '' ']'
++ '[' -d /var/lib/ctdb ']'
++ export CTDB_VARDIR=/var/lib/ctdb
++ CTDB_VARDIR=/var/lib/ctdb
++ '[' -z '' ']'
++ export CTDB_ETCDIR=/etc
++ CTDB_ETCDIR=/etc
++ ctdb_status_dir=/var/lib/ctdb/status
++ ctdb_fail_dir=/var/lib/ctdb/failcount
++ ctdb_managed_dir=/var/lib/ctdb/managed_history
++ tickledir=/var/lib/ctdb/state/tickles
++ mkdir -p /var/lib/ctdb/state/tickles
++ '[' -n '' -a -x '' ']'
++ '[' -x /etc/ctdb/rc.local ']'
++ '[' -d /etc/ctdb/rc.local.d ']'
++ ctdb_set_current_debuglevel
++ '[' -z '' ']'
++ _f=/var/lib/ctdb/eventscript_debuglevel
++ '[' '' = create -o '!' -r /var/lib/ctdb/eventscript_debuglevel ']'
++ . /var/lib/ctdb/eventscript_debuglevel
+++ export CTDB_CURRENT_DEBUGLEVEL=0
+++ CTDB_CURRENT_DEBUGLEVEL=0
++ script_name=50.samba
++ service_name=50.samba
++ service_fail_limit=1
++ event_name=monitor
基本来讲,就是export了几个目录
CTDB_VARDIR=/var/lib/ctdb
CTDB_ETCDIR=/etc
ctdb_status_dir=/var/lib/ctdb/status
ctdb_fail_dir=/var/lib/ctdb/failcount
ctdb_managed_dir=/var/lib/ctdb/managed_history
tickledir=/var/lib/ctdb/state/tickles
这些目录基本都是ctdb的work directory,有状态目录,有失败目录,
root@controller02:/var/lib/ctdb/state# ll
total 36
drwxr-xr-x 9 root root 4096 May 8 10:44 ./
drwxrwxrwx 6 root root 4096 May 8 10:53 ../
drwxr-xr-x 2 root root 4096 May 4 08:03 ctdb/
drwxr-xr-x 3 root root 4096 May 4 08:03 gpfs/
drwxr-xr-x 3 root root 4096 May 8 10:44 interface_modify/
drwxr-xr-x 2 root root 4096 May 4 08:03 nfs/
drwxr-xr-x 2 root root 4096 May 4 08:03 per_ip_routing/
drwxr-xr-x 2 root root 4096 May 8 13:51 samba/
drwxr-xr-x 2 root root 4096 May 4 08:03 tickles/
root@node1:/var/lib/ctdb/failcount# ll
total 8
drwxr-xr-x 2 root root 4096 Apr 11 21:45 ./
drwxrwxrwx 6 root root 4096 Apr 12 21:36 ../
-rw-r--r-- 1 root root 0 May 8 13:55 01.reclock
-rw-r--r-- 1 root root 0 Apr 11 21:45 samba
-rw-r--r-- 1 root root 0 Apr 11 21:45 winbind
root@node1:/var/lib/ctdb/managed_history# ll
total 8
drwxr-xr-x 2 root root 4096 Apr 11 21:45 ./
drwxrwxrwx 6 root root 4096 Apr 12 21:36 ../
-rw-r--r-- 1 root root 0 Apr 11 21:45 samba
-rw-r--r-- 1 root root 0 Apr 11 21:45 winbind
目录下具体文件的含义,我们暂时掠过,后面代码部分详细阐述。
关于function的第二部分是获取 service name和事件号。
++ . /var/lib/ctdb/eventscript_debuglevel
+++ export CTDB_CURRENT_DEBUGLEVEL=0
+++ CTDB_CURRENT_DEBUGLEVEL=0
++ script_name=50.samba
++ service_name=50.samba
++ service_fail_limit=1
++ event_name=monitor
表示,执行的脚本名称是50.samba, service_name是 50.samba,事件为monitor。 代码如下:
ctdb_set_current_debuglevel
script_name="${0##*/}" # basename
service_name="$script_name" # default is just the script name
service_fail_limit=1
event_name="$1"
比较有意思的是 script_name=”${0##*/}” 这一句,这一句的本意是 basename ${0}。这种用法在shell中比较普遍,含义如下:
${VAR##pattern} is the value of variable $VAR with the longest string that matches the pattern removed from the beggining
loadconfig
第二部分我们简单称之为loadconfig。
detect_init_style
case $CTDB_INIT_STYLE in
suse)
CTDB_SERVICE_SMB=${CTDB_SERVICE_SMB:-smb}
CTDB_SERVICE_NMB=${CTDB_SERVICE_NMB:-nmb}
CTDB_SERVICE_WINBIND=${CTDB_SERVICE_WINBIND:-winbind}
;;
debian)
CTDB_SERVICE_SMB=${CTDB_SERVICE_SMB:-samba}
CTDB_SERVICE_NMB=${CTDB_SERVICE_NMB:-""}
CTDB_SERVICE_WINBIND=${CTDB_SERVICE_WINBIND:-winbind}
;;
*)
# should not happen, but for now use redhat style as default:
CTDB_SERVICE_SMB=${CTDB_SERVICE_SMB:-smb}
CTDB_SERVICE_NMB=${CTDB_SERVICE_NMB:-""}
CTDB_SERVICE_WINBIND=${CTDB_SERVICE_WINBIND:-winbind}
;;
esac
service_name="samba"
loadconfig
ctdb_setup_service_state_dir
其中loadconfig是functions中的函数:
_loadconfig() {
if [ -z "$1" ] ; then
foo="${service_config:-${service_name}}"
if [ -n "$foo" ] ; then
loadconfig "$foo"
fi
elif [ "$1" != "ctdb" ] ; then
loadconfig "ctdb"
fi
if [ -f $CTDB_ETCDIR/sysconfig/$1 ]; then
. $CTDB_ETCDIR/sysconfig/$1
elif [ -f $CTDB_ETCDIR/default/$1 ]; then
. $CTDB_ETCDIR/default/$1
elif [ -f $CTDB_BASE/sysconfig/$1 ]; then
. $CTDB_BASE/sysconfig/$1
fi
}
loadconfig () {
_loadconfig "$@"
}
这段函数比较简单,就是如果参数不是ctdb,那么先loadconfig ctdb,然后执行加载具体服务的配置文件,对于ubuntu而言,就相当于执行了如下语句:
. /etc/default/ctdb
. /etc/default/samba
两者一个是ctdb的默认配置,另一个是samba的默认配置。对于第二者就一句话:
RUN_MODE="daemons"
对于 ctdb的配置有如下:
CTDB_RECOVERY_LOCK=/var/share/ezfs/ctdb/gwgrp/vs-1000/recovery.lock
CTDB_PUBLIC_INTERFACE=eth0
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_MANAGES_SAMBA=yes
CTDB_MANAGES_WINBIND=yes
CTDB_SERVICE_SMB=smbd
CTDB_SERVICE_NMB=nmbd
CTDB_DEBUGLEVEL=INFO
CTDB_SERVICE_AUTOSTARTSTOP=yes
CTDB_SET_RecoveryBanPeriod=60
CTDB_SET_RecoveryDropAllIPs=65
至此准备工作全部完成,后面看monitor的正文部分:
50.samba monitor
monitor)
# Create a dummy file to track when we need to do periodic cleanup
# of samba databases
periodic_cleanup_file="$service_state_dir/periodic_cleanup"
[ -f "$periodic_cleanup_file" ] || {
touch "$periodic_cleanup_file"
}
[ `find "$periodic_cleanup_file" -mmin +$SAMBA_CLEANUP_PERIOD | wc -l` -eq 1 ] && {
# Cleanup the databases
periodic_cleanup
touch "$periodic_cleanup_file"
}
is_ctdb_managed_service "samba" && {
[ "$CTDB_SAMBA_SKIP_SHARE_CHECK" = "yes" ] || {
testparm_background_update
testparm_cat | egrep '^WARNING|^ERROR|^Unknown' && {
testparm_foreground_update
testparm_cat | egrep '^WARNING|^ERROR|^Unknown' && {
echo "ERROR: testparm shows smb.conf is not clean"
exit 1
}
}
list_samba_shares |
ctdb_check_directories_probe || {
testparm_foreground_update
list_samba_shares |
ctdb_check_directories
} || exit $?
}
任务1 periodic_cleanup
samba monitor的第一个任务是周期性的清理工作,按照设计,每超过10分钟,就要执行
periodic_cleanup
touch "$periodic_cleanup_file"
实现的方法是在/var/lib/ctdb/samba/下放一个periodic_cleanup文件,每执行完一遍periodic_cleanup,就touch一下该文件。每次执行50.samba monitor的时候,看下该文件修改时间是否超过10分钟,如果是,则执行下一次的periodic_cleanup,否则,本轮不需要执行periodic_cleanup。
[ `find "$periodic_cleanup_file" -mmin +$SAMBA_CLEANUP_PERIOD | wc -l` -eq 1 ]
说了这么久,那么period_cleanup到底做的是什么事情呢?
# periodic cleanup function
periodic_cleanup() {
# running smbstatus scrubs any dead entries from the connections
# and sessionid database
# echo "Running periodic cleanup of samba databases"
smbstatus -np > /dev/null 2>&1 &
}
说实话,不太清除为什么执行smbstatus -np就能起到清理的作用。smbstatus的作用是展示smb连接的。
任务2 检查smb配置文件和端口
/*smb是否被CTDB托管,毫无疑问是Yes*/
is_ctdb_managed_service "samba" && {
/*是否跳过samba share的检查,我们是no,不跳过*/
[ "$CTDB_SAMBA_SKIP_SHARE_CHECK" = "yes" ] || {
testparm_background_update
testparm_cat | egrep '^WARNING|^ERROR|^Unknown' && {
testparm_foreground_update
testparm_cat | egrep '^WARNING|^ERROR|^Unknown' && {
echo "ERROR: testparm shows smb.conf is not clean"
exit 1
}
}
list_samba_shares |
ctdb_check_directories_probe || {
testparm_foreground_update
list_samba_shares |
ctdb_check_directories
} || exit $?
}
smb_ports="$CTDB_SAMBA_CHECK_PORTS"
[ -z "$smb_ports" ] && {
smb_ports=`testparm_cat --parameter-name="smb ports"`
}
ctdb_check_tcp_ports $smb_ports || exit $?
}
检查配置文件
testparm_background_update 这个函数用来后台执行testparm ,检查/etc/samba/smb.conf , 将有效的配置整理后放入 /var/lib/ctdb/state/samba/smb.conf.cache。
如果10秒内,后台进程无法完成testparm的检查,那么就杀死进程,失败返回,一般不会失败,哪怕有个别参数有问题。
完成这一步后,会通过testparm_cat 检查配置中是否’^WARNING | ^ERROR | ^Unknown’, 如果存在,那么在前台再执行一遍 testparm,生成新的smb.conf.cache, 然后再次检查,如果依然存在 ‘^WARNING | ^ERROR | ^Unknown’,那么就失败返回,表示当前事件检查出异常。一般而言这种情况比较罕见,我曾经故意修改smb.conf ,比如将 readonly = yes 改成readonly = fuck,依然50.samba monitor依然返回OK,只不过smb.conf.cache中,readonly = yes就不复存在了。 |
检查share folder是否存在
list_samba_shares |
ctdb_check_directories_probe || {
testparm_foreground_update
list_samba_shares |
ctdb_check_directories
} || exit $?
首先,list_samba_shares 会列出来所有export出去的shared folder
list_samba_shares ()
{
testparm_cat |
sed -n -e 's@^[[:space:]]*path[[:space:]]*=[[:space:]]@@p' |
sed -e 's/"//g'
}
因为export出去的samba文件夹,都是这种格式
path = /XXX/YYY 因此,list_samba_shares 就是搜索path,来获得目录列表。 通过 -d来检查目录是否存在,如果目录不存在,则返回1,表示失败。
ctdb_check_directories_probe() {
while IFS="" read d ; do
case "$d" in
*%*)
continue
;;
*)
[ -d "${d}/." ] || return 1
esac
done
}
ctdb_check_directories() {
n="${1:-${service_name}}"
ctdb_check_directories_probe || {
echo "ERROR: $n directory \"$d\" not available"
exit 1
}
}
检查samba用到的端口
smb_ports="$CTDB_SAMBA_CHECK_PORTS"
[ -z "$smb_ports" ] && {
smb_ports=`testparm_cat --parameter-name="smb ports"`
}
ctdb_check_tcp_ports $smb_ports || exit $?
相当于执行:
testparm -s /var/lib/ctdb/state/samba/smb.conf.cache '--parameter-name=smb ports' 2>/dev/null
445 139
samba占用的两个端口是445和139:
root@controller03:/var/lib/ctdb/state/samba# netstat -antp |grep smb
tcp 0 0 0.0.0.0:445 0.0.0.0:* LISTEN 211760/smbd
tcp 0 0 0.0.0.0:139 0.0.0.0:* LISTEN 211760/smbd
检查端口占用执行的是ctdb_check_tcp_ports 函数,该函数的实现在functions中。
ctdb_check_tcp_ports()
{
if [ -z "$1" ] ; then
echo "INTERNAL ERROR: ctdb_check_tcp_ports - no ports specified"
exit 1
fi
# Set default value for CTDB_TCP_PORT_CHECKS if unset.
# If any of these defaults are unsupported then this variable can
# be overridden in /etc/sysconfig/ctdb or via a file in
# /etc/ctdb/rc.local.d/.
: ${CTDB_TCP_PORT_CHECKERS:=ctdb nmap netstat}
for _c in $CTDB_TCP_PORT_CHECKERS ; do
ctdb_check_tcp_ports_$_c "$@"
case "$?" in
0)
_ctdb_check_tcp_common
rm -f "$_ctdb_service_started_file"
return 0
;;
1)
_ctdb_check_tcp_common
if [ ! -f "$_ctdb_service_started_file" ] ; then
echo "ERROR: $service_name tcp port $_p is not responding"
debug <<EOF
$ctdb_check_tcp_ports_debug
EOF
else
echo "INFO: $service_name tcp port $_p is not responding"
fi
return 1
;;
127)
debug <<EOF
ctdb_check_ports - checker $_c not implemented
output from checker was:
$ctdb_check_tcp_ports_debug
EOF
;;
*)
esac
done
echo "INTERNAL ERROR: ctdb_check_ports - no working checkers in CTDB_TCP_PORT_CHECKERS=\"$CTDB_TCP_PORT_CHECKERS\"
return 127
}
functions中实现了三个函数,来判断端口是否已经被占用:
ctdb_check_tcp_ports_ctdb
ctdb_check_tcp_ports_nmap
ctdb_check_tcp_ports_netstat
使用这三个函数来判断端口使用情况。如果端口没有被使用,自如地bind到对应端口,说明samba进程不正常。对于samba而言,使用的是第一个函数:ctdb_check_tcp_ports_ctdb
ctdb_check_tcp_ports_ctdb ()
{
for _p ; do # process each function argument (port)
_cmd="ctdb checktcpport $_p"
_out=$($_cmd 2>&1)
_ret=$?
case "$_ret" in
0)
ctdb_check_tcp_ports_debug="\"$_cmd\" was able to bind to port"
return 1
;;
98)
# Couldn't bind, something already listening, next port...
continue
;;
*)
ctdb_check_tcp_ports_debug="$_cmd (exited with $_ret) with output:
$_out"
# assume not implemented
return 127
esac
done
return 0
执行ctdb checktcpport 445 命令,注意,返回98表示无法bind,即端口已经被占用,这说明程序运行的好好的,返回0。 如果ctdb checktcpport 445 返回0,表示,因为没有进程在使用445,因此bind成功,这种情况反倒说明samba进程不存在,因此,ctdb_check_tcp_ports_ctdb返回1。
另外一种情况是ctdb checktcpport 不存在这种命令,就返回127 ,对于ctdb_check_tcp_ports 函数,会接下来尝试 ctdb_check_tcp_ports_nmap 和ctdb_check_tcp_ports_netstat 函数,不赘述了。
如果我们通过 /etc/init.d/smbd stop 将samba停掉:
50.samba Status:ERROR Duration:0.136 Mon May 8 17:53:00 2017
OUTPUT:ERROR: samba tcp port 445 is not responding
任务3 winbind 检查
check_ctdb_manages_winbind && {
ctdb_check_command "winbind" "wbinfo -p"
}
monitor的第三个任务是检查winbind是否健康:
简单地说就是执行wbinfo -p命令,查看是否正常返回。 wbinfo -p的含义是:
-p|--ping
Check whether winbindd(8) is still alive. Prints out either 'succeeded' or 'failed'.
ping 一下winbindd进程,查看进程是否活着。
root@controller01:~# wbinfo -p
Ping to winbindd succeeded
root@controller01:~# echo $?
0
如果通过 /etc/init.d/winbind stop命令,将winbind停掉,我们可以看到如下输出:
50.samba Status:ERROR Duration:0.061 Mon May 8 17:57:08 2017
OUTPUT:ERROR: winbind - wbinfo -p returned error
结尾
至此,05.samba monitor 我们就完整的分析了一遍。在此基础上,也熟悉了部分functions的函数。