2017-06-22 | 分类 CTDB | 标签 CTDB

前言

前面有一篇文章：CTDB 之重启网络虚IP消失以后介绍的是虚IP因为某种原因消失之后，CTDB如何检测到虚IP消失，如何发起虚IP的接管，原来负责该虚IP的物理节点如何重新接管虚IP。

本文介绍的是，如果CTDB用的物理网口发生down，CTDB如何检测到，如果存在虚IP的话，该节点的虚IP又是如何被其他节点接管。

测试方法

注意我们的CTDB使用的public interface为：bond0

/usr/sbin/ctdbd --reclock=/var/share/ezfs/ctdb/recovery.lock --public-addresses=/etc/ctdb/public_addresses --public-interface=bond0 -d ERR

在我的bond0中，只有一网卡：

root@BEAN-1:/var/log/ctdb# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 7
Permanent HW addr: 00:0c:29:9b:e2:be
Slave queue ID: 0

现在我们通过ifconfig eth0 down的方法，来停掉CTDB的public interface，看下CTDB的反应。

操作的命令行如下：

onnode all ctdb setdebug INFO ; sleep 2 ; date ; ifconfig eth0 down ; for i in {1..10} ; do date && ctdb status && ctdb ip&&sleep 1 ; done  ; onnode all ctdb setdebug NOTICE 

注意，down掉往网卡之后，我们每秒钟都会执行ctdb status和ctdb ip，目的是观测，down掉网口多久之后，才会被ctdb检测到该事件，从而ctdb 集群将down 网口的node标记成UNHEALTHY，以及虚IP接管需要花费的时间。

输出如下：

>> NODE: 10.11.12.3 <<

>> NODE: 10.11.12.2 <<

>> NODE: 10.11.12.1 <<
Thu Jun 22 14:27:07 CST 2017

Thu Jun 22 14:27:07 CST 2017
Number of nodes:3
pnn:0 10.11.12.3       OK
pnn:1 10.11.12.2       OK
pnn:2 10.11.12.1       OK (THIS NODE)
Generation:1562912436
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
Public IPs on node 2
172.16.146.193 2
172.16.146.192 1
172.16.146.191 0
...
Thu Jun 22 14:27:16 CST 2017
Number of nodes:3
pnn:0 10.11.12.3       OK
pnn:1 10.11.12.2       OK
pnn:2 10.11.12.1       OK (THIS NODE)
Generation:1562912436
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
Public IPs on node 2
172.16.146.193 2
172.16.146.192 1
172.16.146.191 0
Thu Jun 22 14:27:17 CST 2017
Number of nodes:3
pnn:0 10.11.12.3       OK
pnn:1 10.11.12.2       OK
pnn:2 10.11.12.1       UNHEALTHY (THIS NODE)
Generation:1562912436
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
Public IPs on node 2
172.16.146.193 2
172.16.146.192 1
172.16.146.191 0
Thu Jun 22 14:27:25 CST 2017
Number of nodes:3
pnn:0 10.11.12.3       OK
pnn:1 10.11.12.2       OK
pnn:2 10.11.12.1       UNHEALTHY (THIS NODE)
Generation:1562912436
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
Public IPs on node 2
172.16.146.193 0
172.16.146.192 1
172.16.146.191 0
...
Thu Jun 22 14:27:31 CST 2017
Number of nodes:3
pnn:0 10.11.12.3       OK
pnn:1 10.11.12.2       OK
pnn:2 10.11.12.1       UNHEALTHY (THIS NODE)
Generation:1562912436
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
Public IPs on node 2
172.16.146.193 0
172.16.146.192 1
172.16.146.191 0

>> NODE: 10.11.12.3 <<

>> NODE: 10.11.12.2 <<

>> NODE: 10.11.12.1 <<

注意，上图中 Thu Jun 22 14:27:07 CST 2017这个时间点，开始down eth0，但是直到10秒之后的Thu Jun 22 14:27:17 CST 2017，CTDB集群才检测出CTDB该节点UNHEALTHY。注意，Thu Jun 22 14:27:16 CST 2017的时候，CTDB对应节点依然是OK，并未检测出网卡down。

延迟时间一般多久？CTDB靠什么机制检测出来public interface down？这是下面要解决的问题。

检测机制

我们翻开ctdb的log ，可以看到如下信息：

2017/06/22 14:27:17.465327 [261821]: server/eventscript.c:762 Starting eventscript monitor
2017/06/22 14:27:17.502480 [261821]: 10.interface: ERROR: No active slaves for bond device bond0
2017/06/22 14:27:17.503291 [261821]: iface[bond0] has changed it's link status up => down
2017/06/22 14:27:17.503791 [261821]: server/eventscript.c:485 Eventscript monitor  finished with state 1
2017/06/22 14:27:17.503815 [261821]: monitor event failed - disabling node
2017/06/22 14:27:17.503825 [261821]: Node became UNHEALTHY. Ask recovery master 0 to perform ip reallocation

发现public interface down，依靠的是eventscripts的monitor机制。在CTDB 中 eventscript功能的集成这篇文章中，我们介绍了ctdb_check_health 函数在正常运行期间，会执行eventscripts下的所有脚本的monitor事件，其中10.interface脚本会负责检测网卡的健康情况。

注意这套eventscripts机制在正常运行期间，检查的周期是15秒：

MonitorInterval         = 15

因此，平均来说，网口down了7.5秒之后，才会检测到，最恶劣的情况下，15秒才会检测掉。

接下来我们要分析下10.interface的monitor事件。

10.interface的monitor行为

monitor_interfaces()
{
    get_all_interfaces                                                 
    delete_unexpected_ips

    fail=false
    up_interfaces_found=false

    for iface in $all_interfaces ;do
        ip addr show $iface 2>/dev/null >/dev/null || {
        echo "WARNING: Interface $iface does not exist but it is used by public addresses."
        continue
        }

        # These interfaces are sometimes bond devices
        # When we use VLANs for bond interfaces, there will only
        # be an entry in /proc for the underlying real interface
        /*注意下面这段逻辑是处理bond相关的配置的*/
        realiface=`echo $iface |sed -e 's/\..*$//'`
        bi=$(get_proc "net/bonding/$realiface" 2>/dev/null) && {
        echo "$bi" | grep -q 'Currently Active Slave: None' && {
            echo "ERROR: No active slaves for bond device $realiface"
            mark_down $iface
            continue
        }
        echo "$bi" | grep -q '^MII Status: up' || {
            echo "ERROR: public network interface $realiface is down"
            mark_down $iface
            continue
        }                                                                                                                                              
        echo "$bi" | grep -q '^Bonding Mode: IEEE 802.3ad Dynamic link aggregation' && {
            # This works around a bug in the driver where the
            # overall bond status can be up but none of the actual
            # physical interfaces have a link.
            echo "$bi" | grep 'MII Status:' | tail -n +2 | grep -q '^MII Status: up' || {
                echo "ERROR: No active slaves for 802.ad bond device $realiface"
                mark_down $iface
                continue
            }
        }
        mark_up $iface
        continue
        }
        
        /*如果不是bond，则走下面的逻辑*/
        case $iface in
        lo*)
        # loopback is always working
        mark_up $iface
        ;;
        ib*)
        # we dont know how to test ib links
        mark_up $iface
        ;;
        *)
        /*通过ethtool检测Link detected来判断网卡是否up*/
        [ -z "$iface" ] || {
            [ "$(basename $(readlink /sys/class/net/$iface/device/driver) 2>/dev/null)" = virtio_net ] ||
            ethtool $iface | grep -q 'Link detected: yes' || {
            # On some systems, this is not successful when a
            # cable is plugged but the interface has not been
            # brought up previously. Bring the interface up and                                                                                        
            # try again...
            ip link set $iface up
            ethtool $iface | grep -q 'Link detected: yes' || {
                echo "ERROR: No link on the public network interface $iface"
                mark_down $iface
                continue
            }
            }
            mark_up $iface
        }
        ;;
        esac

    done

    /*考虑多张网卡绑定的bond模式，分一下场景考虑：
     * 所有网卡都OK，那么自然$fail = False,直接返回0 
     * 如果存在某张网卡down，那么$fail =True，那么不返回，而是执行下面的一句
     * 只要up_interface_found True,表示bond中至少有1个可用，如果CTDB_PARTIALLY_ONLINE_INTERFACES为yes，
     * 仍然返回0，否则返回1,表示该node UNHEALTHY/
    $fail || return 0

    $up_interfaces_found && \
        [ "$CTDB_PARTIALLY_ONLINE_INTERFACES" = "yes" ] && \
        return 0

    return 1
}

对于bond的情况下，因为bond可能会有多张网卡，这些网卡中可能部分网卡up可用，部分不可用。这种情况下，应该返回0 还是返回1 ？取决于配置CTDB_PARTIALLY_ONLINE_INTERFACES。

这部分逻辑并不难读懂，我们单步执行，我们的场景一般都有bond，结果如下：

+ ctdb_check_args monitor
+ case "$1" in
+ case "$1" in
+ monitor_interfaces
+ get_all_interfaces
++ sed -e 's/^[^\t ]*[\t ]*//' -e 's/,/ /g' -e 's/[\t ]*$//' /etc/ctdb/public_addresses
+ all_interfaces=
+ '[' bond0 ']'
+ all_interfaces='bond0 '
+ '[' '' ']'
++ ctdb -Y ip -v
++ sed -e 1d -e 's/:[^:]*:$//' -e 's/^.*://' -e 's/,/ /g'
+ ctdb_ifaces='bond0
bond0
bond0'
++ echo bond0 bond0 bond0 bond0
++ sort -u
++ tr ' ' '\n'
+ all_interfaces=bond0
+ delete_unexpected_ips
+ '[' '' = yes ']'
+ return
+ fail=false
+ up_interfaces_found=false
+ for iface in '$all_interfaces'
+ ip addr show bond0
++ echo bond0
++ sed -e 's/\..*$//'
+ realiface=bond0
++ get_proc net/bonding/bond0
+ bi='Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:29:77:bd:53
Slave queue ID: 0'
+ echo 'Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:29:77:bd:53
Slave queue ID: 0'
+ grep -q 'Currently Active Slave: None'
+ echo 'Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:29:77:bd:53
Slave queue ID: 0'
+ grep -q '^MII Status: up'
+ echo 'Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:29:77:bd:53
Slave queue ID: 0'
+ grep -q '^Bonding Mode: IEEE 802.3ad Dynamic link aggregation'
+ mark_up bond0
+ up_interfaces_found=true
+ ctdb setifacelink bond0 up
+ continue
+ false
+ return 0
+ exit 0

eventscripts检测到失败之后

注意，周期性执行eventscripts monitor，有一个回调函数：

            ret = ctdb_event_script_callback(ctdb,              
                            ctdb->monitor->monitor_context, ctdb_health_callback,
                            ctdb, false,
                            CTDB_EVENT_MONITOR, "%s", "");

执行脚本之后，会调用ctdb_health_callback。

static void ctdb_health_callback(struct ctdb_context *ctdb, int status, void *p)
{
    struct ctdb_node *node = ctdb->nodes[ctdb->pnn];
    TDB_DATA data;
    struct ctdb_node_flag_change c;                                                                                                                    
    uint32_t next_interval;
    int ret;
    TDB_DATA rddata;
    struct takeover_run_reply rd;
    const char *state_str = NULL;

    c.pnn = ctdb->pnn;
    c.old_flags = node->flags;

    rd.pnn   = ctdb->pnn;
    rd.srvid = CTDB_SRVID_TAKEOVER_RUN_RESPONSE;

    rddata.dptr = (uint8_t *)&rd;
    rddata.dsize = sizeof(rd);

    if (status == -ECANCELED) {
        DEBUG(DEBUG_ERR,("Monitoring event was cancelled\n"));
        goto after_change_status;
    }

    if (status == -ETIME) {
        ctdb->event_script_timeouts++;
        if (ctdb->event_script_timeouts >= ctdb->tunable.script_timeout_count) {
            DEBUG(DEBUG_ERR, ("Maximum timeout count %u reached for eventscript. Making node unhealthy\n", ctdb->tunable.script_timeout_count));
        } else {
            /* We pretend this is OK. */
            goto after_change_status;
        }
    }
    /*之前健康，但是这一次返回值不是0，那么该node的flags置上 NODE_FLAGS_UNHEALTHY标志位*/
    if (status != 0 && !(node->flags & NODE_FLAGS_UNHEALTHY)) {
        DEBUG(DEBUG_NOTICE,("monitor event failed - disabling node\n"));
        node->flags |= NODE_FLAGS_UNHEALTHY;
        ctdb->monitor->next_interval = 5;
        
        ctdb_run_notification_script(ctdb, "unhealthy");
    } else if (status == 0 && (node->flags & NODE_FLAGS_UNHEALTHY)) {
        /*如果之前不健康，这次返回健康，那么取消NODE_FLAGS_UNHEALTHY标志位*/
        DEBUG(DEBUG_NOTICE,("monitor event OK - node re-enabled\n"));
        node->flags &= ~NODE_FLAGS_UNHEALTHY;
        ctdb->monitor->next_interval = 5;

        ctdb_run_notification_script(ctdb, "healthy");
    }

after_change_status:
    next_interval = ctdb->monitor->next_interval;
    ctdb->monitor->next_interval *= 2;
    if (ctdb->monitor->next_interval > ctdb->tunable.monitor_interval) {
        ctdb->monitor->next_interval = ctdb->tunable.monitor_interval;                                                                                 
    }
    /*预约下次ctdb_check_health*/
    event_add_timed(ctdb->ev, ctdb->monitor->monitor_context,
                timeval_current_ofs(next_interval, 0),
                ctdb_check_health, ctdb);
    /*如果新旧标志位一样，没有发生任何变化，那么直接返回*/
    if (c.old_flags == node->flags) {
        return;
    }

    c.new_flags = node->flags;

    data.dptr = (uint8_t *)&c;
    data.dsize = sizeof(c);

    /* ask the recovery daemon to push these changes out to all nodes */
    ctdb_daemon_send_message(ctdb, ctdb->pnn,
                 CTDB_SRVID_PUSH_NODE_FLAGS, data);

    if (c.new_flags & NODE_FLAGS_UNHEALTHY) {
        state_str = "UNHEALTHY";
    } else {
        state_str = "HEALTHY";
    }

    /* ask the recmaster to reallocate all addresses */
    DEBUG(DEBUG_ERR,("Node became %s. Ask recovery master %u to perform ip reallocation\n",
             state_str, ctdb->recovery_master));
    /*注意，如果节点情况发生比那话，需要告诉recovery master，发起takeover，重启发起虚IP的接管*/
    ret = ctdb_daemon_send_message(ctdb, ctdb->recovery_master, CTDB_SRVID_TAKEOVER_RUN, rddata);
    if (ret != 0) {
        DEBUG(DEBUG_ERR,(__location__ " Failed to send ip takeover run request message to %u\n", ctdb->recovery_master));
    }                                                                                                                                                  
}

从上面的讨论不难看出，无论node是从HEALTHY到UNHEALTHY，还是从UNHEALTHY到HEALTHY，都会及时通知其他节点，并发消息给rec master，发起takeover。

我们跳到rec master 节点，可以看到如下的日志：

2017/06/22 14:27:19.630180 [recoverd:98081]: recovery master forced ip reallocation
2017/06/22 14:27:19.636677 [recoverd:98081]:  172.16.146.193 -> 0 [+14641]
2017/06/22 14:27:19.732740 [97940]: server/ctdb_takeover.c:152 public address '172.16.146.193' now assigned to iface 'bond0' refs[-1]
2017/06/22 14:27:19.732781 [97940]: Takeover of IP 172.16.146.193/24 on interface bond0
2017/06/22 14:27:19.732794 [97940]: Monitoring event was cancelled
2017/06/22 14:27:19.732806 [97940]: server/eventscript.c:584 Sending SIGTERM to child pid:358288
2017/06/22 14:27:19.732841 [97940]: server/eventscript.c:762 Starting eventscript takeip bond0 172.16.146.193 24
2017/06/22 14:27:19.733316 [97940]: server/ctdb_takeover.c:163 public address '172.16.146.191' now unassigned (old iface 'bond0' refs[-1])
2017/06/22 14:27:19.733344 [97940]: server/ctdb_takeover.c:152 public address '172.16.146.191' now assigned to iface 'bond0' refs[-1]

rec master通过计算，将public interface down的节点负责的IP 172.16.146.193，分配给了 pnn为0 的node，凑巧是rec master节点。，后面可以看到，该节点确实Takeover of IP 172.16.146.193。

takeover的逻辑，我们此处按下不提，后续希望有单独一篇文章介绍takeover。

总之从上面讨论可以看出，一旦网卡损坏，public interface不能工作，虚IP漂移的延迟主要在于网卡down的发现。而发现机制的延迟，主要在于MonitorInterval，默认情况下15秒，因此平均延迟在7~9秒左右，最大延迟在15~17秒左右。