ceph-mon的lease机制

| 分类 ceph-internal  | 标签 ceph-internal 

前言

ceph-mon负责很多的功能:

  • startup
  • data store
  • data sync
  • data check
  • scrub
  • leader elect
  • timecheck
  • lease
  • paxos
  • paxos service
  • consistency

本文介绍lease机制,即租约机制。

ceph-osd之间,会有心跳机制:

osd_heartbeat_interval   (默认是6)
osd_heartbeat_grace (默认是20)

即OSD Peer之间,其实形成了彼此监控的网络,每 6秒向Peer发送心跳信息,如果超过osd_heartbeat_grace 时间没收到Peer OSD的心跳信息,则send_failure,状告该OSD已经fail。

这种机制的存在确保了当OSD 异常退出或者网络不通的时候,ceph-mon能够发现。

当集群中存在多个ceph-mon的时候,有leader,有peon,ceph-mon进程也可能因为某种原因异常死亡或者网络不通,也必须有机制报障及时发现。这个机制是lease。

monitor内部采用lease协议,保证副本数据在一定时间范围内可读写(写需要是leader节点),同时也用来发现monitor的异常,然后重新选举。

leader节点会定期发送lease消息,延长各个peon的时间,但是如果某个peon 节点挂掉,leader节点就无法收到lease_ack消息,超时之后,就会重新选举。

同样道理,leader节点也可能会异常宕机,peon节点也要能监督leader节点。如果leader down掉,peon节点就收不到来自leader的lease更新消息,超时之后,也会选举。

这里面有几个参数,比如

  • 多久发送一次lease消息:mon_lease_renew_interval 默认3秒
  • 每次延长租约多长时间:mon_lease 默认是5秒
  • 超时重新选举的timeout时间是多久:mon_lease_ack_timeout 默认是10秒

其中mon_lease_ack_timeout对monitor leader节点和peon节点都是有效。对于monitor leader来说,如果在mon_lease_ack_timeout 的时间内,没有搜集到所有peon的lease ack,就判定超时,调用bootstrap重新选举。在另一个方面,如果peon节点在mon_lease_ack_timeout 时间内,没有收到新的lease 信息,就判定超时,也会发起重新选举。

A面:leader

我们首先站在leader节点的角度,看下lease相关的操作。lease这个功能的发起点是extend_lease函数:

void Paxos::extend_lease()
{
  assert(mon->is_leader());
  //assert(is_active());

  /*当前时间+5秒,作为新的lease_expire*/
  lease_expire = ceph_clock_now(g_ceph_context);
  lease_expire += g_conf->mon_lease;
  
  /*已经收到的ack的集合 acked_lease清空,将当前mon leader
   *加入其中*/
  acked_lease.clear();
  acked_lease.insert(mon->rank);

  dout(7) << "extend_lease now+" << g_conf->mon_lease 
          << " (" << lease_expire << ")" << dendl;

  // bcast
  /*向所有的peon发送OP_LEASE消息,消息体中带上lease_expire */
  for (set<int>::const_iterator p = mon->get_quorum().begin();
      p != mon->get_quorum().end(); ++p) {
    if (*p == mon->rank) continue;
    MMonPaxos *lease = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_LEASE,
                                     ceph_clock_now(g_ceph_context));
    lease->last_committed = last_committed;
    lease->lease_timestamp = lease_expire;
    lease->first_committed = first_committed;
    mon->messenger->send_message(lease, mon->monmap->get_inst(*p));
  }

  /*注册ack timeout event,如果在规定时间(默认10秒)内,并未搜集齐ack,
   *那么就执行C_LeaseAckTimeout中定义的超时函数
   *正常情况下,该定时事件总是在收到最后一个ack后,被cancel掉,无法获得执行
   *只有异常发生,才会执行Paxos::lease_ack_timeout*/
  if (!lease_ack_timeout_event) {
    lease_ack_timeout_event = new C_LeaseAckTimeout(this);
    mon->timer.add_event_after(g_conf->mon_lease_ack_timeout, 
                               lease_ack_timeout_event);
  }

  /*因为extend_lease 要一轮一轮的跑下去,因此,注册下一次调用extend_lease的定时事件
   *C_LeaseRenew,触发时间是3秒后,正常情况下总是触发,发起下一轮*/
  lease_renew_event = new C_LeaseRenew(this);
  utime_t at = lease_expire;
  at -= g_conf->mon_lease;
  at += g_conf->mon_lease_renew_interval;
  mon->timer.add_event_at(at, lease_renew_event);
}

发送消息之后,mon leader就开始等待peon返回的lease ack消息。收到消息后,monitor leader

void Paxos::dispatch(PaxosServiceMessage *m)
{
    switch (m->get_type()) {  
    case MSG_MON_PAXOS:
    {                                                
      MMonPaxos *pm = (MMonPaxos*)m;
      // NOTE: these ops are defined in messages/MMonPaxos.h
      switch (pm->op) {
      ...
        case MMonPaxos::OP_LEASE_ACK:
          handle_lease_ack(pm);
          break;
      }
     ...
    }
    ...
}

void Paxos::handle_lease_ack(MMonPaxos *ack)
{
  int from = ack->get_source().num();

  if (!lease_ack_timeout_event) {
    dout(10) << "handle_lease_ack from " << ack->get_source() 
             << " -- stray (probably since revoked)" << dendl;
  }
  else if (acked_lease.count(from) == 0) {
    acked_lease.insert(from);
    
    if (acked_lease == mon->get_quorum()) {
      // 最后一个peon的消息也收到了,那么没有超时,就取消掉lease_ack_timeout_event
      dout(10) << "handle_lease_ack from " << ack->get_source() 
               << " -- got everyone" << dendl;
      mon->timer.cancel_event(lease_ack_timeout_event);
      lease_ack_timeout_event = 0;
    } else {
      /*并非最后一个peon的消息,除了打印,并不做特殊的处理*/
      dout(10) << "handle_lease_ack from " << ack->get_source() 
               << " -- still need "
               << mon->get_quorum().size() - acked_lease.size()
               << " more" << dendl;
    }
  } else {
    /*已经acked的peon,会记录再acked_lease集合中,如果已经收到对应ack消息了,
     *那么就是重复的消息了,ignore掉*/
    dout(10) << "handle_lease_ack from " << ack->get_source()
             << " dup (lagging!), ignoring" << dendl;
  }
  warn_on_future_time(ack->sent_timestamp, ack->get_source());
  
  ack->put();
}

对于monitor leader 来说,每mon_lease_renew_interval 秒(默认3秒)触发依次extend_lease,在该函数中,monitor leader会向所有的peon发送lease 消息,然后设置定时事件C_LeaseAckTimeout,如果在mon_lease_ack_timeout 时间内搜集全所有的lease ack消息,就既往不咎,取消掉C_LeaseAckTimeout定时事件。

如果超过mon_lease_ack_timeout ,也没搜集起所有的lease ack 怎么办?通过lease_ack_timeout函数,调用bootstrap函数,发起选举。

class C_LeaseAckTimeout : public Context {
    Paxos *paxos;
  public:
    C_LeaseAckTimeout(Paxos *p) : paxos(p) {}
    void finish(int r) { 
      if (r == -ECANCELED)
        return;
      paxos->lease_ack_timeout();
    }                                                                                                                                                  
};
  
void Paxos::lease_ack_timeout()                                                    
{   
  dout(1) << "lease_ack_timeout -- calling new election" << dendl;
  assert(mon->is_leader());
  assert(is_active());
  logger->inc(l_paxos_lease_ack_timeout);
  lease_ack_timeout_event = 0;
  /*bootstrap 发起monitor leader的选举*/
  mon->bootstrap();
} 

B面 peon

对于peon节点而言,收到OP_LEASE消息,是讨论的起点:

void Paxos::handle_lease(MMonPaxos *lease)                                                   
{
  // sanity
  if (!mon->is_peon() ||
      last_committed != lease->last_committed) {
    dout(10) << "handle_lease i'm not a peon, or they're not the leader,"
             << " or the last_committed doesn't match, dropping" << dendl;
    lease->put();
    return;
  }
  warn_on_future_time(lease->sent_timestamp, lease->get_source());

  /*延长lease 到mon leader指定的时间*/
  if (lease_expire < lease->lease_timestamp) {
    lease_expire = lease->lease_timestamp;
    utime_t now = ceph_clock_now(g_ceph_context);
    /*如果peon和monitor leader的时间差太大,lease_expire小于now,那么警告*/
    if (lease_expire < now) {
      utime_t diff = now - lease_expire;
      derr << "lease_expire from " << lease->get_source_inst() << " is " << diff << " seconds in the past; mons are probably laggy (or possibly clocks are too skewed)" << dendl; 
    }
  }

  state = STATE_ACTIVE;

  /*发送OP_LEASE_ACK消息到mon leader*/
  dout(10) << "handle_lease on " << lease->last_committed
           << " now " << lease_expire << dendl;
  MMonPaxos *ack = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_LEASE_ACK,
                                 ceph_clock_now(g_ceph_context));
  ack->last_committed = last_committed;
  ack->first_committed = first_committed;
  ack->lease_timestamp = ceph_clock_now(g_ceph_context);
  lease->get_connection()->send_message(ack);

  // (re)set timeout event.
  reset_lease_timeout();

  // kick waiters
  finish_contexts(g_ceph_context, waiting_for_active);
  if (is_readable())
    finish_contexts(g_ceph_context, waiting_for_readable);

  lease->put();
}  

前面讲过,mon leader和peon是互相监督,peon对monitor leader的监督,体现在reset_lease_timeout函数。他会以收到OP_LEASE消息的时间为起点,注册一个超时时间为mon_lease_ack_timeout的定时事件。如果该定时器超时了,表示在过去的mon_lease_ack_timeout时间内,没有收到任何的OP_LEASE消息,基本可以确定mon leader出问题了。

void Paxos::reset_lease_timeout()
{
  dout(20) << "reset_lease_timeout - setting timeout event" << dendl;
  /*先取消掉当前的定时事件
   *事实上,该定时事件几乎总是被cancel掉,因为正常情况下,peon会每隔3秒,源源不断地收到OP_LEASE消息
   */
  if (lease_timeout_event)
    mon->timer.cancel_event(lease_timeout_event);
  lease_timeout_event = new C_LeaseTimeout(this);                                            
  mon->timer.add_event_after(g_conf->mon_lease_ack_timeout, lease_timeout_event);
}

通过这个C_LeaseTimeout定时事件,peon也在监督monitor leader,如果monitor leader迟迟不发送OP_LEASE消息,延长租约,那么peon会通过如下方法,发起选举:

  class C_LeaseTimeout : public Context {
    Paxos *paxos;
  public:
    C_LeaseTimeout(Paxos *p) : paxos(p) {}
    void finish(int r) {
      if (r == -ECANCELED)
        return;
      paxos->lease_timeout();
    }
  };
  
void Paxos::lease_timeout()
{
  dout(1) << "lease_timeout -- calling new election" << dendl;
  /*只有peon节点才会调用该函数*/
  assert(mon->is_peon());
  logger->inc(l_paxos_lease_timeout);
  lease_timeout_event = 0;
  /*调用bootstrap发起选举*/
  mon->bootstrap();
}

注意,lease_expire每次续费3秒,但是超时时间是10秒,那么就会有一段时间,租约已经过期,但是还没超时重新选举。这段时间内租约是无效的:

bool Paxos::is_lease_valid()
{
  return ((mon->get_quorum().size() == 1)
      || (ceph_clock_now(g_ceph_context) < lease_expire));
}   

注意这段时间内,是不可读写的:

bool Paxos::is_readable(version_t v)
{
  bool ret;
  if (v > last_committed)
    ret = false;
  else
    ret =
      (mon->is_peon() || mon->is_leader()) &&
      (is_active() || is_updating() || is_writing()) &&
      last_committed > 0 &&           // must have a value
      (mon->get_quorum().size() == 1 ||  // alone, or
       is_lease_valid()); // have lease                                                                                                                
  dout(5) << __func__ << " = " << (int)ret
          << " - now=" << ceph_clock_now(g_ceph_context)
          << " lease_expire=" << lease_expire
          << " has v" << v << " lc " << last_committed
          << dendl;
  return ret;
}
bool Paxos::is_writeable()
{
  return
    mon->is_leader() &&
    is_active() &&
    is_lease_valid();
}  

上一篇     下一篇