2017-08-19 | 分类 ceph-internal | 标签 ceph-internal

前言

ceph-mon负责的功能有很多：

startup
data store
data sync
data check
scrub
leader elect
timecheck
lease
paxos
paxos service
consistency

我们今天先挑一个软一点地柿子捏一下，简单介绍下timecheck。

分布式系统正常运转依赖系统时间，ceph通过这个timecheck机制来检查每个monitor的时间是否一致，如果误差过大（clock skew），会发出警告信息。

我们知道，集群中多个节点可能都存在ceph-mon，当时扮演的角色不同，有一个节点是monitor leader，其他的节点上的monitor 为peon, 在timecheck机制中，两者扮演的角色不同，如下图所示：

注意，monitor leader是整个战术的发起点，他会主动向所有的peon发送OP_PING请求，所有的peon monitor会恢复OP_PONG，在OP_PONG消息中，会带上自己这边的时间戳。当monitor leader收到回应后，会计算出monitor leader和各个peon中间的时间偏移（估算，无法做到绝对精确），记录到ceph-mon的数据结构中。

当所有的peon都回应过OP_PONG之后，monitor leader收到所有的回应之后，会在timecheck_finish_round 函数中通过调用timecheck_report ，给所有的peon发送OP_REPORT消息，在消息体中，会把monitor leader算出来的时钟偏移和往来延迟记入其中，这样peon收到OP_REPROT消息之后，就能得到，该节点与monitor leader之间的往来延迟和时钟偏移。

粗略的过程就是如上，下面要展开细节，详细的描述这个过程。

原点

不介绍ceph-mon的PAXOS以及election，似乎很难介绍好其他功能，但是们还是暂时放下Paxos和election，我们起点从有一个节点赢得ceph-mon monitor leader的选举开始：

如同封建时代，新皇登基总要大赦天下，提拔一群新的大臣到重要岗位，某个节点的ceph-mon赢得monitor leader 选举之后，也会做一些重新洗牌的动作。其中timecheck的重新初始化也在其中。

void Monitor::win_election(epoch_t epoch, set<int>& active, uint64_t features,
                           const MonCommand *cmdset, int cmdsize, 
                           const set<int> *classic_monitors)
{
    if (monmap->size() > 1 &&
      monmap->get_epoch() > 0)
      timecheck_start();
}
void Monitor::timecheck_start()
{
  dout(10) << __func__ << dendl;
  timecheck_cleanup();
  timecheck_start_round();
}
void Monitor::timecheck_cleanup()
{
  timecheck_round = 0;
  timecheck_acks = 0;
  timecheck_round_start = utime_t();

  if (timecheck_event) {
    timer.cancel_event(timecheck_event);
    timecheck_event = NULL;
  }
  timecheck_waiting.clear();
  timecheck_skews.clear();
  timecheck_latencies.clear();
}

我们可以看到，新当选的monitor leader通过win_election—>timecheck_start—->timecheck_cleanup，完成了对timecheck相关数据结构的重新洗牌。

竞争leader的失败者，也需要重新洗牌，完成对timecheck相关数据结构的初始化。

void Monitor::lose_election(epoch_t epoch, set<int> &q, int l, uint64_t features) 
{
  state = STATE_PEON;
  ...
  logger->inc(l_mon_election_win);
  finish_election();                                                  
}
void Monitor::finish_election()
{
  apply_quorum_to_compatset_features();
  timecheck_finish();
  ...
}
void Monitor::timecheck_finish()
{
  dout(10) << __func__ << dendl;
  timecheck_cleanup();
}
void Monitor::timecheck_cleanup()                                                
{
  timecheck_round = 0;
  timecheck_acks = 0;
  timecheck_round_start = utime_t();

  if (timecheck_event) {
    timer.cancel_event(timecheck_event);
    timecheck_event = NULL;
  }
  timecheck_waiting.clear();
  timecheck_skews.clear();
  timecheck_latencies.clear();
}

通过上面的讨论可以看到，竞争leader的失败者，也重新初始化了timecheck相关的数据结构。

timecheck的流程

现在我们可以开始讨论下相关的数据结构到底记录什么信息了。

  map<entity_inst_t, utime_t> timecheck_waiting;
  map<entity_inst_t, double> timecheck_skews;
  map<entity_inst_t, double> timecheck_latencies;
  // odd value means we are mid-round; even value means the round has
  // finished.
  version_t timecheck_round; 
  
  unsigned int timecheck_acks;
  utime_t timecheck_round_start;

首先的话timecheck_round是一个version_t类型，即uint64_t类型的变量。因为timecheck是一轮一轮的做的，因此需要一个轮数的概念。当timecheck_round 是奇数还是偶数，有不同的含义，后面会详细分析。

timecheck_round_start是一个时间值，记录的是本轮timecheck发起的时间。记录下这个时间之后，就要开始给各个PEON monitor发送OP_PING消息了。这个时间非常有用。因为有些时候，可能并不顺利，很可能过了很久，也收不到某个PEON回应的OP_PONG消息，比如发送的时候，该PEON网络还是通的，但是PEON收到消息之后，网路不通了，monitor leader可能无法集齐所有PEON monitor的回应，这种情况下，timecheck需要有cancel的机制，不能因为单个节点的故障，导致大家timecheck都无法进行。

void Monitor::timecheck_start_round()
{
  dout(10) << __func__ << " curr " << timecheck_round << dendl;
  assert(is_leader());
  
  if (monmap->size() == 1) {
    assert(0 == "We are alone; this shouldn't have been scheduled!");
    return;
  }
  
  if (timecheck_round % 2) {
    dout(10) << __func__ << " there's a timecheck going on" << dendl;
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    double max = g_conf->mon_timecheck_interval*3;
    if (curr_time - timecheck_round_start < max) {
      dout(10) << __func__ << " keep current round going" << dendl;
      goto out;
    } else {
      dout(10) << __func__
               << " finish current timecheck and start new" << dendl;
      timecheck_cancel_round();
    }
  }
  
  assert(timecheck_round % 2 == 0);
  timecheck_acks = 0;
  timecheck_round ++;
  timecheck_round_start = ceph_clock_now(g_ceph_context);
  dout(10) << __func__ << " new " << timecheck_round << dendl;

  timecheck();
out:
  dout(10) << __func__ << " setting up next event" << dendl;
  timecheck_event = new C_TimeCheck(this);
  timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);
} 

前面讲过，timecheck_round是奇数还是偶数，含义是不同的

奇数：timecheck已经发起，但是尚未结束
偶数：timecheck已经完成，正在等待下一轮timecheck的发起。

wait a minute，我们提到了等待下一轮，那么到底多久是一轮呢？我们看定时器：

out:
  dout(10) << __func__ << " setting up next event" << dendl;
  timecheck_event = new C_TimeCheck(this);
  timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);  

OPTION(mon_timecheck_interval, OPT_FLOAT, 300.0) 

这个浮点数300秒，定义了timecheck的周期，每五分钟，发起一轮timecheck。注意C_TimeCheck：

  struct C_TimeCheck : public Context {
    Monitor *mon;
    C_TimeCheck(Monitor *m) : mon(m) { }
    void finish(int r) {
      mon->timecheck_start_round();                                            
    }
  }; 

定时器到了，会执行下一轮的timecheck_start_round函数。

注意哈，当ceph-mon成为monitor leader之后，在win_election函数中调用timecheck_start函数，在该函数中会第一次调用timecheck_start_round，后续的timecheck发起，就靠定时任务了。每过300秒，就会发起下一轮的timecheck。

void Monitor::timecheck_start()                                               
{
  dout(10) << __func__ << dendl;
  timecheck_cleanup();
  timecheck_start_round();
}

timecheck_start_round作为timecheck的发起者，就非常重要了。

timecheck_start_round函数

   /*如果是只有一个cephmon，压根就不需要发起timecheck，
    *事实上win_election中也判定了，是否是一个mon*/
  if (monmap->size() == 1) {
    assert(0 == "We are alone; this shouldn't have been scheduled!");
    return;
  }

理想很丰满，显示很骨感，实际情况是很复杂的，比如又有某种原因，上一轮的timecheck迟迟不能结案，现实中又不能不理，因此，下面这段逻辑处理的是timecheck因为某些原因无法结束的情形。如果定时器timeout了，即等待了300秒，结果发现上一轮的timecheck居然还没完工，那么是放弃还是继续等待？取决于等待的时间，如果等待了3倍的mon_timecheck_interval时间，即15分钟以上，还没等到timecheck结束，那么就不等路，直接cancel本轮timecheck，但是如果低于3倍时间，就goto out设置定时器，再等一轮。

 /*timecheck_round为奇数的时候，表示有一轮timecheck 正在进行中*/ 
 if (timecheck_round % 2) {
    dout(10) << __func__ << " there's a timecheck going on" << dendl;
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    double max = g_conf->  if (timecheck_round % 2) {
    dout(10) << __func__ << " there's a timecheck going on" << dendl;
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    double max = g_conf->mon_timecheck_interval*3;
    /*如果等待时间低于3倍的mon_timecheck_interval，那么再等300秒
     * goto out是为了设置新的定时器的*/
    if (curr_time - timecheck_round_start < max) {
      dout(10) << __func__ << " keep current round going" << dendl;
      goto out;
    } else {
      dout(10) << __func__
               << " finish current timecheck and start new" << dendl;
      timecheck_cancel_round();
    }
  }

正常情况下，300秒的时间，timecheck肯定是完成了，但是也有异常，比如发送OP_PING的时候，PEON好好的，但是某一个PEON就是不给会消息，这种情况下，没有搜集起所有的相应，本轮timecheck就不能结束。上面的逻辑就是处理这个的。

这一部分逻辑是异常部分，正常情况下不会走到。正常部分下，走下面这个逻辑：

 /*assert判定，并无当前正在进行的timecheck*/
  assert(timecheck_round % 2 == 0);
  /*新的一轮check，自然一个回应也没收到*/
  timecheck_acks = 0;
  /*timecheck_round自加，变成奇数，表示正在进行timecheck*/
  timecheck_round ++;
  /*记录本轮timecheck的起始时间，到timecheck_round_start变量*/
  timecheck_round_start = ceph_clock_now(g_ceph_context);
  dout(10) << __func__ << " new " << timecheck_round << dendl;
  
  /*真正发起timecheck*/
  timecheck();
out:
  dout(10) << __func__ << " setting up next event" << dendl;
  timecheck_event = new C_TimeCheck(this);
  timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);

timecheck函数

void Monitor::timecheck()
{
  dout(10) << __func__ << dendl;
  assert(is_leader());
  if (monmap->size() == 1) {
    assert(0 == "We are alone; we shouldn't have gotten here!");
    return;
  }
  assert(timecheck_round % 2 != 0);

  timecheck_acks = 1; // we ack ourselves

  dout(10) << __func__ << " start timecheck epoch " << get_epoch()
           << " round " << timecheck_round << dendl;

  // we are at the eye of the storm; the point of reference
  timecheck_skews[messenger->get_myinst()] = 0.0;
  timecheck_latencies[messenger->get_myinst()] = 0.0;

  for (set<int>::iterator it = quorum.begin(); it != quorum.end(); ++it) {
    if (monmap->get_name(*it) == name)
      continue;
      
    entity_inst_t inst = monmap->get_inst(*it);
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    timecheck_waiting[inst] = curr_time;
    MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_PING);
    m->epoch = get_epoch();
    m->round = timecheck_round;
    dout(10) << __func__ << " send " << *m << " to " << inst << dendl;
    messenger->send_message(m, inst);
  }
}

首先是下面的逻辑，用来处理monitor leader自身到自身的时间偏移，毫无疑问，自己和自己肯定是没有任何偏移的，也不需要假惺惺地发消息测试：

  timecheck_acks = 1; // we ack ourselves

  dout(10) << __func__ << " start timecheck epoch " << get_epoch()
           << " round " << timecheck_round << dendl;

  // we are at the eye of the storm; the point of reference
  timecheck_skews[messenger->get_myinst()] = 0.0;
  timecheck_latencies[messenger->get_myinst()] = 0.0;

接下来是发给其他ceph-mon的消息：

  for (set<int>::iterator it = quorum.begin(); it != quorum.end(); ++it) {
    /*如果ceph-mon是leader自己，就不用发消息了*/
    if (monmap->get_name(*it) == name)
      continue;
      
    entity_inst_t inst = monmap->get_inst(*it);
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    /*记录下发送OP_PING的时间点，到timecheck_waiting[inst],后面会有用
     *后面要计算latency，这时候，发送的时间和收到OP_PONG响应的时间，就能估算延迟了*/
    timecheck_waiting[inst] = curr_time;
    MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_PING);
    m->epoch = get_epoch();
    m->round = timecheck_round;
    dout(10) << __func__ << " send " << *m << " to " << inst << dendl;
    messenger->send_message(m, inst);
  }

handle_timecheck 函数

对于Monitor::dispatch 函数我就不提了，他是整个Monitor的消息集散中心，其中我们timecheck相关的消息类型，都是这种MSG_TIMECHECK。

    case MSG_TIMECHECK:                                           
      handle_timecheck(static_cast<MTimeCheck *>(m));
      break;

我们细细来看handle_timecheck函数：

void Monitor::handle_timecheck(MTimeCheck *m)
{
  dout(10) << __func__ << " " << *m << dendl;
  /*monitor leader只会、应该收到 OP_PONG的消息*/
  if (is_leader()) {
    if (m->op != MTimeCheck::OP_PONG) {
      dout(1) << __func__ << " drop unexpected msg (not pong)" << dendl;
    } else {
      handle_timecheck_leader(m);
    }
  } else if (is_peon()) {
    /*非Leader，则只应该收到OP_PING和OP_REPORT两种消息*/
    if (m->op != MTimeCheck::OP_PING && m->op != MTimeCheck::OP_REPORT) {
      dout(1) << __func__ << " drop unexpected msg (not ping or report)" << dendl;
    } else {
      handle_timecheck_peon(m);
    }
  } else {
    dout(1) << __func__ << " drop unexpected msg" << dendl;
  }
  m->put();
}

很明显，peon只会收到OP_PING和OP_REPORT两种消息，先收到OP_PING。

void Monitor::handle_timecheck_peon(MTimeCheck *m)
{
  ...
  if (m->epoch != get_epoch()) {
    dout(1) << __func__ << " got wrong epoch "
            << "(ours " << get_epoch() 
            << " theirs: " << m->epoch << ") -- discarding" << dendl;
    return;
  }

  /*如果收到消息的round，小于自己的timecheck_round，表示迷路已久的OP_PING终于到了，
   *因为时过境迁，这种过时的消息已经没有回复的必要了。*/
  if (m->round < timecheck_round) {
    dout(1) << __func__ << " got old round " << m->round
            << " current " << timecheck_round
            << " (epoch " << get_epoch() << ") -- discarding" << dendl;
    return;
  }

  /*peon修改自己的timecheck_round，向monitor leader看起*/
  timecheck_round = m->round;

  assert((timecheck_round % 2) != 0);
  MTimeCheck *reply = new MTimeCheck(MTimeCheck::OP_PONG);
  utime_t curr_time = ceph_clock_now(g_ceph_context);
  /*把当前节点的时间写入消息体，回给monitor leader*/
  reply->timestamp = curr_time;
  reply->epoch = m->epoch;
  reply->round = m->round;
  dout(10) << __func__ << " send " << *m
           << " to " << m->get_source_inst() << dendl;
  m->get_connection()->send_message(reply);
}

OK，接下来看下，monitor leader收到 OP_PONG之后，如何处理：

void Monitor::handle_timecheck_leader(MTimeCheck *m)
{
  dout(10) << __func__ << " " << *m << dendl;
  /* handles PONG's */                                                                               /*monitor leader只会OP_PONG类型的消息*/                                          
  assert(m->op == MTimeCheck::OP_PONG);

  entity_inst_t other = m->get_source_inst();
  if (m->epoch < get_epoch()) {
    dout(1) << __func__ << " got old timecheck epoch " << m->epoch
            << " from " << other
            << " curr " << get_epoch()
            << " -- severely lagged? discard" << dendl;
    return;
  }
  assert(m->epoch == get_epoch());

  if (m->round < timecheck_round) {
    dout(1) << __func__ << " got old round " << m->round
            << " from " << other
            << " curr " << timecheck_round << " -- discard" << dendl;
    return;
  }

  utime_t curr_time = ceph_clock_now(g_ceph_context);

  /*timecheck_waiting中记录了消息的发送时间
   *取出来发送时间之后，该记录可以清除掉了，而该发送时间用来计算延迟latency*/
  assert(timecheck_waiting.count(other) > 0);
  utime_t timecheck_sent = timecheck_waiting[other];
  timecheck_waiting.erase(other);
  
  /*这是一种特殊情况，即收到消息的时间，比发送的时间还要早，
   *这意味着monitor leader 调整了时间，如果发生这种情况，本轮timecheck没有必要进行了，cancel掉*/
  if (curr_time < timecheck_sent) {
    // our clock was readjusted -- drop everything until it all makes sense.
    dout(1) << __func__ << " our clock was readjusted --"
            << " bump round and drop current check"
            << dendl;
    timecheck_cancel_round();
    return;
  }

  /* 更新monitor leader 到对应peon的latency 
   * 计算简单粗暴，即收到回应消息的时间减掉发送时间
   * 注意如果有历史值的话，要将历史值和当前值加权。
   * 最终的latency结果，保存在timecheck_latencies中*/
  double latency = (double)(curr_time - timecheck_sent);
  if (timecheck_latencies.count(other) == 0)
    timecheck_latencies[other] = latency;
  else {
    double avg_latency = ((timecheck_latencies[other]*0.8)+(latency*0.2));
    timecheck_latencies[other] = avg_latency;
  }
  

截止到此处，逻辑比较清晰，latency用发送OP_PING的时间和收到OP_PONG回应的时间来计算。然后将latency信息保存在timecheck_latencies 数据结构。

接下来到了最核心的地方，即如何估算两个节点的时间差。ceph给出了一段很长的注释：

/*
   * update skews
   *
   * some nasty thing goes on if we were to do 'a - b' between two utime_t,
   * and 'a' happens to be lower than 'b'; so we use double instead.
   *
   * latency is always expected to be >= 0.
   *
   * delta, the difference between theirs timestamp and ours, may either be
   * lower or higher than 0; will hardly ever be 0.
   *
   * The absolute skew is the absolute delta minus the latency, which is
   * taken as a whole instead of an rtt given that there is some queueing
   * and dispatch times involved and it's hard to assess how long exactly
   * it took for the message to travel to the other side and be handled. So
   * we call it a bounded skew, the worst case scenario.
   *
   * Now, to math!
   *
   * Given that the latency is always positive, we can establish that the
   * bounded skew will be:
   *
   *  1. positive if the absolute delta is higher than the latency and
   *     delta is positive
   *  2. negative if the absolute delta is higher than the latency and
   *     delta is negative.
   *  3. zero if the absolute delta is lower than the latency.
   *
   * On 3. we make a judgement call and treat the skew as non-existent.
   * This is because that, if the absolute delta is lower than the
   * latency, then the apparently existing skew is nothing more than a
   * side-effect of the high latency at work.
   *
   * This may not be entirely true though, as a severely skewed clock
   * may be masked by an even higher latency, but with high latencies
   * we probably have worse issues to deal with than just skewed clocks.
   */

这段注释解释了如何计算两个节点之间的时间偏移（clock skew）。PEON节点的时间戳是a，monitor leader收到OP_PONG之后当前的时间戳是b，那么时间偏移粗略来看是 a-b，但是还是要考虑延迟。

a-b的值要和latency比较一下，如果说(a-b)的绝对值小于latency，说明a和b之间的这点时间偏移太小了，比网络延迟还要小，这种情况下，就不必计较a和b之间的时间偏移。这就是注释当中的第三条。

  double delta = ((double) m->timestamp) - ((double) curr_time);
  double abs_delta = (delta > 0 ? delta : -delta);
  double skew_bound = abs_delta - latency;
  /*时间偏移的值小于网络延迟，那么就认为skew_bound =0，没有偏移
   *否则，就认定偏移的值为skew_bound，不过还是要根据delta的正负，确定是领先monitor leader，还是落后*/
  if (skew_bound < 0)
    skew_bound = 0;
  else if (delta < 0)
    skew_bound = -skew_bound;

  ostringstream ss;
  health_status_t status = timecheck_status(ss, skew_bound, latency);
  if (status == HEALTH_ERR)
    clog->error() << other << " " << ss.str() << "\n";
  else if (status == HEALTH_WARN)
    clog->warn() << other << " " << ss.str() << "\n";

  dout(10) << __func__ << " from " << other << " ts " << m->timestamp
           << " delta " << delta << " skew_bound " << skew_bound
           << " latency " << latency << dendl;

  if (timecheck_skews.count(other) == 0) {
    timecheck_skews[other] = skew_bound;
  } else {
    timecheck_skews[other] = (timecheck_skews[other]*0.8)+(skew_bound*0.2);
  }
  /*收到PEON回应的个数自加*/
  timecheck_acks++;
  /*如果所有的PEON都回应了，那么执行timecheck_finish_round*/
  if (timecheck_acks == quorum.size()) {
    dout(10) << __func__ << " got pongs from everybody ("
             << timecheck_acks << " total)" << dendl;
    assert(timecheck_skews.size() == timecheck_acks);
    assert(timecheck_waiting.empty());
    // everyone has acked, so bump the round to finish it.
    timecheck_finish_round();
  }

计算规则就是注释中的三点，不多说。逻辑非常简单，不多说了。如果所有的PEON的回应都收到了，那么执行timecheck_finish_round函数。

  /*这个timecheck_finish_round函数是公用的，无论成功还是cancel掉本轮，都会调用
   *区别就在标志位success，如果为true，表示成功处理本轮timecheck，所有的PEON的OP_PONG都收到
   *如果success = false，表示本轮失败，由于某种原因，取消掉了本轮timecheck*/
void Monitor::timecheck_finish_round(bool success)
{
  dout(10) << __func__ << " curr " << timecheck_round << dendl;
  assert(timecheck_round % 2);
  timecheck_round ++;
  timecheck_round_start = utime_t();

  /*如果成功，则发送OP_REPORT消息到各个PEON，通知他们更新最新计算出来的clock skew*/
  if (success) {
    assert(timecheck_waiting.empty());
    assert(timecheck_acks == quorum.size());
    timecheck_report();
    return;
  }

  /*如果是取消本轮timecheck的话，将还未收到消息的PEON从timecheck_waiting中去掉，并打印*/
  dout(10) << __func__ << " " << timecheck_waiting.size()
           << " peers still waiting:";
  for (map<entity_inst_t,utime_t>::iterator p = timecheck_waiting.begin();
      p != timecheck_waiting.end(); ++p) {
    *_dout << " " << p->first.name;
  }
  *_dout << dendl;
  timecheck_waiting.clear()                                               
  dout(10) << __func__ << " finished to " << timecheck_round << dendl;
}

注意，如果所有的PEON的回应都收到，才会，通过timecheck_report 发送OP_REPORT消息到各个PEON。为什么要发送这个消息呢。其实就是把最新的计算结果告诉PEON，通知它，所有PEON与monitor leader的时间偏移和延迟。

void Monitor::timecheck_report()
{
  dout(10) << __func__ << dendl;
  assert(is_leader());
  assert((timecheck_round % 2) == 0);
  if (monmap->size() == 1) {
    assert(0 == "We are alone; we shouldn't have gotten here!");
    return;
  }
  
  assert(timecheck_latencies.size() == timecheck_skews.size());
  bool do_output = true; // only output report once
  for (set<int>::iterator q = quorum.begin(); q != quorum.end(); ++q) {
    /*如果是monitor leader ，不用自己发给你自己*/
    if (monmap->get_name(*q) == name)
      continue;
      
    MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_REPORT);
    m->epoch = get_epoch();
    m->round = timecheck_round;

    for (map<entity_inst_t, double>::iterator it = timecheck_skews.begin(); it != timecheck_skews.end(); ++it) {
      double skew = it->second;
      double latency = timecheck_latencies[it->first];
      
      /*消息体里，带着skew和latency的信息，把最新的结果告诉对端的PEON*/
      m->skews[it->first] = skew;
      m->latencies[it->first] = latency;
      
      if (do_output) {
        dout(25) << __func__ << " " << it->first
                 << " latency " << latency
                 << " skew " << skew << dendl;
      }
    }
    do_output = false;
    entity_inst_t inst = monmap->get_inst(*q);
    dout(10) << __func__ << " send report to " << inst << dendl;
    messenger->send_message(m, inst);
  }
}

对端的PEON收到OP_REPORT信息之后，把这个信息记录下来：

void Monitor::handle_timecheck_peon(MTimeCheck *m)
{
  ...
  timecheck_round = m->round;

  if (m->op == MTimeCheck::OP_REPORT) {
    assert((timecheck_round % 2) == 0);
    /*记录下来monitor leader发过来的最新的latency和skew信息*/
    timecheck_latencies.swap(m->latencies);                                            
    timecheck_skews.swap(m->skews);
    return;
  }
  ...
}

如果clock skew，如何处理

讲了这么多，还是没说，如果发生了这种情况，如何处理。

首先是如果节点间的时间偏移确实很大，ceph health detail中会有警告信息出现，那么问题是多大的偏移才叫比较大呢？


health_status_t Monitor::timecheck_status(ostringstream &ss,
                                          const double skew_bound,
                                          const double latency)
{
  health_status_t status = HEALTH_OK;
  double abs_skew = (skew_bound > 0 ? skew_bound : -skew_bound);
  assert(latency >= 0);

  if (abs_skew > g_conf->mon_clock_drift_allowed) {
    status = HEALTH_WARN;
    ss << "clock skew " << abs_skew << "s"
       << " > max " << g_conf->mon_clock_drift_allowed << "s";
  }
  
  return status;
}

此处有个配置项，mon_clock_drift_allowed

OPTION(mon_clock_drift_allowed, OPT_FLOAT, .050)

即，允许节点之间的偏移为50毫秒。

如果超过，ceph health detail 会有如下的打印：

ceph health detail
HEALTH_WARN clock skew detected on mon.1, mon.2
mon.1 addr 192.168.0.6:6789/0 clock skew 8.37274s > max 0.05s (latency 0.004945s)
mon.2 addr 192.168.0.7:6789/0 clock skew 8.52479s > max 0.05s (latency 0.005965s)

这部分逻辑在

void Monitor::get_health(string& status, bufferlist *detailbl, Formatter *f)
{
  ...
   if (f) {
    f->open_object_section("timechecks");
    f->dump_unsigned("epoch", get_epoch());
    f->dump_int("round", timecheck_round);
    f->dump_stream("round_status")
      << ((timecheck_round%2) ? "on-going" : "finished");
   }

  if (!timecheck_skews.empty()) {
    list<string> warns;
    if (f)
      f->open_array_section("mons");
    for (map<entity_inst_t,double>::iterator i = timecheck_skews.begin();
         i != timecheck_skews.end(); ++i) {
      entity_inst_t inst = i->first;
      double skew = i->second;
      double latency = timecheck_latencies[inst];
      string name = monmap->get_name(inst.addr);

      ostringstream tcss;
      health_status_t tcstatus = timecheck_status(tcss, skew, latency);
      if (tcstatus != HEALTH_OK) {
        if (overall > tcstatus)
          overall = tcstatus;
        warns.push_back(name);
        
        ostringstream tmp_ss;
        tmp_ss << "mon." << name
               << " addr " << inst.addr << " " << tcss.str()
               << " (latency " << latency << "s)";
        detail.push_back(make_pair(tcstatus, tmp_ss.str()));
      }

      if (f) {
        f->open_object_section("mon");
        f->dump_string("name", name.c_str());
        f->dump_float("skew", skew);
        f->dump_float("latency", latency);
        f->dump_stream("health") << tcstatus;
        if (tcstatus != HEALTH_OK)
          f->dump_stream("details") << tcss.str();
        f->close_section();
      }
    }
    ...
}

发生这种事情，应该如何处理，很多文章都有提到了，基本就是强制ntpdate一次，让时间强制校准：

停掉所有节点的ntpd服务，如果有的话
```
/etc/init.d/ntpd stop
```
同步时间
```
ntpdate  {ntpserver}
```

注意，如果无法连出外网的情况下，可以选择某一台机器作为NTP Server，大家强制向它看齐。