前言
ceph-mon负责的功能有很多:
- startup
- data store
- data sync
- data check
- scrub
- leader elect
- timecheck
- lease
- paxos
- paxos service
- consistency
我们今天先挑一个软一点地柿子捏一下,简单介绍下timecheck。
分布式系统正常运转依赖系统时间,ceph通过这个timecheck机制来检查每个monitor的时间是否一致,如果误差过大(clock skew),会发出警告信息。
我们知道,集群中多个节点可能都存在ceph-mon,当时扮演的角色不同,有一个节点是monitor leader,其他的节点上的monitor 为peon, 在timecheck机制中,两者扮演的角色不同,如下图所示:
注意,monitor leader是整个战术的发起点,他会主动向所有的peon发送OP_PING请求,所有的peon monitor会恢复OP_PONG,在OP_PONG消息中,会带上自己这边的时间戳。当monitor leader收到回应后,会计算出monitor leader和各个peon中间的时间偏移(估算,无法做到绝对精确),记录到ceph-mon的数据结构中。
当所有的peon都回应过OP_PONG之后,monitor leader收到所有的回应之后,会在timecheck_finish_round 函数中通过调用timecheck_report ,给所有的peon发送OP_REPORT消息,在消息体中,会把monitor leader算出来的时钟偏移和往来延迟记入其中,这样peon收到OP_REPROT消息之后,就能得到,该节点与monitor leader之间的往来延迟和时钟偏移。
粗略的过程就是如上,下面要展开细节,详细的描述这个过程。
原点
不介绍ceph-mon的PAXOS以及election,似乎很难介绍好其他功能,但是们还是暂时放下Paxos和election,我们起点从有一个节点赢得ceph-mon monitor leader的选举开始:
如同封建时代,新皇登基总要大赦天下,提拔一群新的大臣到重要岗位,某个节点的ceph-mon赢得monitor leader 选举之后,也会做一些重新洗牌的动作。其中timecheck的重新初始化也在其中。
void Monitor::win_election(epoch_t epoch, set<int>& active, uint64_t features,
const MonCommand *cmdset, int cmdsize,
const set<int> *classic_monitors)
{
if (monmap->size() > 1 &&
monmap->get_epoch() > 0)
timecheck_start();
}
void Monitor::timecheck_start()
{
dout(10) << __func__ << dendl;
timecheck_cleanup();
timecheck_start_round();
}
void Monitor::timecheck_cleanup()
{
timecheck_round = 0;
timecheck_acks = 0;
timecheck_round_start = utime_t();
if (timecheck_event) {
timer.cancel_event(timecheck_event);
timecheck_event = NULL;
}
timecheck_waiting.clear();
timecheck_skews.clear();
timecheck_latencies.clear();
}
我们可以看到,新当选的monitor leader通过win_election—>timecheck_start—->timecheck_cleanup,完成了对timecheck相关数据结构的重新洗牌。
竞争leader的失败者,也需要重新洗牌,完成对timecheck相关数据结构的初始化。
void Monitor::lose_election(epoch_t epoch, set<int> &q, int l, uint64_t features)
{
state = STATE_PEON;
...
logger->inc(l_mon_election_win);
finish_election();
}
void Monitor::finish_election()
{
apply_quorum_to_compatset_features();
timecheck_finish();
...
}
void Monitor::timecheck_finish()
{
dout(10) << __func__ << dendl;
timecheck_cleanup();
}
void Monitor::timecheck_cleanup()
{
timecheck_round = 0;
timecheck_acks = 0;
timecheck_round_start = utime_t();
if (timecheck_event) {
timer.cancel_event(timecheck_event);
timecheck_event = NULL;
}
timecheck_waiting.clear();
timecheck_skews.clear();
timecheck_latencies.clear();
}
通过上面的讨论可以看到,竞争leader的失败者,也重新初始化了timecheck相关的数据结构。
timecheck的流程
现在我们可以开始讨论下相关的数据结构到底记录什么信息了。
map<entity_inst_t, utime_t> timecheck_waiting;
map<entity_inst_t, double> timecheck_skews;
map<entity_inst_t, double> timecheck_latencies;
// odd value means we are mid-round; even value means the round has
// finished.
version_t timecheck_round;
unsigned int timecheck_acks;
utime_t timecheck_round_start;
首先的话timecheck_round是一个version_t类型,即uint64_t类型的变量。因为timecheck是一轮一轮的做的,因此需要一个轮数的概念。当timecheck_round 是奇数还是偶数,有不同的含义,后面会详细分析。
timecheck_round_start是一个时间值,记录的是本轮timecheck发起的时间。记录下这个时间之后,就要开始给各个PEON monitor发送OP_PING消息了。这个时间非常有用。因为有些时候,可能并不顺利,很可能过了很久,也收不到某个PEON回应的OP_PONG消息,比如发送的时候,该PEON网络还是通的,但是PEON收到消息之后,网路不通了,monitor leader可能无法集齐所有PEON monitor的回应,这种情况下,timecheck需要有cancel的机制,不能因为单个节点的故障,导致大家timecheck都无法进行。
void Monitor::timecheck_start_round()
{
dout(10) << __func__ << " curr " << timecheck_round << dendl;
assert(is_leader());
if (monmap->size() == 1) {
assert(0 == "We are alone; this shouldn't have been scheduled!");
return;
}
if (timecheck_round % 2) {
dout(10) << __func__ << " there's a timecheck going on" << dendl;
utime_t curr_time = ceph_clock_now(g_ceph_context);
double max = g_conf->mon_timecheck_interval*3;
if (curr_time - timecheck_round_start < max) {
dout(10) << __func__ << " keep current round going" << dendl;
goto out;
} else {
dout(10) << __func__
<< " finish current timecheck and start new" << dendl;
timecheck_cancel_round();
}
}
assert(timecheck_round % 2 == 0);
timecheck_acks = 0;
timecheck_round ++;
timecheck_round_start = ceph_clock_now(g_ceph_context);
dout(10) << __func__ << " new " << timecheck_round << dendl;
timecheck();
out:
dout(10) << __func__ << " setting up next event" << dendl;
timecheck_event = new C_TimeCheck(this);
timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);
}
前面讲过,timecheck_round是奇数还是偶数,含义是不同的
- 奇数:timecheck已经发起,但是尚未结束
- 偶数:timecheck已经完成,正在等待下一轮timecheck的发起。
wait a minute, 我们提到了等待下一轮,那么到底多久是一轮呢?我们看定时器:
out:
dout(10) << __func__ << " setting up next event" << dendl;
timecheck_event = new C_TimeCheck(this);
timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);
OPTION(mon_timecheck_interval, OPT_FLOAT, 300.0)
这个浮点数300秒,定义了timecheck的周期,每五分钟,发起一轮timecheck。注意C_TimeCheck:
struct C_TimeCheck : public Context {
Monitor *mon;
C_TimeCheck(Monitor *m) : mon(m) { }
void finish(int r) {
mon->timecheck_start_round();
}
};
定时器到了,会执行下一轮的timecheck_start_round函数。
注意哈,当ceph-mon成为monitor leader之后,在win_election函数中调用timecheck_start函数,在该函数中会第一次调用timecheck_start_round,后续的timecheck发起,就靠定时任务了。每过300秒,就会发起下一轮的timecheck。
void Monitor::timecheck_start()
{
dout(10) << __func__ << dendl;
timecheck_cleanup();
timecheck_start_round();
}
timecheck_start_round作为timecheck的发起者,就非常重要了。
timecheck_start_round函数
/*如果是只有一个cephmon,压根就不需要发起timecheck,
*事实上win_election中也判定了,是否是一个mon*/
if (monmap->size() == 1) {
assert(0 == "We are alone; this shouldn't have been scheduled!");
return;
}
理想很丰满,显示很骨感,实际情况是很复杂的,比如又有某种原因,上一轮的timecheck迟迟不能结案,现实中又不能不理,因此,下面这段逻辑处理的是timecheck因为某些原因无法结束的情形。如果定时器timeout了,即等待了300秒,结果发现上一轮的timecheck居然还没完工,那么是放弃还是继续等待?取决于等待的时间,如果等待了3倍的mon_timecheck_interval时间,即15分钟以上,还没等到timecheck结束,那么就不等路,直接cancel本轮timecheck,但是如果低于3倍时间,就goto out设置定时器,再等一轮。
/*timecheck_round为奇数的时候,表示有一轮timecheck 正在进行中*/
if (timecheck_round % 2) {
dout(10) << __func__ << " there's a timecheck going on" << dendl;
utime_t curr_time = ceph_clock_now(g_ceph_context);
double max = g_conf-> if (timecheck_round % 2) {
dout(10) << __func__ << " there's a timecheck going on" << dendl;
utime_t curr_time = ceph_clock_now(g_ceph_context);
double max = g_conf->mon_timecheck_interval*3;
/*如果等待时间低于3倍的mon_timecheck_interval,那么再等300秒
* goto out是为了设置新的定时器的*/
if (curr_time - timecheck_round_start < max) {
dout(10) << __func__ << " keep current round going" << dendl;
goto out;
} else {
dout(10) << __func__
<< " finish current timecheck and start new" << dendl;
timecheck_cancel_round();
}
}
正常情况下,300秒的时间,timecheck肯定是完成了,但是也有异常,比如发送OP_PING的时候,PEON好好的,但是某一个PEON就是不给会消息,这种情况下,没有搜集起所有的相应,本轮timecheck就不能结束。上面的逻辑就是处理这个的。
这一部分逻辑是异常部分,正常情况下不会走到。正常部分下,走下面这个逻辑:
/*assert判定,并无当前正在进行的timecheck*/
assert(timecheck_round % 2 == 0);
/*新的一轮check,自然一个回应也没收到*/
timecheck_acks = 0;
/*timecheck_round自加,变成奇数,表示正在进行timecheck*/
timecheck_round ++;
/*记录本轮timecheck的起始时间,到timecheck_round_start变量*/
timecheck_round_start = ceph_clock_now(g_ceph_context);
dout(10) << __func__ << " new " << timecheck_round << dendl;
/*真正发起timecheck*/
timecheck();
out:
dout(10) << __func__ << " setting up next event" << dendl;
timecheck_event = new C_TimeCheck(this);
timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);
timecheck函数
void Monitor::timecheck()
{
dout(10) << __func__ << dendl;
assert(is_leader());
if (monmap->size() == 1) {
assert(0 == "We are alone; we shouldn't have gotten here!");
return;
}
assert(timecheck_round % 2 != 0);
timecheck_acks = 1; // we ack ourselves
dout(10) << __func__ << " start timecheck epoch " << get_epoch()
<< " round " << timecheck_round << dendl;
// we are at the eye of the storm; the point of reference
timecheck_skews[messenger->get_myinst()] = 0.0;
timecheck_latencies[messenger->get_myinst()] = 0.0;
for (set<int>::iterator it = quorum.begin(); it != quorum.end(); ++it) {
if (monmap->get_name(*it) == name)
continue;
entity_inst_t inst = monmap->get_inst(*it);
utime_t curr_time = ceph_clock_now(g_ceph_context);
timecheck_waiting[inst] = curr_time;
MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_PING);
m->epoch = get_epoch();
m->round = timecheck_round;
dout(10) << __func__ << " send " << *m << " to " << inst << dendl;
messenger->send_message(m, inst);
}
}
首先是下面的逻辑,用来处理monitor leader自身到自身的时间偏移,毫无疑问,自己和自己肯定是没有任何偏移的,也不需要假惺惺地发消息测试:
timecheck_acks = 1; // we ack ourselves
dout(10) << __func__ << " start timecheck epoch " << get_epoch()
<< " round " << timecheck_round << dendl;
// we are at the eye of the storm; the point of reference
timecheck_skews[messenger->get_myinst()] = 0.0;
timecheck_latencies[messenger->get_myinst()] = 0.0;
接下来是发给其他ceph-mon的消息:
for (set<int>::iterator it = quorum.begin(); it != quorum.end(); ++it) {
/*如果ceph-mon是leader自己,就不用发消息了*/
if (monmap->get_name(*it) == name)
continue;
entity_inst_t inst = monmap->get_inst(*it);
utime_t curr_time = ceph_clock_now(g_ceph_context);
/*记录下发送OP_PING的时间点,到timecheck_waiting[inst],后面会有用
*后面要计算latency,这时候,发送的时间和收到OP_PONG响应的时间,就能估算延迟了*/
timecheck_waiting[inst] = curr_time;
MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_PING);
m->epoch = get_epoch();
m->round = timecheck_round;
dout(10) << __func__ << " send " << *m << " to " << inst << dendl;
messenger->send_message(m, inst);
}
handle_timecheck 函数
对于Monitor::dispatch 函数我就不提了,他是整个Monitor的消息集散中心,其中我们timecheck相关的消息类型,都是这种MSG_TIMECHECK。
case MSG_TIMECHECK:
handle_timecheck(static_cast<MTimeCheck *>(m));
break;
我们细细来看handle_timecheck函数:
void Monitor::handle_timecheck(MTimeCheck *m)
{
dout(10) << __func__ << " " << *m << dendl;
/*monitor leader只会、应该收到 OP_PONG的消息*/
if (is_leader()) {
if (m->op != MTimeCheck::OP_PONG) {
dout(1) << __func__ << " drop unexpected msg (not pong)" << dendl;
} else {
handle_timecheck_leader(m);
}
} else if (is_peon()) {
/*非Leader,则只应该收到OP_PING和OP_REPORT两种消息*/
if (m->op != MTimeCheck::OP_PING && m->op != MTimeCheck::OP_REPORT) {
dout(1) << __func__ << " drop unexpected msg (not ping or report)" << dendl;
} else {
handle_timecheck_peon(m);
}
} else {
dout(1) << __func__ << " drop unexpected msg" << dendl;
}
m->put();
}
很明显,peon只会收到OP_PING和OP_REPORT两种消息,先收到OP_PING。
void Monitor::handle_timecheck_peon(MTimeCheck *m)
{
...
if (m->epoch != get_epoch()) {
dout(1) << __func__ << " got wrong epoch "
<< "(ours " << get_epoch()
<< " theirs: " << m->epoch << ") -- discarding" << dendl;
return;
}
/*如果收到消息的round,小于自己的timecheck_round,表示迷路已久的OP_PING终于到了,
*因为时过境迁,这种过时的消息已经没有回复的必要了。*/
if (m->round < timecheck_round) {
dout(1) << __func__ << " got old round " << m->round
<< " current " << timecheck_round
<< " (epoch " << get_epoch() << ") -- discarding" << dendl;
return;
}
/*peon修改自己的timecheck_round,向monitor leader看起*/
timecheck_round = m->round;
assert((timecheck_round % 2) != 0);
MTimeCheck *reply = new MTimeCheck(MTimeCheck::OP_PONG);
utime_t curr_time = ceph_clock_now(g_ceph_context);
/*把当前节点的时间写入消息体,回给monitor leader*/
reply->timestamp = curr_time;
reply->epoch = m->epoch;
reply->round = m->round;
dout(10) << __func__ << " send " << *m
<< " to " << m->get_source_inst() << dendl;
m->get_connection()->send_message(reply);
}
OK,接下来看下,monitor leader收到 OP_PONG之后,如何处理:
void Monitor::handle_timecheck_leader(MTimeCheck *m)
{
dout(10) << __func__ << " " << *m << dendl;
/* handles PONG's */ /*monitor leader只会OP_PONG类型的消息*/
assert(m->op == MTimeCheck::OP_PONG);
entity_inst_t other = m->get_source_inst();
if (m->epoch < get_epoch()) {
dout(1) << __func__ << " got old timecheck epoch " << m->epoch
<< " from " << other
<< " curr " << get_epoch()
<< " -- severely lagged? discard" << dendl;
return;
}
assert(m->epoch == get_epoch());
if (m->round < timecheck_round) {
dout(1) << __func__ << " got old round " << m->round
<< " from " << other
<< " curr " << timecheck_round << " -- discard" << dendl;
return;
}
utime_t curr_time = ceph_clock_now(g_ceph_context);
/*timecheck_waiting中记录了消息的发送时间
*取出来发送时间之后,该记录可以清除掉了,而该发送时间用来计算延迟latency*/
assert(timecheck_waiting.count(other) > 0);
utime_t timecheck_sent = timecheck_waiting[other];
timecheck_waiting.erase(other);
/*这是一种特殊情况,即收到消息的时间,比发送的时间还要早,
*这意味着monitor leader 调整了时间,如果发生这种情况,本轮timecheck没有必要进行了,cancel掉*/
if (curr_time < timecheck_sent) {
// our clock was readjusted -- drop everything until it all makes sense.
dout(1) << __func__ << " our clock was readjusted --"
<< " bump round and drop current check"
<< dendl;
timecheck_cancel_round();
return;
}
/* 更新monitor leader 到对应peon的latency
* 计算简单粗暴,即收到回应消息的时间减掉发送时间
* 注意如果有历史值的话,要将历史值和当前值加权。
* 最终的latency结果,保存在timecheck_latencies中*/
double latency = (double)(curr_time - timecheck_sent);
if (timecheck_latencies.count(other) == 0)
timecheck_latencies[other] = latency;
else {
double avg_latency = ((timecheck_latencies[other]*0.8)+(latency*0.2));
timecheck_latencies[other] = avg_latency;
}
截止到此处,逻辑比较清晰,latency用发送OP_PING的时间和收到OP_PONG回应的时间来计算。然后将latency信息保存在timecheck_latencies 数据结构。
接下来到了最核心的地方,即如何估算两个节点的时间差。ceph给出了一段很长的注释:
/*
* update skews
*
* some nasty thing goes on if we were to do 'a - b' between two utime_t,
* and 'a' happens to be lower than 'b'; so we use double instead.
*
* latency is always expected to be >= 0.
*
* delta, the difference between theirs timestamp and ours, may either be
* lower or higher than 0; will hardly ever be 0.
*
* The absolute skew is the absolute delta minus the latency, which is
* taken as a whole instead of an rtt given that there is some queueing
* and dispatch times involved and it's hard to assess how long exactly
* it took for the message to travel to the other side and be handled. So
* we call it a bounded skew, the worst case scenario.
*
* Now, to math!
*
* Given that the latency is always positive, we can establish that the
* bounded skew will be:
*
* 1. positive if the absolute delta is higher than the latency and
* delta is positive
* 2. negative if the absolute delta is higher than the latency and
* delta is negative.
* 3. zero if the absolute delta is lower than the latency.
*
* On 3. we make a judgement call and treat the skew as non-existent.
* This is because that, if the absolute delta is lower than the
* latency, then the apparently existing skew is nothing more than a
* side-effect of the high latency at work.
*
* This may not be entirely true though, as a severely skewed clock
* may be masked by an even higher latency, but with high latencies
* we probably have worse issues to deal with than just skewed clocks.
*/
这段注释解释了如何计算两个节点之间的时间偏移(clock skew)。PEON节点的时间戳是a,monitor leader收到OP_PONG之后当前的时间戳是b,那么时间偏移粗略来看是 a-b,但是还是要考虑延迟。
a-b的值要和latency比较一下,如果说(a-b)的绝对值小于latency,说明a和b之间的这点时间偏移太小了,比网络延迟还要小,这种情况下,就不必计较a和b之间的时间偏移。这就是注释当中的第三条。
double delta = ((double) m->timestamp) - ((double) curr_time);
double abs_delta = (delta > 0 ? delta : -delta);
double skew_bound = abs_delta - latency;
/*时间偏移的值小于网络延迟,那么就认为skew_bound =0,没有偏移
*否则,就认定偏移的值为skew_bound,不过还是要根据delta的正负,确定是领先monitor leader,还是落后*/
if (skew_bound < 0)
skew_bound = 0;
else if (delta < 0)
skew_bound = -skew_bound;
ostringstream ss;
health_status_t status = timecheck_status(ss, skew_bound, latency);
if (status == HEALTH_ERR)
clog->error() << other << " " << ss.str() << "\n";
else if (status == HEALTH_WARN)
clog->warn() << other << " " << ss.str() << "\n";
dout(10) << __func__ << " from " << other << " ts " << m->timestamp
<< " delta " << delta << " skew_bound " << skew_bound
<< " latency " << latency << dendl;
if (timecheck_skews.count(other) == 0) {
timecheck_skews[other] = skew_bound;
} else {
timecheck_skews[other] = (timecheck_skews[other]*0.8)+(skew_bound*0.2);
}
/*收到PEON回应的个数自加*/
timecheck_acks++;
/*如果所有的PEON都回应了,那么执行timecheck_finish_round*/
if (timecheck_acks == quorum.size()) {
dout(10) << __func__ << " got pongs from everybody ("
<< timecheck_acks << " total)" << dendl;
assert(timecheck_skews.size() == timecheck_acks);
assert(timecheck_waiting.empty());
// everyone has acked, so bump the round to finish it.
timecheck_finish_round();
}
计算规则就是注释中的三点,不多说。逻辑非常简单,不多说了。如果所有的PEON的回应都收到了,那么执行timecheck_finish_round函数。
/*这个timecheck_finish_round函数是公用的,无论成功还是cancel掉本轮,都会调用
*区别就在标志位success,如果为true,表示成功处理本轮timecheck,所有的PEON的OP_PONG都收到
*如果success = false,表示本轮失败,由于某种原因,取消掉了本轮timecheck*/
void Monitor::timecheck_finish_round(bool success)
{
dout(10) << __func__ << " curr " << timecheck_round << dendl;
assert(timecheck_round % 2);
timecheck_round ++;
timecheck_round_start = utime_t();
/*如果成功,则发送OP_REPORT消息到各个PEON,通知他们更新最新计算出来的clock skew*/
if (success) {
assert(timecheck_waiting.empty());
assert(timecheck_acks == quorum.size());
timecheck_report();
return;
}
/*如果是取消本轮timecheck的话,将还未收到消息的PEON从timecheck_waiting中去掉,并打印*/
dout(10) << __func__ << " " << timecheck_waiting.size()
<< " peers still waiting:";
for (map<entity_inst_t,utime_t>::iterator p = timecheck_waiting.begin();
p != timecheck_waiting.end(); ++p) {
*_dout << " " << p->first.name;
}
*_dout << dendl;
timecheck_waiting.clear()
dout(10) << __func__ << " finished to " << timecheck_round << dendl;
}
注意,如果所有的PEON的回应都收到,才会,通过timecheck_report 发送OP_REPORT消息到各个PEON。为什么要发送这个消息呢。其实就是把最新的计算结果告诉PEON,通知它,所有PEON与monitor leader的时间偏移和延迟。
void Monitor::timecheck_report()
{
dout(10) << __func__ << dendl;
assert(is_leader());
assert((timecheck_round % 2) == 0);
if (monmap->size() == 1) {
assert(0 == "We are alone; we shouldn't have gotten here!");
return;
}
assert(timecheck_latencies.size() == timecheck_skews.size());
bool do_output = true; // only output report once
for (set<int>::iterator q = quorum.begin(); q != quorum.end(); ++q) {
/*如果是monitor leader ,不用自己发给你自己*/
if (monmap->get_name(*q) == name)
continue;
MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_REPORT);
m->epoch = get_epoch();
m->round = timecheck_round;
for (map<entity_inst_t, double>::iterator it = timecheck_skews.begin(); it != timecheck_skews.end(); ++it) {
double skew = it->second;
double latency = timecheck_latencies[it->first];
/*消息体里,带着skew和latency的信息,把最新的结果告诉对端的PEON*/
m->skews[it->first] = skew;
m->latencies[it->first] = latency;
if (do_output) {
dout(25) << __func__ << " " << it->first
<< " latency " << latency
<< " skew " << skew << dendl;
}
}
do_output = false;
entity_inst_t inst = monmap->get_inst(*q);
dout(10) << __func__ << " send report to " << inst << dendl;
messenger->send_message(m, inst);
}
}
对端的PEON收到OP_REPORT信息之后,把这个信息记录下来:
void Monitor::handle_timecheck_peon(MTimeCheck *m)
{
...
timecheck_round = m->round;
if (m->op == MTimeCheck::OP_REPORT) {
assert((timecheck_round % 2) == 0);
/*记录下来monitor leader发过来的最新的latency和skew信息*/
timecheck_latencies.swap(m->latencies);
timecheck_skews.swap(m->skews);
return;
}
...
}
如果clock skew,如何处理
讲了这么多,还是没说,如果发生了这种情况,如何处理。
首先是如果节点间的时间偏移确实很大,ceph health detail中会有警告信息出现,那么问题是多大的偏移才叫比较大呢?
health_status_t Monitor::timecheck_status(ostringstream &ss,
const double skew_bound,
const double latency)
{
health_status_t status = HEALTH_OK;
double abs_skew = (skew_bound > 0 ? skew_bound : -skew_bound);
assert(latency >= 0);
if (abs_skew > g_conf->mon_clock_drift_allowed) {
status = HEALTH_WARN;
ss << "clock skew " << abs_skew << "s"
<< " > max " << g_conf->mon_clock_drift_allowed << "s";
}
return status;
}
此处有个配置项,mon_clock_drift_allowed
OPTION(mon_clock_drift_allowed, OPT_FLOAT, .050)
即,允许节点之间的偏移为50毫秒。
如果超过,ceph health detail 会有如下的打印:
ceph health detail
HEALTH_WARN clock skew detected on mon.1, mon.2
mon.1 addr 192.168.0.6:6789/0 clock skew 8.37274s > max 0.05s (latency 0.004945s)
mon.2 addr 192.168.0.7:6789/0 clock skew 8.52479s > max 0.05s (latency 0.005965s)
这部分逻辑在
void Monitor::get_health(string& status, bufferlist *detailbl, Formatter *f)
{
...
if (f) {
f->open_object_section("timechecks");
f->dump_unsigned("epoch", get_epoch());
f->dump_int("round", timecheck_round);
f->dump_stream("round_status")
<< ((timecheck_round%2) ? "on-going" : "finished");
}
if (!timecheck_skews.empty()) {
list<string> warns;
if (f)
f->open_array_section("mons");
for (map<entity_inst_t,double>::iterator i = timecheck_skews.begin();
i != timecheck_skews.end(); ++i) {
entity_inst_t inst = i->first;
double skew = i->second;
double latency = timecheck_latencies[inst];
string name = monmap->get_name(inst.addr);
ostringstream tcss;
health_status_t tcstatus = timecheck_status(tcss, skew, latency);
if (tcstatus != HEALTH_OK) {
if (overall > tcstatus)
overall = tcstatus;
warns.push_back(name);
ostringstream tmp_ss;
tmp_ss << "mon." << name
<< " addr " << inst.addr << " " << tcss.str()
<< " (latency " << latency << "s)";
detail.push_back(make_pair(tcstatus, tmp_ss.str()));
}
if (f) {
f->open_object_section("mon");
f->dump_string("name", name.c_str());
f->dump_float("skew", skew);
f->dump_float("latency", latency);
f->dump_stream("health") << tcstatus;
if (tcstatus != HEALTH_OK)
f->dump_stream("details") << tcss.str();
f->close_section();
}
}
...
}
发生这种事情,应该如何处理,很多文章都有提到了,基本就是强制ntpdate一次,让时间强制校准:
-
停掉所有节点的ntpd服务,如果有的话
/etc/init.d/ntpd stop
-
同步时间
ntpdate {ntpserver}
注意,如果无法连出外网的情况下,可以选择某一台机器作为NTP Server,大家强制向它看齐。