Bean Li 2018-12-16T07:06:24+00:00 beanli.coder@gmail.com Radosgw上传对象与multisite相关的逻辑 2018-12-16T13:20:40+00:00 Bean Li http://bean-li.github.io/multisite-put-obj 前言

本文对MultiSite内部数据结构和流程做一些梳理,加深对RadosGW内部流程的理解。因为MultiSite如何搭建,网上有较多的资料,因此不再赘述。

本文中创建的zonegroup为xxxx,两个zone:

  • master
  • secondary

zonegroup相关的信息如下:

{
    "id": "9908295f-d8f5-4ac3-acd7-c955a177bd09",
    "name": "xxxx",
    "api_name": "",
    "is_master": "true",
    "endpoints": [
        "http:\/\/s3.246.com\/"
    ],
    "hostnames": [],
    "hostnames_s3website": [],
    "master_zone": "8aa27332-01da-486a-994c-1ce527fa2fd7",
    "zones": [
        {
            "id": "484742ba-f8b7-4681-8411-af96ac778150",
            "name": "secondary",
            "endpoints": [
                "http:\/\/s3.243.com\/"
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false"
        },
        {
            "id": "8aa27332-01da-486a-994c-1ce527fa2fd7",
            "name": "master",
            "endpoints": [
                "http:\/\/s3.246.com\/"
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false"
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement",
    "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b"
}

相关的pool

数据池(data pool)和索引池(index pool)

首当其中的pool是数据pool,即用户上传的对象数据,最终存放的地点:

root@NODE-246:/var/log/ceph# radosgw-admin zone get 
{
    "id": "8aa27332-01da-486a-994c-1ce527fa2fd7",
    "name": "master",
    "domain_root": "default.rgw.data.root",
    "control_pool": "default.rgw.control",
    "gc_pool": "default.rgw.gc",
    "log_pool": "default.rgw.log",
    "intent_log_pool": "default.rgw.intent-log",
    "usage_log_pool": "default.rgw.usage",
    "user_keys_pool": "default.rgw.users.keys",
    "user_email_pool": "default.rgw.users.email",
    "user_swift_pool": "default.rgw.users.swift",
    "user_uid_pool": "default.rgw.users.uid",
    "system_key": {
        "access_key": "B9494C9XE7L7N50E9K2V",
        "secret_key": "O8e3IYV0gxHOwy61Og5ep4f7vQWPPFPhqRXjJrYT"
    },
    "placement_pools": [
        {
            "key": "default-placement",
            "val": {
                "index_pool": "default.rgw.buckets.index",
                "data_pool": "default.rgw.buckets.data",
                "data_extra_pool": "default.rgw.buckets.non-ec",
                "index_type": 0
            }
        }
    ],
    "metadata_heap": "",
    "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b"
}

从上面可以看出,master zone的default-placement中

作用 pool name
data pool default.rgw.buckets.data
index pool default.rgw.buckets.index
data extra pool default.rgw.buckets.non-ec

测试版本是Jewel,尚不支持index 动态shard,我们选择index max shards=8,即每个bucket 有8个分片。

rgw_override_bucket_index_max_shards = 8 

通过如下指令可以看到我们当前集群的bucket信息:

root@NODE-246:/var/log/ceph# radosgw-admin bucket stats
[
    {
        "bucket": "segtest2",
        "pool": "default.rgw.buckets.data",
        "index_pool": "default.rgw.buckets.index",
        "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769",
        "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769",
        "owner": "segs3account",
        ...
    },
    {
        "bucket": "segtest1",
        "pool": "default.rgw.buckets.data",
        "index_pool": "default.rgw.buckets.index",
        "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768",
        "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768",
        "owner": "segs3account",
        ...
    }
}

从上图可以看到,一共有两个bucket,bucket id分别是:

bucket name bucket id
segtest1 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768
segtest2 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769

每个bucket有8个index shards,共有16个对象。

root@NODE-246:/var/log/ceph# rados -p default.rgw.buckets.index ls
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.7
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.0
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.2
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.5
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.6
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.3
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.4
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.3
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.0
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.1
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.6
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.5
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.2
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.1
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.4
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.7

default.rgw.log pool

log pool记录的是各种日志信息,对于MultiSite这种使用场景,我们可以从default.rgw.log pool中找到这种命名的对象:

root@NODE-246:~# rados -p default.rgw.log ls  |grep data_log 
data_log.0
data_log.11
data_log.12
data_log.8
data_log.14
data_log.13
data_log.10
data_log.9
data_log.7

一般来讲这种命名风格的对象最多有rgw_data_log_num_shards,对于我们的场景:

OPTION(rgw_data_log_num_shards, OPT_INT, 128) 

rgw_bucket.h中可以看到如下代码:

    num_shards = cct->_conf->rgw_data_log_num_shards;
    oids = new string[num_shards];
    string prefix = cct->_conf->rgw_data_log_obj_prefix;
    if (prefix.empty()) {
      prefix = "data_log";
    }   
    for (int i = 0; i < num_shards; i++) {
      char buf[16];
      snprintf(buf, sizeof(buf), "%s.%d", prefix.c_str(), i); 
      oids[i] = buf;
    }   
    renew_thread = new ChangesRenewThread(cct, this);
    renew_thread->create("rgw_dt_lg_renew")

一般来讲,该对象内容为空,相关有用的信息,都记录在omap中:

root@NODE-246:~# rados -p default.rgw.log stat data_log.61
default.rgw.log/data_log.61 mtime 2018-12-10 14:39:38.000000, size 0
root@NODE-246:~# rados -p default.rgw.log listomapkeys data_log.61
1_1544421980.298394_2914.1
1_1544422002.458109_2939.1
...
1_1544423969.748641_4486.1
1_1544423978.090683_4495.1
1_1544424000.286801_4507.1

写入对象

概述

宏观上讲,上传一个对象到bucket中,需要写多个地方,如果同时打开了bi log和data log的话。

  • default.rgw.buckets.data : 将真实数据写入此pool,一般来讲新增一个_的对象
  • default.rgw.buckets.index: 当数据写入完成之后,在该对象对应的bucket index shard的omap中增加该对象的信息
  • bucket index 对象的omap中增加bi log
  • default.rgw.log pool中的data_log对象的omap中增加data log

bi log

当上传对象完毕之后,我们查看bucket index shard,可以看到如下内容:

root@node247:/var/log/ceph# rados -p default.rgw.buckets.index listomapkeys .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 
oem.tar.bz2
0_00000000001.1.2
0_00000000002.2.3

其中oem.tar.bz2文件是我们上传的对象,略过不提,除此意外还有两个0_00000000001.1.2和0_00000000002.2.3对象。

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 31 2e 31  |.0_00000000001.1|
00000010  2e 32                                             |.2|
00000012

value (133 bytes) :
00000000  03 01 7f 00 00 00 0f 00  00 00 30 30 30 30 30 30  |..........000000|
00000010  30 30 30 30 31 2e 31 2e  32 0b 00 00 00 6f 65 6d  |00001.1.2....oem|
00000020  2e 74 61 72 2e 62 7a 32  00 00 00 00 00 00 00 00  |.tar.bz2........|
00000030  01 01 0a 00 00 00 88 ff  ff ff ff ff ff ff ff 00  |................|
00000040  30 00 00 00 31 39 63 62  66 32 35 30 2d 62 62 33  |0...19cbf250-bb3|
00000050  65 2d 34 62 38 63 2d 62  35 62 66 2d 31 61 34 30  |e-4b8c-b5bf-1a40|
00000060  64 61 36 36 31 30 66 65  2e 31 35 30 38 33 2e 36  |da6610fe.15083.6|
00000070  34 32 31 30 00 00 01 00  00 00 00 00 00 00 00 00  |4210............|
00000080  00 00 00 00 00                                    |.....|
00000085

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 32 2e 32  |.0_00000000002.2|
00000010  2e 33                                             |.3|
00000012

value (125 bytes) :
00000000  03 01 77 00 00 00 0f 00  00 00 30 30 30 30 30 30  |..w.......000000|
00000010  30 30 30 30 32 2e 32 2e  33 0b 00 00 00 6f 65 6d  |00002.2.3....oem|
00000020  2e 74 61 72 2e 62 7a 32  e7 b2 14 5c 20 8e a4 04  |.tar.bz2...\ ...|
00000030  01 01 02 00 00 00 03 01  30 00 00 00 31 39 63 62  |........0...19cb|
00000040  66 32 35 30 2d 62 62 33  65 2d 34 62 38 63 2d 62  |f250-bb3e-4b8c-b|
00000050  35 62 66 2d 31 61 34 30  64 61 36 36 31 30 66 65  |5bf-1a40da6610fe|
00000060  2e 31 35 30 38 33 2e 36  34 32 31 30 00 01 02 00  |.15083.64210....|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00           |.............|
0000007d

为什么PUT对象之后,在.dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 的omap中还有两个key-value对,它们是干什么用的?

我们打开所有OSD的debug-objclass,查看下

ceph tell osd.\* injectargs --debug-objclass 20

我们在日志中可以看到如下内容:

ceph-client.radosgw.0的日志:
--------------------------------------
2018-12-15 15:53:11.079498 7f45723c7700 10 moving default.rgw.data.root+.bucket.meta.bucket_0:19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1 to cache LRU end
2018-12-15 15:53:11.079530 7f45723c7700 20  bucket index object: .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7
2018-12-15 15:53:11.083307 7f45723c7700 20 RGWDataChangesLog::add_entry() bucket.name=bucket_0 shard_id=7 now=2018-12-15 15:53:11.0.083306s cur_expiration=1970-01-01 08:00:00.000000s
2018-12-15 15:53:11.083351 7f45723c7700 20 RGWDataChangesLog::add_entry() sending update with now=2018-12-15 15:53:11.0.083306s cur_expiration=2018-12-15 15:53:41.0.083306s
2018-12-15 15:53:11.085002 7f45723c7700  2 req 64210:0.012000:s3:PUT /bucket_0/oem.tar.bz2:put_obj:completing
2018-12-15 15:53:11.085140 7f45723c7700  2 req 64210:0.012139:s3:PUT /bucket_0/oem.tar.bz2:put_obj:op status=0
2018-12-15 15:53:11.085148 7f45723c7700  2 req 64210:0.012147:s3:PUT /bucket_0/oem.tar.bz2:put_obj:http status=200
2018-12-15 15:53:11.085159 7f45723c7700  1 ====== req done req=0x7f45723c1750 op status=0 http_status=200 ======

ceph-osd.0.log 
----------------
2018-12-15 15:53:11.080017 7faaa6fb3700  1 <cls> cls/rgw/cls_rgw.cc:689: rgw_bucket_prepare_op(): request: op=0 name=oem.tar.bz2 instance= tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210

2018-12-15 15:53:11.083526 7faaa6fb3700  1 <cls> cls/rgw/cls_rgw.cc:830: rgw_bucket_complete_op(): request: op=0 name=oem.tar.bz2 instance= ver=3:1 tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210

2018-12-15 15:53:11.083592 7faaa6fb3700  1 <cls> cls/rgw/cls_rgw.cc:753: read_index_entry(): existing entry: ver=-1:0 name=oem.tar.bz2 instance= locator=

2018-12-15 15:53:11.083639 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:949: rgw_bucket_complete_op(): remove_objs.size()=0

2018-12-15 15:53:12.142564 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11
2018-12-15 15:53:12.142584 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0

2018-12-15 15:53:12.170787 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11
2018-12-15 15:53:12.170799 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0

2018-12-15 15:53:12.194152 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11
2018-12-15 15:53:12.194167 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=1

2018-12-15 15:53:12.256510 7faaa6fb3700 10 <cls> cls/rgw/cls_rgw.cc:2591: bi_log_iterate_range
2018-12-15 15:53:12.256523 7faaa6fb3700  0 <cls> cls/rgw/cls_rgw.cc:2621: bi_log_iterate_entries start_key=<80>0_00000000002.2.3 end_key=<80>1000_

从上面的日志可以看出:

  • rgw_bucket_prepare_op
  • rgw_bucket_complete_op
  • RGWDataChangesLog::add_entry()

在void RGWPutObj::execute()函数的最后,会调用proecssor->complete 函数:

  op_ret = processor->complete(etag, &mtime, real_time(), attrs,
                               (delete_at ? *delete_at : real_time()), if_match, if_nomatch,
                               (user_data.empty() ? nullptr : &user_data));  

complete函数会调用do_complete函数,因此直接看do_complete函数:

int RGWPutObjProcessor_Atomic::do_complete(string& etag, real_time *mtime, real_time set_mtime,
                                           map<string, bufferlist>& attrs, real_time delete_at,
                                           const char *if_match,
                                           const char *if_nomatch, const string *user_data) {
  //等待该rgw对象的所有异步写完成
  int r = complete_writing_data();                                              
  if (r < 0)
    return r;
  //标识该对象为Atomic类型的对象
  obj_ctx.set_atomic(head_obj);
  // 将该rgw对象的attrs写入head对象的xattr中
  RGWRados::Object op_target(store, bucket_info, obj_ctx, head_obj);
  /* some object types shouldn't be versioned, e.g., multipart parts */
  op_target.set_versioning_disabled(!versioned_object);

  RGWRados::Object::Write obj_op(&op_target);

  obj_op.meta.data = &first_chunk;
  obj_op.meta.manifest = &manifest;
  obj_op.meta.ptag = &unique_tag; /* use req_id as operation tag */
  obj_op.meta.if_match = if_match;
  obj_op.meta.if_nomatch = if_nomatch;
  obj_op.meta.mtime = mtime;
  obj_op.meta.set_mtime = set_mtime;
  obj_op.meta.owner = bucket_info.owner;
  obj_op.meta.flags = PUT_OBJ_CREATE;
  obj_op.meta.olh_epoch = olh_epoch;
  obj_op.meta.delete_at = delete_at;
  obj_op.meta.user_data = user_data;

  /* write_meta是一个综合操作,是我们下面分析的重点 */
  r = obj_op.write_meta(obj_len, attrs);
  if (r < 0) {
    return r;
  }
  canceled = obj_op.meta.canceled;
  return 0;                                                     
}

要探究bucket index shard中 omap中的0_00000000001.1.2和0_00000000002.2.3到底是什么,需要进入write_meta函数:

int RGWRados::Object::Write::write_meta(uint64_t size,
                  map<string, bufferlist>& attrs)
{
  int r = 0;
  RGWRados *store = target->get_store();
  if ((r = this->_write_meta(store, size, attrs, true)) == -ENOTSUP) {
    ldout(store->ctx(), 0) << "WARNING: " << __func__
      << "(): got ENOSUP, retry w/o store pg ver" << dendl;
    r = this->_write_meta(store, size, attrs, false);      
  }
  return r;
}


int RGWRados::Object::Write::_write_meta(RGWRados *store, uint64_t size,
                  map<string, bufferlist>& attrs, bool store_pg_ver)
{
  ...
  r = index_op.prepare(CLS_RGW_OP_ADD);
  if (r < 0)
    return r;

  r = ref.ioctx.operate(ref.oid, &op); 
  if (r < 0) { /* we can expect to get -ECANCELED if object was replaced under,
                or -ENOENT if was removed, or -EEXIST if it did not exist
                before and now it does */
    goto done_cancel;
  }

  epoch = ref.ioctx.get_last_version();
  poolid = ref.ioctx.get_id();

  r = target->complete_atomic_modification();
  if (r < 0) {
    ldout(store->ctx(), 0) << "ERROR: complete_atomic_modification returned r=" << r << dendl;
  }
  r = index_op.complete(poolid, epoch, size, 
                        meta.set_mtime, etag, content_type, &acl_bl,
                        meta.category, meta.remove_objs, meta.user_data);

  ...    
}

RGWRados::Bucket::UpdateIndex::prepare

在index_op.prepare 操作中,bucket index shard 中0_00000000001.1.2 该key-value对写入。

int RGWRados::Bucket::UpdateIndex::prepare(RGWModifyOp op)
{
  if (blind) {
    return 0;
  }
  RGWRados *store = target->get_store();
  BucketShard *bs;
  int ret = get_bucket_shard(&bs);
  if (ret < 0) {
    ldout(store->ctx(), 5) << "failed to get BucketShard object: ret=" << ret << dendl;
    return ret;
  }
  if (obj_state && obj_state->write_tag.length()) {
    optag = string(obj_state->write_tag.c_str(), obj_state->write_tag.length());
  } else {
    if (optag.empty()) {
      append_rand_alpha(store->ctx(), optag, optag, 32);
    }
  }
  ret = store->cls_obj_prepare_op(*bs, op, optag, obj, bilog_flags);
  return ret;
}
int RGWRados::cls_obj_prepare_op(BucketShard& bs, RGWModifyOp op, string& tag, 
                                 rgw_obj& obj, uint16_t bilog_flags)
{
  ObjectWriteOperation o;
  cls_rgw_obj_key key(obj.get_index_key_name(), obj.get_instance());
  cls_rgw_bucket_prepare_op(o, op, tag, key, obj.get_loc(), get_zone().log_data, bilog_flags);
  int flags = librados::OPERATION_FULL_TRY;
  int r = bs.index_ctx.operate(bs.bucket_obj, &o, flags);
  return r;
}
void cls_rgw_bucket_prepare_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag,
                               const cls_rgw_obj_key& key, const string& locator, bool log_op,
                               uint16_t bilog_flags)
{
  struct rgw_cls_obj_prepare_op call;
  call.op = op; 
  call.tag = tag;
  call.key = key;
  call.locator = locator;
  call.log_op = log_op;
  call.bilog_flags = bilog_flags;
  bufferlist in; 
  ::encode(call, in);
  o.exec("rgw", "bucket_prepare_op", in);
} 

cls/rgw/cls_rgw.cc
----------------------
void __cls_init()
{
    ...
   cls_register_cxx_method(h_class, "bucket_prepare_op", CLS_METHOD_RD | CLS_METHOD_WR, rgw_bucket_prepare_op, &h_rgw_bucket_prepare_op); 
    ...
}

int rgw_bucket_prepare_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
  ...
  CLS_LOG(1, "rgw_bucket_prepare_op(): request: op=%d name=%s instance=%s tag=%s\n",
          op.op, op.key.name.c_str(), op.key.instance.c_str(), op.tag.c_str());

...
      // fill in proper state
  struct rgw_bucket_pending_info info;
  info.timestamp = real_clock::now();
  info.state = CLS_RGW_STATE_PENDING_MODIFY;
  info.op = op.op;
  entry.pending_map.insert(pair<string, rgw_bucket_pending_info>(op.tag, info));

  struct rgw_bucket_dir_header header;
  rc = read_bucket_header(hctx, &header);
  if (rc < 0) {
    CLS_LOG(1, "ERROR: rgw_bucket_complete_op(): failed to read header\n");
    return rc;
  }

  if (op.log_op) {
    //產生出0_00000000001.1.2
    rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime,
                             entry.ver, info.state, header.ver, header.max_marker, op.bilog_flags, NULL, NULL);
    if (rc < 0)
      return rc;
  }

  // write out new key to disk
  bufferlist info_bl;
  ::encode(entry, info_bl);
  rc = cls_cxx_map_set_val(hctx, idx, &info_bl);
  if (rc < 0)
    return rc; 
  return write_bucket_header(hctx, &header);
}

注意上面的log_index_operation函數,我們的第一個0_00000000001.1.2對象即由該函數產生。

static void bi_log_prefix(string& key)
{
  key = BI_PREFIX_CHAR;
  key.append(bucket_index_prefixes[BI_BUCKET_LOG_INDEX]);
}

static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver)                                                   
{
  bi_log_prefix(key);
  get_index_ver_key(hctx, index_ver, &id);
  key.append(id);
}
#define BI_PREFIX_CHAR 0x80    
#define BI_BUCKET_OBJS_INDEX          0
#define BI_BUCKET_LOG_INDEX           1
#define BI_BUCKET_OBJ_INSTANCE_INDEX  2
#define BI_BUCKET_OLH_DATA_INDEX      3
#define BI_BUCKET_LAST_INDEX          4
static string bucket_index_prefixes[] = { "", /* special handling for the objs list index */
                                          "0_",     /* bucket log index */
                                          "1000_",  /* obj instance index */
                                          "1001_",  /* olh data index */
                                          /* this must be the last index */
                                          "9999_",};

我们可以看到,这种bi log都是以字符0x80开始,后面跟着’0_’:

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 32 2e 32  |.0_00000000002.2|
00000010  2e 33                                             |.3|
00000012

我们可以通过radosgw-admin bilog list查看相应的bilog:

    {
        "op_id": "7#00000000001.1.2",
        "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210",
        "op": "write",
        "object": "oem.tar.bz2",
        "instance": "",
        "state": "pending",
        "index_ver": 1,
        "timestamp": "0.000000",
        "ver": {
            "pool": -1,
            "epoch": 0
        },
        "bilog_flags": 0,
        "versioned": false,
        "owner": "",
        "owner_display_name": ""
    },

RGWRados::Bucket::UpdateIndex::complete

介绍完UpdateIndex的prepare阶段,该介绍complete阶段了

int RGWRados::Bucket::UpdateIndex::complete(int64_t poolid, uint64_t epoch, uint64_t size, 
                                    ceph::real_time& ut, string& etag, string& content_type,                                           bufferlist *acl_bl, RGWObjCategory category,
                                    list<rgw_obj_key> *remove_objs, const string *user_data)

在该函数的末尾:

  ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags);
  int r = store->data_log->add_entry(bs->bucket, bs->shard_id);
  if (r < 0) {
    lderr(store->ctx()) << "ERROR: failed writing data log" << dendl;
  }
  return ret;

其中store->cls_obj_complete_add这个函数:

ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); 
int RGWRados::cls_obj_complete_add(BucketShard& bs, string& tag,
                                   int64_t pool, uint64_t epoch,
                                   RGWObjEnt& ent, RGWObjCategory category,
                                   list<rgw_obj_key> *remove_objs, uint16_t bilog_flags)
{
  return cls_obj_complete_op(bs, CLS_RGW_OP_ADD, tag, pool, epoch, ent, category, remove_objs, bilog_flags);
}

int RGWRados::cls_obj_complete_op(BucketShard& bs, RGWModifyOp op, string& tag,
                                  int64_t pool, uint64_t epoch,
                                  RGWObjEnt& ent, RGWObjCategory category,
                                  list<rgw_obj_key> *remove_objs, uint16_t bilog_flags)
{
      ...
      cls_rgw_bucket_complete_op(o, op, tag, ver, key, dir_meta, pro,  
                             get_zone().log_data, bilog_flags);
      ...
}
void cls_rgw_bucket_complete_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag,
                                rgw_bucket_entry_ver& ver,
                                const cls_rgw_obj_key& key,
                                rgw_bucket_dir_entry_meta& dir_meta,
                                list<cls_rgw_obj_key> *remove_objs, bool log_op,
                                uint16_t bilog_flags)
{

  bufferlist in;
  struct rgw_cls_obj_complete_op call;
  call.op = op;
  call.tag = tag;
  call.key = key;
  call.ver = ver;
  call.meta = dir_meta;
  call.log_op = log_op;
  call.bilog_flags = bilog_flags;
  if (remove_objs)
    call.remove_objs = *remove_objs;
  ::encode(call, in);
  o.exec("rgw", "bucket_complete_op", in);
}

cls/rgw/cls_rgw.cc
-------------------
int rgw_bucket_complete_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
   ...
    case CLS_RGW_OP_ADD:
    {
      struct rgw_bucket_dir_entry_meta& meta = op.meta;
      struct rgw_bucket_category_stats& stats = header.stats[meta.category];
      entry.meta = meta;
      entry.key = op.key;
      entry.exists = true;
      entry.tag = op.tag;
      stats.num_entries++;
      stats.total_size += meta.accounted_size;
      stats.total_size_rounded += cls_rgw_get_rounded_size(meta.accounted_size);
      bufferlist new_key_bl;
      ::encode(entry, new_key_bl);
      int ret = cls_cxx_map_set_val(hctx, idx, &new_key_bl);
      if (ret < 0)
        return ret;
    }
    break;
  }

  if (op.log_op) {
    rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime, entry.ver,
                             CLS_RGW_STATE_COMPLETE, header.ver, header.max_marker, op.bilog_flags, NULL, NULL);
    if (rc < 0)
      return rc;                                               
 }
}

可以看到log_index_operation 函数,这个函数是第二个 bi log

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 32 2e 32  |.0_00000000002.2|
00000010  2e 33                                             |.3|
00000012

同样,我们可以通过radosgw-admin命令查看bilog

   radosgw-admin bilog list  --bucket bucket_0
   {
        "op_id": "7#00000000002.2.3",
        "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210",
        "op": "write",
        "object": "oem.tar.bz2",
        "instance": "",
        "state": "complete",
        "index_ver": 2,
        "timestamp": "2018-12-15 07:53:11.077893152Z",
        "ver": {
            "pool": 3,
            "epoch": 1
        },
        "bilog_flags": 0,
        "versioned": false,
        "owner": "",
        "owner_display_name": ""
    },

至此,当上传对象的时候,两条bi log都介绍完了,值得注意的是key中的数值,

static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver)
{
  bi_log_prefix(key);
  get_index_ver_key(hctx, index_ver, &id);
  key.append(id);
}
static void get_index_ver_key(cls_method_context_t hctx, uint64_t index_ver, string *key)
{
  char buf[48];
  snprintf(buf, sizeof(buf), "%011llu.%llu.%d", (unsigned long long)index_ver,
           (unsigned long long)cls_current_version(hctx),
           cls_current_subop_num(hctx));                                               
  *key = buf;
} 
uint64_t cls_current_version(cls_method_context_t hctx)  
{ 
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;

  return ctx->pg->info.last_user_version;
}
int cls_current_subop_num(cls_method_context_t hctx)
{ 
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;

  return ctx->current_osd_subop_num;
}

Ceph保证了后面的序列部分单调递增。这个单调性对于multisite增量同步比较重要。

data_log

在UpdateIndex::complete 函数中,有如下内容:

  ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags);
  int r = store->data_log->add_entry(bs->bucket, bs->shard_id);
  if (r < 0) {
    lderr(store->ctx()) << "ERROR: failed writing data log" << dendl;
  }
  return ret;

其中store->data_log->add_entry 即为往default.rgw.log 对应的data_log条目中增加log的部分。

当对bucket进行对象操作时,会在omap上新建一条”1_”+ 开头的日志,表明这个bucket被修改过,增量同步时会根据这些日志判断出哪些bucket被更改过,进而再针对每个bucket进行同步。

bucket 与 data_log.X的映射

以我们的Jewel 版本为例,bucket的shard个数为 8, 而default.rgw.log 中data_log.X 对象的个数为rgw_data_log_num_shards,即128个,RGW提供了两者的映射关系:

int RGWDataChangesLog::choose_oid(const rgw_bucket_shard& bs) {
    const string& name = bs.bucket.name;
    int shard_shift = (bs.shard_id > 0 ? bs.shard_id : 0);
    uint32_t r = (ceph_str_hash_linux(name.c_str(), name.size()) + shard_shift) % num_shards; 
    return (int)r;
}

加入我们有N个bucket,每个bucket shards是8,也就是将8*N个对象通过choose_oid映射到128个data_log.X对象。

上传一个对象之后,我们可以从default.rgw.log 的data_log.X的omap信息中得到一笔新的key-value信息:

root@NODE-246:/var/log# rados -p default.rgw.log ls |grep data_log  |xargs -I {} rados -p default.rgw.log listomapvals {} 
1_1544942616.469385_1491.1
value (185 bytes) :
00000000  02 01 b3 00 00 00 00 00  00 00 37 00 00 00 62 75  |..........7...bu|
00000010  63 6b 65 74 5f 30 3a 31  39 63 62 66 32 35 30 2d  |cket_0:19cbf250-|
00000020  62 62 33 65 2d 34 62 38  63 2d 62 35 62 66 2d 31  |bb3e-4b8c-b5bf-1|
00000030  61 34 30 64 61 36 36 31  30 66 65 2e 31 35 30 38  |a40da6610fe.1508|
00000040  33 2e 31 3a 37 18 f4 15  5c 44 40 fa 1b 4a 00 00  |3.1:7...\D@..J..|
00000050  00 01 01 44 00 00 00 01  37 00 00 00 62 75 63 6b  |...D....7...buck|
00000060  65 74 5f 30 3a 31 39 63  62 66 32 35 30 2d 62 62  |et_0:19cbf250-bb|
00000070  33 65 2d 34 62 38 63 2d  62 35 62 66 2d 31 61 34  |3e-4b8c-b5bf-1a4|
00000080  30 64 61 36 36 31 30 66  65 2e 31 35 30 38 33 2e  |0da6610fe.15083.|
00000090  31 3a 37 18 f4 15 5c 44  40 fa 1b 1a 00 00 00 31  |1:7...\D@......1|
000000a0  5f 31 35 34 34 39 34 32  36 31 36 2e 34 36 39 33  |_1544942616.4693|
000000b0  38 35 5f 31 34 39 31 2e  31                       |85_1491.1|
000000b9

这个键值命名的规范是如何的?

cls/log/cls_log.cc
-----------------------
static string log_index_prefix = "1_"; 
static void get_index(cls_method_context_t hctx, utime_t& ts, string& index)
{
  get_index_time_prefix(ts, index);   
  string unique_id;
  cls_cxx_subop_version(hctx, &unique_id);
  index.append(unique_id);
}
static void get_index_time_prefix(utime_t& ts, string& index)
{
  char buf[32];
  snprintf(buf, sizeof(buf), "%010ld.%06ld_", (long)ts.sec(), (long)ts.usec());
  index = log_index_prefix + buf;
}
uint64_t cls_current_version(cls_method_context_t hctx)
{
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;

  return ctx->pg->info.last_user_version;
}
int cls_current_subop_num(cls_method_context_t hctx)
{
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;
  return ctx->current_osd_subop_num;
}
void cls_cxx_subop_version(cls_method_context_t hctx, string *s) 
{
  if (!s)
    return;
  char buf[32];
  uint64_t ver = cls_current_version(hctx);
  int subop_num = cls_current_subop_num(hctx);
  snprintf(buf, sizeof(buf), "%lld.%d", (long long)ver, subop_num);
  *s = buf;
}

1_1544942616.469385_1491.1这个键值也是一样,ceph保证其单调递增的特性。当multisite同步的时候,这个特性很重要。

]]>
iSCSI command 2018-11-28T23:12:40+00:00 Bean Li http://bean-li.github.io/iSCSI-Command 前言

iSCSI客户端常用命令总是忘记,在此处记录下。

常用命令

查看当前session

挂载之前,一般如下图所示:

root@node-242:~# iscsiadm -m session 
iscsiadm: No active sessions.

挂载之后:

root@node-242:~# iscsiadm -m session 
tcp: [2] 10.16.172.247:3260,1 iqn.2018-11.com:BEAN

root@node-242:~# iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-871
Target: iqn.2018-11.com:BEAN
	Current Portal: 10.16.172.247:3260,1
	Persistent Portal: 10.16.172.247:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface Initiatorname: iqn.1993-08.org.debian:01:c9c12dd76e
		Iface IPaddress: 10.16.172.242
		Iface HWaddress: (null)
		Iface Netdev: (null)
		SID: 2
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 1048576
		FirstBurstLength: 262144
		MaxBurstLength: 1048576
		ImmediateData: Yes
		InitialR2T: No
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 25	State: running
		scsi25 Channel 00 Id 0 Lun: 0
			Attached scsi disk sde		State: running

根据IP 发现target

iscsiadm -m discovery -t st -p 10.16.172.246

输出如下:

root@node-242:~# iscsiadm -m discovery -t st -p 10.16.172.246
10.16.172.246:3260,1 iqn.2018-11.com:BEAN
10.16.172.247:3260,1 iqn.2018-11.com:BEAN
10.16.172.248:3260,1 iqn.2018-11.com:BEAN

登录到指定target

iscsiadm -m node -T [target_name] -p [ip:3260] -l
 如下所示:
iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -l

输出如下:

root@node-242:~# iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -l
Logging in to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]
Login to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]: successful

登录之后,可以用iscsiadm -m session查看。结果一般如下所示:

root@node-242:~# iscsiadm -m session 
tcp: [3] 10.16.172.246:3260,1 iqn.2018-11.com:BEAN

登出指定target

iscsiadm -m node -T [target_name] -p [ip:3260] -u

具体指令如下所示:

root@node-242:~# iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -u
Logging out of session [sid: 3, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]
Logout of [sid: 3, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]: successful

登出之后,可以用iscsiadm -m session 检查效果。

root@node-242:~# iscsiadm -m session 
iscsiadm: No active sessions.

信息

一般来讲,登录target之后会新增一个盘符,登录之前,lsblk输出如下:

root@node2:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0    30G  0 disk 
├─sda1   8:1    0     7M  0 part 
├─sda2   8:2    0  22.2G  0 part /
├─sda3   8:3    0   7.5G  0 part [SWAP]
└─sda4   8:4    0   261M  0 part 
sdb      8:16   0   100G  0 disk 
├─sdb1   8:17   0     8G  0 part 
└─sdb2   8:18   0    92G  0 part /data/osd.2
sdc      8:32   0     2T  0 disk 
sr0     11:0    1  1024M  0 rom 

执行登录target之后:

root@node2:~# iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -l
Logging in to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260] (multiple)
Login to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260] successful.
root@node2:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0    30G  0 disk 
├─sda1   8:1    0     7M  0 part 
├─sda2   8:2    0  22.2G  0 part /
├─sda3   8:3    0   7.5G  0 part [SWAP]
└─sda4   8:4    0   261M  0 part 
sdb      8:16   0   100G  0 disk 
├─sdb1   8:17   0     8G  0 part 
└─sdb2   8:18   0    92G  0 part /data/osd.2
sdc      8:32   0     2T  0 disk 
sr0     11:0    1  1024M  0 rom  

我们可以看到新增了一个sdc。

如果确定sdc和iSCSI target的关系呢:

iscsiadm -m session -P 3

比如之前的输出, sde这块磁盘即iSCSI,来自 10.16.172.247:3260的Target: iqn.2018-11.com:BEAN

root@node-242:~# iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-871
Target: iqn.2018-11.com:BEAN
	Current Portal: 10.16.172.247:3260,1
	Persistent Portal: 10.16.172.247:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface Initiatorname: iqn.1993-08.org.debian:01:c9c12dd76e
		Iface IPaddress: 10.16.172.242
		Iface HWaddress: (null)
		Iface Netdev: (null)
		SID: 2
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 1048576
		FirstBurstLength: 262144
		MaxBurstLength: 1048576
		ImmediateData: Yes
		InitialR2T: No
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 25	State: running
		scsi25 Channel 00 Id 0 Lun: 0
			Attached scsi disk sde		State: running
]]>
How s3 data store in ceph 2018-06-01T17:20:40+00:00 Bean Li http://bean-li.github.io/how-s3-data-store-in-ceph 前言

本文解决Where is my data 之对象存储部分,主要集中在S3对象存储。

where is my s3 data

简略地回答,可以说,用户的s3 data存放在 .rgw.buckets这个pool中,可是,pool中的数据长这个样:

default.11383165.1_kern.log
....
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_53
default.11383165.1_821
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_260
default.11383165.1_5
default.11383165.1_572
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_618
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_153
default.11383165.1_217
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_537
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_357
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_565
default.11383165.1_441
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_223

如何和bucket对应,如何和bucket中的用户对象文件对应?

整体上传对象

整体上传对象的时候,分成两种情况,区分的维度就是:

    "rgw_max_chunk_size": "524288"

这个值是RadosGW下发到RADOS集群的单个IO的大小,同时也决定了应用对象分成多个RADOS对象时首对象(head_obj)的大小。

  • 对象大小分块大小,即小于512KB
  • 对象大小大于分块大小,即大于512KB

注意,如果大于rgw_max_chunk_size的对象文件,后续部分会根据如下参数切成多个RADOS对象:

"rgw_obj_stripe_size": "4194304"

也就说,小于512K的对象文件在底层RADOS 只有一个对象,大于512KB的对象文件,会分成多个对象存储,其中第一个对象叫做首对象,大小为rgw_max_chunk_size,其他的对象按照rgw_obj_stripe_size切成不同的object 存入rados。

小于rgw_max_chunk_size的对象文件

这种情况就比较简单了,即将bucket_id 和 对象文件的名字用下划线拼接,作为pool中底层对象的名字。

root@44:~# s3cmd pub /var/log/syslog s3://bean_book/syslog 
ERROR: Invalid command: u'pub'
root@44:~# s3cmd put /var/log/syslog s3://bean_book/syslog 
/var/log/syslog -> s3://bean_book/syslog  [1 of 1]
 60600 of 60600   100% in    0s     8.56 MB/s  done


root@44:~# rados -p .rgw.buckets ls |grep syslog 
default.11383165.2_syslog
root@44:/# rados -p .rgw.buckets stat default.11383165.2_syslog
.rgw.buckets/default.11383165.2_syslog mtime 2018-05-27 14:51:14.000000, size 60600

大于rgw_max_chunk_size对象文件

对于大于rgw_max_chunk_size的对象文件,会分成多个底层RADOS对象存放。

root@44:~# s3cmd put VirtualStor\ Scaler-v6.3-319~201805240311~cda7fd7.iso s3://bean_book/scaler.iso 

上传完毕之后,我们可以从.rgw.buckets桶里面找到如下对象:

default.11383165.2_scaler.iso
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_208
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_221
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_76
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_293

用户上传对象文件的时候,被分解成

  • 大小等于rgw_max_chunk_size的首对象 head_obj (512KB)
  • 多个大小等于条带大小的中间对象
  • 一个小于或者等于条带大小的尾对象

对于head_obj的命名组成,和上面一样,我就不重复绘图了,对于中间的对象和最后的尾对象,命名组成如下:

这里面有个问题,因为对象名字中有随机字符,当然只有一个大于4M的对象文件的时候,比如我就上传了一2G+的大文件,所有的bucket内的带shadow的文件都属于这个scaler.iso 对象。

但是我们想想,如果bucket中很多这种2G+的大对象文件,我们如何区分

root@44:/var/log/ceph# rados -p .rgw.buckets ls |grep shadow |grep "_1$"
default.11383165.2__shadow_.3vU63olQg1ovOpVdWQxJsx2o28N3TFl_1
default.11383165.2__shadow_.iDlJATXiRQBiT9xxSX5qS_Rb8iFdHam_1
default.11383165.2__shadow_.ipsp4zhQCPa1ckNNQZaJeLRSq3miyhR_1
default.11383165.2__shadow_.JKq4eXO5IJ6BMANVmLluwcUVHH7wzW9_1
default.11383165.2__shadow_.C7e7w4gQLapZ_KK3c2_2pKcz-yIobaN_1
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_1
default.11383165.2__shadow_.OvUkm8069EUeyXHneWhd4JOiVPev3gI_1
default.11383165.2__shadow_.zNsCV2xYKlym7uLDkR7cV0SF3edH0t3_1

换句话说,head_obj可以和对象文件关联起来,但是这些中间对象和尾对象,如何和head_obj关联起来呢?

head_obj不一般,它需要维护对象文件元数据信息和manifest信息:

root@44:~# rados -p .rgw.buckets listxattr default.11383165.2_scaler.iso 
user.rgw.acl
user.rgw.content_type
user.rgw.etag
user.rgw.idtag
user.rgw.manifest
user.rgw.x-amz-date

其中对于寻找数据比较重要的数据结构为:

rados -p .rgw.buckets getxattr  default.11383165.2_scaler.iso  user.rgw.manifest  > /root/scaler.iso.manifest

root@44:~# ceph-dencoder type RGWObjManifest import /root/scaler.iso.manifest  decode dump_json
{
    "objs": [],
    "obj_size": 2842374144,     <-----------------对象文件大小
    "explicit_objs": "false",
    "head_obj": {
        "bucket": {
            "name": "bean_book",
            "pool": ".rgw.buckets",
            "data_extra_pool": ".rgw.buckets.extra",
            "index_pool": ".rgw.buckets.index",
            "marker": "default.11383165.2",
            "bucket_id": "default.11383165.2"
        },
        "key": "",
        "ns": "",
        "object": "scaler.iso",         <---------------------对象名
        "instance": ""
    },
    "head_size": 524288,
    "max_head_size": 524288,
    "prefix": ".mGwYpWb3FXieaaaDNdaPzfs546ysNnT_",      <------------------中间对象和尾对象的随机前缀
    "tail_bucket": {
        "name": "bean_book",
        "pool": ".rgw.buckets",
        "data_extra_pool": ".rgw.buckets.extra",
        "index_pool": ".rgw.buckets.index",
        "marker": "default.11383165.2",
        "bucket_id": "default.11383165.2"
    },
    "rules": [
        {
            "key": 0,
            "val": {
                "start_part_num": 0,
                "start_ofs": 524288,
                "part_size": 0,
                "stripe_max_size": 4194304,
                "override_prefix": ""
            }
        }
    ]
}

有了head size 以及strip size这些,还有前缀,就可以很轻松的组成中间对象和尾对象的名字,进而读取对象文件的不同部分了。

寻找数据结束了之后,我们可以关注下其他的元数据信息:

root@44:~# rados -p .rgw.buckets getxattr  default.11383165.2_scaler.iso  user.rgw.etag -
9df9be75a165539894ef584cd27cc39f

root@44:~# md5sum VirtualStor\ Scaler-v6.3-319~201805240311~cda7fd7.iso 
9df9be75a165539894ef584cd27cc39f  VirtualStor Scaler-v6.3-319~201805240311~cda7fd7.iso

对于非分片上传的对象文件而言,etag就是MD5,几乎在对象文件 head_obj的扩展属性中。

对象文件的ACL信息,也记录在head_obj的扩展属性中:

root@44:~# rados -p .rgw.buckets getxattr  default.11383165.2_scaler.iso  user.rgw.acl > scaler.iso.acl
root@44:~# ceph-dencoder type RGWAccessControlPolicy import scaler.iso.acl  decode dump_json
{
    "acl": {
        "acl_user_map": [
            {
                "user": "bean_li",
                "acl": 15
            }
        ],
        "acl_group_map": [],
        "grant_map": [
            {
                "id": "bean_li",
                "grant": {
                    "type": {
                        "type": 0
                    },
                    "id": "bean_li",
                    "email": "",
                    "permission": {
                        "flags": 15
                    },
                    "name": "bean_li",
                    "group": 0
                }
            }
        ]
    },
    "owner": {
        "id": "bean_li",
        "display_name": "bean_li"
    }
}

除了这些默认的扩展属性,用户指定的metadata也是存放在此处。

分片上传 multipart upload

分片上传的对象,数据如何存放?

root@44:~# cp VirtualStor\ Scaler-v6.3-319~201805240311~cda7fd7.iso  /var/share/ezfs/shareroot/NAS/scaler_iso
root@44:~# 
root@44:~# s3cmd mb s3://iso
Bucket 's3://iso/' created

使用分片上传,每10M一个分片:

上传上去的对象有如下的命名风格:

default.14434697.1_scaler_iso
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.187_2
default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.129_2
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.134_2
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.22_1
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.83_2
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.136_2

head_obj不说了,还是老的命名风格,和完整上传的区别是,size 为0

root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1_scaler_iso 
.rgw.buckets/default.14434697.1_scaler_iso mtime 2018-05-27 18:48:32.000000, size 0

注意上面名字中的2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH,最容易让你困扰的是2~,这个2~是upload_id的前缀。

#define MULTIPART_UPLOAD_ID_PREFIX_LEGACY "2/"
#define MULTIPART_UPLOAD_ID_PREFIX "2~" // must contain a unique char that may not come up in gen_rand_alpha() 

命名规则如下:

需要注意的是,RADOS中multipart对象就是普通的rgw_obj_stripe_size ,即4M大小:

root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31 
.rgw.buckets/default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31 mtime 2018-05-27 18:48:10.000000, size 4194304

但是注意,应用程序分片,是用户可以指定的,比如我们的RRS,10M分片:

obsync.py
-----------------
MULTIPART_THRESH = 10485760

            mpu = self.bucket.initiate_multipart_upload(obj.name, metadata=meta_to_dict(obj.meta))
            try: 
                remaining = obj.size
                part_num = 0
                part_size = MULTIPART_THRESH

                while remaining > 0: 
                    offset = part_num * part_size
                    length = min(remaining, part_size)
                    ioctx = src.get_obj_ioctx(obj, offset, length)
                    mpu.upload_part_from_file(ioctx, part_num + 1) 
                    remaining -= length
                    part_num += 1 

                mpu.complete_upload()
            except Exception as e:
                mpu.cancel_upload()
                raise e

很明显,单个multipart 对象不足以存放10M大小,因此,一般对应分片还有对应的shadow对象:

root@45:/var/log/radosgw# rados -p .rgw.buckets ls |grep "2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31"
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_1
default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_2

root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_1 
.rgw.buckets/default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_1 mtime 2018-05-27 18:48:10.000000, size 4194304
root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_2
.rgw.buckets/default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_2 mtime 2018-05-27 18:48:10.000000, size 2097152

毫无意外,两个shadow 1个4M,另一个2M,加上分片的multipart 对象,共10M。

同样的问题是,用户读对象的时候,根据上传 时候的情况,决定了数据对象的名字,那么如何区分是分片上传的对象还是完整上传的对象呢?

root@45:~# rados -p .rgw.buckets getxattr default.14434697.1_scaler_iso user.rgw.manifest > scaler_iso_multipart.manifest 

root@45:~# ceph-dencoder type RGWObjManifest import /root/scaler_iso_multipart.manifest decode dump_json
{
    "objs": [],
    "obj_size": 2842374144,
    "explicit_objs": "false",
    "head_obj": {
        "bucket": {
            "name": "iso",
            "pool": ".rgw.buckets",
            "data_extra_pool": ".rgw.buckets.extra",
            "index_pool": ".rgw.buckets.index",
            "marker": "default.14434697.1",
            "bucket_id": "default.14434697.1"
        },
        "key": "",
        "ns": "",
        "object": "scaler_iso",
        "instance": ""
    },
    "head_size": 0,
    "max_head_size": 0,
    "prefix": "scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH",
    "tail_bucket": {
        "name": "iso",
        "pool": ".rgw.buckets",
        "data_extra_pool": ".rgw.buckets.extra",
        "index_pool": ".rgw.buckets.index",
        "marker": "default.14434697.1",
        "bucket_id": "default.14434697.1"
    },
    "rules": [
        {
            "key": 0,
            "val": {
                "start_part_num": 1,
                "start_ofs": 0,
                "part_size": 10485760,
                "stripe_max_size": 4194304,
                "override_prefix": ""
            }
        },
        {
            "key": 2841640960,
            "val": {
                "start_part_num": 272,
                "start_ofs": 2841640960,
                "part_size": 733184,
                "stripe_max_size": 4194304,
                "override_prefix": ""
            }
        }
    ]
}
]]>
flashcache 源码解析 2017-10-20T17:20:40+00:00 Bean Li http://bean-li.github.io/flashcache-source-code-1 前言

从flashcache的创建开始,介绍flashcache在SSD上的layout和内存数据结构,简单地说就是数据组织形式。

        sprintf(dmsetup_cmd, "echo 0 %lu flashcache %s %s %s %d 2 %lu %lu %d %lu %d %lu"
                " | dmsetup create %s",
                disk_devsize, disk_devname, ssd_devname, cachedev, cache_mode, block_size, 
                cache_size, associativity, disk_associativity, write_cache_only, md_block_size,
                cachedev);

从flashcache之后的参数算起:

dmc的成员 dmsetup create中的参数 默认值 含义  
disk_dev disk_devname 慢速块设备的名字  
cache_dev ssd_devname SSD设备的名字  
dm_vdevname flashcache的名字 flashcache起的名字  
cache_mode cache_mode 三种合法值:write_back,write_through和write_around  
persistence(非dmc的成员变量) 2 2 实际flashcache_ctr函数,即为flashcache_create服务,也为flashcache_load服务  
block_size block_size 8 8个扇区即4K  
size cache_size 设备扇区总数/block_size, 注意这个值的含义是block的个数,即总扇区除以block_size.  
assoc associativity 512 合法值为(256,8192)之间的2的整数幂,不包含256和8192  
disk_assoc        
write_only_cache write_cache_only 0 write_back模式有一个子模式,即write_only  
md_block_size   8    
num_sets   dmc->size » dmc->assoc_shift    
         

影响flashcache布局的几个参数有:

  • block_size: 默认情况下值为8,即8个扇区组成一个block,即block的大小为4KB
  • size : block的个数

注意,注意在

       
      //截止到此处,dmc->size是SSD设备的扇区个数,
      //后面调用dmc->size /= (dmc->block_size)执行之后,才变成block的个数。

        dmc->md_blocks = INDEX_TO_MD_BLOCK(dmc, dmc->size / dmc->block_size) + 1 + 1; 
        /*总扇区数减去md_block需要的扇区数,得到最多可以用于存放数据的扇区数*/
        dmc->size -= dmc->md_blocks * MD_SECTORS_PER_BLOCK(dmc);  
        /*可以用来存放cache数据的block个数,默认情况下即4K的个数*/
        dmc->size /= dmc->block_size;
    /*注意,block是要组成set的,因此有assoc的概念,默认512个block组成一个set
     *因此block的个数需要向下对齐512的倍数*/
        dmc->size = (dmc->size / dmc->assoc) * dmc->assoc;           
        
        /*有了准确的block的个数,需要的meta data block重新计算*/
        dmc->md_blocks = INDEX_TO_MD_BLOCK(dmc, dmc->size) + 1 + 1;                                                                                    
        DMINFO("flashcache_writeback_create: md_blocks = %d, md_sectors = %d\n", 
               dmc->md_blocks, dmc->md_blocks * MD_SECTORS_PER_BLOCK(dmc));
        dev_size = to_sector(dmc->cache_dev->bdev->bd_inode->i_size);
        cache_size = dmc->md_blocks * MD_SECTORS_PER_BLOCK(dmc) + (dmc->size * dmc->block_size);
        if (cache_size > dev_size) {
                DMERR("Requested cache size exceeds the cache device's capacity" \
                      "(%lu>%lu)",
                      cache_size, dev_size);
                vfree((void *)header);
                return 1;
        }

这段代码的执行过后,我们基本就能对flashcache的组织形式有一定的了解了。首先是8个扇区做成一个block,然后是512个block组成一个set,这样的话,一个set的大小为2M,将SSD整体空间扣除meta需要的部分之后,组织成这样的结构:

注意,这只是cache block的部分,对于flashcache来说,还有metadata block和superblock。和文件系统一样,flashcache也有superblock,介绍flashcache的组织形式:

        header = (struct flash_superblock *)vmalloc(MD_BLOCK_BYTES(dmc));
        if (!header) {
                DMERR("flashcache_writeback_create: Unable to allocate sector");
                return 1;                                                                                                                              
        }
struct flash_superblock {
        sector_t size;          /* Cache size */
        u_int32_t block_size;   /* Cache block size */
        u_int32_t assoc;        /* Cache associativity */
        u_int32_t cache_sb_state;       /* Clean shutdown ? */
        char cache_devname[DEV_PATHLEN]; /* Contains dm_vdev name as of v2 modifications */
        sector_t cache_devsize;
        char disk_devname[DEV_PATHLEN]; /* underlying block device name (use UUID paths!) */
        sector_t disk_devsize;
        u_int32_t cache_version;
        u_int32_t md_block_size;                                                                                                                       
        u_int32_t disk_assoc;
        u_int32_t write_only_cache;
};

尽管flashcache的superblock需要的空间比较小,但是flashcache给superblock预留了一个meta data block的大小,即默认情况下4KB的大小,为将来可能的扩展预留的空间。

flash_superblock这个数据结构存在在SSD这个设备的第一个4K,当机器重启之后,flashcache_load会阅读该设备,确保头部扇区中存放的内容,即superblock的内容:

        ssd_devname = argv[optind++];
        cache_fd = open(ssd_devname, O_RDONLY);
        if (cache_fd < 0) {
                fprintf(stderr, "Failed to open %s\n", ssd_devname);
                exit(1);
        }   
        lseek(cache_fd, 0, SEEK_SET);
        if (read(cache_fd, buf, 512) < 0) {
                fprintf(stderr, "Cannot read Flashcache superblock %s\n", ssd_devname);
                exit(1);                    
        }   
        if (!(sb->cache_sb_state == CACHE_MD_STATE_DIRTY ||
              sb->cache_sb_state == CACHE_MD_STATE_CLEAN ||
              sb->cache_sb_state == CACHE_MD_STATE_FASTCLEAN ||
              sb->cache_sb_state == CACHE_MD_STATE_UNSTABLE)) {
                fprintf(stderr, "%s: Invalid Flashcache superblock %s\n", pname, ssd_devname);
                exit(1);
        }   

创建flashcache的时候,flashcache_create也会阅读SSD设备的第一个扇区,来确保SSD是不是已经创建了flashcache。

对于上面网格图中的任何一个cache block,都需要数据结构来描述其状态,比如值是否有效,是否DIRTY等,其数据结构如下:

#ifdef FLASHCACHE_DO_CHECKSUMS
struct flash_cacheblock {                                                                                                                              
        sector_t        dbn;    /* Sector number of the cached block */
        u_int64_t       checksum;
        u_int32_t       cache_state; /* INVALID | VALID | DIRTY */
} __attribute__ ((aligned(32)));
#else   
struct flash_cacheblock {
        sector_t        dbn;    /* Sector number of the cached block */
        u_int32_t       cache_state; /* INVALID | VALID | DIRTY */      
} __attribute__ ((aligned(16)));
#endif

对于我们而言,flash_cacheblock的大小为16字节,因此,每个cache block都会有16字节的元数据。这16字节描述了一个cache block。每个meta data block 默认有4KB,即每个meta data block可以存放256 个cache block的元数据信息。

综合上述讨论,一个完整的SSD layout如下所示:

上面的布局,主要是块设备上的布局,除此外,flashcache正常运行期间,需要消耗内存,内存中有数据结构管理这些cache block,如下所示:

 order = dmc->size * sizeof(struct cacheblock); 
 struct cacheblock {
        u_int16_t       cache_state;
        int16_t         nr_queued;      /* jobs in pending queue */                                                                                    
        u_int16_t       lru_prev, lru_next;
        u_int8_t        use_cnt;
        u_int8_t        lru_state;
        sector_t        dbn;    /* Sector number of the cached block */
        u_int16_t       hash_prev, hash_next;
#ifdef FLASHCACHE_DO_CHECKSUMS
        u_int64_t       checksum;
#endif
} __attribute__((packed));

目前来讲,不考虑checksum,内存中18 Byte 描述一个cache block(默认4KB)。

        order = dmc->size * sizeof(struct cacheblock);
        DMINFO("Allocate %luKB (%luB per) mem for %lu-entry cache" \
               "(capacity:%luMB, associativity:%u, block size:%u " \
               "sectors(%uKB))",
               order >> 10, sizeof(struct cacheblock), dmc->size,
               cache_size >> (20-SECTOR_SHIFT), dmc->assoc, dmc->block_size,
               dmc->block_size >> (10-SECTOR_SHIFT));
        dmc->cache = (struct cacheblock *)vmalloc(order);
        if (!dmc->cache) {
                vfree((void *)header);
                DMERR("flashcache_writeback_create: Unable to allocate cache md");
                return 1;
        }
        memset(dmc->cache, 0, order);
        /* Initialize the cache structs */
        for (i = 0; i < dmc->size ; i++) {
                dmc->cache[i].dbn = 0;
#ifdef FLASHCACHE_DO_CHECKSUMS
                dmc->cache[i].checksum = 0;
#endif
                dmc->cache[i].cache_state = INVALID;
                dmc->cache[i].lru_state = 0;
                dmc->cache[i].nr_queued = 0;
        }                         

通过这个18 Byte的内存描述一个flashcache 的cache block,我们可以估算,一个400G 的SSD作为flashcache的SSD部分,消耗的内存约为:

400G/4KB*18 = 1.8GB

cache_set

dmc的assoc 默认是512,表示512个block组成一个set,即512*4K= 2MB:

init:
        /*计算整个flashcache set的个数*/
        dmc->num_sets = dmc->size >> dmc->assoc_shift;
        order = dmc->num_sets * sizeof(struct cache_set);
        dmc->cache_sets = (struct cache_set *)vmalloc(order);                                                                                          
        if (!dmc->cache_sets) {
                ti->error = "Unable to allocate memory";
                r = -ENOMEM;
                vfree((void *)dmc->cache);
                goto bad3;
        }                                    
        memset(dmc->cache_sets, 0, order);
        for (i = 0 ; i < dmc->num_sets ; i++) {
                dmc->cache_sets[i].set_fifo_next = i * dmc->assoc;
                dmc->cache_sets[i].set_clean_next = i * dmc->assoc;
                dmc->cache_sets[i].fallow_tstamp = jiffies;
                dmc->cache_sets[i].fallow_next_cleaning = jiffies;
                dmc->cache_sets[i].hotlist_lru_tail = FLASHCACHE_NULL;
                dmc->cache_sets[i].hotlist_lru_head = FLASHCACHE_NULL;
                dmc->cache_sets[i].warmlist_lru_tail = FLASHCACHE_NULL;
                dmc->cache_sets[i].warmlist_lru_head = FLASHCACHE_NULL;
                spin_lock_init(&dmc->cache_sets[i].set_spin_lock);
        }

对于每个set有单独的数据结构描述:

struct cache_set {
        spinlock_t              set_spin_lock;
        u_int32_t               set_fifo_next;
        u_int32_t               set_clean_next;
        u_int16_t               clean_inprog;
        u_int16_t               nr_dirty;
        u_int16_t               dirty_fallow;
        unsigned long           fallow_tstamp;
        unsigned long           fallow_next_cleaning;
        /*  
         * 2 LRU queues/cache set.
         * 1) A block is faulted into the MRU end of the warm list from disk.
         * 2) When the # of accesses hits a threshold, it is promoted to the
         * (MRU) end of the hot list. To keep the lists in equilibrium, the
         * LRU block from the host list moves to the MRU end of the warm list.
         * 3) Within each list, an access will move the block to the MRU end.
         * 4) Reclaims happen from the LRU end of the warm list. After reclaim
         * we move a block from the LRU end of the hot list to the MRU end of
         * the warm list.
         */
        u_int16_t               hotlist_lru_head, hotlist_lru_tail;
        u_int16_t               warmlist_lru_head, warmlist_lru_tail;
        u_int16_t               lru_hot_blocks, lru_warm_blocks;
#define NUM_BLOCK_HASH_BUCKETS          512
        u_int16_t               hash_buckets[NUM_BLOCK_HASH_BUCKETS];
        u_int16_t               invalid_head;                                                                                                          
};

注意,对于同一个set的cache block而言,根据状态,位于三个不同的链表之中:

  • INVALID
    • invalid_head为头部的invalid 链表
  • VALID
    • hot:
      • hotlist_lru_head为头部,hotlist_lru_tail为尾部的hot链表
    • warm
      • warmlist_lru_head为头部,warmlist_lru_tail为尾部的warm链表

注意,一个cacheblock只会位于其中的一条链表之中,不会同时属于hot和warm,更不会同时属于invalid和warm。

在64位系统上,指针的长度是8Byte,如果用普通的链表,prev next就要消耗16B的空间,这样是比较浪费的,flashcache使用是的u_int16_t类型的,每一个cacheblock通过一个2字节的short值,记录前一个cacheblock的值和后一个cacheblock的值。注意该值是同一个set的index值,因为默认set只有512,所以,2Byte的short足够记录下。

注意,当cacheblock中没有任何数据的时候,它位于invalid链表中,即这个链表里面都没啥数据。毫无疑问,当新建的flashcache里面,其实并没有任何有用的数据,并不和SATA DISK的数据相关联,因此,都会位于invalid 链表。在flashcache_ctr之中有如下的语句:

        for (i = 0 ; i < dmc->size ; i++) {
                dmc->cache[i].hash_prev = FLASHCACHE_NULL;
                dmc->cache[i].hash_next = FLASHCACHE_NULL;
                /*注意,flashcache_ctr并非只有创建flashcache一种情况,
                 *还有flashcache使用了一段时间之后,重启机器后的flashcache_load
                 *因此,需要判断对应的cacheblock的cache_state状态值,来初始化到合适的链表*/
                 
                /*如果cache_state状态中VALID置位,则插入的flashcache_hash,方便查找*/
                if (dmc->cache[i].cache_state & VALID) {
                        flashcache_hash_insert(dmc, i);
                        atomic_inc(&dmc->cached_blocks);
                }    
                /*如果dirty,则dirty统计增加*/
                if (dmc->cache[i].cache_state & DIRTY) {
                        dmc->cache_sets[i / dmc->assoc].nr_dirty++;
                        atomic_inc(&dmc->nr_dirty);
                }    
                /*如果是新创建,或者该cacheblock并无有效数据,则插入Invalid链表
                 *对应新创建的flashcahce,所有的cacheblock都在invalid链表,
                 *注意,并不是1条链表,而是每个cacheset都有1条链表*/
                if (dmc->cache[i].cache_state & INVALID)
                        flashcache_invalid_insert(dmc, i);

下面来介绍hotlist和warmlist,flashcache采用的缓存置换算法是LRU算法,它维护着2条链表:hot和warm。当然了,顾名思义,hot链表的数据更热,更不应该被置换出去。每条链表有head和tail,约靠近尾部的cacheblock,越热,越不应该被置换出去。

数据是被访问的,因此,频繁访问的数据,可能会从warm迁到(premote)hot,如果hot链表中最冷的数据(即靠近head的数据),也可能会被降级(demote)到warm中。

除此以外,可能某个cacheblock中存在合法的数据(VALID),但是由于新的io进来,第一反应肯定是会不会我请求的IO对应的地址 dbn恰好在flashcahce的中并且状态为VALID,如果找到皆大欢喜;如果找不到,第二反应是寻找一个无人用的cacheblock,即位于INVALID链表的cacheblock。如果很不幸,没有INVALID的cache block,所有的block都已经用了(VALID),这时候,就必须要寻找牺牲品了,即reclaim策略。

接下来我们以flashcache_read为例,详细介绍寻找cacheblock的方法。

寻找cacheblock

对于读请求,由函数flashcache_read负责处理,注意,对于那些注定不会进入cacheblock的读写,在进入flashcache_read之前都已经过滤掉了:

        uncacheable = (unlikely(dmc->bypass_cache) ||
                       (to_sector(bio->bi_size) != dmc->block_size) ||
                       /* 
                        * If the op is a READ, we serve it out of cache whenever possible, 
                        * regardless of cacheablity 
                        */
                       (bio_data_dir(bio) == WRITE && 
                        ((dmc->cache_mode == FLASHCACHE_WRITE_AROUND) ||
                         flashcache_uncacheable(dmc, bio))));
        spin_unlock_irqrestore(&dmc->ioctl_lock, flags);
        if (uncacheable) {
                flashcache_setlocks_multiget(dmc, bio);
                queued = flashcache_inval_blocks(dmc, bio);
                flashcache_setlocks_multidrop(dmc, bio);
                if (queued) {
                        if (unlikely(queued < 0))                    
                                flashcache_bio_endio(bio, -EIO, dmc, NULL);
                } else {
                        /* Start uncached IO */
                        /*绕过flashcache,直接访问慢速设备*/
                        flashcache_start_uncached_io(dmc, bio);
                }
        } else {
                /*如果io类型可以走flashcache,那么根据类型分别调用
                 *flashcache_read和flashcache_write*/
                if (bio_data_dir(bio) == READ)
                        flashcache_read(dmc, bio);
                else
                        flashcache_write(dmc, bio);
        }
        return DM_MAPIO_SUBMITTED;

剩下内容的重点是cacheblock的查找 置换的策略,什么io走flashcache,什么io直接访问慢速设备,并不是我们关心的内容。我们继续以flashcache_read为例,介绍寻找cacheblock的过程。

下面代码是查找cacheblock的方法,主要的寻找过程位于flashcache_lookup函数。

        flashcache_setlocks_multiget(dmc, bio);
        res = flashcache_lookup(dmc, bio, &index);
        /* Cache Read Hit case */
        if (res > 0) {
                cacheblk = &dmc->cache[index];
                if ((cacheblk->cache_state & VALID) && 
                    (cacheblk->dbn == bio->bi_sector)) {
                        flashcache_read_hit(dmc, bio, index);
                        return;
                }
        }
        /*
         * In all cases except for a cache hit (and VALID), test for potential 
         * invalidations that we need to do.
         */
        queued = flashcache_inval_blocks(dmc, bio);
        if (queued) {
                if (unlikely(queued < 0))
                        flashcache_bio_endio(bio, -EIO, dmc, NULL);
                if ((res > 0) && 
                    (dmc->cache[index].cache_state == INVALID))
                        /* 
                         * If happened to pick up an INVALID block, put it back on the 
                         * per cache-set invalid list
                         */
                        flashcache_invalid_insert(dmc, index);                                                                                         
                flashcache_setlocks_multidrop(dmc, bio);
                return;
        }

因为数据是流动的,因此整个flashcache N个cacheset,每个cacheset M个cache block,其状态都是流动的,刚才我可能是invalid,可能很快我就位于warmlist了,再有数据访问,我可能就迁移到了hotlist了。因此理解flashcache_lookup,知道当用户某一个请求要访问sector_t dbn = bio->bi_sector 这个扇区的时候,如何查找cacheblock是理解状态流动的非常关键的一步。

static int
flashcache_lookup(struct cache_c *dmc, struct bio *bio, int *index)
{
        sector_t dbn = bio->bi_sector;
#if DMC_DEBUG                                                                                                                                          
        int io_size = to_sector(bio->bi_size);
#endif
        unsigned long set_number = hash_block(dmc, dbn);
        int invalid, oldest_clean = -1;
        int start_index;

        start_index = dmc->assoc * set_number;
        DPRINTK("Cache lookup : dbn %llu(%lu), set = %d",
                dbn, io_size, set_number);
        find_valid_dbn(dmc, dbn, start_index, index);
        if (*index >= 0) {
                DPRINTK("Cache lookup HIT: Block %llu(%lu): VALID index %d",
                             dbn, io_size, *index);
                /* We found the exact range of blocks we are looking for */
                return VALID;
        }
        invalid = find_invalid_dbn(dmc, set_number);
        if (invalid == -1) {
                /* We didn't find an invalid entry, search for oldest valid entry */
                find_reclaim_dbn(dmc, start_index, &oldest_clean);
        }
        /* 
         * Cache miss :
         * We can't choose an entry marked INPROG, but choose the oldest                                                                               
         * INVALID or the oldest VALID entry.
         */
        *index = start_index + dmc->assoc;
        if (invalid != -1) {
                DPRINTK("Cache lookup MISS (INVALID): dbn %llu(%lu), set = %d, index = %d, start_index = %d", dbn, io_size, set_number, invalid, start_index);
                *index = invalid;
        } else if (oldest_clean != -1) {
                DPRINTK("Cache lookup MISS (VALID): dbn %llu(%lu), set = %d, index = %d, start_index = %d",
                             dbn, io_size, set_number, oldest_clean, start_index);
                *index = oldest_clean;
        } else {
                DPRINTK_LITE("Cache read lookup MISS (NOROOM): dbn %llu(%lu), set = %d",
                        dbn, io_size, set_number);
        }
        if (*index < (start_index + dmc->assoc))
                return INVALID;
        else {
                dmc->flashcache_stats.noroom++;
                return -1;
        }
}

注意,这就是寻找cacheblock的算法了。第一步是要寻找合适的set,因为flashcache默认情况下,每个set 512个cache block,首先要定位到那个cacheset,然后再cacheset中确定合适的cache block。通俗点说,就是分两步走:

  • 找到合适的cache set
  • 从该cache set中找到合适的cache block

第一步比较简单,根据bio的扇区号,计算hash,然后映射到对应的cache set:

unsigned long   
hash_block(struct cache_c *dmc, sector_t dbn)
{
        unsigned long set_number, value;
        int num_cache_sets = dmc->size >> dmc->assoc_shift;

        /*
         * Starting in Flashcache SSD Version 3 :
         * We map a sequential cluster of disk_assoc blocks onto a given set.
         * But each disk_assoc cluster can be randomly placed in any set.
         * But if we are running on an older on-ssd cache, we preserve old
         * behavior.
         */
        if (dmc->on_ssd_version < 3 || dmc->disk_assoc == 0) {
                value = (unsigned long)
                        (dbn >> (dmc->block_shift + dmc->assoc_shift));
        } else {
                /*我们走本分支*/
                value = (unsigned long) (dbn >> dmc->disk_assoc_shift);
                /* Then place it in a random set */
                value = jhash_1word(value, 0xbeef);
        }
        set_number = value % num_cache_sets;
        DPRINTK("Hash: %llu(%lu)->%lu", dbn, value, set_number);                                                                                       
        return set_number;
}

我们走else分支,这里面有一个参数,初看flashcache不容易理解,即disk_assoc_shift,这个参数在创建flashcache的时候可以指定disk_associativity :

root@XMT-S02:~# dmsetup table
osd4: 0 70316455903 flashcache conf:
	ssd dev (/dev/disk/by-partlabel/osd4-ssd), disk dev (/dev/disk/by-partlabel/osd4-data) cache mode(WRITE_BACK)
	capacity(446572M), associativity(512), data block size(4K) metadata block size(4096b)
	disk assoc(256K)
	skip sequential thresh(32K)
	total blocks(114322432), cached blocks(96119380), cache percent(84)
	dirty blocks(41155646), dirty percent(35)
	nr_queued(0)

我们看到,默认情况下,disk assoc的值是256K,事实上这个控制选项发挥作用也就是在寻找合适的cache set中发挥控制作用,如果没有这个选项,直接拿dbn进行hash,然后map到cacheset,相邻的两个dbn,可能压根就不会位于同一个cache set,那么将来对同一个cache set的io进行merge也就没啥必要了,因为相邻的dbn在同一个set的可能性并不大。

有了这个disk assoc参数就不同了,它hash之前,首先执行:

value = (unsigned long) (dbn >> dmc->disk_assoc_shift);

它确保的是,在同一个256KB块内的扇区,最终会得到同一个value,然后hash会map到同一个cache set,将来就有可能将相邻的请求merge,从而提高性能。

除了此处不太好理解意外,其他基本就是算出hash值,然后对cache set的个数求余,来决定落在那个cache set中。

第一步已经解决了,接下来是第二部,如何在cache set中找到

其算法核心可以分成三部:

  • find_valid_dbn
  • find_invalid_dbn
  • find_reclaim_dbn

find_valid_dbn

static void
find_valid_dbn(struct cache_c *dmc, sector_t dbn, 
               int start_index, int *index)
{
        *index = flashcache_hash_lookup(dmc, start_index / dmc->assoc, dbn);
        if (*index == -1)
                return;
        if (dmc->sysctl_reclaim_policy == FLASHCACHE_LRU &&
            ((dmc->cache[*index].cache_state & BLOCK_IO_INPROG) == 0))
                flashcache_lru_accessed(dmc, *index);
        /* 
         * If the block was DIRTY and earmarked for cleaning because it was old, make 
         * the block young again.
         */
        flashcache_clear_fallow(dmc, *index);
}

int
flashcache_hash_lookup(struct cache_c *dmc,
                       int set,
                       sector_t dbn)                                                  
{
        struct cache_set *cache_set = &dmc->cache_sets[set];
        int index;
        struct cacheblock *cacheblk;
        u_int16_t set_ix;
#if 0
        int start_index, end_index, i;
#endif
        
        set_ix = *flashcache_get_hash_bucket(dmc, cache_set, dbn);
        while (set_ix != FLASHCACHE_NULL) {
                index = set * dmc->assoc + set_ix;
                cacheblk = &dmc->cache[index];
                /* Only VALID blocks on the hash queue */
                VERIFY(cacheblk->cache_state & VALID);
                VERIFY((cacheblk->cache_state & INVALID) == 0);
                if (dbn == cacheblk->dbn)
                        return index;
                set_ix = cacheblk->hash_next;
        }
        return -1;
}  

static inline u_int16_t *
flashcache_get_hash_bucket(struct cache_c *dmc, struct cache_set *cache_set, sector_t dbn)  
{
        unsigned int hash = jhash_1word(dbn, 0xfeed);
     
        return &cache_set->hash_buckets[hash % NUM_BLOCK_HASH_BUCKETS];
}

我们已经找到了cache set,默认情况下cache set中有512个cache block,这些cache block中是否有我们需要的扇区呢?

最容易想到的是,逐个cache block比对,看下dbn号是否一致,状态是否是VALID。但是这种方法太蠢,效率太低。正确的方法是hash。

如果cache block中存在有效数据,他会根据对应的扇区号 dbn来计算hash,放入cache set中的合适bucket中。这种hash的做法,加速了cache set内部对某dbn是否存在在某个cacheblock的查找。

对于读来讲最完美的情况是,请求要求的数据块,恰巧位于flashcache的SSD 设备中,这种情况称为读命中。如果命中的话,因为该cacheblock的数据,相当于获得一次有效的访问,那么当空间吃紧的时候,应该降低该block被替换出去的概率,即提升其热度。

        if (dmc->sysctl_reclaim_policy == FLASHCACHE_LRU &&
            ((dmc->cache[*index].cache_state & BLOCK_IO_INPROG) == 0))
                flashcache_lru_accessed(dmc, *index);

这个flashcache_lru_accessed函数,即某个cacheblock最近被访问时,需要执行的操作,代码中有一段注释,言简意赅地介绍了这部分的算法:

/* 
 * Block is accessed.
 * 
 * Algorithm :
   if (block is in the warm list) {
       block_lru_refcnt++;
       if (block_lru_refcnt >= THRESHOLD) {
          clear refcnt
          Swap this block for the block at LRU end of hot list
       } else     
          move it to MRU end of the warm list
   }
   if (block is in the hot list)
       move it to MRU end of the hot list
 */

  • 如果block目前在warm list
    • 引用计数++
      • 如果引用计数大于等于门限值(sysctl_lru_promote_thresh),一般是2,则从warm list 移入 hot list的LRU端(最左端)
      • 如果引用计数低于门限值,则从移入 warm list的MRU端,即最右端
  • 如果block 目前在hot list
    • 将该block移入hot list的MRU端,即最右端。

两个链表host list和warm list,其最左端都是LRU端(Least Recent Used), 其最右端是MRU端(Most Recent Used)。一旦需要置换,将某些cacheblock中的内容踢出出去,选择的顺序如下:

Worm List  LRU -------->Worm List MRU--------->Hot List LRU -------------> Hot List MRU

代码部分就不列了,简单的链表操作。

从cache set中寻找cache block的第一步就完成,这种情况是最幸运的一种,即要读取的内容所在的扇区,恰好在flashcache的 SSD部分中,数据有效VALID,可以拿到cache block的index,因为本次访问,将该cache block的热度提升到合适的位置。

但是也许并没有这么幸运,SSD中没有dbn对应扇区的内容,这种情况下,需要选择一个cache block来盛放即将从 慢速设备的扇区中读取上来的内容。这种情况下,第一选择是选择一个并且投入使用的cache block,即INVALID状态的cache block。

Why?

如果不这么做,选择一个VALID状态的cache block,该cache block的内容就会被新的dbn的内容替换,那么该cache block中老的内容,就被逐出SSD了,如果紧接着发来一个访问cache block 中老的dbn扇区的内容的请求,就会造成miss。更恶劣的情况是该cache block的内容是dirty,flashcache 可能不得不先等待dirty内容flush下去之后,方能使用该cache block。

所以当命中已成不可能的时候,选择INVALID状态的cache block是上策:

find_invalid_dbn

static int
find_invalid_dbn(struct cache_c *dmc, int set)                                                 
{
        int index = flashcache_invalid_get(dmc, set);

        if (index != -1) {
                if (dmc->sysctl_reclaim_policy == FLASHCACHE_LRU)
                        flashcache_lru_accessed(dmc, index);
                VERIFY((dmc->cache[index].cache_state & FALLOW_DOCLEAN) == 0);
        }    
        return index;
}

寻找INVALID状态的cache block比较简单,因为对于一个cache set而言,所有的invalid都位于invalid_head为头部的链表,只需要摘下头部的cache block就可以了

int
flashcache_invalid_get(struct cache_c *dmc, int set)
{
        struct cache_set *cache_set;
        int index;
        struct cacheblock *cacheblk;

        cache_set = &dmc->cache_sets[set];
        index = cache_set->invalid_head;
        if (index == FLASHCACHE_NULL)
                return -1;
        index += (set * dmc->assoc);
        cacheblk = &dmc->cache[index];
        VERIFY(cacheblk->cache_state == INVALID);
        flashcache_invalid_remove(dmc, index);                                                                                                      
        return index;
}

同样的道理,因为该cache block从INVALID迁移到了warm list的MRU端。

其实这种情况还不错,因为还能找到闲置的cache block。随着flashcahe的使用,很可能这种情况也不可得。很可能该cache set下的所有的cache block都投入了战局,在该cache set已经找不到一块闲置的cache block了。

find_reclaim_dbn

这种情况下,就需要从投入使用的cacheblock中寻找一个牺牲品了,也就是cache block要回收了。优于SSD Dev和DIsk Dev大小的关系,不可能所有数据都存入SSD,所有的缓存算法都需要缓存替换算法,高效的缓存替换算法,能够获得更大的性能提升。

当选择牺牲品的时候,长期以来,我们维护hot list 和warm list的操作,就有了价值,这些信息给了我们选择牺牲品的依据。

static void 
find_reclaim_dbn(struct cache_c *dmc, int start_index, int *index)
{
        if (dmc->sysctl_reclaim_policy == FLASHCACHE_FIFO)
                flashcache_reclaim_fifo_get_old_block(dmc, start_index, index);
        else /* flashcache_reclaim_policy == FLASHCACHE_LRU */
                flashcache_reclaim_lru_get_old_block(dmc, start_index, index);                                                                      
}

Flashcache目前支持两种策略,FIFO和LRU。我们此处讨论LRU,这种算法就是将最近最不常使用的cache block替换出去。

代码给了注释:

/* 
 * Get least recently used LRU block
 * 
 * Algorithm :
 *      Always pick block from the LRU end of the warm list. 
 *      And move it to the MRU end of the warm list.
 *      If we don't find a suitable block in the "warm" list,
 *      pick the block from the hot list, demote it to the warm
 *      list and move a block from the warm list to the hot list.
 */

总是从worm list的LRU端找,然后把它移到MRU端。如果在warm list找不到合适的,那么从hot list的LRU端找,如果找到,执行demote操作,即将hot list的LRU和worm list MRU互换位置。

都是一些简单的链表操作,就不在此处贴代码了。

]]>
ceph-mon之Paxos算法(2) 2017-10-04T17:20:40+00:00 Bean Li http://bean-li.github.io/ceph-paxos-2 前言

上一篇文章介绍了一次提案通过的正常流程,尽管流程已经介绍完毕了,但是,总有一些困扰萦绕心头。

accepted_pn到底是什么鬼?

在monitor leader的begin 函数中:

 t->put(get_name(), last_committed+1, new_value);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", last_committed + 1);
  t->put(get_name(), "pending_pn", accepted_pn);

在Peon的handle_begin函数中:

  t->put(get_name(), v, begin->values[v]);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", v);
  t->put(get_name(), "pending_pn", accepted_pn);

讲提案编码这块是有意义的,因为commit的阶段要解码这段bufferlist,并提交事务,这好理解,可是后两句,pending_v和pending_pn到底是干嘛滴?后面一直也没下文,也不知道设置pending_v和pending_pn到底有啥用途。

这一步逻辑,其实是用于恢复的。正常情况下,自然不会用到,但是如果有异常发生,Paxos的恢复逻辑需要用到上述的信息。

基本概念

  • PN Proposal Number

Leader当选之后,会执行一次Phase 1过程来确定PN,在其为Leader的过程中,所有的Phase 2共用一个PN。所以省略了大量的Phase 1过程。这也是Paxos能够减小网络开销的原因。

A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm
                                                                                  -- << Paxos Make Simple>>
  • Version

verson可以理解为Paxos中的Instance ID。应用层的每一个提案,可以encode成二进制的字节流,作为value,而version或者Instance ID作为键值和该value对应。

需要持久化的数据结构有:

名称 含义 其他
last_pn 上次当选leader后生成的PN get_new_proposal_number()使用,下次当选后,接着生成
accepted_pn 我接受过的PN,可能是别的leader提议的PN peon根据这个值拒绝较小的PN
first_committed 本节点记录的第一个被commit的版本 更早的版本(日志),本节点没有了
last_committed 本节点记录的最后一次被commit的版本 往后的版本,未被commit,可能有一个
uncommitted_v 本节点记录的未commit的版本,如果有,只能等于last_commit+1 ceph只允许有一个未commit的版本
uncommitted_pn 未commit的版本对应的PN 与uncommitted_v,uncommitted_value在一个事务中记录
uncommitted_value 未commit的版本的内容 与uncommitted_v,uncommitted_value在一个事务中记录

注意,上述三个”uncommitted”开头的值,可能压根就不存在,比如正常关机,全部都commit了。

介绍完这些基本概念,我们需要开始考虑异常了。事实上,从时间顺序上讲,这一篇才是应该是第一篇,因为整个集群的mon要首先到达一个一致的状态,然后开始有条不紊地进行上一篇文章进行的步骤。

但是,从认知规律上讲,上一篇讲的内容,是Paxos主干路径,每天进行无数次,而ceph mon恢复到一致的状态,才是异常路径,只有发生异常的时候,才会走到。因此,我们选择了先介绍正常,然后介绍异常,以及从异常中恢复到一致的状态。

注意哈,Leader选举成功之后,会掉用collect,这个名字看起来怪怪的,其实是有意义的,是说可能发生了杂七杂八的异常,现在新的老大也已经选出来了,搜集一下各自的信息,然后将所有的成员的状态达成一致。

如果不能理清楚,可能会发生哪些异常,单纯流水账一样的阅读 collect handle_collect handle_last,可能无法体会代码为什么要这么写,为什么集群经过这么几个步骤就能达成一致。

所以,下面我们要从异常出发,可能产生哪几种异常,以及如何恢复的。

Recovery

当mon leader选举出来之后,会进入到STATE_RECOVERING状态,并调用collect函数,搜集peon的信息,以期互通有无,达成一致。

void Paxos::leader_init()
{
  cancel_events();
  new_value.clear();

  finish_contexts(g_ceph_context, proposals, -EAGAIN);

  logger->inc(l_paxos_start_leader);

  if (mon->get_quorum().size() == 1) {
    state = STATE_ACTIVE;
    return;
  }

  /*进入 recovering状态*/
  state = STATE_RECOVERING;
  lease_expire = utime_t();
  dout(10) << "leader_init -- starting paxos recovery" << dendl;
  
  /*掉用collect函数,进入phase 1*/
  collect(0);
}

注意在collect函数中,会生成一个新的PN(Proposal Number)。注意哈,这个编号有要求,要全局唯一,并且单调递增。那么集群这么多节点,mon leader也可能会变动,如何确保PN的这两个特点呢?

version_t Paxos::get_new_proposal_number(version_t gt)
{
  if (last_pn < gt) 
    last_pn = gt;
  
  // update. make it unique among all monitors.
  /*核心的算法在下面四句*/
  last_pn /= 100;
  last_pn++;
  last_pn *= 100;
  last_pn += (version_t)mon->rank;

  // write
  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), "last_pn", last_pn);

  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  t->dump(&f);
  f.flush(*_dout);
  *_dout << dendl;

  logger->inc(l_paxos_new_pn);
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_new_pn_latency, end - start);

  dout(10) << "get_new_proposal_number = " << last_pn << dendl;
  return last_pn;
}

在上次的值的基础上,加上100,然后加上该mon 的rank值即可。比如,如果rank值为0,1,2,最开始PN为100,每次触发选举,如果monitor 0存在的话,总是它获胜,那么下一个产生的PN= 100+100+0 = 200。如果当前PN为200,再次发生了monitor的选举,但是这一次,monitor 0并不在(发生了异常),那么monitor 1就会获胜,新产生的PN为200+100+1=301;如果突然monitor 0成功启动了,那么新的PN为(300/100+1)*100+0 = 400。

注意这个值,只会在leader选举完成后,collect的时候更新一次,当达成一致之后,后面可能有很多的提案,但是这个PN并不会发生变化。

步骤 Leader Peon 说明
1 collect() =>   Leader给quorum中各个peon发送PN以及其他附带信息,告诉peon,请将各自信息汇报上来
2   <=handle_collect() Peon同意或者拒绝PN。并中间可能分享已经commit的数据
3 handle_last()   Quorum中peon全部同意leader的PN,才算成功。这个函数会根据peon的信息以及自身的信息,要么重新propose uncommitted的提案,要么将某成员缺失的信息share出去,确保各个成员达成一致。

下面的内容,根据mon leader down还是Peon down,分开讨论

Peon down

Peon down的话,Leader会检测到。

首先有租约机制:

void Paxos::lease_ack_timeout()
{
  dout(1) << "lease_ack_timeout -- calling new election" << dendl;
  assert(mon->is_leader());
  assert(is_active());
  logger->inc(l_paxos_lease_ack_timeout);
  lease_ack_timeout_event = 0;
  mon->bootstrap();
}

其次,如果发送了OP_BEGIN,和peon因为down,无法回复OP_ACCEPT消息,会触发:

void Paxos::accept_timeout()
{
  dout(1) << "accept timeout, calling fresh election" << dendl;
  accept_timeout_event = 0;
  assert(mon->is_leader());
  assert(is_updating() || is_updating_previous() || is_writing() ||
	 is_writing_previous());
  logger->inc(l_paxos_accept_timeout);
  mon->bootstrap();
}

无论是哪一种情况,都会因bootstrap重新选举,选举结束后,原来的Leader仍然是Leader,这时候会调用collect函数。

这里我们分成 Peon Down 和Up两个阶段来讨论

Peon Down

注意,在collect函数中,会生成新的PN :

  accepted_pn = get_new_proposal_number(MAX(accepted_pn, oldpn));
  accepted_pn_from = last_committed;

Peon Down 就意味着 Leader一直都完好如初,而重新选举之后,Leader节点不会发生变化。这意味着所有Peon的数据并不会比Leader更新。

  • last_committed(leader) >= last_committed(peon)
  • accepted_pn(leader) > accepted_pn(peon)

第二条之所以成立,是因为在collect函数中,Leader重新生成了新的PN,因此,leader的accepted_pn要大于所有的Peon的accepted_pn。

timeout事件是在time线程内完成,time线程干活的时候会获取monitor lock,那么可以推断,leader的paxos流程可能被中断的情况包括以下几个点:

  1. Leader处于active状态,未开始任何提案
  2. leader为updating状态,即begin函数已经执行,等待accept中,此时leader有uncommitted数据,并且可能已经有部分accept消息
  3. leader为writing状态,说明已经接收到所有accept消息,即commit_start已经开始执行,事务已经排队等待执行
  4. leader为writing状态,写操作已经执行完成,即事务已经生效,只是回调函数(commit_finish)还没有被执行(回调函数没被执行是因为需要获取monitor lock的锁)

3和4会发生是因为Leader的commit采取了异步的方式:

  get_store()->queue_transaction(t, new C_Committed(this));
  
struct C_Committed : public Context {
  Paxos *paxos;
  explicit C_Committed(Paxos *p) : paxos(p) {}
  void finish(int r) {
    assert(r >= 0);
    Mutex::Locker l(paxos->mon->lock);
    paxos->commit_finish();
  }
};

一旦commit_finish 开始执行,就意味着持有monitor lock(paxos->mon->lock。leader不会被中断在refresh状态,因为一旦commit_finish函数开始执行, 会将refresh状态执行完成,重新回到active状态,time线程才可能获取到锁执行。

第1种情况,不需要处理,并没有什么新的提案在行进中,无需理会。 第二种情况下,存在uncommitted数据,Leader会重新开始一个propose的过程。如何做到?

注意哈,下面的注释部分,仅仅考虑Peon Down情况下的第二种情况,即Leader已经发起begin,正在等待OP_ACCEPT消息,可能收到了部分OP_ACCEPT的情况。

void Paxos::collect(version_t oldpn)
{
  // we're recoverying, it seems!
  state = STATE_RECOVERING;
  assert(mon->is_leader());

  /*uncommitted_v uncommitted_pn以及uncommitted_value是个三元组
   *collect也会搜集其他Peon的数据,因此此处为初始化*/
  uncommitted_v = 0;
  uncommitted_pn = 0;
  uncommitted_value.clear();
  peer_first_committed.clear();
  peer_last_committed.clear();

  /*注意哈,考虑第二种情况,Leader自己也有uncommitted数据,因此,本循环体是可以得到尚未commit的提案
   * 包括上一轮的PN存放到uncommitted_pn,
   * 上一轮的提案的Instance ID存放到 uncommitted_v,
   * 以及上一轮提案的值存放入uncommitted_value*/
  if (get_store()->exists(get_name(), last_committed+1)) {
    version_t v = get_store()->get(get_name(), "pending_v");
    version_t pn = get_store()->get(get_name(), "pending_pn");
    if (v && pn && v == last_committed + 1) {
      uncommitted_pn = pn;
    } else {
      dout(10) << "WARNING: no pending_pn on disk, using previous accepted_pn " << accepted_pn
	       << " and crossing our fingers" << dendl;
      uncommitted_pn = accepted_pn;
    }
    uncommitted_v = last_committed+1;

    get_store()->get(get_name(), last_committed+1, uncommitted_value);
    assert(uncommitted_value.length());
    dout(10) << "learned uncommitted " << (last_committed+1)
	     << " pn " << uncommitted_pn
	     << " (" << uncommitted_value.length() << " bytes) from myself" 
	     << dendl;

    logger->inc(l_paxos_collect_uncommitted);
  }

  /*重新生成新的PN,这个PN一定*/
  accepted_pn = get_new_proposal_number(MAX(accepted_pn, oldpn));
  accepted_pn_from = last_committed;
  num_last = 1;
  dout(10) << "collect with pn " << accepted_pn << dendl;

  // send collect
  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
    if (*p == mon->rank) continue;
    
    /*向其他节点发送OP_COLLECT,搜集信息,来使集群恢复到一致的状态*/
    
    MMonPaxos *collect = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_COLLECT,
				       ceph_clock_now(g_ceph_context));
    collect->last_committed = last_committed;
    collect->first_committed = first_committed;
    collect->pn = accepted_pn;
    mon->messenger->send_message(collect, mon->monmap->get_inst(*p));
  }

  // set timeout event
  collect_timeout_event = new C_MonContext(mon, [this](int r) {
	if (r == -ECANCELED)
	  return;
	collect_timeout();
    });
  mon->timer.add_event_after(g_conf->mon_accept_timeout_factor *
			     g_conf->mon_lease,
			     collect_timeout_event);
}


注意对于这种情况下,其他Peon节点,其accepted_pn一定会小于新产生的PN,即OP_COLLECT消息体中的PN。我们来看其他PEON节点的反应:

void Paxos::handle_collect(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_collect");

  MMonPaxos *collect = static_cast<MMonPaxos*>(op->get_req());
  dout(10) << "handle_collect " << *collect << dendl;

  assert(mon->is_peon()); // mon epoch filter should catch strays

  // we're recoverying, it seems!
  state = STATE_RECOVERING;

  /*这不会发生,对于我们限定的这种场景*/
  if (collect->first_committed > last_committed+1) {
    dout(2) << __func__
            << " leader's lowest version is too high for our last committed"
            << " (theirs: " << collect->first_committed
            << "; ours: " << last_committed << ") -- bootstrap!" << dendl;
    op->mark_paxos_event("need to bootstrap");
    mon->bootstrap();
    return;
  }

  /*回复OP_LAST消息,将自己的last_committed和first_committed放入消息体内*/
  MMonPaxos *last = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_LAST,
				  ceph_clock_now(g_ceph_context));
  last->last_committed = last_committed;
  last->first_committed = first_committed;
  
  version_t previous_pn = accepted_pn;

  /*注意,collect->pn是选举之后,原来的leader新产生出来的,因此一定会比PEON的accepted_n大*/
  if (collect->pn > accepted_pn) {
    // ok, accept it
    accepted_pn = collect->pn;
    accepted_pn_from = collect->pn_from;
    dout(10) << "accepting pn " << accepted_pn << " from " 
	     << accepted_pn_from << dendl;
  
    MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
    t->put(get_name(), "accepted_pn", accepted_pn);

    dout(30) << __func__ << " transaction dump:\n";
    JSONFormatter f(true);
    t->dump(&f);
    f.flush(*_dout);
    *_dout << dendl;

    logger->inc(l_paxos_collect);
    logger->inc(l_paxos_collect_keys, t->get_keys());
    logger->inc(l_paxos_collect_bytes, t->get_bytes());
    utime_t start = ceph_clock_now(NULL);

    get_store()->apply_transaction(t);

    utime_t end = ceph_clock_now(NULL);
    logger->tinc(l_paxos_collect_latency, end - start);
  } else {
    // don't accept!
    dout(10) << "NOT accepting pn " << collect->pn << " from " << collect->pn_from
	     << ", we already accepted " << accepted_pn
	     << " from " << accepted_pn_from << dendl;
  }
  last->pn = accepted_pn;
  last->pn_from = accepted_pn_from;

  // share whatever committed values we have
  if (collect->last_committed < last_committed)
    share_state(last, collect->first_committed, collect->last_committed);

  // do we have an accepted but uncommitted value?
  //  (it'll be at last_committed+1)
  bufferlist bl;
  
  /*注意,如果已经有Peon回复过OP_ACCEPT消息,那么此处就会走到*/
  if (collect->last_committed <= last_committed &&
      get_store()->exists(get_name(), last_committed+1)) {
    get_store()->get(get_name(), last_committed+1, bl);
    assert(bl.length() > 0);
    dout(10) << " sharing our accepted but uncommitted value for " 
	     << last_committed+1 << " (" << bl.length() << " bytes)" << dendl;
    last->values[last_committed+1] = bl;

    version_t v = get_store()->get(get_name(), "pending_v");
    version_t pn = get_store()->get(get_name(), "pending_pn");
    if (v && pn && v == last_committed + 1) {
      last->uncommitted_pn = pn;
    } else {
      // previously we didn't record which pn a value was accepted
      // under!  use the pn value we just had...  :(
      dout(10) << "WARNING: no pending_pn on disk, using previous accepted_pn " << previous_pn
	       << " and crossing our fingers" << dendl;
      last->uncommitted_pn = previous_pn;
    }

    logger->inc(l_paxos_collect_uncommitted);
  }

  // send reply
  collect->get_connection()->send_message(last);
}

我们以196 197 198集群为例,毫无疑问,196是monitor leader,在这种情况下,把197的mon 关闭,我们会看到:


196节点:
-------
2017-10-04 21:15:26.559490 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442) begin for 1737443 25958 bytes
2017-10-04 21:15:26.559516 7f36cefe9700 30 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442) begin transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "1737443",
          "length": 25958},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_v",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_pn",
          "length": 8}],
  "num_keys": 3,
  "num_bytes": 26015}
bl dump:
bl dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_432632",
          "length": 15884},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_latest",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "logm",
          "key": "432633",
          "length": 9882},
        { "op_num": 3,
          "type": "PUT",
          "prefix": "logm",
          "key": "last_committed",
          "length": 8}],
  "num_keys": 4,
  "num_bytes": 25840}
2017-10-04 21:15:26.580022 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442)  sending begin to mon.1
2017-10-04 21:15:26.580110 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442)  sending begin to mon.2




2017-10-04 21:15:26.594622 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442) handle_accept paxos(accept lc 1737442 fc 0 pn 1100 opn 0) v3
2017-10-04 21:15:26.594631 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442)  now 0,2 have accepted


2017-10-04 21:15:40.996887 7f36cefe9700 10 mon.oquew@0(electing) e3 win_election epoch 26 quorum 0,2 features 211106232532991
2017-10-04 21:15:40.996955 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) leader_init -- starting paxos recovery
2017-10-04 21:15:40.997144 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) learned uncommitted 1737443 pn 1100 (25958 bytes) from myself
2017-10-04 21:15:40.997172 7f36cefe9700 30 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) get_new_proposal_number transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "last_pn",
          "length": 8}],
  "num_keys": 1,
  "num_bytes": 20}
2017-10-04 21:15:41.000424 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) get_new_proposal_number = 1200
2017-10-04 21:15:41.000456 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) collect with pn 1200






198节点
---------

2017-10-04 21:15:41.042089 7f7c043e3700 10 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442) handle_collect paxos(collect lc 1737442 fc 1736921 pn 1200 opn 0) v3
2017-10-04 21:15:41.042094 7f7c043e3700 10 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442) accepting pn 1200 from 0
2017-10-04 21:15:41.042101 7f7c043e3700 30 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442) handle_collect transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "accepted_pn",
          "length": 8}],
  "num_keys": 1,
  "num_bytes": 24}
2017-10-04 21:15:41.046361 7f7c043e3700 10 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442)  sharing our accepted but uncommitted value for 1737443 (25958 bytes)


注意,1737443议案已经发起,并且收到了两个OP_ACCEPT,0和2,其中0是monitor leader本身,2是198发过来的OP_ACCEPT,1对应的mon是197的monitor,因为down所以迟迟收不到OP_ACCEPT。当196重新当选Leader之后,会发送OP_COLLECT消息到198,而198会接受新的PN 1200(之前是1100),但是它会在OP_LAST消息体中,告诉monitor leader,它曾经收到一份1737443号议案,它议案它已经accept,但是尚未committed。

那么monitor leader收到消息之后会怎样呢?

  if (last->pn > accepted_pn) {
    // no, try again.
    dout(10) << " they had a higher pn than us, picking a new one." << dendl;

    // cancel timeout event
    mon->timer.cancel_event(collect_timeout_event);
    collect_timeout_event = 0;

    collect(last->pn);
  } else if (last->pn == accepted_pn) {
  
    /*对于我们构造的这种场景,会走这个分支*/
    // yes, they accepted our pn.  great.
    num_last++;
    dout(10) << " they accepted our pn, we now have " 
	     << num_last << " peons" << dendl;

    
    /*记录下收到的uncommitted三元组*/
    if (last->uncommitted_pn) {
      if (last->uncommitted_pn >= uncommitted_pn &&
	       last->last_committed >= last_committed &&
	       last->last_committed + 1 >= uncommitted_v) {
	         uncommitted_v = last->last_committed+1;
	         uncommitted_pn = last->uncommitted_pn;
	         uncommitted_value = last->values[uncommitted_v];
	         dout(10) << "we learned an uncommitted value for " << uncommitted_v
	                  << " pn " << uncommitted_pn
	                  << " " << uncommitted_value.length() << " bytes"
	                  << dendl;
      } else {
        dout(10) << "ignoring uncommitted value for " << (last->last_committed+1)
                 << " pn " << last->uncommitted_pn
                 << " " << last->values[last->last_committed+1].length() << " bytes"
                 << dendl;
      }
    }
    
    /*如果已经搜集齐了所有的Peon的消息*/
    if (num_last == mon->get_quorum().size()) {
      // cancel timeout event
      mon->timer.cancel_event(collect_timeout_event);
      collect_timeout_event = 0;
      peer_first_committed.clear();
      peer_last_committed.clear();

      // almost...

      /*如果发现uncommitted等于last_committed+1*/
      if (uncommitted_v == last_committed+1 &&
          uncommitted_value.length()) {
          dout(10) << "that's everyone.  begin on old learned value" << dendl;
          
          /*注意后面两句,对于我们说的场景2,leader会把未完成的提案,再次begin,即重新发起一次,确保完成,
           *不过状态是STATE_UPDATING_PREVIOUS,即完成上一轮的情况*/
          state = STATE_UPDATING_PREVIOUS;
          begin(uncommitted_value);
      } else {
      // active!
      dout(10) << "that's everyone.  active!" << dendl;
      extend_lease();
      
      need_refresh = false;
      if (do_refresh()) {
        finish_round();
      }
     }
   }
 } else {
    // no, this is an old message, discard
    dout(10) << "old pn, ignoring" << dendl;
  }

注意哈,无论是否存在某个Peon已经回复了OP_ACCEPT,这个未完成的提案都会通过begin函数,再次发起。

  • 如果一个OP_ACCEPT都没有收到,那么Monitor Leader自己已经记录了uncommitted三元组,不需要通过Peon来学习到这个提案
  • 如果收到了某个OP_ACCEPT信息,那么该Peon在OP_LAST消息体中自然会告诉monitor leader uncommitted 三元组

无论哪种方法,monitor leader在 handle_last函数中都会执行 begin函数,完成上一轮未完成的提案。

2017-10-04 21:15:41.038753 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) handle_last paxos(last lc 1737442 fc 1736921 pn 1200 opn 1100) v3
2017-10-04 21:15:41.038759 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) store_state nothing to commit
2017-10-04 21:15:41.038824 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442)  they accepted our pn, we now have 2 peons
2017-10-04 21:15:41.038835 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) we learned an uncommitted value for 1737443 pn 1100 25958 bytes
2017-10-04 21:15:41.038843 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) that's everyone.  begin on old learned value
2017-10-04 21:15:41.038848 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating-previous c 1736921..1737442) begin for 1737443 25958 bytes
2017-10-04 21:15:41.038868 7f36ce7e8700 30 mon.oquew@0(leader).paxos(paxos updating-previous c 1736921..1737442) begin transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "1737443",
          "length": 25958},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_v",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_pn",
          "length": 8}],
  "num_keys": 3,
  "num_bytes": 26015}
bl dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_432632",
          "length": 15884},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_latest",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "logm",
          "key": "432633",
          "length": 9882},
        { "op_num": 3,
          "type": "PUT",
          "prefix": "logm",
          "key": "last_committed",
          "length": 8}],
  "num_keys": 4,
  "num_bytes": 25840}

2017-10-04 21:15:41.057345 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating-previous c 1736921..1737442)  sending begin to mon.2

花了很长的篇幅,终于介绍完了当Peon down的时候的第二种情形。下面我们来考虑第三和第四种情况。

3. leader为writing状态,说明已经接收到所有accept消息,即commit_start已经开始执行,事务已经排队等待执行
4. leader为writing状态,写操作已经执行完成,即事务已经生效,只是回调函数(commit_finish)还没有被执行(回调函数没被执行是因为需要获取         monitor lock的锁)

注意,第3和4种情况会等待已经在writing状态的数据commit完成后,才会重新选举:

void Monitor::wait_for_paxos_write()
{
  if (paxos->is_writing() || paxos->is_writing_previous()) {
    dout(10) << __func__ << " flushing pending write" << dendl;
    lock.Unlock();
    store->flush();
    lock.Lock();
    dout(10) << __func__ << " flushed pending write" << dendl;
  }
}

void Monitor::bootstrap()
{
  dout(10) << "bootstrap" << dendl;
  wait_for_paxos_write();
  ...
  
}

void Monitor::start_election()
{
  dout(10) << "start_election" << dendl;
  wait_for_paxos_write();
  ...
}

对于第三种和第四种情况,Paxos应该处于writing或者writing_previous状态,这种情况下,会执行store->flush,在选举之前,确保已经处于writing状态的数据commit完成,然后开始选举。

对于其他的Peon,无论是否commit,Leader都已经完成了commit,在handle_last阶段:

 for (map<int,version_t>::iterator p = peer_last_committed.begin();
       p != peer_last_committed.end();
       ++p) {
    if (p->second + 1 < first_committed && first_committed > 1) {
      dout(5) << __func__
	      << " peon " << p->first
	      << " last_committed (" << p->second
	      << ") is too low for our first_committed (" << first_committed
	      << ") -- bootstrap!" << dendl;
      op->mark_paxos_event("need to bootstrap");
      mon->bootstrap();
      return;
    }
    
    /*对于第三第四种情况,mon leader可以将peon缺失的部分share给Peon,让Peon commit这些缺失的部分*/
    if (p->second < last_committed) {
      // share committed values
      dout(10) << " sending commit to mon." << p->first << dendl;
      MMonPaxos *commit = new MMonPaxos(mon->get_epoch(),
					MMonPaxos::OP_COMMIT,
					ceph_clock_now(g_ceph_context));
      share_state(commit, peer_first_committed[p->first], p->second);
      mon->messenger->send_message(commit, mon->monmap->get_inst(p->first));
    }
  }

我们来看下第三第四种情况下的log。其中197是down的Peon,198是正常的Peon,但是没来得及commit,这时候,Leader会发现198缺失1743405这个commit,会通过share_state函数,将缺失部分塞入消息体,发给198,即mon.2



2017-10-04 22:05:44.680463 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405) handle_last paxos(last lc 1743404 fc 1742694 pn 1300 opn 0) v3
2017-10-04 22:05:44.680481 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405) store_state nothing to commit

/*197*/
2017-10-04 22:05:44.680556 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405)  sending commit to mon.2
2017-10-04 22:05:44.680568 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405) share_state peer has fc 1742694 lc 1743404
2017-10-04 22:05:44.680639 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405)  sharing 1743405 (133 bytes)
2017-10-04 22:05:44.680730 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405)  they accepted our pn, we now have 2 peons

Peon up

上面四种情况,讲述的是Peon down之后的4种可能性。当Down的Peon重新Up会发生什么事情呢?

因为Peon down了很长时间,它的很多信息都落后,因此启动的时候,会有sync的过程。这个过程并不是通过 collect–>handle_collect—>handle_last 完成的信息同步,而是在Peon启动的时候,调用sync_start函数,发起数据同步,进入STATE_SYNCHRONIZING状态。这部分内容不打算在此处展开。

数据sync完毕之后,调用sync_finish函数,在该函数中会再次bootstrap,会触发选举,当然,还是原来的leader会获胜。

Leader Down

Leader 可能会死在Paxos任意函数的任何地方,这时候,新的选举中,会从Peon中选择rank最小的Peon当新的Leader。和之前一样,我们来考虑,Leader down 和Leader Up这两件事情发生之后,集群如何恢复到一致。

Down

peon在lease超时后会重新选举,peon可能中断在active或updating状态,peon之间的状态并不是一样的,可能一些在active,一些在updating:

  • leader down在active状态,不需要特殊处理
  • leader down在updating状态,如果没有peon已经accept,不需要特殊处理,如果有peon已经accept,新的leader要么自己已经accept,要么会从其他peon学习到,会重新propose
  • leader down在writing状态,说明所有peon已经accept,新的leader会重新propose已经accept的值(此时down的leader可能已经写成功,也可能没有写成功)
  • leader down在refresh状态,down的leader已经写成功,如果有peon已经收到commit消息,新的commit会被新的leader在collect阶段学习到,如果没有peon收到commit消息,会重新propose

对于情况2中,如果有些peon已经accept,那么在handle_collect函数,该peon就会讲这些uncommitted三元组发给新的Leader,或者新的Leader自己就曾经accept,自己从自身也能获得uncommmited三元组,这时候就会掉用 begin重新propose。

    /*记录下收到的uncommitted三元组*/
    if (last->uncommitted_pn) {
      if (last->uncommitted_pn >= uncommitted_pn &&
	       last->last_committed >= last_committed &&
	       last->last_committed + 1 >= uncommitted_v) {
	         uncommitted_v = last->last_committed+1;
	         uncommitted_pn = last->uncommitted_pn;
	         uncommitted_value = last->values[uncommitted_v];
	         dout(10) << "we learned an uncommitted value for " << uncommitted_v
	                  << " pn " << uncommitted_pn
	                  << " " << uncommitted_value.length() << " bytes"
	                  << dendl;
      } else {
        dout(10) << "ignoring uncommitted value for " << (last->last_committed+1)
                 << " pn " << last->uncommitted_pn
                 << " " << last->values[last->last_committed+1].length() << " bytes"
                 << dendl;
      }
    }
    
    /*如果已经搜集齐了所有的Peon的消息*/
    if (num_last == mon->get_quorum().size()) {
      // cancel timeout event
      mon->timer.cancel_event(collect_timeout_event);
      collect_timeout_event = 0;
      peer_first_committed.clear();
      peer_last_committed.clear();

      // almost...

      /*如果发现uncommitted等于last_committed+1*/
      if (uncommitted_v == last_committed+1 &&
          uncommitted_value.length()) {
          dout(10) << "that's everyone.  begin on old learned value" << dendl;
          
          /*注意后面两句,对于我们说的场景2,leader会把未完成的提案,再次begin,即重新发起一次,确保完成,
           *不过状态是STATE_UPDATING_PREVIOUS,即完成上一轮的情况*/
          state = STATE_UPDATING_PREVIOUS;
          begin(uncommitted_value);
      }
      
      ....
      

对于情况3,和情况2一样,会通过如下代码,重新propose:

      if (uncommitted_v == last_committed+1 &&
          uncommitted_value.length()) {
          dout(10) << "that's everyone.  begin on old learned value" << dendl;
          state = STATE_UPDATING_PREVIOUS;
          begin(uncommitted_value);
      }
      

情况4 稍稍复杂一点,因为不确定是否有peon 执行过commit,如果没有peon执行过commit,和情况2 3一样,重新propose,但是如果曾经commit过,新的leader会通过collect函数学习到来自某peon的commit,同时将其他peon缺失的部分通过share_state分享给其他peon。

UP

leader重新up后,可能probing阶段就会做一次sync,此时数据可能会同步一部分,再一次被选举成leader,collect阶段会同步差异的几个版本数据, 同时,如果peon有uncommitted的数据,也会同步给leader,由新的leader重新propose。

唯一需要注意的是,leader down的时候存在的uncommitted的数据,由上面的情况可知,如果有peon已经接受,数据会被重新propose, 重新up后,根据pending_v,由于版本较低,pending数据会被抛弃。如果leader已经commit过,peon也一定会commit,所以不会导致数据不一致。

因为上一种情况,已经详细地分析了代码了,对于Leader down 的这种情况,我们就不全面展开了。

尾声

注意,本文大量的参考第一篇参考文献,我基本是按图索骥,我无意抄袭前辈的文章,只是前辈水平太高,很多东西高屋建瓴,语焉不详,对于初学者而言,可能不能领会其含义,本文做了一些展开,将某些内容和代码以及日志输出对应,帮助初学者更好地理解。

另外,参考文献2也是非常不错的文章,但是如果不分析可能发生的异常,这个Phase 1往往会知其然,不知其所以然,将代码读成了流水账。

参考文献

  1. Ceph Monitor Paxos
  2. Ceph的Paxos源码注释 - Phase 1
]]>
ceph-mon之Paxos算法 2017-09-24T17:20:40+00:00 Bean Li http://bean-li.github.io/ceph-paxos 前言

Paxos算法应该算是分布式系统中最赫赫有名的算法了,就如同江湖上那句 “为人不识陈近南,纵称英雄也枉然”,Paxos在分布式中的地位,只会比陈近南在江湖上的地位更高。

按照我的打算,这个PAXOS系列,应该有3篇文章,我并不打算一上来就介绍Paxos的原理,因为势必太枯燥,我们小时候学习数学也是从1+1开始,然后倒引申到变量,介绍一元一次方程 二元一次方程,最后引申到行列式 矩阵 线性代数。从逻辑上讲,为什么不直接学习线性代数呢? 不直观,而且不符合人类认知事物的规律。

首先要介绍下,为什么ceph-mon需要这个Paxos算法。举个简单的例子,如果两个client都需要写入同一个cephfs上的文件,那么它们需要OSDMap,因为必须要根据OSDMap和文件名来决定要写到哪些OSD上,注意client A和Client B看到的OSDMap必须是一致的,否则的话会造成不一致。

因此我们看出来了,对于分布式存储来讲,一致性( consensus )是一个强需求。而对于分布式consensus来讲,几乎就等同于Paxos。

世界上只有一种一致性协议,就是Paxos

其他协议要么是paxos的简化,要么是错误的

本文是第一篇,用来介绍正常的一次Proposal应该是怎么样的。

Paxos 规则

角色

  • Proposer 提案者,它可以提出议案
  • Proposal 未被批准的决议称为提案,由Proposer提出,一个提案由一个编号和value形成的对组成,编号非常重要,保证提案的可区分性。
  • Acceptor 提案的受理者,可以简单理解为独立法官,有权决定接受收到的提案还是拒绝提案。当然接受还是拒绝是有一定的规则的。
  • Choose 提案被批准,被选定。当有半数以上Acceptor接受该提案时,就认为该提案被选定了
  • Learner 旁观者,需要知道被选定的提案的那些人。Learner只能获取到被批准的提案。

算法

这里并不打算推导Paxos算法,或者证明算法的正确性,只介绍怎么做:

  1. P1: 一个acceptor必须通过(accept)它收到的第一个提案。

    P1a:当且仅当acceptor没有回应过编号大于n的prepare请求时,acceptor接受(accept)编号为n的提案。
    
  2. P2: 如果具有value值v的提案被选定(chosen)了,那么所有比它编号更高的被选定的提案的value值也必须是v。

  P2c:如果一个编号为n的提案具有value v,那么存在一个多数派,要么他们中所有人都没有接受(accept)编号小于n 
的任何提案,要么他们已经接受(accept)的所有编号小于n的提案中编号最大的那个提案具有value v。

ceph中的 Paxos 实现

截止到本文,只会以正常流程为主,并不会介绍异常恢复过程,那是下一篇的主题。我们学习下面内容的时候,要注意两点

  • 代码如何实现的Paxos的算法,和上一节的内容对应
  • 正常情况下的代码,做了那些准备工作,看似无用,其实用于异常发生时的恢复

何时需要发起提案Proposal

Paxos的Trigger点总是要发起提案,那么ceph中需要发起提案的地方,大抵有以下三种:

  • ConfigKeyService在修改或删除key/value对的时候。

    ceph提供了分布式的key-value服务,这个服务讲ceph-mon当成存储k/v的黑盒子。用户可以使用如下命令存放k/v

    ceph config-key put key value 
    ceph config-key get key
    ceph config-key del key
    

    ceph相关的函数接口在ConfigKeyService::store_put和store_delete

      void ConfigKeyService::store_put(string key, bufferlist &bl, Context *cb)
      {
        bufferlist proposal_bl;
        MonitorDBStore::TransactionRef t = paxos->get_pending_transaction();
        t->put(STORE_PREFIX, key, bl);
        if (cb)
          paxos->queue_pending_finisher(cb);
        paxos->trigger_propose();
      }
    	
      void ConfigKeyService::store_delete(string key, Context *cb)
      {
        bufferlist proposal_bl;
        MonitorDBStore::TransactionRef t = paxos->get_pending_transaction();
        t->erase(STORE_PREFIX, key);
        if (cb)
          paxos->queue_pending_finisher(cb);
        paxos->trigger_propose();
      }
    
  • Paxos以及PaxosService对数据做trim的时候,trim的目的是为了节省存储空间,参见Paxos::trim和PaxosService::maybe_trim

    注意,PaxosService是在Paxos基础上,封装了一些接口,用来构建基于Paxos的服务,早期的版本有六大PaxosService,如下图所示。

    这些PaxosService,为了节省存储空间,也会通过调用maybe_trim来删除一些太老太旧的数据:

     	void Monitor::tick()
      {
        // ok go.
        dout(11) << "tick" << dendl;
    	  
        for (vector<PaxosService*>::iterator p = paxos_service.begin(); p != paxos_service.end(); ++p) {
          (*p)->tick();
          (*p)->maybe_trim();
        }
       ...  
    }
    

    因此,每个Paxos都要定义自己的maybe_trim函数。

  • PaxosService的各种服务,需要更新值的时候,参见PaxosService::propose_pending

需要发起proposal的场合,主要是上面提到的这几种,在决定做proposal之前,都会讲操作封装成事务,存放在Paxos类的变量pending_proposal中.

  /**
   * Pending proposal transaction
   *
   * This is the transaction that is under construction and pending
   * proposal.  We will add operations to it until we decide it is
   * time to start a paxos round.
   */
  MonitorDBStore::TransactionRef pending_proposal;
  
  /**
   * Finishers for pending transaction
   *
   * These are waiting for updates in the pending proposal/transaction
   * to be committed.
   */
  list<Context*> pending_finishers;

  /**
   * Finishers for committing transaction
   *
   * When the pending_proposal is submitted, pending_finishers move to
   * this list.  When it commits, these finishers are notified.
   */
  list<Context*> committing_finishers;

事务操作pending_proposal会被编码到bufferlist中,作为此次决议的值,会存放在paxos相关的k/v中,key为版本号, value为bufferlist二进制数据。commit的时候需要将bufferlist中的二进制数据还原成transaction,然后执行其中的操作, 即让决议的值反应在各个服务中,更新相关map。

也就是说,事务操作的内容会被编码成bufferlist,这个二进制数据流作为value,而key会版本号,作为paxos的提案。

注意,很多逻辑完成Paxos提案全过程之后,会有一些回调函数,这些回调会暂时放入pending_finishers列表。当Paxos的滚滚车轮一旦启动,会存放入committing_finishers列表。

bool Paxos::trigger_propose()
{
  if (is_active()) {
    dout(10) << __func__ << " active, proposing now" << dendl;
    propose_pending();
    return true;
  } else {
    dout(10) << __func__ << " not active, will propose later" << dendl;
    return false;
  }
}

void Paxos::propose_pending()
{
  assert(is_active());
  assert(pending_proposal);

  cancel_events();

  bufferlist bl;
  pending_proposal->encode(bl);

  dout(10) << __func__ << " " << (last_committed + 1)
	   << " " << bl.length() << " bytes" << dendl;
  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  pending_proposal->dump(&f);
  f.flush(*_dout);
  *_dout << dendl;

  /*pengding_proposal 就可以reset了*/
  pending_proposal.reset();

  /*已经开始处理,因此,讲pending_finishers的内容存放入committing_finishers*/
  committing_finishers.swap(pending_finishers);
  
  /*注意,掉用begin之前,先将状态改成STATE_UPDATING*/
  state = STATE_UPDATING;
  begin(bl);
}

介绍了这些基本知识之后,可以看下Paxos决议的整体流程了。整个流程的起点是void Paxos::begin(bufferlist& v)。注意,这个函数只能由mon leader 发起,Peon不会掉用 begin函数,提出议案。

当然了,Paxos算法并未规定,只能有一个Proposer,但是ceph的实现通过只允许mon leader发起提案,简化了代码处理的流程。

Paxos 正常工作流程

整体的流程入下图所示:

begin

void Paxos::begin(bufferlist& v)
{
  dout(10) << "begin for " << last_committed+1 << " " 
	   << v.length() << " bytes"
	   << dendl;

  /*只有mon leader才能掉用begin,提出提案*/
  assert(mon->is_leader());
  assert(is_updating() || is_updating_previous());

  // we must already have a majority for this to work.
  assert(mon->get_quorum().size() == 1 ||
	 num_last > (unsigned)mon->monmap->size()/2);
  
  // and no value, yet.
  assert(new_value.length() == 0);

  /*刚刚发起提案,目前还没有收到任何Acceptor的接受提案的信息*/
  accepted.clear();
  /*在接受提案的Acceptor中插入mon leader自己,因为自己的提案,自己不会拒绝*/
  accepted.insert(mon->rank);
  
  /*将 new_value 赋值为v,即将事务encode得到的bufferlist*/
  new_value = v;

  /*第一个commit,只有第一次提出提案的时候才会遇到*/
  if (last_committed == 0) {
    MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
    // initial base case; set first_committed too
    t->put(get_name(), "first_committed", 1);
    decode_append_transaction(t, new_value);

    bufferlist tx_bl;
    t->encode(tx_bl);

    new_value = tx_bl;
  }

  // store the proposed value in the store. IF it is accepted, we will then
  // have to decode it into a transaction and apply it.
  
  /*注意截下来的三个put操作是begin的一个关键地方,首先将事务encode过的bufferlist存放到*/
  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), last_committed+1, new_value);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", last_committed + 1);
  t->put(get_name(), "pending_pn", accepted_pn);

  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  t->dump(&f);
  f.flush(*_dout);
  MonitorDBStore::TransactionRef debug_tx(new MonitorDBStore::Transaction);
  bufferlist::iterator new_value_it = new_value.begin();
  debug_tx->decode(new_value_it);
  debug_tx->dump(&f);
  *_dout << "\nbl dump:\n";
  f.flush(*_dout);
  *_dout << dendl;

  logger->inc(l_paxos_begin);
  logger->inc(l_paxos_begin_keys, t->get_keys());
  logger->inc(l_paxos_begin_bytes, t->get_bytes());
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_begin_latency, end - start);

  assert(g_conf->paxos_kill_at != 3);

  if (mon->get_quorum().size() == 1) {
    // we're alone, take it easy
    commit_start();
    return;
  }

  // ask others to accept it too!
  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
    if (*p == mon->rank) continue;
    
    dout(10) << " sending begin to mon." << *p << dendl;
    MMonPaxos *begin = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_BEGIN,
				     ceph_clock_now(g_ceph_context));
    begin->values[last_committed+1] = new_value;
    begin->last_committed = last_committed;
    begin->pn = accepted_pn;
    
    mon->messenger->send_message(begin, mon->monmap->get_inst(*p));
  }

  /*注册超时*/
  accept_timeout_event = new C_MonContext(mon, [this](int r) {
      if (r == -ECANCELED)
	return;
      accept_timeout();
    });
  mon->timer.add_event_after(g_conf->mon_accept_timeout_factor *
			     g_conf->mon_lease,
			     accept_timeout_event);
}

注意,下面的代码是begin函数的关键:

  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), last_committed+1, new_value);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", last_committed + 1);
  t->put(get_name(), "pending_pn", accepted_pn);
  
  ...
  
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_begin_latency, end - start);

首先,将要执行的transaction encode成的bufferlist先保存下来,并不真正执行,而仅仅是记录下,而这条信息以last_commit+1作为键值,一旦超过半数的Acceptor通过提案,那么就可以从leveldb或者rocksdb中根据last_commit+1,取出要执行的事务。

我们以如下值为例,介绍整个流程。

first_committed = 1
last_committed = 10
accepted_pn = 100

此次提案会新增如下信息到mon leader的 MonitorDBStore

# 此次提议增加的数据
v11=new_value; # 11是last_committed+1的值,这里key会有前缀,简单以v代替,new_value是最终事务的编码过的bufflerlist
pending_v=11
pending_pn=100

注意 get_store()->apply_transaction(t)执行之后,上述三个值就写入了mon leader的DB中了。

接下来的事情是向Peon发送OP_BEGIN消息,请Acceptor审核提案。

  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
       
    /*leader不必向自己发送*/
    if (*p == mon->rank) continue;
    
    dout(10) << " sending begin to mon." << *p << dendl;
    MMonPaxos *begin = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_BEGIN,
				     ceph_clock_now(g_ceph_context));
				     
	 /*将new_value和last_committed+1作为k/v对,发送个PEON*/
    begin->values[last_committed+1] = new_value;
   
    /*这两个值将来辅助Peon做决策,决定是否接受该提案*/
    begin->last_committed = last_committed;
    begin->pn = accepted_pn;
    
    mon->messenger->send_message(begin, mon->monmap->get_inst(*p));
  }

begin函数有特例,即整个集群只有一个mon,那么就可以跳过搜集其他Acceptor接受与否的过程,直接进入commit阶段:

  /*只有自己存在,就没有必要征求意见了*/
  if (mon->get_quorum().size() == 1) {
    // we're alone, take it easy
    commit_start();
    return;
  }

handle_begin

Peon收到OP_BEGIN消息之后,开始处理。

Peon只会处理pn>= accepted_pn的提案,否则就会拒绝该提案:

  // can we accept this?
  if (begin->pn < accepted_pn) {
    dout(10) << " we accepted a higher pn " << accepted_pn << ", ignoring" << dendl;
    op->mark_paxos_event("have higher pn, ignore");
    return;
  }
  
  assert(begin->pn == accepted_pn);
  assert(begin->last_committed == last_committed);
  
  assert(g_conf->paxos_kill_at != 4);

  logger->inc(l_paxos_begin);

  /*将状态改成STATE_UPDATING*/
  state = STATE_UPDATING;
  lease_expire = utime_t();  // cancel lease

对于Peon来讲:

first_committed = 1
last_committed =10 
accepted_pn = 100

v11=new_value
pending_v=11
pending_pn=100

当Peon决定接受提案的时候,将会讲new_value暂时保存到DB(leveldb or rocksdb)中,做的事情和mon leader是一致的:

  // yes.
  version_t v = last_committed+1;
  dout(10) << "accepting value for " << v << " pn " << accepted_pn << dendl;
  // store the accepted value onto our store. We will have to decode it and
  // apply its transaction once we receive permission to commit.
  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), v, begin->values[v]);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", v);
  t->put(get_name(), "pending_pn", accepted_pn);
  
  ....
  
  logger->inc(l_paxos_begin_bytes, t->get_bytes());
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_begin_latency, end - start);

接下来,就可以讲接受提案的消息发送给mon leader,即发送OP_ACCEPT消息给mon leader。

  // reply
  MMonPaxos *accept = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_ACCEPT,
				    ceph_clock_now(g_ceph_context));
  accept->pn = accepted_pn;
  accept->last_committed = last_committed;
  begin->get_connection()->send_message(accept);

handle_accept

mon leader自从向所有的peon发送了OP_BEGIN消息之后,就望穿秋水地等待回应。

// leader
void Paxos::handle_accept(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_accept");
  MMonPaxos *accept = static_cast<MMonPaxos*>(op->get_req());
  dout(10) << "handle_accept " << *accept << dendl;
  int from = accept->get_source().num();

  if (accept->pn != accepted_pn) {
    // we accepted a higher pn, from some other leader
    dout(10) << " we accepted a higher pn " << accepted_pn << ", ignoring" << dendl;
    op->mark_paxos_event("have higher pn, ignore");
    return;
  }
  if (last_committed > 0 &&
      accept->last_committed < last_committed-1) {
    dout(10) << " this is from an old round, ignoring" << dendl;
    op->mark_paxos_event("old round, ignore");
    return;
  }
  assert(accept->last_committed == last_committed ||   // not committed
	 accept->last_committed == last_committed-1);  // committed

  assert(is_updating() || is_updating_previous());
  assert(accepted.count(from) == 0);
  accepted.insert(from);
  dout(10) << " now " << accepted << " have accepted" << dendl;

  assert(g_conf->paxos_kill_at != 6);

  // only commit (and expose committed state) when we get *all* quorum
  // members to accept.  otherwise, they may still be sharing the now
  // stale state.
  // FIXME: we can improve this with an additional lease revocation message
  // that doesn't block for the persist.
  

  if (accepted == mon->get_quorum()) {
    // yay, commit!
    dout(10) << " got majority, committing, done with update" << dendl;
    op->mark_paxos_event("commit_start");
    commit_start();
  }
}

首先会做一些检查,比如accept->pn和accepted_pn是否相等之类的。如果通过检查,会讲对应peon放入accepted中,表示已经收到了来自该peon的消息,该peon已经同意该提案。

注意,和一般的Paxos不同的是,mon leader要收到所有的peon的OP_ACCEPT之后,才会进入下一阶段,而不是半数以上。

  /*要收到所有的peon的OP_ACCEPT,才会进入到commit阶段*/
  if (accepted == mon->get_quorum()) {
    // yay, commit!
    dout(10) << " got majority, committing, done with update" << dendl;
    op->mark_paxos_event("commit_start");
    commit_start();
  }

leader在begin函数中,为了防止无法及时收集齐所有的OP_ACCEPT消息,注册了超时事件:

  // set timeout event
  accept_timeout_event = new C_MonContext(mon, [this](int r) {
      if (r == -ECANCELED)
	return;
      accept_timeout();
    });
  mon->timer.add_event_after(g_conf->mon_accept_timeout_factor *
			     g_conf->mon_lease,
			     accept_timeout_event);
			     
OPTION(mon_lease, OPT_FLOAT, 5)       // lease interval
OPTION(mon_accept_timeout_factor, OPT_FLOAT, 2.0)    // on leader, if paxos update isn't accepted

也就是说10秒中之内,不能收到所有的OP_ACCEPT,mon leader就会掉用accept_timeout函数,会掉用mon->bootstrap.

void Paxos::accept_timeout()
{
  dout(1) << "accept timeout, calling fresh election" << dendl;
  accept_timeout_event = 0;
  assert(mon->is_leader());
  assert(is_updating() || is_updating_previous() || is_writing() ||
	 is_writing_previous());
  logger->inc(l_paxos_accept_timeout);
  mon->bootstrap();
}

commit_start

当mon leader掉用commit_start的时候,表示走到了第二阶段。和二阶段提交有点类似,该提案已经得到了全部的peon的同意,因此可以大刀阔斧地将真正的事务提交,让提案生效。

void Paxos::commit_start()
{
  dout(10) << __func__ << " " << (last_committed+1) << dendl;

  assert(g_conf->paxos_kill_at != 7);

  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);

  // commit locally
  /*last_committed的值 自加*/
  t->put(get_name(), "last_committed", last_committed + 1);

  // decode the value and apply its transaction to the store.
  // this value can now be read from last_committed.
  
  /*事务编码之后的bufferlist之前存储到了new_value这个成员,将事务decode,并追加到transaction中*/
  decode_append_transaction(t, new_value);

  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  t->dump(&f);
  f.flush(*_dout);
  *_dout << dendl;

  logger->inc(l_paxos_commit);
  logger->inc(l_paxos_commit_keys, t->get_keys());
  logger->inc(l_paxos_commit_bytes, t->get_bytes());
  commit_start_stamp = ceph_clock_now(NULL);

  /*让事务生效,注意,此处是异步掉用*/
  get_store()->queue_transaction(t, new C_Committed(this));

  if (is_updating_previous())
    state = STATE_WRITING_PREVIOUS;
  else if (is_updating())
    state = STATE_WRITING;
  else
    assert(0);

  if (mon->get_quorum().size() > 1) {
    // cancel timeout event
    mon->timer.cancel_event(accept_timeout_event);
    accept_timeout_event = 0;
  }
}

此处事务的处理,是异步的,掉用了MonitorDBStore的queue_transaction函数。当事务完成之后,会掉用相关的回调函数。

  void queue_transaction(MonitorDBStore::TransactionRef t,
			 Context *oncommit) {
    io_work.queue(new C_DoTransaction(this, t, oncommit));
  }

注意,当将事务放入队列之后,状态从UPDATING切换成了 STATE_WRITING。

回调函数定义在:

struct C_Committed : public Context {
  Paxos *paxos;
  explicit C_Committed(Paxos *p) : paxos(p) {}
  void finish(int r) {
    assert(r >= 0);
    Mutex::Locker l(paxos->mon->lock);
    paxos->commit_finish();
  }
};

注意事务完成之后,会掉用commit_finish函数。

commit_finish函数

这个函数主要做三件事:

  • 将内存中last_committed值+1
  • 向peon发送commit消息
  • 设置状态为refresh,刷新PaxosService服务
void Paxos::commit_finish()
{
  dout(20) << __func__ << " " << (last_committed+1) << dendl;
  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_commit_latency, end - commit_start_stamp);

  assert(g_conf->paxos_kill_at != 8);

  // cancel lease - it was for the old value.
  //  (this would only happen if message layer lost the 'begin', but
  //   leader still got a majority and committed with out us.)
  lease_expire = utime_t();  // cancel lease

  /*last_committed可以自加了*/
  last_committed++;
  last_commit_time = ceph_clock_now(NULL);

  // refresh first_committed; this txn may have trimmed.
  first_committed = get_store()->get(get_name(), "first_committed");

  _sanity_check_store();

  /*给所有的peon发送OP_COMMIT消息*/
  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
    if (*p == mon->rank) continue;

    dout(10) << " sending commit to mon." << *p << dendl;
    MMonPaxos *commit = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_COMMIT,
				      ceph_clock_now(g_ceph_context));
    commit->values[last_committed] = new_value;
    commit->pn = accepted_pn;
    commit->last_committed = last_committed;

    mon->messenger->send_message(commit, mon->monmap->get_inst(*p));
  }

  assert(g_conf->paxos_kill_at != 9);

  // get ready for a new round.
  new_value.clear();

  // WRITING -> REFRESH
  // among other things, this lets do_refresh() -> mon->bootstrap() know
  // it doesn't need to flush the store queue
  assert(is_writing() || is_writing_previous());
  state = STATE_REFRESH;

  if (do_refresh()) {
    commit_proposal();
    if (mon->get_quorum().size() > 1) {
      extend_lease();
    }

    finish_contexts(g_ceph_context, waiting_for_commit);

    assert(g_conf->paxos_kill_at != 10);

    finish_round();
  }
}

需要注意的是,refresh完成后,在变回状态active之前,会开始lease协议,即发送lease消息给peon,这会帮助peon也变为active。

handle_commit

  • 更新内存中和后端存储中last_committed值,即+1
  • 将new_value中的值解码成事务,然后调用后端存储接口执行请求,这里采用同步写,和leader节点不一样
  • 刷新PaxosService服务
void Paxos::handle_commit(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_commit");
  MMonPaxos *commit = static_cast<MMonPaxos*>(op->get_req());
  dout(10) << "handle_commit on " << commit->last_committed << dendl;

  logger->inc(l_paxos_commit);

  if (!mon->is_peon()) {
    dout(10) << "not a peon, dropping" << dendl;
    assert(0);
    return;
  }

  op->mark_paxos_event("store_state");
  
  /*store_state是函数之眼,同步地处理事务*/
  store_state(commit);

  if (do_refresh()) {
    finish_contexts(g_ceph_context, waiting_for_commit);
  }
}

handle_lease

peon收到延长租约的消息OP_LEASE之后,会掉用handle_lease,peon的状态从updating转变成active

参考文献

  1. Ceph Monitor Paxos
]]>
使用atop排查 谁引起了CPU小尖峰 2017-09-09T14:43:40+00:00 Bean Li http://bean-li.github.io/CPU-sharp-pulse 前言

atop是一个非常有用的工具,我已经记不清楚atop多少次帮助我定位到问题了。本文讲述,如果在极短的时间内,突然出现CPU load的飙升, 如何排查,可能是which process出现了异常。

当然了,情况可能有很多,比如,突然涌现了很多的进程,也可能是突然某个进程消耗大量的CPU,那么如何查看那个进程可能是嫌疑人呢。如何查看在某个时段,进程消耗的CPU情况呢。 atop就可以出场了。

最近发现在晚上22:23分左右,CPU使用率过高,但是很快就会恢复正常了。

为什么关注22:23分 这个时间段,因为这个时间段出了点小意外。我们看22:21分的时候,过去1分钟的CPU负载 avg1 只是5.03 ,到了22:23分的时候,avg1 已经飙升到了39.80了。很快这个CPU飙升就消失了。

在22:23分附近 到底发生了什么?

使用atop排查谁消耗了最多的CPU资源

注意,atop有一个daemon,会记录下系统的各种信息,一旦某个时间段有故障,通过atop追溯当时的情况,发现引起故障的蛛丝马迹。

通过 atop -P PRC -b {begin_time} -e {end_time} -r atop.log 可以输出对应时间段的所以进程的CPU相关的信息。

比如可以通过如下指令观察昨天22:21~22:24分记录的信息:

 atop -P PRC -b 22:21 -e 22:24 -r atop.log.1

注意输出长这个样子:

PRC scalars08 1504880492 2017/09/08 22:21:32 120 1 (init) S 100 16 4 0 120 0 0 13 0 1 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 2 (kthreadd) S 100 0 2 0 120 0 0 12 0 2 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 3 (ksoftirqd/0) S 100 0 9 0 120 0 0 0 0 3 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 5 (kworker/0:0H) S 100 0 0 -20 100 0 0 0 0 5 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 8 (rcu_sched) S 100 0 43 0 120 0 0 15 0 8 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 9 (rcu_bh) S 100 0 0 0 120 0 0 0 0 9 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 10 (migration/0) S 100 0 2 0 0 99 1 0 0 10 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 11 (watchdog/0) S 100 0 0 0 0 99 1 0 0 11 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 12 (watchdog/1) S 100 0 0 0 0 99 1 1 0 12 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 13 (migration/1) S 100 0 2 0 0 99 1 1 0 13 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 14 (ksoftirqd/1) S 100 0 8 0 120 0 0 1 0 14 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 16 (kworker/1:0H) S 100 0 0 -20 100 0 0 1 0 16 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 17 (watchdog/2) S 100 0 0 0 0 99 1 2 0 17 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 18 (migration/2) S 100 0 2 0 0 99 1 2 0 18 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 19 (ksoftirqd/2) S 100 0 8 0 120 0 0 2 0 19 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 21 (kworker/2:0H) S 100 0 0 -20 100 0 0 2 0 21 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 22 (watchdog/3) S 100 0 0 0 0 99 1 3 0 22 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 23 (migration/3) S 100 0 2 0 0 99 1 3 0 23 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 24 (ksoftirqd/3) S 100 0 8 0 120 0 0 3 0 24 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 26 (kworker/3:0H) S 100 0 0 -20 100 0 0 3 0 26 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 27 (watchdog/4) S 100 0 0 0 0 99 1 4 0 27 y
PRC scalars08 1504880492 2017/09/08 22:21:32 120 28 (migration/4) S 100 0 2 0 0 99 1 4 0 28 y

一堆数字,如何解读呢,看atop的手册:

    This  line  contains  the  total  cpu time consumed in system mode (`sys') and in user mode (`user'), the
    total number of processes present at this moment (`#proc'), the total number of threads present  at  this
    moment  in  state `running' (`#trun'), `sleeping interruptible' (`#tslpi') and `sleeping uninterruptible'
    (`#tslpu'), the number of zombie processes (`#zombie'), the number of clone system calls (`clones'),  and
    the number of processes that ended during the interval (`#exit', which shows `?' if process accounting is
    not used).
    If the screen-width does not allow all of these counters, only a relevant subset is shown.

我们以第一行为例解读输出

  1. PRC: 表示atop的PRC输出
  2. scalars08 主机名
  3. 1504880492 时间戳
  4. 2017/09/08 日期
  5. 22:21:32 时间
  6. 120 atop采样间隔,即120秒采样一次
  7. 1 进程ID
  8. init 进程名
  9. S 进程状态
  10. 100 每秒ticks,即100 ticks 每秒
  11. 16 用户态消耗的CPU ticks数
  12. 4 内核态消耗的CPU ticks数

blah blah 。。。

通过次数我们不难得到对应时间段内消耗CPU资源最多的进程,它很可能就是尖峰的罪魁祸首:

atop -P PRC -b 22:21 -e 22:24 -r atop.log.1 | awk '{print $8,$7,$19,$3,$4,$5, $11,$12, $11+$12}'   |sort -nk 9

最后的的输出如下:

(ceph-mon) 9493 9493 1504880612 2017/09/08 22:23:32 152 77 229
(ceph-osd) 12181 12181 1504880492 2017/09/08 22:21:32 105 129 234
(ceph-osd) 11929 11929 1504880492 2017/09/08 22:21:32 109 126 235
(ceph-osd) 12181 12181 1504880612 2017/09/08 22:23:32 104 134 238
(ceph-mon) 9493 9493 1504880492 2017/09/08 22:21:32 161 79 240
(ceph-osd) 12297 12297 1504880492 2017/09/08 22:21:32 102 154 256
(ceph-osd) 11483 11483 1504880492 2017/09/08 22:21:32 100 157 257
(ceph-osd) 12063 12063 1504880492 2017/09/08 22:21:32 123 136 259
(ceph-osd) 11698 11698 1504880492 2017/09/08 22:21:32 110 160 270
(ceph-osd) 12297 12297 1504880612 2017/09/08 22:23:32 118 160 278
(ceph-osd) 11199 11199 1504880492 2017/09/08 22:21:32 112 174 286
(ceph-osd) 12063 12063 1504880612 2017/09/08 22:23:32 132 155 287
(ceph-osd) 11199 11199 1504880612 2017/09/08 22:23:32 136 157 293
(ceph-osd) 12413 12413 1504880612 2017/09/08 22:23:32 150 149 299
(ceph-osd) 11698 11698 1504880612 2017/09/08 22:23:32 143 159 302
(ceph-osd) 11483 11483 1504880612 2017/09/08 22:23:32 155 148 303
(smbd) 345320 345320 1504880612 2017/09/08 22:23:32 175 179 354
(gmond) 39518 39518 1504880492 2017/09/08 22:21:32 281 126 407
(gmond) 39518 39518 1504880612 2017/09/08 22:23:32 320 123 443
(ezmonitord) 7909 7909 1504880492 2017/09/08 22:21:32 298 153 451
(ezmonitord) 7909 7909 1504880612 2017/09/08 22:23:32 304 150 454
(eziscsid.py) 34772 34772 1504880492 2017/09/08 22:21:32 223 281 504
(eziscsid.py) 34772 34772 1504880612 2017/09/08 22:23:32 236 307 543
(eziscsid.py) 34772 34772 1504880492 2017/09/08 22:21:32 271 308 579
(eziscsid.py) 34772 34772 1504880612 2017/09/08 22:23:32 286 334 620
(gmond) 43505 43505 1504880612 2017/09/08 22:23:32 16252 8171 24423
(python) 8020 8020 1504880612 2017/09/08 22:23:32 43420 16328 59748
(ezfs-agent) 34567 34567 1504880612 2017/09/08 22:23:32 91905 95602 187507

通过上面最后一行,可以看出,毫无疑问,尖峰很可能是ezfs-agent引起的,事实上,这是孤例,但是其他时段的尖峰也是类似的,如下所示:

(ceph-osd) 10154 10154 1504838821 2017/09/08 10:47:01 554 923 1477
(ceph-osd) 10774 10774 1504838821 2017/09/08 10:47:01 617 860 1477
(ceph-osd) 10654 10654 1504838821 2017/09/08 10:47:01 614 884 1498
(ceph-osd) 9917 9917 1504838821 2017/09/08 10:47:01 602 1019 1621
(ceph-osd) 10274 10274 1504838821 2017/09/08 10:47:01 633 1020 1653
(ceph-osd) 10512 10512 1504838821 2017/09/08 10:47:01 668 993 1661
(ceph-osd) 10392 10392 1504838821 2017/09/08 10:47:01 586 1092 1678
(ceph-osd) 10036 10036 1504838821 2017/09/08 10:47:01 753 1052 1805
(gmond) 55762 55762 1504838821 2017/09/08 10:47:01 1539 631 2170
(ezmonitord) 7880 7880 1504838821 2017/09/08 10:47:01 1497 820 2317
(eziscsid.py) 53596 53596 1504838821 2017/09/08 10:47:01 1088 1353 2441
(eziscsid.py) 53596 53596 1504838821 2017/09/08 10:47:01 1344 1469 2813
(ceph-mon) 8185 8178 1504838821 2017/09/08 10:47:01 3060 290 3350
(smbd) 1002788 1002788 1504838821 2017/09/08 10:47:01 2041 3082 5123
(ceph-mon) 8178 8178 1504838821 2017/09/08 10:47:01 4609 1433 6042
(ezfs-agent) 53295 53295 1504838821 2017/09/08 10:47:01 65233 66407 131640

第二个例子中, 消耗CPU最多的进程,13万ticks,而第二名菜6400 ticks,是第二名的20倍。因此可以推断,ezfs-agent在对应时段发了疯,然后恢复了正常。它瞬间的发疯导致了尖峰的出现。

]]>
ceph-mon的Leader Elect机制 2017-09-02T17:20:40+00:00 Bean Li http://bean-li.github.io/ceph-mon-election 前言

在ceph monitor 运行过程中,需要选举leader,所有的ceph monitor中只会有一个节点会被选举成leader,其他的节点为peon。

后续所有的update相关的操作,通过leader 发出Paxos propose完成,如果是peon节点收到更新请求,那么会转发到leader,让leader代为执行。

注意ceph中的monitor leader的选举算法,并不是paxos算法,ceph讨了一个巧,它利用了节点在monmap中的rank值,人为地制造了地位不平等,rand最小的节点获胜,从而简单快速地达到选举的目的。

发起

其实整个选举过程,我们可以以start_election为起点研究,那么何时会调用start_election函数呢?

  • 节点调用bootstrap函数引导启动,接着会probing,查询其他monitor信息(有可能需要同步数据),完成后发起选举
  • 节点收到选举消息MMonElection,如果节点自己已经处于quorum或自己的编号更小,也会重新发起选举
  • 节点收到quorum enter/exit命令

第三种情况是一个测试用的工具,我们可以忽略不提。很多种情况都会导致某个节点调用bootstrap,比如某个ceph-mon重新启动,比如上篇lease中提到的两种情形。

如果发生了mon 选举,我们从ceph.log中可以看到如下的内容:

2017-08-27 17:33:43.144946 mon.1 172.1.1.197:6789/0 152076 : cluster [INF] mon.hymwq calling new monitor election
2017-08-27 17:33:43.151244 mon.0 172.1.1.196:6789/0 282354 : cluster [INF] mon.nhgfb calling new monitor election

注意,当集群节点个数比较多的时候,在同一时间点我们可能看到一条到多条上述信息,这是为何?这要深入了解mon的选举过程

选举过程

整个选举过程,在Elector类中实现。此类之中实现了一个election_epoch:

root@scaler02:~# ceph quorum_status |json_pp
{
   "quorum_leader_name" : "skmif",
   "monmap" : {
      "mons" : [
         {
            "name" : "skmif",
            "addr" : "10.10.1.1:6789/0",
            "rank" : 0
         },
         {
            "name" : "vqdtz",
            "addr" : "10.10.1.2:6789/0",
            "rank" : 1
         },
         {
            "name" : "lzhsg",
            "addr" : "10.10.1.3:6789/0",
            "rank" : 2
         }
      ],
      "created" : "2017-08-29 09:27:11.587301",
      "epoch" : 3,
      "modified" : "2017-08-29 09:28:05.098478",
      "fsid" : "6e74645a-9894-4d7b-9e94-9f4b9596d59f"
   },
   "quorum_names" : [
      "skmif",
      "vqdtz"
   ],
   "quorum" : [
      0,
      1
   ],
   "election_epoch" : 24
}

当这个election_epoch为偶数的时候,表示处于稳定状态,为奇数的时候,表示还在选举过程中,mon leader的宝座还在竞争,鹿死谁手尚未可知。

下面梳理下流程。

void Monitor::start_election()                                                   
{
  /*这条日志我们一般看不到*/
  dout(10) << "start_election" << dendl;
  wait_for_paxos_write();
  _reset();
  state = STATE_ELECTING;

  logger->inc(l_mon_num_elections);
  logger->inc(l_mon_election_call);

  cancel_probe_timeout();

  clog->info() << "mon." << name << " calling new monitor election\n";
  elector.call_election();
}

当我们从ceph.log中看到如下的打印的时候,表示选举已经开始,某一时间段内第一条打印,是发起选举的mon。它率先觉察到异常,调用了bootstrap,最终走到了start_election。

2017-08-27 17:33:43.144946 mon.1 172.1.1.197:6789/0 152076 : cluster [INF] mon.hymwq calling new monitor election

如果我们打印更多debug信息的时候我们可能看到如下的流程:

因为10秒内没有收到延长租约的消息,最终触发了election,有PEON发起,调用了bootstrap
2017-09-02 15:17:13.687831 7fe3d3c5a700  1 mon.vqdtz@1(peon).paxos(paxos updating c 1051189..1051804) lease_timeout -- calling new election
2017-09-02 15:17:13.687849 7fe3d3c5a700 10 mon.vqdtz@1(peon) e3 bootstrap
2017-09-02 15:17:13.687856 7fe3d3c5a700 10 mon.vqdtz@1(peon) e3 sync_reset_requester
2017-09-02 15:17:13.687859 7fe3d3c5a700 10 mon.vqdtz@1(peon) e3 unregister_cluster_logger
2017-09-02 15:17:13.687865 7fe3d3c5a700 10 mon.vqdtz@1(peon) e3 cancel_probe_timeout (none scheduled)
2017-09-02 15:17:13.687869 7fe3d3c5a700 10 mon.vqdtz@1(probing) e3 _reset
2017-09-02 15:17:13.687871 7fe3d3c5a700 10 mon.vqdtz@1(probing) e3 cancel_probe_timeout (none scheduled)
2017-09-02 15:17:13.687873 7fe3d3c5a700 10 mon.vqdtz@1(probing) e3 timecheck_finish
2017-09-02 15:17:13.687882 7fe3d3c5a700 10 mon.vqdtz@1(probing) e3 scrub_reset
....
此处的cancel_probe_timeout即函数中cancel_probe_timeout()语句
2017-09-02 15:17:13.688846 7fe3d3459700 10 mon.vqdtz@1(electing) e3 cancel_probe_timeout (none scheduled)
2017-09-02 15:17:13.688875 7fe3d3459700  5 mon.vqdtz@1(electing).elector(42) start -- can i be leader?
2017-09-02 15:17:13.688915 7fe3d3459700  1 mon.vqdtz@1(electing).elector(42) init, last seen epoch 42
2017-09-02 15:17:13.688919 7fe3d3459700 10 mon.vqdtz@1(electing).elector(42) bump_epoch 42 to 43
2017-09-02 15:17:13.690523 7fe3d3459700 10 mon.vqdtz@1(electing) e3 join_election
2017-09-02 15:17:13.690540 7fe3d3459700 10 mon.vqdtz@1(electing) e3 _reset
2017-09-02 15:17:13.690543 7fe3d3459700 10 mon.vqdtz@1(electing) e3 cancel_probe_timeout (none scheduled)
2017-09-02 15:17:13.690545 7fe3d3459700 10 mon.vqdtz@1(electing) e3 timecheck_finish
2017-09-02 15:17:13.690549 7fe3d3459700 10 mon.vqdtz@1(electing) e3 scrub_reset


从 elector.call_election()开始,就开始调用elector类定义的方法,开始选举。

  void call_election() {                                                     
    start();
  }

void Elector::start()
{
  if (!participating) {
    dout(0) << "not starting new election -- not participating" << dendl;
    return;
  }
  dout(5) << "start -- can i be leader?" << dendl;

  acked_me.clear();
  classic_mons.clear();
  init();
  
  /*从稳定态进入选举态,需要将版本号从偶数往上抬,抬成奇数*/
  if (epoch % 2 == 0) 
    bump_epoch(epoch+1);  // odd == election cycle
  start_stamp = ceph_clock_now(g_ceph_context);
  electing_me = true;
  acked_me[mon->rank] = CEPH_FEATURES_ALL;
  leader_acked = -1;

  /*向每一个成员广播消息,提议开始重新选举*/
  for (unsigned i=0; i<mon->monmap->size(); ++i) {
    if ((int)i == mon->rank) continue;
    Message *m = new MMonElection(MMonElection::OP_PROPOSE, epoch, mon->monmap);
    mon->messenger->send_message(m, mon->monmap->get_inst(i));
  }              
  reset_timer();
}

当其他成员收到OP_PROPOSE的消息时,就知道了,需要开始新一轮的选举了。其他的成员收到消息之后,反应可以分成三种:

  • 赞成
  • 给其他所有成员发消息,选我
  • 不理

因为到底谁当选leader,取决于rank的大小,rank小者胜,因此收到消息之后,会比对rank,来决定做什么事情。

如果选举发起方的rank比自身的rank大

天子宁有种乎,兵强马壮者为之耳!如果从未收到过更强者(rank更小者)发来的选举请求,调用start_election,给所有成员发消息,让他们选自己为mon leader。这里面有一种场景,即连续两个弱者要求当leader,这时候,第一次的时候,如果已经调用了start_election,要求大家选自己,就没有必要重新再发一次选自己当leader的请求了。

  if (mon->rank < from) {
    // i would win over them.
    if (leader_acked >= 0) { // we already acked someone
      /*自己曾经认过怂,消息来源的mon还不如自己,直接不理,
       *来源的mon太弱,是不可能被自己承认的*/
      assert(leader_acked < from);  // and they still win, of course
      dout(5) << "no, we already acked " << leader_acked << dendl;
    } else {
      /*注意,electing_me记录了自己是否发出过选我为leader的请求
       *如果先后收到两个弱小者发来的选举请求,处理第一个的时候,本节点已经发出了选自己当leader的请求,
       *当第二个弱者消息到来的时候,没必要再发送选自己当leader的请求*/
      if (!electing_me) {
        mon->start_election();
      }
    }
  } else {
    // they would win over me
    if (leader_acked < 0 ||      // haven't acked anyone yet, or
        leader_acked > from ||   // they would win over who you did ack, or
        leader_acked == from) {  // this is the guy we're already deferring to
      defer(from);
    } else {
      // ignore them!
      dout(5) << "no, we already acked " << leader_acked << dendl;
    }
  }

这里面还有另外一种情况,即更强者曾经来过,自己曾经认过怂,承认过别人更强,那么这种情况下,采用的是不理的策略,自己都认了怂,这个消息的来源mon还不如自己,肯定是不能承认其leader的地位。

如果选举发起方的rank比自身的rank小

如果消息的来源mon,rank比自己小,要强于自己,发出选举倡议,让大家选自己是不可能了,只剩两种可能,要么承认它,要么不理它。

    if (leader_acked < 0 ||      /*从未承认过别人,从未认过怂*/ 
        leader_acked > from ||   /*虽然承认过别人,认过怂,无奈这次来的更强大,所以还是得认怂,承认它*/
        leader_acked == from) {  
      defer(from);  /*defer函数的作用是认可对方可以当leader*/
    }else {
      /*曾经认可过更强者,不可能向不够强的mon发送认可,不理*/
      dout(5) << "no, we already acked " << leader_acked << dendl;
    }

从上面可以看出,一个节点根据时序的不同,可能调用defer多次,承认多个mon当leader的请求。

如果曾经认可过更强的mon,当处于中间水平的mon到来的时候,自己是不可能再向该mon发送认可的回应。

通过上面的讨论可以看出,只有rank最小者,即最强者才有可能搜集到最多的承认。

void Elector::defer(int who)
{
  dout(5) << "defer to " << who << dendl;

  if (electing_me) {
    /*注意,认怂就要清零,哪怕曾竖起过大旗,要求别人选自己*/
    acked_me.clear();
    classic_mons.clear();
    electing_me = false;
  }

  // ack them
  leader_acked = who;
  ack_stamp = ceph_clock_now(g_ceph_context);
  /*发送OP_ACK承认对方可以当leader*/
  MMonElection *m = new MMonElection(MMonElection::OP_ACK, epoch, mon->monmap);
  m->sharing_bl = mon->get_supported_commands_bl();
  mon->messenger->send_message(m, mon->monmap->get_inst(who));
  
  // set a timer
  reset_timer(1.0);  // give the leader some extra time to declare victory
}

处理OP_ACK 消息

收到这个OP_ACK消息,表示对方认可自己当leader的请求,我们来看下如何处理:

void Elector::handle_ack(MMonElection *m)
{                                                                                                                                                      
  dout(5) << "handle_ack from " << m->get_source() << dendl;
  int from = m->get_source().num();

  assert(m->epoch % 2 == 1); // election
  if (m->epoch > epoch) {
    dout(5) << "woah, that's a newer epoch, i must have rebooted.  bumping and re-starting!" << dendl;
    bump_epoch(m->epoch);
    start();
    m->put();
    return;
  }
  assert(m->epoch == epoch);
  uint64_t required_features = mon->get_required_features();
  if ((required_features ^ m->get_connection()->get_features()) &
      required_features) {
    dout(5) << " ignoring ack from mon" << from
            << " without required features" << dendl;
    return;
  }
  
  if (electing_me) {
    // thanks
    /*搜集到一枚承认,记录在acked_me*/
    acked_me[from] = m->get_connection()->get_features();
    if (!m->sharing_bl.length())
      classic_mons.insert(from);
    dout(5) << " so far i have " << acked_me << dendl;
    
    /*如果所有成员都承认了自己的leader地位,那么宣布获胜,调用victory*/
    if (acked_me.size() == mon->monmap->size()) {
      // if yes, shortcut to election finish
      victory();
    }
  } else {
    // ignore, i'm deferring already.
    assert(leader_acked >= 0);
  }
  
  m->put();
}  

注意,如果所有的人都承认自己leader地位,那么可以宣布获胜。但是有些情况下,无法等到所有的回应。比如某个ceph-mon进程已经不在了,是不可能得到其承认的。为了防止出现这种情况下,在通知其他节点选自己的start函数设置了定时器“

void Elector::start() 
{
    ...
    reset_timer();
}
void Elector::reset_timer(double plus)                                                    
{
  // set the timer
  cancel_timer();
  expire_event = new C_ElectionExpire(this);
  mon->timer.add_event_after(g_conf->mon_lease + plus,
                             expire_event);
}
void Elector::expire()
{
  dout(5) << "election timer expired" << dendl;
  
  /*注意,超过半数,就能宣布获胜
   *注意,如果认过怂,electing_me就变成false了,你就不可能宣布胜利了*/
  if (electing_me &&
      acked_me.size() > (unsigned)(mon->monmap->size() / 2)) {
    // i win
    victory();
  } else {
    // whoever i deferred to didn't declare victory quickly enough.
    if (mon->has_ever_joined)
      start();
    else                                            
      mon->bootstrap();
  }
}

如果自己获胜的话,会给其他节点发送OP_VICTORY消息,告诉别的节点,自己已经当选leader了。

void Elector::victory()
{

  /*选举过程完成,需要抬election_epoch,变成偶数*/
  assert(epoch % 2 == 1);  // election
  bump_epoch(epoch+1);     // is over!  
  // tell everyone!
  for (set<int>::iterator p = quorum.begin();
       p != quorum.end();
       ++p) {
    if (*p == mon->rank) continue;
    MMonElection *m = new MMonElection(MMonElection::OP_VICTORY, epoch, mon->monmap);
    m->quorum = quorum;
    m->quorum_features = features;
    m->sharing_bl = *cmds_bl;
    mon->messenger->send_message(m, mon->monmap->get_inst(*p));
  }    
  // tell monitor                                                  
  mon->win_election(epoch, quorum, features, cmds, cmdsize, &copy_classic_mons);
}

而竞选的失败者,在handle_victory函数中处理OP_VICTORY消息:

void Elector::handle_victory(MMonElection *m)
{
  ...
   bump_epoch(m->epoch) ;
   // they win
   mon->lose_election(epoch, m->quorum, from, m->quorum_features);
   ...
}

注意,胜利者通过win_election重新初始化,成为Leader,而失败者,通过lost_election重新初始化,变成PEON

]]>
ceph-mon的lease机制 2017-08-23T17:57:40+00:00 Bean Li http://bean-li.github.io/ceph-mon-lease 前言

ceph-mon负责很多的功能:

  • startup
  • data store
  • data sync
  • data check
  • scrub
  • leader elect
  • timecheck
  • lease
  • paxos
  • paxos service
  • consistency

本文介绍lease机制,即租约机制。

ceph-osd之间,会有心跳机制:

osd_heartbeat_interval   (默认是6)
osd_heartbeat_grace (默认是20)

即OSD Peer之间,其实形成了彼此监控的网络,每 6秒向Peer发送心跳信息,如果超过osd_heartbeat_grace 时间没收到Peer OSD的心跳信息,则send_failure,状告该OSD已经fail。

这种机制的存在确保了当OSD 异常退出或者网络不通的时候,ceph-mon能够发现。

当集群中存在多个ceph-mon的时候,有leader,有peon,ceph-mon进程也可能因为某种原因异常死亡或者网络不通,也必须有机制报障及时发现。这个机制是lease。

monitor内部采用lease协议,保证副本数据在一定时间范围内可读写(写需要是leader节点),同时也用来发现monitor的异常,然后重新选举。

leader节点会定期发送lease消息,延长各个peon的时间,但是如果某个peon 节点挂掉,leader节点就无法收到lease_ack消息,超时之后,就会重新选举。

同样道理,leader节点也可能会异常宕机,peon节点也要能监督leader节点。如果leader down掉,peon节点就收不到来自leader的lease更新消息,超时之后,也会选举。

这里面有几个参数,比如

  • 多久发送一次lease消息:mon_lease_renew_interval 默认3秒
  • 每次延长租约多长时间:mon_lease 默认是5秒
  • 超时重新选举的timeout时间是多久:mon_lease_ack_timeout 默认是10秒

其中mon_lease_ack_timeout对monitor leader节点和peon节点都是有效。对于monitor leader来说,如果在mon_lease_ack_timeout 的时间内,没有搜集到所有peon的lease ack,就判定超时,调用bootstrap重新选举。在另一个方面,如果peon节点在mon_lease_ack_timeout 时间内,没有收到新的lease 信息,就判定超时,也会发起重新选举。

A面:leader

我们首先站在leader节点的角度,看下lease相关的操作。lease这个功能的发起点是extend_lease函数:

void Paxos::extend_lease()
{
  assert(mon->is_leader());
  //assert(is_active());

  /*当前时间+5秒,作为新的lease_expire*/
  lease_expire = ceph_clock_now(g_ceph_context);
  lease_expire += g_conf->mon_lease;
  
  /*已经收到的ack的集合 acked_lease清空,将当前mon leader
   *加入其中*/
  acked_lease.clear();
  acked_lease.insert(mon->rank);

  dout(7) << "extend_lease now+" << g_conf->mon_lease 
          << " (" << lease_expire << ")" << dendl;

  // bcast
  /*向所有的peon发送OP_LEASE消息,消息体中带上lease_expire */
  for (set<int>::const_iterator p = mon->get_quorum().begin();
      p != mon->get_quorum().end(); ++p) {
    if (*p == mon->rank) continue;
    MMonPaxos *lease = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_LEASE,
                                     ceph_clock_now(g_ceph_context));
    lease->last_committed = last_committed;
    lease->lease_timestamp = lease_expire;
    lease->first_committed = first_committed;
    mon->messenger->send_message(lease, mon->monmap->get_inst(*p));
  }

  /*注册ack timeout event,如果在规定时间(默认10秒)内,并未搜集齐ack,
   *那么就执行C_LeaseAckTimeout中定义的超时函数
   *正常情况下,该定时事件总是在收到最后一个ack后,被cancel掉,无法获得执行
   *只有异常发生,才会执行Paxos::lease_ack_timeout*/
  if (!lease_ack_timeout_event) {
    lease_ack_timeout_event = new C_LeaseAckTimeout(this);
    mon->timer.add_event_after(g_conf->mon_lease_ack_timeout, 
                               lease_ack_timeout_event);
  }

  /*因为extend_lease 要一轮一轮的跑下去,因此,注册下一次调用extend_lease的定时事件
   *C_LeaseRenew,触发时间是3秒后,正常情况下总是触发,发起下一轮*/
  lease_renew_event = new C_LeaseRenew(this);
  utime_t at = lease_expire;
  at -= g_conf->mon_lease;
  at += g_conf->mon_lease_renew_interval;
  mon->timer.add_event_at(at, lease_renew_event);
}

发送消息之后,mon leader就开始等待peon返回的lease ack消息。收到消息后,monitor leader

void Paxos::dispatch(PaxosServiceMessage *m)
{
    switch (m->get_type()) {  
    case MSG_MON_PAXOS:
    {                                                
      MMonPaxos *pm = (MMonPaxos*)m;
      // NOTE: these ops are defined in messages/MMonPaxos.h
      switch (pm->op) {
      ...
        case MMonPaxos::OP_LEASE_ACK:
          handle_lease_ack(pm);
          break;
      }
     ...
    }
    ...
}

void Paxos::handle_lease_ack(MMonPaxos *ack)
{
  int from = ack->get_source().num();

  if (!lease_ack_timeout_event) {
    dout(10) << "handle_lease_ack from " << ack->get_source() 
             << " -- stray (probably since revoked)" << dendl;
  }
  else if (acked_lease.count(from) == 0) {
    acked_lease.insert(from);
    
    if (acked_lease == mon->get_quorum()) {
      // 最后一个peon的消息也收到了,那么没有超时,就取消掉lease_ack_timeout_event
      dout(10) << "handle_lease_ack from " << ack->get_source() 
               << " -- got everyone" << dendl;
      mon->timer.cancel_event(lease_ack_timeout_event);
      lease_ack_timeout_event = 0;
    } else {
      /*并非最后一个peon的消息,除了打印,并不做特殊的处理*/
      dout(10) << "handle_lease_ack from " << ack->get_source() 
               << " -- still need "
               << mon->get_quorum().size() - acked_lease.size()
               << " more" << dendl;
    }
  } else {
    /*已经acked的peon,会记录再acked_lease集合中,如果已经收到对应ack消息了,
     *那么就是重复的消息了,ignore掉*/
    dout(10) << "handle_lease_ack from " << ack->get_source()
             << " dup (lagging!), ignoring" << dendl;
  }
  warn_on_future_time(ack->sent_timestamp, ack->get_source());
  
  ack->put();
}

对于monitor leader 来说,每mon_lease_renew_interval 秒(默认3秒)触发依次extend_lease,在该函数中,monitor leader会向所有的peon发送lease 消息,然后设置定时事件C_LeaseAckTimeout,如果在mon_lease_ack_timeout 时间内搜集全所有的lease ack消息,就既往不咎,取消掉C_LeaseAckTimeout定时事件。

如果超过mon_lease_ack_timeout ,也没搜集起所有的lease ack 怎么办?通过lease_ack_timeout函数,调用bootstrap函数,发起选举。

class C_LeaseAckTimeout : public Context {
    Paxos *paxos;
  public:
    C_LeaseAckTimeout(Paxos *p) : paxos(p) {}
    void finish(int r) { 
      if (r == -ECANCELED)
        return;
      paxos->lease_ack_timeout();
    }                                                                                                                                                  
};
  
void Paxos::lease_ack_timeout()                                                    
{   
  dout(1) << "lease_ack_timeout -- calling new election" << dendl;
  assert(mon->is_leader());
  assert(is_active());
  logger->inc(l_paxos_lease_ack_timeout);
  lease_ack_timeout_event = 0;
  /*bootstrap 发起monitor leader的选举*/
  mon->bootstrap();
} 

B面 peon

对于peon节点而言,收到OP_LEASE消息,是讨论的起点:

void Paxos::handle_lease(MMonPaxos *lease)                                                   
{
  // sanity
  if (!mon->is_peon() ||
      last_committed != lease->last_committed) {
    dout(10) << "handle_lease i'm not a peon, or they're not the leader,"
             << " or the last_committed doesn't match, dropping" << dendl;
    lease->put();
    return;
  }
  warn_on_future_time(lease->sent_timestamp, lease->get_source());

  /*延长lease 到mon leader指定的时间*/
  if (lease_expire < lease->lease_timestamp) {
    lease_expire = lease->lease_timestamp;
    utime_t now = ceph_clock_now(g_ceph_context);
    /*如果peon和monitor leader的时间差太大,lease_expire小于now,那么警告*/
    if (lease_expire < now) {
      utime_t diff = now - lease_expire;
      derr << "lease_expire from " << lease->get_source_inst() << " is " << diff << " seconds in the past; mons are probably laggy (or possibly clocks are too skewed)" << dendl; 
    }
  }

  state = STATE_ACTIVE;

  /*发送OP_LEASE_ACK消息到mon leader*/
  dout(10) << "handle_lease on " << lease->last_committed
           << " now " << lease_expire << dendl;
  MMonPaxos *ack = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_LEASE_ACK,
                                 ceph_clock_now(g_ceph_context));
  ack->last_committed = last_committed;
  ack->first_committed = first_committed;
  ack->lease_timestamp = ceph_clock_now(g_ceph_context);
  lease->get_connection()->send_message(ack);

  // (re)set timeout event.
  reset_lease_timeout();

  // kick waiters
  finish_contexts(g_ceph_context, waiting_for_active);
  if (is_readable())
    finish_contexts(g_ceph_context, waiting_for_readable);

  lease->put();
}  

前面讲过,mon leader和peon是互相监督,peon对monitor leader的监督,体现在reset_lease_timeout函数。他会以收到OP_LEASE消息的时间为起点,注册一个超时时间为mon_lease_ack_timeout的定时事件。如果该定时器超时了,表示在过去的mon_lease_ack_timeout时间内,没有收到任何的OP_LEASE消息,基本可以确定mon leader出问题了。

void Paxos::reset_lease_timeout()
{
  dout(20) << "reset_lease_timeout - setting timeout event" << dendl;
  /*先取消掉当前的定时事件
   *事实上,该定时事件几乎总是被cancel掉,因为正常情况下,peon会每隔3秒,源源不断地收到OP_LEASE消息
   */
  if (lease_timeout_event)
    mon->timer.cancel_event(lease_timeout_event);
  lease_timeout_event = new C_LeaseTimeout(this);                                            
  mon->timer.add_event_after(g_conf->mon_lease_ack_timeout, lease_timeout_event);
}

通过这个C_LeaseTimeout定时事件,peon也在监督monitor leader,如果monitor leader迟迟不发送OP_LEASE消息,延长租约,那么peon会通过如下方法,发起选举:

  class C_LeaseTimeout : public Context {
    Paxos *paxos;
  public:
    C_LeaseTimeout(Paxos *p) : paxos(p) {}
    void finish(int r) {
      if (r == -ECANCELED)
        return;
      paxos->lease_timeout();
    }
  };
  
void Paxos::lease_timeout()
{
  dout(1) << "lease_timeout -- calling new election" << dendl;
  /*只有peon节点才会调用该函数*/
  assert(mon->is_peon());
  logger->inc(l_paxos_lease_timeout);
  lease_timeout_event = 0;
  /*调用bootstrap发起选举*/
  mon->bootstrap();
}

注意,lease_expire每次续费3秒,但是超时时间是10秒,那么就会有一段时间,租约已经过期,但是还没超时重新选举。这段时间内租约是无效的:

bool Paxos::is_lease_valid()
{
  return ((mon->get_quorum().size() == 1)
      || (ceph_clock_now(g_ceph_context) < lease_expire));
}   

注意这段时间内,是不可读写的:

bool Paxos::is_readable(version_t v)
{
  bool ret;
  if (v > last_committed)
    ret = false;
  else
    ret =
      (mon->is_peon() || mon->is_leader()) &&
      (is_active() || is_updating() || is_writing()) &&
      last_committed > 0 &&           // must have a value
      (mon->get_quorum().size() == 1 ||  // alone, or
       is_lease_valid()); // have lease                                                                                                                
  dout(5) << __func__ << " = " << (int)ret
          << " - now=" << ceph_clock_now(g_ceph_context)
          << " lease_expire=" << lease_expire
          << " has v" << v << " lc " << last_committed
          << dendl;
  return ret;
}
bool Paxos::is_writeable()
{
  return
    mon->is_leader() &&
    is_active() &&
    is_lease_valid();
}  
]]>
ceph-mon的timecheck机制 2017-08-19T14:57:40+00:00 Bean Li http://bean-li.github.io/ceph-mon-timecheck 前言

ceph-mon负责的功能有很多:

  • startup
  • data store
  • data sync
  • data check
  • scrub
  • leader elect
  • timecheck
  • lease
  • paxos
  • paxos service
  • consistency

我们今天先挑一个软一点地柿子捏一下,简单介绍下timecheck。

分布式系统正常运转依赖系统时间,ceph通过这个timecheck机制来检查每个monitor的时间是否一致,如果误差过大(clock skew),会发出警告信息。

我们知道,集群中多个节点可能都存在ceph-mon,当时扮演的角色不同,有一个节点是monitor leader,其他的节点上的monitor 为peon, 在timecheck机制中,两者扮演的角色不同,如下图所示:

注意,monitor leader是整个战术的发起点,他会主动向所有的peon发送OP_PING请求,所有的peon monitor会恢复OP_PONG,在OP_PONG消息中,会带上自己这边的时间戳。当monitor leader收到回应后,会计算出monitor leader和各个peon中间的时间偏移(估算,无法做到绝对精确),记录到ceph-mon的数据结构中。

当所有的peon都回应过OP_PONG之后,monitor leader收到所有的回应之后,会在timecheck_finish_round 函数中通过调用timecheck_report ,给所有的peon发送OP_REPORT消息,在消息体中,会把monitor leader算出来的时钟偏移和往来延迟记入其中,这样peon收到OP_REPROT消息之后,就能得到,该节点与monitor leader之间的往来延迟和时钟偏移。

粗略的过程就是如上,下面要展开细节,详细的描述这个过程。

原点

不介绍ceph-mon的PAXOS以及election,似乎很难介绍好其他功能,但是们还是暂时放下Paxos和election,我们起点从有一个节点赢得ceph-mon monitor leader的选举开始:

如同封建时代,新皇登基总要大赦天下,提拔一群新的大臣到重要岗位,某个节点的ceph-mon赢得monitor leader 选举之后,也会做一些重新洗牌的动作。其中timecheck的重新初始化也在其中。

void Monitor::win_election(epoch_t epoch, set<int>& active, uint64_t features,
                           const MonCommand *cmdset, int cmdsize, 
                           const set<int> *classic_monitors)
{
    if (monmap->size() > 1 &&
      monmap->get_epoch() > 0)
      timecheck_start();
}
void Monitor::timecheck_start()
{
  dout(10) << __func__ << dendl;
  timecheck_cleanup();
  timecheck_start_round();
}
void Monitor::timecheck_cleanup()
{
  timecheck_round = 0;
  timecheck_acks = 0;
  timecheck_round_start = utime_t();

  if (timecheck_event) {
    timer.cancel_event(timecheck_event);
    timecheck_event = NULL;
  }
  timecheck_waiting.clear();
  timecheck_skews.clear();
  timecheck_latencies.clear();
}

我们可以看到,新当选的monitor leader通过win_election—>timecheck_start—->timecheck_cleanup,完成了对timecheck相关数据结构的重新洗牌。

竞争leader的失败者,也需要重新洗牌,完成对timecheck相关数据结构的初始化。

void Monitor::lose_election(epoch_t epoch, set<int> &q, int l, uint64_t features) 
{
  state = STATE_PEON;
  ...
  logger->inc(l_mon_election_win);
  finish_election();                                                  
}
void Monitor::finish_election()
{
  apply_quorum_to_compatset_features();
  timecheck_finish();
  ...
}
void Monitor::timecheck_finish()
{
  dout(10) << __func__ << dendl;
  timecheck_cleanup();
}
void Monitor::timecheck_cleanup()                                                
{
  timecheck_round = 0;
  timecheck_acks = 0;
  timecheck_round_start = utime_t();

  if (timecheck_event) {
    timer.cancel_event(timecheck_event);
    timecheck_event = NULL;
  }
  timecheck_waiting.clear();
  timecheck_skews.clear();
  timecheck_latencies.clear();
}

通过上面的讨论可以看到,竞争leader的失败者,也重新初始化了timecheck相关的数据结构。

timecheck的流程

现在我们可以开始讨论下相关的数据结构到底记录什么信息了。

  map<entity_inst_t, utime_t> timecheck_waiting;
  map<entity_inst_t, double> timecheck_skews;
  map<entity_inst_t, double> timecheck_latencies;
  // odd value means we are mid-round; even value means the round has
  // finished.
  version_t timecheck_round; 
  
  unsigned int timecheck_acks;
  utime_t timecheck_round_start;

首先的话timecheck_round是一个version_t类型,即uint64_t类型的变量。因为timecheck是一轮一轮的做的,因此需要一个轮数的概念。当timecheck_round 是奇数还是偶数,有不同的含义,后面会详细分析。

timecheck_round_start是一个时间值,记录的是本轮timecheck发起的时间。记录下这个时间之后,就要开始给各个PEON monitor发送OP_PING消息了。这个时间非常有用。因为有些时候,可能并不顺利,很可能过了很久,也收不到某个PEON回应的OP_PONG消息,比如发送的时候,该PEON网络还是通的,但是PEON收到消息之后,网路不通了,monitor leader可能无法集齐所有PEON monitor的回应,这种情况下,timecheck需要有cancel的机制,不能因为单个节点的故障,导致大家timecheck都无法进行。

void Monitor::timecheck_start_round()
{
  dout(10) << __func__ << " curr " << timecheck_round << dendl;
  assert(is_leader());
  
  if (monmap->size() == 1) {
    assert(0 == "We are alone; this shouldn't have been scheduled!");
    return;
  }
  
  if (timecheck_round % 2) {
    dout(10) << __func__ << " there's a timecheck going on" << dendl;
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    double max = g_conf->mon_timecheck_interval*3;
    if (curr_time - timecheck_round_start < max) {
      dout(10) << __func__ << " keep current round going" << dendl;
      goto out;
    } else {
      dout(10) << __func__
               << " finish current timecheck and start new" << dendl;
      timecheck_cancel_round();
    }
  }
  
  assert(timecheck_round % 2 == 0);
  timecheck_acks = 0;
  timecheck_round ++;
  timecheck_round_start = ceph_clock_now(g_ceph_context);
  dout(10) << __func__ << " new " << timecheck_round << dendl;

  timecheck();
out:
  dout(10) << __func__ << " setting up next event" << dendl;
  timecheck_event = new C_TimeCheck(this);
  timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);
} 

前面讲过,timecheck_round是奇数还是偶数,含义是不同的

  • 奇数:timecheck已经发起,但是尚未结束
  • 偶数:timecheck已经完成,正在等待下一轮timecheck的发起。

wait a minute, 我们提到了等待下一轮,那么到底多久是一轮呢?我们看定时器:

out:
  dout(10) << __func__ << " setting up next event" << dendl;
  timecheck_event = new C_TimeCheck(this);
  timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);  
OPTION(mon_timecheck_interval, OPT_FLOAT, 300.0) 

这个浮点数300秒,定义了timecheck的周期,每五分钟,发起一轮timecheck。注意C_TimeCheck:

  struct C_TimeCheck : public Context {
    Monitor *mon;
    C_TimeCheck(Monitor *m) : mon(m) { }
    void finish(int r) {
      mon->timecheck_start_round();                                            
    }
  }; 

定时器到了,会执行下一轮的timecheck_start_round函数。

注意哈,当ceph-mon成为monitor leader之后,在win_election函数中调用timecheck_start函数,在该函数中会第一次调用timecheck_start_round,后续的timecheck发起,就靠定时任务了。每过300秒,就会发起下一轮的timecheck。

void Monitor::timecheck_start()                                               
{
  dout(10) << __func__ << dendl;
  timecheck_cleanup();
  timecheck_start_round();
}

timecheck_start_round作为timecheck的发起者,就非常重要了。

timecheck_start_round函数

   /*如果是只有一个cephmon,压根就不需要发起timecheck,
    *事实上win_election中也判定了,是否是一个mon*/
  if (monmap->size() == 1) {
    assert(0 == "We are alone; this shouldn't have been scheduled!");
    return;
  }

理想很丰满,显示很骨感,实际情况是很复杂的,比如又有某种原因,上一轮的timecheck迟迟不能结案,现实中又不能不理,因此,下面这段逻辑处理的是timecheck因为某些原因无法结束的情形。如果定时器timeout了,即等待了300秒,结果发现上一轮的timecheck居然还没完工,那么是放弃还是继续等待?取决于等待的时间,如果等待了3倍的mon_timecheck_interval时间,即15分钟以上,还没等到timecheck结束,那么就不等路,直接cancel本轮timecheck,但是如果低于3倍时间,就goto out设置定时器,再等一轮。

 /*timecheck_round为奇数的时候,表示有一轮timecheck 正在进行中*/ 
 if (timecheck_round % 2) {
    dout(10) << __func__ << " there's a timecheck going on" << dendl;
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    double max = g_conf->  if (timecheck_round % 2) {
    dout(10) << __func__ << " there's a timecheck going on" << dendl;
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    double max = g_conf->mon_timecheck_interval*3;
    /*如果等待时间低于3倍的mon_timecheck_interval,那么再等300秒
     * goto out是为了设置新的定时器的*/
    if (curr_time - timecheck_round_start < max) {
      dout(10) << __func__ << " keep current round going" << dendl;
      goto out;
    } else {
      dout(10) << __func__
               << " finish current timecheck and start new" << dendl;
      timecheck_cancel_round();
    }
  }

正常情况下,300秒的时间,timecheck肯定是完成了,但是也有异常,比如发送OP_PING的时候,PEON好好的,但是某一个PEON就是不给会消息,这种情况下,没有搜集起所有的相应,本轮timecheck就不能结束。上面的逻辑就是处理这个的。

这一部分逻辑是异常部分,正常情况下不会走到。正常部分下,走下面这个逻辑:

 /*assert判定,并无当前正在进行的timecheck*/
  assert(timecheck_round % 2 == 0);
  /*新的一轮check,自然一个回应也没收到*/
  timecheck_acks = 0;
  /*timecheck_round自加,变成奇数,表示正在进行timecheck*/
  timecheck_round ++;
  /*记录本轮timecheck的起始时间,到timecheck_round_start变量*/
  timecheck_round_start = ceph_clock_now(g_ceph_context);
  dout(10) << __func__ << " new " << timecheck_round << dendl;
  
  /*真正发起timecheck*/
  timecheck();
out:
  dout(10) << __func__ << " setting up next event" << dendl;
  timecheck_event = new C_TimeCheck(this);
  timer.add_event_after(g_conf->mon_timecheck_interval, timecheck_event);

timecheck函数

void Monitor::timecheck()
{
  dout(10) << __func__ << dendl;
  assert(is_leader());
  if (monmap->size() == 1) {
    assert(0 == "We are alone; we shouldn't have gotten here!");
    return;
  }
  assert(timecheck_round % 2 != 0);

  timecheck_acks = 1; // we ack ourselves

  dout(10) << __func__ << " start timecheck epoch " << get_epoch()
           << " round " << timecheck_round << dendl;

  // we are at the eye of the storm; the point of reference
  timecheck_skews[messenger->get_myinst()] = 0.0;
  timecheck_latencies[messenger->get_myinst()] = 0.0;

  for (set<int>::iterator it = quorum.begin(); it != quorum.end(); ++it) {
    if (monmap->get_name(*it) == name)
      continue;
      
    entity_inst_t inst = monmap->get_inst(*it);
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    timecheck_waiting[inst] = curr_time;
    MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_PING);
    m->epoch = get_epoch();
    m->round = timecheck_round;
    dout(10) << __func__ << " send " << *m << " to " << inst << dendl;
    messenger->send_message(m, inst);
  }
}

首先是下面的逻辑,用来处理monitor leader自身到自身的时间偏移,毫无疑问,自己和自己肯定是没有任何偏移的,也不需要假惺惺地发消息测试:

  timecheck_acks = 1; // we ack ourselves

  dout(10) << __func__ << " start timecheck epoch " << get_epoch()
           << " round " << timecheck_round << dendl;

  // we are at the eye of the storm; the point of reference
  timecheck_skews[messenger->get_myinst()] = 0.0;
  timecheck_latencies[messenger->get_myinst()] = 0.0;

接下来是发给其他ceph-mon的消息:

  for (set<int>::iterator it = quorum.begin(); it != quorum.end(); ++it) {
    /*如果ceph-mon是leader自己,就不用发消息了*/
    if (monmap->get_name(*it) == name)
      continue;
      
    entity_inst_t inst = monmap->get_inst(*it);
    utime_t curr_time = ceph_clock_now(g_ceph_context);
    /*记录下发送OP_PING的时间点,到timecheck_waiting[inst],后面会有用
     *后面要计算latency,这时候,发送的时间和收到OP_PONG响应的时间,就能估算延迟了*/
    timecheck_waiting[inst] = curr_time;
    MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_PING);
    m->epoch = get_epoch();
    m->round = timecheck_round;
    dout(10) << __func__ << " send " << *m << " to " << inst << dendl;
    messenger->send_message(m, inst);
  }

handle_timecheck 函数

对于Monitor::dispatch 函数我就不提了,他是整个Monitor的消息集散中心,其中我们timecheck相关的消息类型,都是这种MSG_TIMECHECK。

    case MSG_TIMECHECK:                                           
      handle_timecheck(static_cast<MTimeCheck *>(m));
      break;

我们细细来看handle_timecheck函数:

void Monitor::handle_timecheck(MTimeCheck *m)
{
  dout(10) << __func__ << " " << *m << dendl;
  /*monitor leader只会、应该收到 OP_PONG的消息*/
  if (is_leader()) {
    if (m->op != MTimeCheck::OP_PONG) {
      dout(1) << __func__ << " drop unexpected msg (not pong)" << dendl;
    } else {
      handle_timecheck_leader(m);
    }
  } else if (is_peon()) {
    /*非Leader,则只应该收到OP_PING和OP_REPORT两种消息*/
    if (m->op != MTimeCheck::OP_PING && m->op != MTimeCheck::OP_REPORT) {
      dout(1) << __func__ << " drop unexpected msg (not ping or report)" << dendl;
    } else {
      handle_timecheck_peon(m);
    }
  } else {
    dout(1) << __func__ << " drop unexpected msg" << dendl;
  }
  m->put();
}

很明显,peon只会收到OP_PING和OP_REPORT两种消息,先收到OP_PING。

void Monitor::handle_timecheck_peon(MTimeCheck *m)
{
  ...
  if (m->epoch != get_epoch()) {
    dout(1) << __func__ << " got wrong epoch "
            << "(ours " << get_epoch() 
            << " theirs: " << m->epoch << ") -- discarding" << dendl;
    return;
  }

  /*如果收到消息的round,小于自己的timecheck_round,表示迷路已久的OP_PING终于到了,
   *因为时过境迁,这种过时的消息已经没有回复的必要了。*/
  if (m->round < timecheck_round) {
    dout(1) << __func__ << " got old round " << m->round
            << " current " << timecheck_round
            << " (epoch " << get_epoch() << ") -- discarding" << dendl;
    return;
  }

  /*peon修改自己的timecheck_round,向monitor leader看起*/
  timecheck_round = m->round;

  assert((timecheck_round % 2) != 0);
  MTimeCheck *reply = new MTimeCheck(MTimeCheck::OP_PONG);
  utime_t curr_time = ceph_clock_now(g_ceph_context);
  /*把当前节点的时间写入消息体,回给monitor leader*/
  reply->timestamp = curr_time;
  reply->epoch = m->epoch;
  reply->round = m->round;
  dout(10) << __func__ << " send " << *m
           << " to " << m->get_source_inst() << dendl;
  m->get_connection()->send_message(reply);
}

OK,接下来看下,monitor leader收到 OP_PONG之后,如何处理:

void Monitor::handle_timecheck_leader(MTimeCheck *m)
{
  dout(10) << __func__ << " " << *m << dendl;
  /* handles PONG's */                                                                               /*monitor leader只会OP_PONG类型的消息*/                                          
  assert(m->op == MTimeCheck::OP_PONG);

  entity_inst_t other = m->get_source_inst();
  if (m->epoch < get_epoch()) {
    dout(1) << __func__ << " got old timecheck epoch " << m->epoch
            << " from " << other
            << " curr " << get_epoch()
            << " -- severely lagged? discard" << dendl;
    return;
  }
  assert(m->epoch == get_epoch());

  if (m->round < timecheck_round) {
    dout(1) << __func__ << " got old round " << m->round
            << " from " << other
            << " curr " << timecheck_round << " -- discard" << dendl;
    return;
  }

  utime_t curr_time = ceph_clock_now(g_ceph_context);

  /*timecheck_waiting中记录了消息的发送时间
   *取出来发送时间之后,该记录可以清除掉了,而该发送时间用来计算延迟latency*/
  assert(timecheck_waiting.count(other) > 0);
  utime_t timecheck_sent = timecheck_waiting[other];
  timecheck_waiting.erase(other);
  
  /*这是一种特殊情况,即收到消息的时间,比发送的时间还要早,
   *这意味着monitor leader 调整了时间,如果发生这种情况,本轮timecheck没有必要进行了,cancel掉*/
  if (curr_time < timecheck_sent) {
    // our clock was readjusted -- drop everything until it all makes sense.
    dout(1) << __func__ << " our clock was readjusted --"
            << " bump round and drop current check"
            << dendl;
    timecheck_cancel_round();
    return;
  }

  /* 更新monitor leader 到对应peon的latency 
   * 计算简单粗暴,即收到回应消息的时间减掉发送时间
   * 注意如果有历史值的话,要将历史值和当前值加权。
   * 最终的latency结果,保存在timecheck_latencies中*/
  double latency = (double)(curr_time - timecheck_sent);
  if (timecheck_latencies.count(other) == 0)
    timecheck_latencies[other] = latency;
  else {
    double avg_latency = ((timecheck_latencies[other]*0.8)+(latency*0.2));
    timecheck_latencies[other] = avg_latency;
  }
  

截止到此处,逻辑比较清晰,latency用发送OP_PING的时间和收到OP_PONG回应的时间来计算。然后将latency信息保存在timecheck_latencies 数据结构。

接下来到了最核心的地方,即如何估算两个节点的时间差。ceph给出了一段很长的注释:

/*
   * update skews
   *
   * some nasty thing goes on if we were to do 'a - b' between two utime_t,
   * and 'a' happens to be lower than 'b'; so we use double instead.
   *
   * latency is always expected to be >= 0.
   *
   * delta, the difference between theirs timestamp and ours, may either be
   * lower or higher than 0; will hardly ever be 0.
   *
   * The absolute skew is the absolute delta minus the latency, which is
   * taken as a whole instead of an rtt given that there is some queueing
   * and dispatch times involved and it's hard to assess how long exactly
   * it took for the message to travel to the other side and be handled. So
   * we call it a bounded skew, the worst case scenario.
   *
   * Now, to math!
   *
   * Given that the latency is always positive, we can establish that the
   * bounded skew will be:
   *
   *  1. positive if the absolute delta is higher than the latency and
   *     delta is positive
   *  2. negative if the absolute delta is higher than the latency and
   *     delta is negative.
   *  3. zero if the absolute delta is lower than the latency.
   *
   * On 3. we make a judgement call and treat the skew as non-existent.
   * This is because that, if the absolute delta is lower than the
   * latency, then the apparently existing skew is nothing more than a
   * side-effect of the high latency at work.
   *
   * This may not be entirely true though, as a severely skewed clock
   * may be masked by an even higher latency, but with high latencies
   * we probably have worse issues to deal with than just skewed clocks.
   */

这段注释解释了如何计算两个节点之间的时间偏移(clock skew)。PEON节点的时间戳是a,monitor leader收到OP_PONG之后当前的时间戳是b,那么时间偏移粗略来看是 a-b,但是还是要考虑延迟。

a-b的值要和latency比较一下,如果说(a-b)的绝对值小于latency,说明a和b之间的这点时间偏移太小了,比网络延迟还要小,这种情况下,就不必计较a和b之间的时间偏移。这就是注释当中的第三条。

  double delta = ((double) m->timestamp) - ((double) curr_time);
  double abs_delta = (delta > 0 ? delta : -delta);
  double skew_bound = abs_delta - latency;
  /*时间偏移的值小于网络延迟,那么就认为skew_bound =0,没有偏移
   *否则,就认定偏移的值为skew_bound,不过还是要根据delta的正负,确定是领先monitor leader,还是落后*/
  if (skew_bound < 0)
    skew_bound = 0;
  else if (delta < 0)
    skew_bound = -skew_bound;

  ostringstream ss;
  health_status_t status = timecheck_status(ss, skew_bound, latency);
  if (status == HEALTH_ERR)
    clog->error() << other << " " << ss.str() << "\n";
  else if (status == HEALTH_WARN)
    clog->warn() << other << " " << ss.str() << "\n";

  dout(10) << __func__ << " from " << other << " ts " << m->timestamp
           << " delta " << delta << " skew_bound " << skew_bound
           << " latency " << latency << dendl;

  if (timecheck_skews.count(other) == 0) {
    timecheck_skews[other] = skew_bound;
  } else {
    timecheck_skews[other] = (timecheck_skews[other]*0.8)+(skew_bound*0.2);
  }
  /*收到PEON回应的个数自加*/
  timecheck_acks++;
  /*如果所有的PEON都回应了,那么执行timecheck_finish_round*/
  if (timecheck_acks == quorum.size()) {
    dout(10) << __func__ << " got pongs from everybody ("
             << timecheck_acks << " total)" << dendl;
    assert(timecheck_skews.size() == timecheck_acks);
    assert(timecheck_waiting.empty());
    // everyone has acked, so bump the round to finish it.
    timecheck_finish_round();
  }

计算规则就是注释中的三点,不多说。逻辑非常简单,不多说了。如果所有的PEON的回应都收到了,那么执行timecheck_finish_round函数。

  /*这个timecheck_finish_round函数是公用的,无论成功还是cancel掉本轮,都会调用
   *区别就在标志位success,如果为true,表示成功处理本轮timecheck,所有的PEON的OP_PONG都收到
   *如果success = false,表示本轮失败,由于某种原因,取消掉了本轮timecheck*/
void Monitor::timecheck_finish_round(bool success)
{
  dout(10) << __func__ << " curr " << timecheck_round << dendl;
  assert(timecheck_round % 2);
  timecheck_round ++;
  timecheck_round_start = utime_t();

  /*如果成功,则发送OP_REPORT消息到各个PEON,通知他们更新最新计算出来的clock skew*/
  if (success) {
    assert(timecheck_waiting.empty());
    assert(timecheck_acks == quorum.size());
    timecheck_report();
    return;
  }

  /*如果是取消本轮timecheck的话,将还未收到消息的PEON从timecheck_waiting中去掉,并打印*/
  dout(10) << __func__ << " " << timecheck_waiting.size()
           << " peers still waiting:";
  for (map<entity_inst_t,utime_t>::iterator p = timecheck_waiting.begin();
      p != timecheck_waiting.end(); ++p) {
    *_dout << " " << p->first.name;
  }
  *_dout << dendl;
  timecheck_waiting.clear()                                               
  dout(10) << __func__ << " finished to " << timecheck_round << dendl;
}

注意,如果所有的PEON的回应都收到,才会,通过timecheck_report 发送OP_REPORT消息到各个PEON。为什么要发送这个消息呢。其实就是把最新的计算结果告诉PEON,通知它,所有PEON与monitor leader的时间偏移和延迟。

void Monitor::timecheck_report()
{
  dout(10) << __func__ << dendl;
  assert(is_leader());
  assert((timecheck_round % 2) == 0);
  if (monmap->size() == 1) {
    assert(0 == "We are alone; we shouldn't have gotten here!");
    return;
  }
  
  assert(timecheck_latencies.size() == timecheck_skews.size());
  bool do_output = true; // only output report once
  for (set<int>::iterator q = quorum.begin(); q != quorum.end(); ++q) {
    /*如果是monitor leader ,不用自己发给你自己*/
    if (monmap->get_name(*q) == name)
      continue;
      
    MTimeCheck *m = new MTimeCheck(MTimeCheck::OP_REPORT);
    m->epoch = get_epoch();
    m->round = timecheck_round;

    for (map<entity_inst_t, double>::iterator it = timecheck_skews.begin(); it != timecheck_skews.end(); ++it) {
      double skew = it->second;
      double latency = timecheck_latencies[it->first];
      
      /*消息体里,带着skew和latency的信息,把最新的结果告诉对端的PEON*/
      m->skews[it->first] = skew;
      m->latencies[it->first] = latency;
      
      if (do_output) {
        dout(25) << __func__ << " " << it->first
                 << " latency " << latency
                 << " skew " << skew << dendl;
      }
    }
    do_output = false;
    entity_inst_t inst = monmap->get_inst(*q);
    dout(10) << __func__ << " send report to " << inst << dendl;
    messenger->send_message(m, inst);
  }
}

对端的PEON收到OP_REPORT信息之后,把这个信息记录下来:

void Monitor::handle_timecheck_peon(MTimeCheck *m)
{
  ...
  timecheck_round = m->round;

  if (m->op == MTimeCheck::OP_REPORT) {
    assert((timecheck_round % 2) == 0);
    /*记录下来monitor leader发过来的最新的latency和skew信息*/
    timecheck_latencies.swap(m->latencies);                                            
    timecheck_skews.swap(m->skews);
    return;
  }
  ...
}

如果clock skew,如何处理

讲了这么多,还是没说,如果发生了这种情况,如何处理。

首先是如果节点间的时间偏移确实很大,ceph health detail中会有警告信息出现,那么问题是多大的偏移才叫比较大呢?


health_status_t Monitor::timecheck_status(ostringstream &ss,
                                          const double skew_bound,
                                          const double latency)
{
  health_status_t status = HEALTH_OK;
  double abs_skew = (skew_bound > 0 ? skew_bound : -skew_bound);
  assert(latency >= 0);

  if (abs_skew > g_conf->mon_clock_drift_allowed) {
    status = HEALTH_WARN;
    ss << "clock skew " << abs_skew << "s"
       << " > max " << g_conf->mon_clock_drift_allowed << "s";
  }
  
  return status;
}

此处有个配置项,mon_clock_drift_allowed

OPTION(mon_clock_drift_allowed, OPT_FLOAT, .050)

即,允许节点之间的偏移为50毫秒。

如果超过,ceph health detail 会有如下的打印:

ceph health detail
HEALTH_WARN clock skew detected on mon.1, mon.2
mon.1 addr 192.168.0.6:6789/0 clock skew 8.37274s > max 0.05s (latency 0.004945s)
mon.2 addr 192.168.0.7:6789/0 clock skew 8.52479s > max 0.05s (latency 0.005965s)

这部分逻辑在

void Monitor::get_health(string& status, bufferlist *detailbl, Formatter *f)
{
  ...
   if (f) {
    f->open_object_section("timechecks");
    f->dump_unsigned("epoch", get_epoch());
    f->dump_int("round", timecheck_round);
    f->dump_stream("round_status")
      << ((timecheck_round%2) ? "on-going" : "finished");
   }

  if (!timecheck_skews.empty()) {
    list<string> warns;
    if (f)
      f->open_array_section("mons");
    for (map<entity_inst_t,double>::iterator i = timecheck_skews.begin();
         i != timecheck_skews.end(); ++i) {
      entity_inst_t inst = i->first;
      double skew = i->second;
      double latency = timecheck_latencies[inst];
      string name = monmap->get_name(inst.addr);

      ostringstream tcss;
      health_status_t tcstatus = timecheck_status(tcss, skew, latency);
      if (tcstatus != HEALTH_OK) {
        if (overall > tcstatus)
          overall = tcstatus;
        warns.push_back(name);
        
        ostringstream tmp_ss;
        tmp_ss << "mon." << name
               << " addr " << inst.addr << " " << tcss.str()
               << " (latency " << latency << "s)";
        detail.push_back(make_pair(tcstatus, tmp_ss.str()));
      }

      if (f) {
        f->open_object_section("mon");
        f->dump_string("name", name.c_str());
        f->dump_float("skew", skew);
        f->dump_float("latency", latency);
        f->dump_stream("health") << tcstatus;
        if (tcstatus != HEALTH_OK)
          f->dump_stream("details") << tcss.str();
        f->close_section();
      }
    }
    ...
}

发生这种事情,应该如何处理,很多文章都有提到了,基本就是强制ntpdate一次,让时间强制校准:

  • 停掉所有节点的ntpd服务,如果有的话

    /etc/init.d/ntpd stop
    
  • 同步时间

    ntpdate  {ntpserver}
    

注意,如果无法连出外网的情况下,可以选择某一台机器作为NTP Server,大家强制向它看齐。

]]>