Bean Li 2019-05-06T04:22:57+00:00 beanli.coder@gmail.com 使用perf 分析ceph CPU usage High 2019-05-06T12:14:40+00:00 Bean Li http://bean-li.github.io/perf-check-ceph-osd-cpu-high 前言

今天早晨QA来报,我们自己的集群环境里,124~128集群里面,128节点ceph-osd cpu load重,CPU使用率是200%~400% 之间。

我用strace 粗略地看了下,没看出什么端倪。只能上perf了。

排查

perf top

首先用perf top查看下:

CPU消耗大户是ceph-osd,其中用户态的operator« 操作罪魁祸首,其中operator«看起来是运算符重载,应该是和日志打印相关。

perf record

找到ceph-osd的进程ID 4966, 用如下指令采集下:

perf record -e cpu-clock -g -p 4966
  • -g 选项是告诉perf record额外记录函数的调用关系
  • -e cpu-clock 指perf record监控的指标为cpu周期
  • -p 指定需要record的进程pid

我们观测的对象是ceph-osd。运行10秒中左右,ctrl+C中断掉perf record:

root@converger-128:~# perf record -e cpu-clock -g -p 4966
^C[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 5.058 MB perf.data (17667 samples) ]
root@converger-128:~# 

在root目录下会产生出来 perf.data文件。

## perf report

用如下指令查看dump 出来的perf.data

perf report -i perf.data

输出如下:

我们把operator«展开:

我们看到,大部分operator«的调用是由gen_prefix函数产生的。

分析

这部分代码代码在:

osd/ReplicatedBackend.cc
------------------------------
#define dout_subsys ceph_subsys_osd
#define DOUT_PREFIX_ARGS this
#undef dout_prefix
#define dout_prefix _prefix(_dout, this)
static ostream& _prefix(std::ostream *_dout, ReplicatedBackend *pgb) {
  return *_dout << pgb->get_parent()->gen_dbg_prefix();          
}

原因是128集群之前有人分析ceph-osd.4,ceph.conf 里面debug osd = 0/20, 尽管不会往磁盘里面打印日志,但是因为OSD crash的时候,需要dump 级别为20 的debug log,因此,大量的osd debug log会暂存在内存的环形buffer 中,因此,gen_prefix函数被大量的调用,消耗了太多的CPU资源。

实时修改ceph-osd debug_osd的级别,并修改ceph.conf 永久生效,发现ceph-osd CPU使用正常。

root@converger-128:/etc/ceph# ceph daemon osd.5 config set debug_osd 0
{
    "success": ""
}
root@converger-128:/etc/ceph# ceph daemon osd.4 config set debug_osd 0
{
    "success": ""
}

]]>
NSQ 简介 2019-03-10T13:12:40+00:00 Bean Li http://bean-li.github.io/nsq-1 前言

NSQ 是一款基于Go语言的分布式消息队列,这种消息中间件,已经有很多了,比如RabbitMQ,比如阿里开发的RocketMQ,比如Kafka,NSQ一款比较清爽的消息中间件,尽管功能上不如Kafka这么大而全,但是轻量,简单,入手简单,而且大部分情况下,无论是性能还是功能基本够用。

至于消息中间件的作用,无非是解耦,缓冲之类的,工作中遇到类似困境的自然懂,遇不到类似困境的,多说也无益。

设计原理

最新稳定版本的NSQ可以从如下地方下载:

https://nsq.io/deployment/installing.html

对于我们Linux来讲,下载如下版本并解压:

nsq-1.1.0.linux-amd64.go1.10.3.tar.gz

压缩包的内容如下:

nsq-1.1.0.linux-amd64.go1.10.3/
nsq-1.1.0.linux-amd64.go1.10.3/bin/
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsq_to_file
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsqlookupd
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsq_tail
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsqadmin
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsq_to_http
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsq_stat
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsqd
nsq-1.1.0.linux-amd64.go1.10.3/bin/to_nsq
nsq-1.1.0.linux-amd64.go1.10.3/bin/nsq_to_nsq

为了更好的理解NSQ的设计原理,甚至是消息中间件的原理,我们递进的分析 作为一个消息中间件需要做哪些事情,已经NSQ是如何做的。下面部分深度的参考了知乎大神柳树的文章 MQ(1)—— 从队列到消息中间件,我无意抄袭,只是大神讲解的确实漂亮,强烈建议对消息中间件感兴趣的筒子,阅读大神的系列文章,总是光荣属于前辈,我只是一个小学生。

NSQ 1.0,我需要一个消息队列

生产者和消费者模型,不是一个新的东西,生产者负责源源不断地生产任务,而消费者不需要管消息的来源,只需要源源不断地处理任务。既然是生产者只负责把产生任务,而消费者负责处理任务,问题就来了,任务存放到那里。这时候消息队列就横空出世了。

当有任务需要处理的时候,生产者进程就负责往消息队列里面Push一条消息,而消费者能够及时地知道消息队列里面有消息,从而取出任务来处理。 上面这段话有两个问题:

  • 任务有很多种类型,并不是所有的生产者会生产出所有类型的任务,同样,也不是所有的消费者都关心所有类型的消息,因此引入了topic的概念。

  • 消费者如何知道消息队列中存在消息。这里面就有了消息队列中Push和Pull的区别。不同的消息中间件,采用的不同的方法,Kafka采用Pull,而我们介绍的NSQ采用的是Push。

有了这个队列,解决了两个问题:解耦和缓冲。如果消费者处理失败,可以给消息队列回复 requeue, NSQ会将消息重新放入队列,进行重试。

NSQ 2.0 Channel

从生产者的角度来说,放入不同topic的队列,但是从消费者的角度来说,不同的消费者,可能要关注不同的topic。 更详细点说,

  • 对于一个集群来讲,收到消息A,到底由哪个消费者负责处理该消息?
  • 一条消息,能不能同时发给多个消费者,多个消费者都来处理该消息

这就是消费组(Consumer Group)的概念。Kafka里面,这个概念叫消费组,而NSQ里面叫Channel。

我们还是考虑集群,对于一个集群来讲,有些任务是集群层面的任务,即,不需要每个节点都处理该任务,只需要集群中选出一个代表,负责把该任务处理即可。这事一种很common的场景。

我们用官方给的gif来解释这种场景:

对于topic clicks, 三个消费者同时关注了该topic,同时都属于metrics channel,或者说metrics 消费组,这样,NSQ收到消息后,会给每一个channel复制一份消息,对于metrics这个消费组,有三个Consumer 实例,那么该消息会发给which Consumer?还是说同时发给三个consumer,答案是对于metrics这个channel,只会发给一个consumer,至于要发给谁,那就是负载均衡逻辑。即这次发给消费者A,下一次发给另一消费者。

对于一个集群来讲,一个消息,每个消费者(不同主机上),都要执行该怎么办?

⚠️ NSQ采用的是Push的策略,即,nsqd都会负责push消息到所有关注了该topic的channel。

NSQ 3.0 nsqlookup

上面讲到,nsq收到生产者生产的消息之后,会将消息复制多份,推送给关注该topic的所有channel。 问题是,nsq怎么知道哪些消费者订阅了对应topic的的消息呢?

最简单的方法是写死,有个配置文件,ip是 xx.xx.xx.xx,端口为yyyy,消费者关注topic-xx,channel是zzz,这样最大的问题是不灵活。 我们需要的是一个叫做服务发现的功能。这个功能就是nsqlookup.

nsqlookup提供了一个类似etcd的kv存储,里面记录了topic下面都有哪些nsq。 nsqlookup 提供了一个 /lookup API,可以实时查询哪些 nsq下面有某个topic的消息。

 curl "http://127.0.0.1:4161/lookup?topic=x_topic" 

输出如下:

{
   "producers" : [
      {
         "version" : "1.1.0",
         "tcp_port" : 4150,
         "broadcast_address" : "manu-Inspiron-5748",
         "hostname" : "manu-Inspiron-5748",
         "http_port" : 4151,
         "remote_address" : "127.0.0.1:50662"
      }
   ],
   "channels" : []
}

如果我启用了消费者,关注x_topic,并且channel是 work_group_a,那么输出如下:

manu-Inspiron-5748 ~ » curl "http://127.0.0.1:4161/lookup?topic=x_topic" 2>/dev/null |json_pp
{
   "producers" : [
      {
         "tcp_port" : 4150,
         "version" : "1.1.0",
         "remote_address" : "127.0.0.1:50662",
         "hostname" : "manu-Inspiron-5748",
         "broadcast_address" : "manu-Inspiron-5748",
         "http_port" : 4151
      }
   ],
   "channels" : [
      "work_group_a"
   ]
}

消费者就可以通过nsqlookup,获取到producer的列表,根据列表中的broadcast_address和tcp_port ,就可以拿到url 地址。 消费者就会和这些nsq逐个建立连接。当有消息到来的时候,nsq就会给和他建立联系的消费者Push 消息。

小节

我们可以总结下,NSQ的主要组件有三个:

  • nsqd : 一个负责接收、排队、转发消息到客户端的守护进程
  • nsqlookupd: 管理拓扑信息并提供最终一致性的服务发现 daemon
  • nsqadmin: 这是一个WEB用户界面,可选。事实上也可以不启动。实时查看集群的统计数据,并执行管理任务。

开始操练

我们可以简单地使用下NSQ。

只需要将上一节中的可执行文件放入/usr/bin/目录下。

1 启动nsqlookupd

nsqlookupd

可以看到

manu-Inspiron-5748 Python/nsq » nsqlookupd                                                                                                               130 ↵
[nsqlookupd] 2019/03/10 22:05:55.189652 INFO: nsqlookupd v1.1.0 (built w/go1.10.3)
[nsqlookupd] 2019/03/10 22:05:55.190135 INFO: HTTP: listening on [::]:4161
[nsqlookupd] 2019/03/10 22:05:55.190155 INFO: TCP: listening on [::]:4160

我们可以看到版本信息是1.1.0,listen了4161和6160两个端口。

2 运行nsqd实例:

nsqd --lookupd-tcp-address=127.0.0.1:4160
 

3 启用 nsqadmin

nsqadmin --lookupd-http-address=127.0.0.1:4161

启用nsqadmin之后,我们就可以查看WEB 界面了:

接下来我们通过程序来创建topic,书写生产者和消费者的程序:

import nsq
import tornado.ioloop
import time
import random
import json

def pub_message():
    message = {}
    message['number'] = random.randint(1,1000)
    writer.pub('x_topic', json.dumps(message), finish_pub)

def finish_pub(conn, data):
    print data

writer = nsq.Writer(['127.0.0.1:4150'])
tornado.ioloop.PeriodicCallback(pub_message, 1000).start()
nsq.run()

我们创建一个x_topic的topic,并且生产者往里面每秒钟扔一个消息,消息的内容为

{'number': 409}

其中数字部分随机产生,而消费者,负责接受消息,判断数字是否是素数:

import nsq
from tornado import gen
from functools import partial
import ujson as json

def is_prime(n):
    n = int(n)
    if n < 2:
        return False;
    if n % 2 == 0:
        return n == 2  # return False
    k = 3
    while k*k <= n:
        if n % k == 0:
            return False
        k += 2
    return True

@gen.coroutine
def write_message(topic, data, writer):
    response = yield gen.Task(writer.pub, topic, data)
    if isinstance(response, nsq.Error):
        print "Error with Message: {}:{}".format(data, response)
    else:
        print "Published Message: ", data

def calculate_prime(message, writer):
    message.enable_async()
    data = json.loads(message.body)

    prime = is_prime(data["number"])
    data["prime"] = prime

    if prime:
        topic = "primes"
    else:
        topic = "non_primes"

    output_message = json.dumps(data)
    write_message(topic, output_message,writer)
    message.finish()

if __name__ == "__main__":
    writer = nsq.Writer(['127.0.0.1:4150',])
    handler = partial(calculate_prime, writer=writer)
    reader  = nsq.Reader(
              message_handler = handler,
              nsqd_tcp_addresses = ['127.0.0.1:4150'],
              topic = 'x_topic',
              channel = 'work_group_a')

    nsq.run()

我们将两个程序跑起来:

manu-Inspiron-5748 Python/nsq » python nsq_producer.py & ; python nsq_consumer.py
[1] 14833
OK
Published Message:  {"prime":false,"number":669}
OK
Published Message:  {"prime":false,"number":275}
OK
Published Message:  {"prime":false,"number":214}
OK
Published Message:  {"prime":false,"number":518}
OK
Published Message:  {"prime":true,"number":739}
OK
Published Message:  {"prime":false,"number":184}
OK
Published Message:  {"prime":true,"number":521}
OK
]]>
通过ipmitool获取各元件的温度信息 2019-02-26T13:12:40+00:00 Bean Li http://bean-li.github.io/通过ipmitool获取各元件的温度信息 前言

ipmitool可以获取各个元件的温度信息,如何判断各个组件的温度信息,各个组件的温度信息是否OK,有没有温度过高或者过低的元件需要告警?

获取各个元件温度的方法

我们可以通过如下指令获取所有元件的温度信息和相关的状态:

root@node244:~# ipmitool sensor list 
CPU1 Temp        | 29.000     | degrees C  | ok    | 0.000     | 0.000     | 0.000     | 85.000    | 90.000    | 90.000    
CPU2 Temp        | 33.000     | degrees C  | nr    | 10.000    | 10.000    | 10.000    | 30.000    | 30.000    | 30.000    
PCH Temp         | 32.000     | degrees C  | ok    | 0.000     | 5.000     | 16.000    | 90.000    | 95.000    | 100.000   
System Temp      | 30.000     | degrees C  | ok    | -10.000   | -5.000    | 0.000     | 80.000    | 85.000    | 90.000    
Peripheral Temp  | 34.000     | degrees C  | ok    | -10.000   | -5.000    | 0.000     | 80.000    | 85.000    | 90.000    
Vcpu1VRM Temp    | 28.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 95.000    | 100.000   | 105.000   
Vcpu2VRM Temp    | 34.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 95.000    | 100.000   | 105.000   
VmemABVRM Temp   | 29.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 95.000    | 100.000   | 105.000   
VmemCDVRM Temp   | 28.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 95.000    | 100.000   | 105.000   
VmemEFVRM Temp   | 31.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 95.000    | 100.000   | 105.000   
VmemGHVRM Temp   | 30.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 95.000    | 100.000   | 105.000   
P1-DIMMA1 Temp   | 27.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 80.000    | 85.000    | 90.000    
P1-DIMMA2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P1-DIMMB1 Temp   | 27.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 80.000    | 85.000    | 90.000    
P1-DIMMB2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P1-DIMMC1 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P1-DIMMC2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P1-DIMMD1 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P1-DIMMD2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P2-DIMME1 Temp   | 29.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 80.000    | 85.000    | 90.000    
P2-DIMME2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P2-DIMMF1 Temp   | 30.000     | degrees C  | ok    | -5.000    | 0.000     | 5.000     | 80.000    | 85.000    | 90.000    
P2-DIMMF2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P2-DIMMG1 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P2-DIMMG2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P2-DIMMH1 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
P2-DIMMH2 Temp   | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN1             | 4400.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000 
FAN2             | 4300.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000 
FAN3             | 4400.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000 
FAN4             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN5             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN6             | na         |            | na    | na        | na        | na        | na        | na        | na        
FANA             | 4400.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000 
FANB             | na         |            | na    | na        | na        | na        | na        | na        | na        
12V              | 12.315     | Volts      | ok    | 10.173    | 10.299    | 10.740    | 12.945    | 13.260    | 13.386    
5VCC             | 5.000      | Volts      | ok    | 4.246     | 4.298     | 4.480     | 5.390     | 5.546     | 5.598     
3.3VCC           | 3.316      | Volts      | ok    | 2.789     | 2.823     | 2.959     | 3.554     | 3.656     | 3.690     
VBAT             | 3.104      | Volts      | ok    | 2.376     | 2.480     | 2.584     | 3.494     | 3.598     | 3.676     
Vcpu1            | 1.800      | Volts      | ok    | 1.242     | 1.260     | 1.395     | 1.899     | 2.088     | 2.106     
Vcpu2            | 1.809      | Volts      | ok    | 1.242     | 1.260     | 1.395     | 1.899     | 2.088     | 2.106     
VDIMMAB          | 1.200      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443     
VDIMMCD          | 1.209      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443     
VDIMMEF          | 1.209      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443     
VDIMMGH          | 1.209      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443     
5VSB             | 4.974      | Volts      | ok    | 4.246     | 4.298     | 4.480     | 5.390     | 5.546     | 5.598     
3.3VSB           | 3.316      | Volts      | ok    | 2.789     | 2.823     | 2.959     | 3.554     | 3.656     | 3.690     
1.5V PCH         | 1.509      | Volts      | ok    | 1.320     | 1.347     | 1.401     | 1.644     | 1.671     | 1.698     
1.2V BMC         | 1.209      | Volts      | ok    | 1.020     | 1.047     | 1.092     | 1.344     | 1.371     | 1.398     
1.05V PCH        | 1.050      | Volts      | ok    | 0.870     | 0.897     | 0.942     | 1.194     | 1.221     | 1.248     
Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        
PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        
PS2 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        
AOC_SAS Temp     | 60.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 100.000   | 105.000   | 110.000   
HDD Temp         | 29.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 50.000    | 55.000    | 60.000    
HDD Status       | 0x1        | discrete   | 0x01ff| na        | na        | na        | na        | na        | na    

一般来讲,第三列的值中有degree的,我们统计的是温度信息。

  • 第一列: 传感器的名称,比如CPU1 Temp,
  • 第二列: 该元件的当前温度值,注意有时候会是na,即取不到。
  • 第四列: 温度的状态信息,ok表示温度正常,有时候该状态值为nr,为non-recovery,不可恢复的意思

一般来讲,常见的温度状态有以下5种:

  • ok:温度正常
  • nc: non-critical,温度偏高(或者偏低),但是并不太严重
  • cr:critical,温度太高或者温度太低,很严重
  • nr: non-recovery,温度太高或者温度太低,造成不可恢复的损伤。
  • na:温度状态不明,比较少见。

注意ok –> nc –> cr –> nr 从正常,到越来越严重的温度问题。

如何触发温度告警

上一节我们介绍了nc cr 和nr三种状态,都说温度偏高或者温度偏低,那么

  • 温度到什么程度状态会变成nc,
  • 温度到什么程度会变成cr
  • 温度到什么程度会变成nr

显然,各个元件的状态改变是有温度门限值的,我们可以通过如下方法查看:

root@node244:~# ipmitool sensor get "CPU1 Temp"
Locating sensor record...
Sensor ID              : CPU1 Temp (0x1)
 Entity ID             : 3.1 (Processor)
 Sensor Type (Threshold)  : Temperature (0x01)
 Sensor Reading        : 29 (+/- 0) degrees C
 Status                : ok
 Nominal Reading       : 40.000
 Normal Minimum        : -4.000
 Normal Maximum        : 89.000
 Upper non-recoverable : 90.000
 Upper critical        : 90.000
 Upper non-critical    : 85.000
 Lower non-recoverable : 0.000
 Lower critical        : 0.000
 Lower non-critical    : 0.000
 Positive Hysteresis   : 2.000
 Negative Hysteresis   : 2.000
 Minimum sensor range  : Unspecified
 Maximum sensor range  : Unspecified
 Event Message Control : Per-threshold
 Readable Thresholds   : lnr lcr lnc unc ucr unr 
 Settable Thresholds   : lnr lcr lnc unc ucr unr 
 Threshold Read Mask   : lnr lcr lnc unc ucr unr 
 Assertion Events      : 
 Assertions Enabled    : ucr+ 
 Deassertions Enabled  : ucr+ 


从上面的信息可以看出:

  • Upper non-critical 85 度
  • Upper critical 90 度
  • Upper non-recovery 90 度
  • Lower non-critical 0 度
  • Lower critical 0 度
  • Lower non-recoverable

有了门限值,是哪种状态就比较简单了。

  • [0,85)之间是,状态ok
  • [85,90) 状态为nc
  • [90,) 状态为nr (因为cr的门限和nr的门限都是90,状态取nr)

低温的情况也是类似。

如何让温度状态告警呢,即变成nc或者cr或者nr状态呢?

ipmitool提供了方法来设置各个状态的门限值。

ipmitool -I open sensor thresh 'CPU2 Temp' upper 20 30 90

上述指令的意思是将CPU2 Temp元件的告警门限中的温度上限告警门限设置为20 30 和90.

以为CPU的温度是33度左右,我们可以通过如下指令,将状态变为nc:

root@node244:~# ipmitool -I open sensor thresh 'CPU2 Temp' upper 20 40 90
Locating sensor record 'CPU2 Temp'...
Setting sensor "CPU2 Temp" Upper Non-Critical threshold to 20.000
Setting sensor "CPU2 Temp" Upper Critical threshold to 40.000
Setting sensor "CPU2 Temp" Upper Non-Recoverable threshold to 90.000

root@node244:~# ipmitool sensor list
CPU2 Temp        | 33.000     | degrees C  | nc    | 10.000    | 10.000    | 10.000    | 20.000    | 40.000    | 90.000

33摄氏度,超过了20度,但是没要超过40度,因此状态是nc,即non-critical。

同样道理,我们将告警门限设置为 20 30 90的话,就会发现状态为cr,即critical:

root@node244:~# ipmitool -I open sensor thresh 'CPU2 Temp' upper 20 30 90
Locating sensor record 'CPU2 Temp'...
Setting sensor "CPU2 Temp" Upper Non-Critical threshold to 20.000
Setting sensor "CPU2 Temp" Upper Critical threshold to 30.000
Setting sensor "CPU2 Temp" Upper Non-Recoverable threshold to 90.000
root@node244:~# ipmitool sensor list 
CPU1 Temp        | 30.000     | degrees C  | ok    | 0.000     | 0.000     | 0.000     | 85.000    | 90.000    | 90.000    
CPU2 Temp        | 33.000     | degrees C  | cr    | 10.000    | 10.000    | 10.000    | 20.000    | 30.000    | 90.000    

同样道理,可以将状态变成nr,只需要设置门限为20 30 30 ,即可,不在赘述。

]]>
检查电源模块状态 2019-02-26T13:12:40+00:00 Bean Li http://bean-li.github.io/检查电源模块状态 前言

我们知道IPMI很强大,如何利用ipmitool获取到电源的实施状态的。现代的服务器,基本上都有两个电源模块,作为冗余。如何查看电源的状态信息呢,是否所有的电源模块都已启用,电源是否都通电?

方法一

通过如下指令可以获取到电源的状态信息:

ipmitool sdr type "power supply"

正常情况下电源的状态如下所示:

PS1 Status       | C4h | ok  | 10.1 | Presence detected
PS2 Status       | C5h | ok  | 10.2 | Presence detected

如果,我们将其中一个拔掉电源插头,状态就会如下所示:

PS1 Status       | C4h | ok  | 10.1 | Presence detected
PS2 Status       | C5h | ok  | 10.2 | Presence detected, Failure detected, Power Supply AC lost

如果我们将其中一个的电源模块(PSU, power supply unit)直接从服务器上拔出,状态就会如下所示:

PS1 Status       | C4h | ok  | 10.1 | Presence detected
PS2 Status       | C5h | ok  | 10.2 | 

事实上除了上面的几种,我们可以通过

ipmitool sensor get "PS1 Status"

查看其他可能的值:

Sensor ID              : PS1 Status (0xc4)
 Entity ID             : 10.1 (Power Supply)
 Sensor Type (Discrete): Power Supply (0x08)
 Sensor Reading        : 1h                    <-------------这个值是方法2提到0x01 ,即正常状态
 Event Message Control : Per-threshold
 States Asserted       : Power Supply
                         [Presence detected]
 Assertion Events      : Power Supply
                         [Presence detected]
 Deassertion Events    : Power Supply
                         [Failure detected]
 Assertions Enabled    : Power Supply
                         [Failure detected]
                         [Power Supply AC lost]
                         [AC lost or out-of-range]
                         [AC out-of-range, but present]
                         [Config Error]
 Deassertions Enabled  : Power Supply
                         [Failure detected]
                         [Power Supply AC lost]
                         [AC lost or out-of-range]
                         [AC out-of-range, but present]
                         [Config Error]
 OEM                   : 0

方法二

从如下指令也可以获得电源模块的信息:

ipmitool sensor list

输出如下:

PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        
PS2 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na   

第二列的值很有意思:

  • 0x01 status ok,最常见的状态
  • 0x00 power supply unit not present ,即电源模块不存在,一般电源模块从服务器中拔出,状态是0x00
  • 0x03 power supply off or failed,我没有遇到过这种状态,我猜是电源模块坏了的时候,会是这种状态
  • 0x0b input out of range(ex. No AC input)这也是很常见的状态,把电源的插头拔掉,就会是这种状态。

这种方法比较好,个人比较推荐这种方法。

参考文献

  • https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-sg8039en_us&docLocale=en_US
]]>
Radosgw上传对象与multisite相关的逻辑 2018-12-16T13:20:40+00:00 Bean Li http://bean-li.github.io/multisite-put-obj 前言

本文对MultiSite内部数据结构和流程做一些梳理,加深对RadosGW内部流程的理解。因为MultiSite如何搭建,网上有较多的资料,因此不再赘述。

本文中创建的zonegroup为xxxx,两个zone:

  • master
  • secondary

zonegroup相关的信息如下:

{
    "id": "9908295f-d8f5-4ac3-acd7-c955a177bd09",
    "name": "xxxx",
    "api_name": "",
    "is_master": "true",
    "endpoints": [
        "http:\/\/s3.246.com\/"
    ],
    "hostnames": [],
    "hostnames_s3website": [],
    "master_zone": "8aa27332-01da-486a-994c-1ce527fa2fd7",
    "zones": [
        {
            "id": "484742ba-f8b7-4681-8411-af96ac778150",
            "name": "secondary",
            "endpoints": [
                "http:\/\/s3.243.com\/"
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false"
        },
        {
            "id": "8aa27332-01da-486a-994c-1ce527fa2fd7",
            "name": "master",
            "endpoints": [
                "http:\/\/s3.246.com\/"
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false"
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement",
    "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b"
}

相关的pool

数据池(data pool)和索引池(index pool)

首当其中的pool是数据pool,即用户上传的对象数据,最终存放的地点:

root@NODE-246:/var/log/ceph# radosgw-admin zone get 
{
    "id": "8aa27332-01da-486a-994c-1ce527fa2fd7",
    "name": "master",
    "domain_root": "default.rgw.data.root",
    "control_pool": "default.rgw.control",
    "gc_pool": "default.rgw.gc",
    "log_pool": "default.rgw.log",
    "intent_log_pool": "default.rgw.intent-log",
    "usage_log_pool": "default.rgw.usage",
    "user_keys_pool": "default.rgw.users.keys",
    "user_email_pool": "default.rgw.users.email",
    "user_swift_pool": "default.rgw.users.swift",
    "user_uid_pool": "default.rgw.users.uid",
    "system_key": {
        "access_key": "B9494C9XE7L7N50E9K2V",
        "secret_key": "O8e3IYV0gxHOwy61Og5ep4f7vQWPPFPhqRXjJrYT"
    },
    "placement_pools": [
        {
            "key": "default-placement",
            "val": {
                "index_pool": "default.rgw.buckets.index",
                "data_pool": "default.rgw.buckets.data",
                "data_extra_pool": "default.rgw.buckets.non-ec",
                "index_type": 0
            }
        }
    ],
    "metadata_heap": "",
    "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b"
}

从上面可以看出,master zone的default-placement中

作用 pool name
data pool default.rgw.buckets.data
index pool default.rgw.buckets.index
data extra pool default.rgw.buckets.non-ec

测试版本是Jewel,尚不支持index 动态shard,我们选择index max shards=8,即每个bucket 有8个分片。

rgw_override_bucket_index_max_shards = 8 

通过如下指令可以看到我们当前集群的bucket信息:

root@NODE-246:/var/log/ceph# radosgw-admin bucket stats
[
    {
        "bucket": "segtest2",
        "pool": "default.rgw.buckets.data",
        "index_pool": "default.rgw.buckets.index",
        "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769",
        "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769",
        "owner": "segs3account",
        ...
    },
    {
        "bucket": "segtest1",
        "pool": "default.rgw.buckets.data",
        "index_pool": "default.rgw.buckets.index",
        "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768",
        "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768",
        "owner": "segs3account",
        ...
    }
}

从上图可以看到,一共有两个bucket,bucket id分别是:

bucket name bucket id
segtest1 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768
segtest2 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769

每个bucket有8个index shards,共有16个对象。

root@NODE-246:/var/log/ceph# rados -p default.rgw.buckets.index ls
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.7
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.0
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.2
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.5
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.6
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.3
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.4
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.3
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.0
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.1
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.6
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.5
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.2
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.1
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.4
.dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.7

default.rgw.log pool

log pool记录的是各种日志信息,对于MultiSite这种使用场景,我们可以从default.rgw.log pool中找到这种命名的对象:

root@NODE-246:~# rados -p default.rgw.log ls  |grep data_log 
data_log.0
data_log.11
data_log.12
data_log.8
data_log.14
data_log.13
data_log.10
data_log.9
data_log.7

一般来讲这种命名风格的对象最多有rgw_data_log_num_shards,对于我们的场景:

OPTION(rgw_data_log_num_shards, OPT_INT, 128) 

rgw_bucket.h中可以看到如下代码:

    num_shards = cct->_conf->rgw_data_log_num_shards;
    oids = new string[num_shards];
    string prefix = cct->_conf->rgw_data_log_obj_prefix;
    if (prefix.empty()) {
      prefix = "data_log";
    }   
    for (int i = 0; i < num_shards; i++) {
      char buf[16];
      snprintf(buf, sizeof(buf), "%s.%d", prefix.c_str(), i); 
      oids[i] = buf;
    }   
    renew_thread = new ChangesRenewThread(cct, this);
    renew_thread->create("rgw_dt_lg_renew")

一般来讲,该对象内容为空,相关有用的信息,都记录在omap中:

root@NODE-246:~# rados -p default.rgw.log stat data_log.61
default.rgw.log/data_log.61 mtime 2018-12-10 14:39:38.000000, size 0
root@NODE-246:~# rados -p default.rgw.log listomapkeys data_log.61
1_1544421980.298394_2914.1
1_1544422002.458109_2939.1
...
1_1544423969.748641_4486.1
1_1544423978.090683_4495.1
1_1544424000.286801_4507.1

写入对象

概述

宏观上讲,上传一个对象到bucket中,需要写多个地方,如果同时打开了bi log和data log的话。

  • default.rgw.buckets.data : 将真实数据写入此pool,一般来讲新增一个_的对象
  • default.rgw.buckets.index: 当数据写入完成之后,在该对象对应的bucket index shard的omap中增加该对象的信息
  • bucket index 对象的omap中增加bi log
  • default.rgw.log pool中的data_log对象的omap中增加data log

bi log

当上传对象完毕之后,我们查看bucket index shard,可以看到如下内容:

root@node247:/var/log/ceph# rados -p default.rgw.buckets.index listomapkeys .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 
oem.tar.bz2
0_00000000001.1.2
0_00000000002.2.3

其中oem.tar.bz2文件是我们上传的对象,略过不提,除此意外还有两个0_00000000001.1.2和0_00000000002.2.3对象。

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 31 2e 31  |.0_00000000001.1|
00000010  2e 32                                             |.2|
00000012

value (133 bytes) :
00000000  03 01 7f 00 00 00 0f 00  00 00 30 30 30 30 30 30  |..........000000|
00000010  30 30 30 30 31 2e 31 2e  32 0b 00 00 00 6f 65 6d  |00001.1.2....oem|
00000020  2e 74 61 72 2e 62 7a 32  00 00 00 00 00 00 00 00  |.tar.bz2........|
00000030  01 01 0a 00 00 00 88 ff  ff ff ff ff ff ff ff 00  |................|
00000040  30 00 00 00 31 39 63 62  66 32 35 30 2d 62 62 33  |0...19cbf250-bb3|
00000050  65 2d 34 62 38 63 2d 62  35 62 66 2d 31 61 34 30  |e-4b8c-b5bf-1a40|
00000060  64 61 36 36 31 30 66 65  2e 31 35 30 38 33 2e 36  |da6610fe.15083.6|
00000070  34 32 31 30 00 00 01 00  00 00 00 00 00 00 00 00  |4210............|
00000080  00 00 00 00 00                                    |.....|
00000085

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 32 2e 32  |.0_00000000002.2|
00000010  2e 33                                             |.3|
00000012

value (125 bytes) :
00000000  03 01 77 00 00 00 0f 00  00 00 30 30 30 30 30 30  |..w.......000000|
00000010  30 30 30 30 32 2e 32 2e  33 0b 00 00 00 6f 65 6d  |00002.2.3....oem|
00000020  2e 74 61 72 2e 62 7a 32  e7 b2 14 5c 20 8e a4 04  |.tar.bz2...\ ...|
00000030  01 01 02 00 00 00 03 01  30 00 00 00 31 39 63 62  |........0...19cb|
00000040  66 32 35 30 2d 62 62 33  65 2d 34 62 38 63 2d 62  |f250-bb3e-4b8c-b|
00000050  35 62 66 2d 31 61 34 30  64 61 36 36 31 30 66 65  |5bf-1a40da6610fe|
00000060  2e 31 35 30 38 33 2e 36  34 32 31 30 00 01 02 00  |.15083.64210....|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00           |.............|
0000007d

为什么PUT对象之后,在.dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 的omap中还有两个key-value对,它们是干什么用的?

我们打开所有OSD的debug-objclass,查看下

ceph tell osd.\* injectargs --debug-objclass 20

我们在日志中可以看到如下内容:

ceph-client.radosgw.0的日志:
--------------------------------------
2018-12-15 15:53:11.079498 7f45723c7700 10 moving default.rgw.data.root+.bucket.meta.bucket_0:19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1 to cache LRU end
2018-12-15 15:53:11.079530 7f45723c7700 20  bucket index object: .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7
2018-12-15 15:53:11.083307 7f45723c7700 20 RGWDataChangesLog::add_entry() bucket.name=bucket_0 shard_id=7 now=2018-12-15 15:53:11.0.083306s cur_expiration=1970-01-01 08:00:00.000000s
2018-12-15 15:53:11.083351 7f45723c7700 20 RGWDataChangesLog::add_entry() sending update with now=2018-12-15 15:53:11.0.083306s cur_expiration=2018-12-15 15:53:41.0.083306s
2018-12-15 15:53:11.085002 7f45723c7700  2 req 64210:0.012000:s3:PUT /bucket_0/oem.tar.bz2:put_obj:completing
2018-12-15 15:53:11.085140 7f45723c7700  2 req 64210:0.012139:s3:PUT /bucket_0/oem.tar.bz2:put_obj:op status=0
2018-12-15 15:53:11.085148 7f45723c7700  2 req 64210:0.012147:s3:PUT /bucket_0/oem.tar.bz2:put_obj:http status=200
2018-12-15 15:53:11.085159 7f45723c7700  1 ====== req done req=0x7f45723c1750 op status=0 http_status=200 ======

ceph-osd.0.log 
----------------
2018-12-15 15:53:11.080017 7faaa6fb3700  1 <cls> cls/rgw/cls_rgw.cc:689: rgw_bucket_prepare_op(): request: op=0 name=oem.tar.bz2 instance= tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210

2018-12-15 15:53:11.083526 7faaa6fb3700  1 <cls> cls/rgw/cls_rgw.cc:830: rgw_bucket_complete_op(): request: op=0 name=oem.tar.bz2 instance= ver=3:1 tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210

2018-12-15 15:53:11.083592 7faaa6fb3700  1 <cls> cls/rgw/cls_rgw.cc:753: read_index_entry(): existing entry: ver=-1:0 name=oem.tar.bz2 instance= locator=

2018-12-15 15:53:11.083639 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:949: rgw_bucket_complete_op(): remove_objs.size()=0

2018-12-15 15:53:12.142564 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11
2018-12-15 15:53:12.142584 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0

2018-12-15 15:53:12.170787 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11
2018-12-15 15:53:12.170799 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0

2018-12-15 15:53:12.194152 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11
2018-12-15 15:53:12.194167 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=1

2018-12-15 15:53:12.256510 7faaa6fb3700 10 <cls> cls/rgw/cls_rgw.cc:2591: bi_log_iterate_range
2018-12-15 15:53:12.256523 7faaa6fb3700  0 <cls> cls/rgw/cls_rgw.cc:2621: bi_log_iterate_entries start_key=<80>0_00000000002.2.3 end_key=<80>1000_

从上面的日志可以看出:

  • rgw_bucket_prepare_op
  • rgw_bucket_complete_op
  • RGWDataChangesLog::add_entry()

在void RGWPutObj::execute()函数的最后,会调用proecssor->complete 函数:

  op_ret = processor->complete(etag, &mtime, real_time(), attrs,
                               (delete_at ? *delete_at : real_time()), if_match, if_nomatch,
                               (user_data.empty() ? nullptr : &user_data));  

complete函数会调用do_complete函数,因此直接看do_complete函数:

int RGWPutObjProcessor_Atomic::do_complete(string& etag, real_time *mtime, real_time set_mtime,
                                           map<string, bufferlist>& attrs, real_time delete_at,
                                           const char *if_match,
                                           const char *if_nomatch, const string *user_data) {
  //等待该rgw对象的所有异步写完成
  int r = complete_writing_data();                                              
  if (r < 0)
    return r;
  //标识该对象为Atomic类型的对象
  obj_ctx.set_atomic(head_obj);
  // 将该rgw对象的attrs写入head对象的xattr中
  RGWRados::Object op_target(store, bucket_info, obj_ctx, head_obj);
  /* some object types shouldn't be versioned, e.g., multipart parts */
  op_target.set_versioning_disabled(!versioned_object);

  RGWRados::Object::Write obj_op(&op_target);

  obj_op.meta.data = &first_chunk;
  obj_op.meta.manifest = &manifest;
  obj_op.meta.ptag = &unique_tag; /* use req_id as operation tag */
  obj_op.meta.if_match = if_match;
  obj_op.meta.if_nomatch = if_nomatch;
  obj_op.meta.mtime = mtime;
  obj_op.meta.set_mtime = set_mtime;
  obj_op.meta.owner = bucket_info.owner;
  obj_op.meta.flags = PUT_OBJ_CREATE;
  obj_op.meta.olh_epoch = olh_epoch;
  obj_op.meta.delete_at = delete_at;
  obj_op.meta.user_data = user_data;

  /* write_meta是一个综合操作,是我们下面分析的重点 */
  r = obj_op.write_meta(obj_len, attrs);
  if (r < 0) {
    return r;
  }
  canceled = obj_op.meta.canceled;
  return 0;                                                     
}

要探究bucket index shard中 omap中的0_00000000001.1.2和0_00000000002.2.3到底是什么,需要进入write_meta函数:

int RGWRados::Object::Write::write_meta(uint64_t size,
                  map<string, bufferlist>& attrs)
{
  int r = 0;
  RGWRados *store = target->get_store();
  if ((r = this->_write_meta(store, size, attrs, true)) == -ENOTSUP) {
    ldout(store->ctx(), 0) << "WARNING: " << __func__
      << "(): got ENOSUP, retry w/o store pg ver" << dendl;
    r = this->_write_meta(store, size, attrs, false);      
  }
  return r;
}


int RGWRados::Object::Write::_write_meta(RGWRados *store, uint64_t size,
                  map<string, bufferlist>& attrs, bool store_pg_ver)
{
  ...
  r = index_op.prepare(CLS_RGW_OP_ADD);
  if (r < 0)
    return r;

  r = ref.ioctx.operate(ref.oid, &op); 
  if (r < 0) { /* we can expect to get -ECANCELED if object was replaced under,
                or -ENOENT if was removed, or -EEXIST if it did not exist
                before and now it does */
    goto done_cancel;
  }

  epoch = ref.ioctx.get_last_version();
  poolid = ref.ioctx.get_id();

  r = target->complete_atomic_modification();
  if (r < 0) {
    ldout(store->ctx(), 0) << "ERROR: complete_atomic_modification returned r=" << r << dendl;
  }
  r = index_op.complete(poolid, epoch, size, 
                        meta.set_mtime, etag, content_type, &acl_bl,
                        meta.category, meta.remove_objs, meta.user_data);

  ...    
}

RGWRados::Bucket::UpdateIndex::prepare

在index_op.prepare 操作中,bucket index shard 中0_00000000001.1.2 该key-value对写入。

int RGWRados::Bucket::UpdateIndex::prepare(RGWModifyOp op)
{
  if (blind) {
    return 0;
  }
  RGWRados *store = target->get_store();
  BucketShard *bs;
  int ret = get_bucket_shard(&bs);
  if (ret < 0) {
    ldout(store->ctx(), 5) << "failed to get BucketShard object: ret=" << ret << dendl;
    return ret;
  }
  if (obj_state && obj_state->write_tag.length()) {
    optag = string(obj_state->write_tag.c_str(), obj_state->write_tag.length());
  } else {
    if (optag.empty()) {
      append_rand_alpha(store->ctx(), optag, optag, 32);
    }
  }
  ret = store->cls_obj_prepare_op(*bs, op, optag, obj, bilog_flags);
  return ret;
}
int RGWRados::cls_obj_prepare_op(BucketShard& bs, RGWModifyOp op, string& tag, 
                                 rgw_obj& obj, uint16_t bilog_flags)
{
  ObjectWriteOperation o;
  cls_rgw_obj_key key(obj.get_index_key_name(), obj.get_instance());
  cls_rgw_bucket_prepare_op(o, op, tag, key, obj.get_loc(), get_zone().log_data, bilog_flags);
  int flags = librados::OPERATION_FULL_TRY;
  int r = bs.index_ctx.operate(bs.bucket_obj, &o, flags);
  return r;
}
void cls_rgw_bucket_prepare_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag,
                               const cls_rgw_obj_key& key, const string& locator, bool log_op,
                               uint16_t bilog_flags)
{
  struct rgw_cls_obj_prepare_op call;
  call.op = op; 
  call.tag = tag;
  call.key = key;
  call.locator = locator;
  call.log_op = log_op;
  call.bilog_flags = bilog_flags;
  bufferlist in; 
  ::encode(call, in);
  o.exec("rgw", "bucket_prepare_op", in);
} 

cls/rgw/cls_rgw.cc
----------------------
void __cls_init()
{
    ...
   cls_register_cxx_method(h_class, "bucket_prepare_op", CLS_METHOD_RD | CLS_METHOD_WR, rgw_bucket_prepare_op, &h_rgw_bucket_prepare_op); 
    ...
}

int rgw_bucket_prepare_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
  ...
  CLS_LOG(1, "rgw_bucket_prepare_op(): request: op=%d name=%s instance=%s tag=%s\n",
          op.op, op.key.name.c_str(), op.key.instance.c_str(), op.tag.c_str());

...
      // fill in proper state
  struct rgw_bucket_pending_info info;
  info.timestamp = real_clock::now();
  info.state = CLS_RGW_STATE_PENDING_MODIFY;
  info.op = op.op;
  entry.pending_map.insert(pair<string, rgw_bucket_pending_info>(op.tag, info));

  struct rgw_bucket_dir_header header;
  rc = read_bucket_header(hctx, &header);
  if (rc < 0) {
    CLS_LOG(1, "ERROR: rgw_bucket_complete_op(): failed to read header\n");
    return rc;
  }

  if (op.log_op) {
    //產生出0_00000000001.1.2
    rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime,
                             entry.ver, info.state, header.ver, header.max_marker, op.bilog_flags, NULL, NULL);
    if (rc < 0)
      return rc;
  }

  // write out new key to disk
  bufferlist info_bl;
  ::encode(entry, info_bl);
  rc = cls_cxx_map_set_val(hctx, idx, &info_bl);
  if (rc < 0)
    return rc; 
  return write_bucket_header(hctx, &header);
}

注意上面的log_index_operation函數,我們的第一個0_00000000001.1.2對象即由該函數產生。

static void bi_log_prefix(string& key)
{
  key = BI_PREFIX_CHAR;
  key.append(bucket_index_prefixes[BI_BUCKET_LOG_INDEX]);
}

static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver)                                                   
{
  bi_log_prefix(key);
  get_index_ver_key(hctx, index_ver, &id);
  key.append(id);
}
#define BI_PREFIX_CHAR 0x80    
#define BI_BUCKET_OBJS_INDEX          0
#define BI_BUCKET_LOG_INDEX           1
#define BI_BUCKET_OBJ_INSTANCE_INDEX  2
#define BI_BUCKET_OLH_DATA_INDEX      3
#define BI_BUCKET_LAST_INDEX          4
static string bucket_index_prefixes[] = { "", /* special handling for the objs list index */
                                          "0_",     /* bucket log index */
                                          "1000_",  /* obj instance index */
                                          "1001_",  /* olh data index */
                                          /* this must be the last index */
                                          "9999_",};

我们可以看到,这种bi log都是以字符0x80开始,后面跟着’0_’:

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 32 2e 32  |.0_00000000002.2|
00000010  2e 33                                             |.3|
00000012

我们可以通过radosgw-admin bilog list查看相应的bilog:

    {
        "op_id": "7#00000000001.1.2",
        "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210",
        "op": "write",
        "object": "oem.tar.bz2",
        "instance": "",
        "state": "pending",
        "index_ver": 1,
        "timestamp": "0.000000",
        "ver": {
            "pool": -1,
            "epoch": 0
        },
        "bilog_flags": 0,
        "versioned": false,
        "owner": "",
        "owner_display_name": ""
    },

RGWRados::Bucket::UpdateIndex::complete

介绍完UpdateIndex的prepare阶段,该介绍complete阶段了

int RGWRados::Bucket::UpdateIndex::complete(int64_t poolid, uint64_t epoch, uint64_t size, 
                                    ceph::real_time& ut, string& etag, string& content_type,                                           bufferlist *acl_bl, RGWObjCategory category,
                                    list<rgw_obj_key> *remove_objs, const string *user_data)

在该函数的末尾:

  ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags);
  int r = store->data_log->add_entry(bs->bucket, bs->shard_id);
  if (r < 0) {
    lderr(store->ctx()) << "ERROR: failed writing data log" << dendl;
  }
  return ret;

其中store->cls_obj_complete_add这个函数:

ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); 
int RGWRados::cls_obj_complete_add(BucketShard& bs, string& tag,
                                   int64_t pool, uint64_t epoch,
                                   RGWObjEnt& ent, RGWObjCategory category,
                                   list<rgw_obj_key> *remove_objs, uint16_t bilog_flags)
{
  return cls_obj_complete_op(bs, CLS_RGW_OP_ADD, tag, pool, epoch, ent, category, remove_objs, bilog_flags);
}

int RGWRados::cls_obj_complete_op(BucketShard& bs, RGWModifyOp op, string& tag,
                                  int64_t pool, uint64_t epoch,
                                  RGWObjEnt& ent, RGWObjCategory category,
                                  list<rgw_obj_key> *remove_objs, uint16_t bilog_flags)
{
      ...
      cls_rgw_bucket_complete_op(o, op, tag, ver, key, dir_meta, pro,  
                             get_zone().log_data, bilog_flags);
      ...
}
void cls_rgw_bucket_complete_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag,
                                rgw_bucket_entry_ver& ver,
                                const cls_rgw_obj_key& key,
                                rgw_bucket_dir_entry_meta& dir_meta,
                                list<cls_rgw_obj_key> *remove_objs, bool log_op,
                                uint16_t bilog_flags)
{

  bufferlist in;
  struct rgw_cls_obj_complete_op call;
  call.op = op;
  call.tag = tag;
  call.key = key;
  call.ver = ver;
  call.meta = dir_meta;
  call.log_op = log_op;
  call.bilog_flags = bilog_flags;
  if (remove_objs)
    call.remove_objs = *remove_objs;
  ::encode(call, in);
  o.exec("rgw", "bucket_complete_op", in);
}

cls/rgw/cls_rgw.cc
-------------------
int rgw_bucket_complete_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
   ...
    case CLS_RGW_OP_ADD:
    {
      struct rgw_bucket_dir_entry_meta& meta = op.meta;
      struct rgw_bucket_category_stats& stats = header.stats[meta.category];
      entry.meta = meta;
      entry.key = op.key;
      entry.exists = true;
      entry.tag = op.tag;
      stats.num_entries++;
      stats.total_size += meta.accounted_size;
      stats.total_size_rounded += cls_rgw_get_rounded_size(meta.accounted_size);
      bufferlist new_key_bl;
      ::encode(entry, new_key_bl);
      int ret = cls_cxx_map_set_val(hctx, idx, &new_key_bl);
      if (ret < 0)
        return ret;
    }
    break;
  }

  if (op.log_op) {
    rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime, entry.ver,
                             CLS_RGW_STATE_COMPLETE, header.ver, header.max_marker, op.bilog_flags, NULL, NULL);
    if (rc < 0)
      return rc;                                               
 }
}

可以看到log_index_operation 函数,这个函数是第二个 bi log

key (18 bytes):
00000000  80 30 5f 30 30 30 30 30  30 30 30 30 30 32 2e 32  |.0_00000000002.2|
00000010  2e 33                                             |.3|
00000012

同样,我们可以通过radosgw-admin命令查看bilog

   radosgw-admin bilog list  --bucket bucket_0
   {
        "op_id": "7#00000000002.2.3",
        "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210",
        "op": "write",
        "object": "oem.tar.bz2",
        "instance": "",
        "state": "complete",
        "index_ver": 2,
        "timestamp": "2018-12-15 07:53:11.077893152Z",
        "ver": {
            "pool": 3,
            "epoch": 1
        },
        "bilog_flags": 0,
        "versioned": false,
        "owner": "",
        "owner_display_name": ""
    },

至此,当上传对象的时候,两条bi log都介绍完了,值得注意的是key中的数值,

static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver)
{
  bi_log_prefix(key);
  get_index_ver_key(hctx, index_ver, &id);
  key.append(id);
}
static void get_index_ver_key(cls_method_context_t hctx, uint64_t index_ver, string *key)
{
  char buf[48];
  snprintf(buf, sizeof(buf), "%011llu.%llu.%d", (unsigned long long)index_ver,
           (unsigned long long)cls_current_version(hctx),
           cls_current_subop_num(hctx));                                               
  *key = buf;
} 
uint64_t cls_current_version(cls_method_context_t hctx)  
{ 
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;

  return ctx->pg->info.last_user_version;
}
int cls_current_subop_num(cls_method_context_t hctx)
{ 
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;

  return ctx->current_osd_subop_num;
}

Ceph保证了后面的序列部分单调递增。这个单调性对于multisite增量同步比较重要。

data_log

在UpdateIndex::complete 函数中,有如下内容:

  ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags);
  int r = store->data_log->add_entry(bs->bucket, bs->shard_id);
  if (r < 0) {
    lderr(store->ctx()) << "ERROR: failed writing data log" << dendl;
  }
  return ret;

其中store->data_log->add_entry 即为往default.rgw.log 对应的data_log条目中增加log的部分。

当对bucket进行对象操作时,会在omap上新建一条”1_”+ 开头的日志,表明这个bucket被修改过,增量同步时会根据这些日志判断出哪些bucket被更改过,进而再针对每个bucket进行同步。

bucket 与 data_log.X的映射

以我们的Jewel 版本为例,bucket的shard个数为 8, 而default.rgw.log 中data_log.X 对象的个数为rgw_data_log_num_shards,即128个,RGW提供了两者的映射关系:

int RGWDataChangesLog::choose_oid(const rgw_bucket_shard& bs) {
    const string& name = bs.bucket.name;
    int shard_shift = (bs.shard_id > 0 ? bs.shard_id : 0);
    uint32_t r = (ceph_str_hash_linux(name.c_str(), name.size()) + shard_shift) % num_shards; 
    return (int)r;
}

加入我们有N个bucket,每个bucket shards是8,也就是将8*N个对象通过choose_oid映射到128个data_log.X对象。

上传一个对象之后,我们可以从default.rgw.log 的data_log.X的omap信息中得到一笔新的key-value信息:

root@NODE-246:/var/log# rados -p default.rgw.log ls |grep data_log  |xargs -I {} rados -p default.rgw.log listomapvals {} 
1_1544942616.469385_1491.1
value (185 bytes) :
00000000  02 01 b3 00 00 00 00 00  00 00 37 00 00 00 62 75  |..........7...bu|
00000010  63 6b 65 74 5f 30 3a 31  39 63 62 66 32 35 30 2d  |cket_0:19cbf250-|
00000020  62 62 33 65 2d 34 62 38  63 2d 62 35 62 66 2d 31  |bb3e-4b8c-b5bf-1|
00000030  61 34 30 64 61 36 36 31  30 66 65 2e 31 35 30 38  |a40da6610fe.1508|
00000040  33 2e 31 3a 37 18 f4 15  5c 44 40 fa 1b 4a 00 00  |3.1:7...\D@..J..|
00000050  00 01 01 44 00 00 00 01  37 00 00 00 62 75 63 6b  |...D....7...buck|
00000060  65 74 5f 30 3a 31 39 63  62 66 32 35 30 2d 62 62  |et_0:19cbf250-bb|
00000070  33 65 2d 34 62 38 63 2d  62 35 62 66 2d 31 61 34  |3e-4b8c-b5bf-1a4|
00000080  30 64 61 36 36 31 30 66  65 2e 31 35 30 38 33 2e  |0da6610fe.15083.|
00000090  31 3a 37 18 f4 15 5c 44  40 fa 1b 1a 00 00 00 31  |1:7...\D@......1|
000000a0  5f 31 35 34 34 39 34 32  36 31 36 2e 34 36 39 33  |_1544942616.4693|
000000b0  38 35 5f 31 34 39 31 2e  31                       |85_1491.1|
000000b9

这个键值命名的规范是如何的?

cls/log/cls_log.cc
-----------------------
static string log_index_prefix = "1_"; 
static void get_index(cls_method_context_t hctx, utime_t& ts, string& index)
{
  get_index_time_prefix(ts, index);   
  string unique_id;
  cls_cxx_subop_version(hctx, &unique_id);
  index.append(unique_id);
}
static void get_index_time_prefix(utime_t& ts, string& index)
{
  char buf[32];
  snprintf(buf, sizeof(buf), "%010ld.%06ld_", (long)ts.sec(), (long)ts.usec());
  index = log_index_prefix + buf;
}
uint64_t cls_current_version(cls_method_context_t hctx)
{
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;

  return ctx->pg->info.last_user_version;
}
int cls_current_subop_num(cls_method_context_t hctx)
{
  ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx;
  return ctx->current_osd_subop_num;
}
void cls_cxx_subop_version(cls_method_context_t hctx, string *s) 
{
  if (!s)
    return;
  char buf[32];
  uint64_t ver = cls_current_version(hctx);
  int subop_num = cls_current_subop_num(hctx);
  snprintf(buf, sizeof(buf), "%lld.%d", (long long)ver, subop_num);
  *s = buf;
}

1_1544942616.469385_1491.1这个键值也是一样,ceph保证其单调递增的特性。当multisite同步的时候,这个特性很重要。

]]>
iSCSI command 2018-11-28T23:12:40+00:00 Bean Li http://bean-li.github.io/iSCSI-Command 前言

iSCSI客户端常用命令总是忘记,在此处记录下。

常用命令

查看当前session

挂载之前,一般如下图所示:

root@node-242:~# iscsiadm -m session 
iscsiadm: No active sessions.

挂载之后:

root@node-242:~# iscsiadm -m session 
tcp: [2] 10.16.172.247:3260,1 iqn.2018-11.com:BEAN

root@node-242:~# iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-871
Target: iqn.2018-11.com:BEAN
	Current Portal: 10.16.172.247:3260,1
	Persistent Portal: 10.16.172.247:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface Initiatorname: iqn.1993-08.org.debian:01:c9c12dd76e
		Iface IPaddress: 10.16.172.242
		Iface HWaddress: (null)
		Iface Netdev: (null)
		SID: 2
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 1048576
		FirstBurstLength: 262144
		MaxBurstLength: 1048576
		ImmediateData: Yes
		InitialR2T: No
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 25	State: running
		scsi25 Channel 00 Id 0 Lun: 0
			Attached scsi disk sde		State: running

根据IP 发现target

iscsiadm -m discovery -t st -p 10.16.172.246

输出如下:

root@node-242:~# iscsiadm -m discovery -t st -p 10.16.172.246
10.16.172.246:3260,1 iqn.2018-11.com:BEAN
10.16.172.247:3260,1 iqn.2018-11.com:BEAN
10.16.172.248:3260,1 iqn.2018-11.com:BEAN

登录到指定target

iscsiadm -m node -T [target_name] -p [ip:3260] -l
 如下所示:
iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -l

输出如下:

root@node-242:~# iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -l
Logging in to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]
Login to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]: successful

登录之后,可以用iscsiadm -m session查看。结果一般如下所示:

root@node-242:~# iscsiadm -m session 
tcp: [3] 10.16.172.246:3260,1 iqn.2018-11.com:BEAN

登出指定target

iscsiadm -m node -T [target_name] -p [ip:3260] -u

具体指令如下所示:

root@node-242:~# iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -u
Logging out of session [sid: 3, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]
Logout of [sid: 3, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260]: successful

登出之后,可以用iscsiadm -m session 检查效果。

root@node-242:~# iscsiadm -m session 
iscsiadm: No active sessions.

信息

一般来讲,登录target之后会新增一个盘符,登录之前,lsblk输出如下:

root@node2:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0    30G  0 disk 
├─sda1   8:1    0     7M  0 part 
├─sda2   8:2    0  22.2G  0 part /
├─sda3   8:3    0   7.5G  0 part [SWAP]
└─sda4   8:4    0   261M  0 part 
sdb      8:16   0   100G  0 disk 
├─sdb1   8:17   0     8G  0 part 
└─sdb2   8:18   0    92G  0 part /data/osd.2
sdc      8:32   0     2T  0 disk 
sr0     11:0    1  1024M  0 rom 

执行登录target之后:

root@node2:~# iscsiadm -m node -T iqn.2018-11.com:BEAN -p 10.16.172.246:3260 -l
Logging in to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260] (multiple)
Login to [iface: default, target: iqn.2018-11.com:BEAN, portal: 10.16.172.246,3260] successful.
root@node2:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0    30G  0 disk 
├─sda1   8:1    0     7M  0 part 
├─sda2   8:2    0  22.2G  0 part /
├─sda3   8:3    0   7.5G  0 part [SWAP]
└─sda4   8:4    0   261M  0 part 
sdb      8:16   0   100G  0 disk 
├─sdb1   8:17   0     8G  0 part 
└─sdb2   8:18   0    92G  0 part /data/osd.2
sdc      8:32   0     2T  0 disk 
sr0     11:0    1  1024M  0 rom  

我们可以看到新增了一个sdc。

如果确定sdc和iSCSI target的关系呢:

iscsiadm -m session -P 3

比如之前的输出, sde这块磁盘即iSCSI,来自 10.16.172.247:3260的Target: iqn.2018-11.com:BEAN

root@node-242:~# iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-871
Target: iqn.2018-11.com:BEAN
	Current Portal: 10.16.172.247:3260,1
	Persistent Portal: 10.16.172.247:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface Initiatorname: iqn.1993-08.org.debian:01:c9c12dd76e
		Iface IPaddress: 10.16.172.242
		Iface HWaddress: (null)
		Iface Netdev: (null)
		SID: 2
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 1048576
		FirstBurstLength: 262144
		MaxBurstLength: 1048576
		ImmediateData: Yes
		InitialR2T: No
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 25	State: running
		scsi25 Channel 00 Id 0 Lun: 0
			Attached scsi disk sde		State: running
]]>
How s3 data store in ceph 2018-06-01T17:20:40+00:00 Bean Li http://bean-li.github.io/how-s3-data-store-in-ceph 前言

本文解决Where is my data 之对象存储部分,主要集中在S3对象存储。

where is my s3 data

简略地回答,可以说,用户的s3 data存放在 .rgw.buckets这个pool中,可是,pool中的数据长这个样:

default.11383165.1_kern.log
....
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_53
default.11383165.1_821
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_260
default.11383165.1_5
default.11383165.1_572
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_618
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_153
default.11383165.1_217
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_537
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_357
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_565
default.11383165.1_441
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_223

如何和bucket对应,如何和bucket中的用户对象文件对应?

整体上传对象

整体上传对象的时候,分成两种情况,区分的维度就是:

    "rgw_max_chunk_size": "524288"

这个值是RadosGW下发到RADOS集群的单个IO的大小,同时也决定了应用对象分成多个RADOS对象时首对象(head_obj)的大小。

  • 对象大小分块大小,即小于512KB
  • 对象大小大于分块大小,即大于512KB

注意,如果大于rgw_max_chunk_size的对象文件,后续部分会根据如下参数切成多个RADOS对象:

"rgw_obj_stripe_size": "4194304"

也就说,小于512K的对象文件在底层RADOS 只有一个对象,大于512KB的对象文件,会分成多个对象存储,其中第一个对象叫做首对象,大小为rgw_max_chunk_size,其他的对象按照rgw_obj_stripe_size切成不同的object 存入rados。

小于rgw_max_chunk_size的对象文件

这种情况就比较简单了,即将bucket_id 和 对象文件的名字用下划线拼接,作为pool中底层对象的名字。

root@44:~# s3cmd pub /var/log/syslog s3://bean_book/syslog 
ERROR: Invalid command: u'pub'
root@44:~# s3cmd put /var/log/syslog s3://bean_book/syslog 
/var/log/syslog -> s3://bean_book/syslog  [1 of 1]
 60600 of 60600   100% in    0s     8.56 MB/s  done


root@44:~# rados -p .rgw.buckets ls |grep syslog 
default.11383165.2_syslog
root@44:/# rados -p .rgw.buckets stat default.11383165.2_syslog
.rgw.buckets/default.11383165.2_syslog mtime 2018-05-27 14:51:14.000000, size 60600

大于rgw_max_chunk_size对象文件

对于大于rgw_max_chunk_size的对象文件,会分成多个底层RADOS对象存放。

root@44:~# s3cmd put VirtualStor\ Scaler-v6.3-319~201805240311~cda7fd7.iso s3://bean_book/scaler.iso 

上传完毕之后,我们可以从.rgw.buckets桶里面找到如下对象:

default.11383165.2_scaler.iso
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_208
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_221
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_76
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_293

用户上传对象文件的时候,被分解成

  • 大小等于rgw_max_chunk_size的首对象 head_obj (512KB)
  • 多个大小等于条带大小的中间对象
  • 一个小于或者等于条带大小的尾对象

对于head_obj的命名组成,和上面一样,我就不重复绘图了,对于中间的对象和最后的尾对象,命名组成如下:

这里面有个问题,因为对象名字中有随机字符,当然只有一个大于4M的对象文件的时候,比如我就上传了一2G+的大文件,所有的bucket内的带shadow的文件都属于这个scaler.iso 对象。

但是我们想想,如果bucket中很多这种2G+的大对象文件,我们如何区分

root@44:/var/log/ceph# rados -p .rgw.buckets ls |grep shadow |grep "_1$"
default.11383165.2__shadow_.3vU63olQg1ovOpVdWQxJsx2o28N3TFl_1
default.11383165.2__shadow_.iDlJATXiRQBiT9xxSX5qS_Rb8iFdHam_1
default.11383165.2__shadow_.ipsp4zhQCPa1ckNNQZaJeLRSq3miyhR_1
default.11383165.2__shadow_.JKq4eXO5IJ6BMANVmLluwcUVHH7wzW9_1
default.11383165.2__shadow_.C7e7w4gQLapZ_KK3c2_2pKcz-yIobaN_1
default.11383165.2__shadow_.mGwYpWb3FXieaaaDNdaPzfs546ysNnT_1
default.11383165.2__shadow_.OvUkm8069EUeyXHneWhd4JOiVPev3gI_1
default.11383165.2__shadow_.zNsCV2xYKlym7uLDkR7cV0SF3edH0t3_1

换句话说,head_obj可以和对象文件关联起来,但是这些中间对象和尾对象,如何和head_obj关联起来呢?

head_obj不一般,它需要维护对象文件元数据信息和manifest信息:

root@44:~# rados -p .rgw.buckets listxattr default.11383165.2_scaler.iso 
user.rgw.acl
user.rgw.content_type
user.rgw.etag
user.rgw.idtag
user.rgw.manifest
user.rgw.x-amz-date

其中对于寻找数据比较重要的数据结构为:

rados -p .rgw.buckets getxattr  default.11383165.2_scaler.iso  user.rgw.manifest  > /root/scaler.iso.manifest

root@44:~# ceph-dencoder type RGWObjManifest import /root/scaler.iso.manifest  decode dump_json
{
    "objs": [],
    "obj_size": 2842374144,     <-----------------对象文件大小
    "explicit_objs": "false",
    "head_obj": {
        "bucket": {
            "name": "bean_book",
            "pool": ".rgw.buckets",
            "data_extra_pool": ".rgw.buckets.extra",
            "index_pool": ".rgw.buckets.index",
            "marker": "default.11383165.2",
            "bucket_id": "default.11383165.2"
        },
        "key": "",
        "ns": "",
        "object": "scaler.iso",         <---------------------对象名
        "instance": ""
    },
    "head_size": 524288,
    "max_head_size": 524288,
    "prefix": ".mGwYpWb3FXieaaaDNdaPzfs546ysNnT_",      <------------------中间对象和尾对象的随机前缀
    "tail_bucket": {
        "name": "bean_book",
        "pool": ".rgw.buckets",
        "data_extra_pool": ".rgw.buckets.extra",
        "index_pool": ".rgw.buckets.index",
        "marker": "default.11383165.2",
        "bucket_id": "default.11383165.2"
    },
    "rules": [
        {
            "key": 0,
            "val": {
                "start_part_num": 0,
                "start_ofs": 524288,
                "part_size": 0,
                "stripe_max_size": 4194304,
                "override_prefix": ""
            }
        }
    ]
}

有了head size 以及strip size这些,还有前缀,就可以很轻松的组成中间对象和尾对象的名字,进而读取对象文件的不同部分了。

寻找数据结束了之后,我们可以关注下其他的元数据信息:

root@44:~# rados -p .rgw.buckets getxattr  default.11383165.2_scaler.iso  user.rgw.etag -
9df9be75a165539894ef584cd27cc39f

root@44:~# md5sum VirtualStor\ Scaler-v6.3-319~201805240311~cda7fd7.iso 
9df9be75a165539894ef584cd27cc39f  VirtualStor Scaler-v6.3-319~201805240311~cda7fd7.iso

对于非分片上传的对象文件而言,etag就是MD5,几乎在对象文件 head_obj的扩展属性中。

对象文件的ACL信息,也记录在head_obj的扩展属性中:

root@44:~# rados -p .rgw.buckets getxattr  default.11383165.2_scaler.iso  user.rgw.acl > scaler.iso.acl
root@44:~# ceph-dencoder type RGWAccessControlPolicy import scaler.iso.acl  decode dump_json
{
    "acl": {
        "acl_user_map": [
            {
                "user": "bean_li",
                "acl": 15
            }
        ],
        "acl_group_map": [],
        "grant_map": [
            {
                "id": "bean_li",
                "grant": {
                    "type": {
                        "type": 0
                    },
                    "id": "bean_li",
                    "email": "",
                    "permission": {
                        "flags": 15
                    },
                    "name": "bean_li",
                    "group": 0
                }
            }
        ]
    },
    "owner": {
        "id": "bean_li",
        "display_name": "bean_li"
    }
}

除了这些默认的扩展属性,用户指定的metadata也是存放在此处。

分片上传 multipart upload

分片上传的对象,数据如何存放?

root@44:~# cp VirtualStor\ Scaler-v6.3-319~201805240311~cda7fd7.iso  /var/share/ezfs/shareroot/NAS/scaler_iso
root@44:~# 
root@44:~# s3cmd mb s3://iso
Bucket 's3://iso/' created

使用分片上传,每10M一个分片:

上传上去的对象有如下的命名风格:

default.14434697.1_scaler_iso
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.187_2
default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.129_2
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.134_2
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.22_1
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.83_2
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.136_2

head_obj不说了,还是老的命名风格,和完整上传的区别是,size 为0

root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1_scaler_iso 
.rgw.buckets/default.14434697.1_scaler_iso mtime 2018-05-27 18:48:32.000000, size 0

注意上面名字中的2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH,最容易让你困扰的是2~,这个2~是upload_id的前缀。

#define MULTIPART_UPLOAD_ID_PREFIX_LEGACY "2/"
#define MULTIPART_UPLOAD_ID_PREFIX "2~" // must contain a unique char that may not come up in gen_rand_alpha() 

命名规则如下:

需要注意的是,RADOS中multipart对象就是普通的rgw_obj_stripe_size ,即4M大小:

root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31 
.rgw.buckets/default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31 mtime 2018-05-27 18:48:10.000000, size 4194304

但是注意,应用程序分片,是用户可以指定的,比如我们的RRS,10M分片:

obsync.py
-----------------
MULTIPART_THRESH = 10485760

            mpu = self.bucket.initiate_multipart_upload(obj.name, metadata=meta_to_dict(obj.meta))
            try: 
                remaining = obj.size
                part_num = 0
                part_size = MULTIPART_THRESH

                while remaining > 0: 
                    offset = part_num * part_size
                    length = min(remaining, part_size)
                    ioctx = src.get_obj_ioctx(obj, offset, length)
                    mpu.upload_part_from_file(ioctx, part_num + 1) 
                    remaining -= length
                    part_num += 1 

                mpu.complete_upload()
            except Exception as e:
                mpu.cancel_upload()
                raise e

很明显,单个multipart 对象不足以存放10M大小,因此,一般对应分片还有对应的shadow对象:

root@45:/var/log/radosgw# rados -p .rgw.buckets ls |grep "2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31"
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_1
default.14434697.1__multipart_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31
default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_2

root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_1 
.rgw.buckets/default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_1 mtime 2018-05-27 18:48:10.000000, size 4194304
root@45:/var/log/radosgw# rados -p .rgw.buckets stat default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_2
.rgw.buckets/default.14434697.1__shadow_scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH.31_2 mtime 2018-05-27 18:48:10.000000, size 2097152

毫无意外,两个shadow 1个4M,另一个2M,加上分片的multipart 对象,共10M。

同样的问题是,用户读对象的时候,根据上传 时候的情况,决定了数据对象的名字,那么如何区分是分片上传的对象还是完整上传的对象呢?

root@45:~# rados -p .rgw.buckets getxattr default.14434697.1_scaler_iso user.rgw.manifest > scaler_iso_multipart.manifest 

root@45:~# ceph-dencoder type RGWObjManifest import /root/scaler_iso_multipart.manifest decode dump_json
{
    "objs": [],
    "obj_size": 2842374144,
    "explicit_objs": "false",
    "head_obj": {
        "bucket": {
            "name": "iso",
            "pool": ".rgw.buckets",
            "data_extra_pool": ".rgw.buckets.extra",
            "index_pool": ".rgw.buckets.index",
            "marker": "default.14434697.1",
            "bucket_id": "default.14434697.1"
        },
        "key": "",
        "ns": "",
        "object": "scaler_iso",
        "instance": ""
    },
    "head_size": 0,
    "max_head_size": 0,
    "prefix": "scaler_iso.2~PIT5zFUnzqgjA_EjTb1SfugCOtHZKDH",
    "tail_bucket": {
        "name": "iso",
        "pool": ".rgw.buckets",
        "data_extra_pool": ".rgw.buckets.extra",
        "index_pool": ".rgw.buckets.index",
        "marker": "default.14434697.1",
        "bucket_id": "default.14434697.1"
    },
    "rules": [
        {
            "key": 0,
            "val": {
                "start_part_num": 1,
                "start_ofs": 0,
                "part_size": 10485760,
                "stripe_max_size": 4194304,
                "override_prefix": ""
            }
        },
        {
            "key": 2841640960,
            "val": {
                "start_part_num": 272,
                "start_ofs": 2841640960,
                "part_size": 733184,
                "stripe_max_size": 4194304,
                "override_prefix": ""
            }
        }
    ]
}
]]>
flashcache 源码解析 2017-10-20T17:20:40+00:00 Bean Li http://bean-li.github.io/flashcache-source-code-1 前言

从flashcache的创建开始,介绍flashcache在SSD上的layout和内存数据结构,简单地说就是数据组织形式。

        sprintf(dmsetup_cmd, "echo 0 %lu flashcache %s %s %s %d 2 %lu %lu %d %lu %d %lu"
                " | dmsetup create %s",
                disk_devsize, disk_devname, ssd_devname, cachedev, cache_mode, block_size, 
                cache_size, associativity, disk_associativity, write_cache_only, md_block_size,
                cachedev);

从flashcache之后的参数算起:

dmc的成员 dmsetup create中的参数 默认值 含义  
disk_dev disk_devname 慢速块设备的名字  
cache_dev ssd_devname SSD设备的名字  
dm_vdevname flashcache的名字 flashcache起的名字  
cache_mode cache_mode 三种合法值:write_back,write_through和write_around  
persistence(非dmc的成员变量) 2 2 实际flashcache_ctr函数,即为flashcache_create服务,也为flashcache_load服务  
block_size block_size 8 8个扇区即4K  
size cache_size 设备扇区总数/block_size, 注意这个值的含义是block的个数,即总扇区除以block_size.  
assoc associativity 512 合法值为(256,8192)之间的2的整数幂,不包含256和8192  
disk_assoc        
write_only_cache write_cache_only 0 write_back模式有一个子模式,即write_only  
md_block_size   8    
num_sets   dmc->size » dmc->assoc_shift    
         

影响flashcache布局的几个参数有:

  • block_size: 默认情况下值为8,即8个扇区组成一个block,即block的大小为4KB
  • size : block的个数

注意,注意在

       
      //截止到此处,dmc->size是SSD设备的扇区个数,
      //后面调用dmc->size /= (dmc->block_size)执行之后,才变成block的个数。

        dmc->md_blocks = INDEX_TO_MD_BLOCK(dmc, dmc->size / dmc->block_size) + 1 + 1; 
        /*总扇区数减去md_block需要的扇区数,得到最多可以用于存放数据的扇区数*/
        dmc->size -= dmc->md_blocks * MD_SECTORS_PER_BLOCK(dmc);  
        /*可以用来存放cache数据的block个数,默认情况下即4K的个数*/
        dmc->size /= dmc->block_size;
    /*注意,block是要组成set的,因此有assoc的概念,默认512个block组成一个set
     *因此block的个数需要向下对齐512的倍数*/
        dmc->size = (dmc->size / dmc->assoc) * dmc->assoc;           
        
        /*有了准确的block的个数,需要的meta data block重新计算*/
        dmc->md_blocks = INDEX_TO_MD_BLOCK(dmc, dmc->size) + 1 + 1;                                                                                    
        DMINFO("flashcache_writeback_create: md_blocks = %d, md_sectors = %d\n", 
               dmc->md_blocks, dmc->md_blocks * MD_SECTORS_PER_BLOCK(dmc));
        dev_size = to_sector(dmc->cache_dev->bdev->bd_inode->i_size);
        cache_size = dmc->md_blocks * MD_SECTORS_PER_BLOCK(dmc) + (dmc->size * dmc->block_size);
        if (cache_size > dev_size) {
                DMERR("Requested cache size exceeds the cache device's capacity" \
                      "(%lu>%lu)",
                      cache_size, dev_size);
                vfree((void *)header);
                return 1;
        }

这段代码的执行过后,我们基本就能对flashcache的组织形式有一定的了解了。首先是8个扇区做成一个block,然后是512个block组成一个set,这样的话,一个set的大小为2M,将SSD整体空间扣除meta需要的部分之后,组织成这样的结构:

注意,这只是cache block的部分,对于flashcache来说,还有metadata block和superblock。和文件系统一样,flashcache也有superblock,介绍flashcache的组织形式:

        header = (struct flash_superblock *)vmalloc(MD_BLOCK_BYTES(dmc));
        if (!header) {
                DMERR("flashcache_writeback_create: Unable to allocate sector");
                return 1;                                                                                                                              
        }
struct flash_superblock {
        sector_t size;          /* Cache size */
        u_int32_t block_size;   /* Cache block size */
        u_int32_t assoc;        /* Cache associativity */
        u_int32_t cache_sb_state;       /* Clean shutdown ? */
        char cache_devname[DEV_PATHLEN]; /* Contains dm_vdev name as of v2 modifications */
        sector_t cache_devsize;
        char disk_devname[DEV_PATHLEN]; /* underlying block device name (use UUID paths!) */
        sector_t disk_devsize;
        u_int32_t cache_version;
        u_int32_t md_block_size;                                                                                                                       
        u_int32_t disk_assoc;
        u_int32_t write_only_cache;
};

尽管flashcache的superblock需要的空间比较小,但是flashcache给superblock预留了一个meta data block的大小,即默认情况下4KB的大小,为将来可能的扩展预留的空间。

flash_superblock这个数据结构存在在SSD这个设备的第一个4K,当机器重启之后,flashcache_load会阅读该设备,确保头部扇区中存放的内容,即superblock的内容:

        ssd_devname = argv[optind++];
        cache_fd = open(ssd_devname, O_RDONLY);
        if (cache_fd < 0) {
                fprintf(stderr, "Failed to open %s\n", ssd_devname);
                exit(1);
        }   
        lseek(cache_fd, 0, SEEK_SET);
        if (read(cache_fd, buf, 512) < 0) {
                fprintf(stderr, "Cannot read Flashcache superblock %s\n", ssd_devname);
                exit(1);                    
        }   
        if (!(sb->cache_sb_state == CACHE_MD_STATE_DIRTY ||
              sb->cache_sb_state == CACHE_MD_STATE_CLEAN ||
              sb->cache_sb_state == CACHE_MD_STATE_FASTCLEAN ||
              sb->cache_sb_state == CACHE_MD_STATE_UNSTABLE)) {
                fprintf(stderr, "%s: Invalid Flashcache superblock %s\n", pname, ssd_devname);
                exit(1);
        }   

创建flashcache的时候,flashcache_create也会阅读SSD设备的第一个扇区,来确保SSD是不是已经创建了flashcache。

对于上面网格图中的任何一个cache block,都需要数据结构来描述其状态,比如值是否有效,是否DIRTY等,其数据结构如下:

#ifdef FLASHCACHE_DO_CHECKSUMS
struct flash_cacheblock {                                                                                                                              
        sector_t        dbn;    /* Sector number of the cached block */
        u_int64_t       checksum;
        u_int32_t       cache_state; /* INVALID | VALID | DIRTY */
} __attribute__ ((aligned(32)));
#else   
struct flash_cacheblock {
        sector_t        dbn;    /* Sector number of the cached block */
        u_int32_t       cache_state; /* INVALID | VALID | DIRTY */      
} __attribute__ ((aligned(16)));
#endif

对于我们而言,flash_cacheblock的大小为16字节,因此,每个cache block都会有16字节的元数据。这16字节描述了一个cache block。每个meta data block 默认有4KB,即每个meta data block可以存放256 个cache block的元数据信息。

综合上述讨论,一个完整的SSD layout如下所示:

上面的布局,主要是块设备上的布局,除此外,flashcache正常运行期间,需要消耗内存,内存中有数据结构管理这些cache block,如下所示:

 order = dmc->size * sizeof(struct cacheblock); 
 struct cacheblock {
        u_int16_t       cache_state;
        int16_t         nr_queued;      /* jobs in pending queue */                                                                                    
        u_int16_t       lru_prev, lru_next;
        u_int8_t        use_cnt;
        u_int8_t        lru_state;
        sector_t        dbn;    /* Sector number of the cached block */
        u_int16_t       hash_prev, hash_next;
#ifdef FLASHCACHE_DO_CHECKSUMS
        u_int64_t       checksum;
#endif
} __attribute__((packed));

目前来讲,不考虑checksum,内存中18 Byte 描述一个cache block(默认4KB)。

        order = dmc->size * sizeof(struct cacheblock);
        DMINFO("Allocate %luKB (%luB per) mem for %lu-entry cache" \
               "(capacity:%luMB, associativity:%u, block size:%u " \
               "sectors(%uKB))",
               order >> 10, sizeof(struct cacheblock), dmc->size,
               cache_size >> (20-SECTOR_SHIFT), dmc->assoc, dmc->block_size,
               dmc->block_size >> (10-SECTOR_SHIFT));
        dmc->cache = (struct cacheblock *)vmalloc(order);
        if (!dmc->cache) {
                vfree((void *)header);
                DMERR("flashcache_writeback_create: Unable to allocate cache md");
                return 1;
        }
        memset(dmc->cache, 0, order);
        /* Initialize the cache structs */
        for (i = 0; i < dmc->size ; i++) {
                dmc->cache[i].dbn = 0;
#ifdef FLASHCACHE_DO_CHECKSUMS
                dmc->cache[i].checksum = 0;
#endif
                dmc->cache[i].cache_state = INVALID;
                dmc->cache[i].lru_state = 0;
                dmc->cache[i].nr_queued = 0;
        }                         

通过这个18 Byte的内存描述一个flashcache 的cache block,我们可以估算,一个400G 的SSD作为flashcache的SSD部分,消耗的内存约为:

400G/4KB*18 = 1.8GB

cache_set

dmc的assoc 默认是512,表示512个block组成一个set,即512*4K= 2MB:

init:
        /*计算整个flashcache set的个数*/
        dmc->num_sets = dmc->size >> dmc->assoc_shift;
        order = dmc->num_sets * sizeof(struct cache_set);
        dmc->cache_sets = (struct cache_set *)vmalloc(order);                                                                                          
        if (!dmc->cache_sets) {
                ti->error = "Unable to allocate memory";
                r = -ENOMEM;
                vfree((void *)dmc->cache);
                goto bad3;
        }                                    
        memset(dmc->cache_sets, 0, order);
        for (i = 0 ; i < dmc->num_sets ; i++) {
                dmc->cache_sets[i].set_fifo_next = i * dmc->assoc;
                dmc->cache_sets[i].set_clean_next = i * dmc->assoc;
                dmc->cache_sets[i].fallow_tstamp = jiffies;
                dmc->cache_sets[i].fallow_next_cleaning = jiffies;
                dmc->cache_sets[i].hotlist_lru_tail = FLASHCACHE_NULL;
                dmc->cache_sets[i].hotlist_lru_head = FLASHCACHE_NULL;
                dmc->cache_sets[i].warmlist_lru_tail = FLASHCACHE_NULL;
                dmc->cache_sets[i].warmlist_lru_head = FLASHCACHE_NULL;
                spin_lock_init(&dmc->cache_sets[i].set_spin_lock);
        }

对于每个set有单独的数据结构描述:

struct cache_set {
        spinlock_t              set_spin_lock;
        u_int32_t               set_fifo_next;
        u_int32_t               set_clean_next;
        u_int16_t               clean_inprog;
        u_int16_t               nr_dirty;
        u_int16_t               dirty_fallow;
        unsigned long           fallow_tstamp;
        unsigned long           fallow_next_cleaning;
        /*  
         * 2 LRU queues/cache set.
         * 1) A block is faulted into the MRU end of the warm list from disk.
         * 2) When the # of accesses hits a threshold, it is promoted to the
         * (MRU) end of the hot list. To keep the lists in equilibrium, the
         * LRU block from the host list moves to the MRU end of the warm list.
         * 3) Within each list, an access will move the block to the MRU end.
         * 4) Reclaims happen from the LRU end of the warm list. After reclaim
         * we move a block from the LRU end of the hot list to the MRU end of
         * the warm list.
         */
        u_int16_t               hotlist_lru_head, hotlist_lru_tail;
        u_int16_t               warmlist_lru_head, warmlist_lru_tail;
        u_int16_t               lru_hot_blocks, lru_warm_blocks;
#define NUM_BLOCK_HASH_BUCKETS          512
        u_int16_t               hash_buckets[NUM_BLOCK_HASH_BUCKETS];
        u_int16_t               invalid_head;                                                                                                          
};

注意,对于同一个set的cache block而言,根据状态,位于三个不同的链表之中:

  • INVALID
    • invalid_head为头部的invalid 链表
  • VALID
    • hot:
      • hotlist_lru_head为头部,hotlist_lru_tail为尾部的hot链表
    • warm
      • warmlist_lru_head为头部,warmlist_lru_tail为尾部的warm链表

注意,一个cacheblock只会位于其中的一条链表之中,不会同时属于hot和warm,更不会同时属于invalid和warm。

在64位系统上,指针的长度是8Byte,如果用普通的链表,prev next就要消耗16B的空间,这样是比较浪费的,flashcache使用是的u_int16_t类型的,每一个cacheblock通过一个2字节的short值,记录前一个cacheblock的值和后一个cacheblock的值。注意该值是同一个set的index值,因为默认set只有512,所以,2Byte的short足够记录下。

注意,当cacheblock中没有任何数据的时候,它位于invalid链表中,即这个链表里面都没啥数据。毫无疑问,当新建的flashcache里面,其实并没有任何有用的数据,并不和SATA DISK的数据相关联,因此,都会位于invalid 链表。在flashcache_ctr之中有如下的语句:

        for (i = 0 ; i < dmc->size ; i++) {
                dmc->cache[i].hash_prev = FLASHCACHE_NULL;
                dmc->cache[i].hash_next = FLASHCACHE_NULL;
                /*注意,flashcache_ctr并非只有创建flashcache一种情况,
                 *还有flashcache使用了一段时间之后,重启机器后的flashcache_load
                 *因此,需要判断对应的cacheblock的cache_state状态值,来初始化到合适的链表*/
                 
                /*如果cache_state状态中VALID置位,则插入的flashcache_hash,方便查找*/
                if (dmc->cache[i].cache_state & VALID) {
                        flashcache_hash_insert(dmc, i);
                        atomic_inc(&dmc->cached_blocks);
                }    
                /*如果dirty,则dirty统计增加*/
                if (dmc->cache[i].cache_state & DIRTY) {
                        dmc->cache_sets[i / dmc->assoc].nr_dirty++;
                        atomic_inc(&dmc->nr_dirty);
                }    
                /*如果是新创建,或者该cacheblock并无有效数据,则插入Invalid链表
                 *对应新创建的flashcahce,所有的cacheblock都在invalid链表,
                 *注意,并不是1条链表,而是每个cacheset都有1条链表*/
                if (dmc->cache[i].cache_state & INVALID)
                        flashcache_invalid_insert(dmc, i);

下面来介绍hotlist和warmlist,flashcache采用的缓存置换算法是LRU算法,它维护着2条链表:hot和warm。当然了,顾名思义,hot链表的数据更热,更不应该被置换出去。每条链表有head和tail,约靠近尾部的cacheblock,越热,越不应该被置换出去。

数据是被访问的,因此,频繁访问的数据,可能会从warm迁到(premote)hot,如果hot链表中最冷的数据(即靠近head的数据),也可能会被降级(demote)到warm中。

除此以外,可能某个cacheblock中存在合法的数据(VALID),但是由于新的io进来,第一反应肯定是会不会我请求的IO对应的地址 dbn恰好在flashcahce的中并且状态为VALID,如果找到皆大欢喜;如果找不到,第二反应是寻找一个无人用的cacheblock,即位于INVALID链表的cacheblock。如果很不幸,没有INVALID的cache block,所有的block都已经用了(VALID),这时候,就必须要寻找牺牲品了,即reclaim策略。

接下来我们以flashcache_read为例,详细介绍寻找cacheblock的方法。

寻找cacheblock

对于读请求,由函数flashcache_read负责处理,注意,对于那些注定不会进入cacheblock的读写,在进入flashcache_read之前都已经过滤掉了:

        uncacheable = (unlikely(dmc->bypass_cache) ||
                       (to_sector(bio->bi_size) != dmc->block_size) ||
                       /* 
                        * If the op is a READ, we serve it out of cache whenever possible, 
                        * regardless of cacheablity 
                        */
                       (bio_data_dir(bio) == WRITE && 
                        ((dmc->cache_mode == FLASHCACHE_WRITE_AROUND) ||
                         flashcache_uncacheable(dmc, bio))));
        spin_unlock_irqrestore(&dmc->ioctl_lock, flags);
        if (uncacheable) {
                flashcache_setlocks_multiget(dmc, bio);
                queued = flashcache_inval_blocks(dmc, bio);
                flashcache_setlocks_multidrop(dmc, bio);
                if (queued) {
                        if (unlikely(queued < 0))                    
                                flashcache_bio_endio(bio, -EIO, dmc, NULL);
                } else {
                        /* Start uncached IO */
                        /*绕过flashcache,直接访问慢速设备*/
                        flashcache_start_uncached_io(dmc, bio);
                }
        } else {
                /*如果io类型可以走flashcache,那么根据类型分别调用
                 *flashcache_read和flashcache_write*/
                if (bio_data_dir(bio) == READ)
                        flashcache_read(dmc, bio);
                else
                        flashcache_write(dmc, bio);
        }
        return DM_MAPIO_SUBMITTED;

剩下内容的重点是cacheblock的查找 置换的策略,什么io走flashcache,什么io直接访问慢速设备,并不是我们关心的内容。我们继续以flashcache_read为例,介绍寻找cacheblock的过程。

下面代码是查找cacheblock的方法,主要的寻找过程位于flashcache_lookup函数。

        flashcache_setlocks_multiget(dmc, bio);
        res = flashcache_lookup(dmc, bio, &index);
        /* Cache Read Hit case */
        if (res > 0) {
                cacheblk = &dmc->cache[index];
                if ((cacheblk->cache_state & VALID) && 
                    (cacheblk->dbn == bio->bi_sector)) {
                        flashcache_read_hit(dmc, bio, index);
                        return;
                }
        }
        /*
         * In all cases except for a cache hit (and VALID), test for potential 
         * invalidations that we need to do.
         */
        queued = flashcache_inval_blocks(dmc, bio);
        if (queued) {
                if (unlikely(queued < 0))
                        flashcache_bio_endio(bio, -EIO, dmc, NULL);
                if ((res > 0) && 
                    (dmc->cache[index].cache_state == INVALID))
                        /* 
                         * If happened to pick up an INVALID block, put it back on the 
                         * per cache-set invalid list
                         */
                        flashcache_invalid_insert(dmc, index);                                                                                         
                flashcache_setlocks_multidrop(dmc, bio);
                return;
        }

因为数据是流动的,因此整个flashcache N个cacheset,每个cacheset M个cache block,其状态都是流动的,刚才我可能是invalid,可能很快我就位于warmlist了,再有数据访问,我可能就迁移到了hotlist了。因此理解flashcache_lookup,知道当用户某一个请求要访问sector_t dbn = bio->bi_sector 这个扇区的时候,如何查找cacheblock是理解状态流动的非常关键的一步。

static int
flashcache_lookup(struct cache_c *dmc, struct bio *bio, int *index)
{
        sector_t dbn = bio->bi_sector;
#if DMC_DEBUG                                                                                                                                          
        int io_size = to_sector(bio->bi_size);
#endif
        unsigned long set_number = hash_block(dmc, dbn);
        int invalid, oldest_clean = -1;
        int start_index;

        start_index = dmc->assoc * set_number;
        DPRINTK("Cache lookup : dbn %llu(%lu), set = %d",
                dbn, io_size, set_number);
        find_valid_dbn(dmc, dbn, start_index, index);
        if (*index >= 0) {
                DPRINTK("Cache lookup HIT: Block %llu(%lu): VALID index %d",
                             dbn, io_size, *index);
                /* We found the exact range of blocks we are looking for */
                return VALID;
        }
        invalid = find_invalid_dbn(dmc, set_number);
        if (invalid == -1) {
                /* We didn't find an invalid entry, search for oldest valid entry */
                find_reclaim_dbn(dmc, start_index, &oldest_clean);
        }
        /* 
         * Cache miss :
         * We can't choose an entry marked INPROG, but choose the oldest                                                                               
         * INVALID or the oldest VALID entry.
         */
        *index = start_index + dmc->assoc;
        if (invalid != -1) {
                DPRINTK("Cache lookup MISS (INVALID): dbn %llu(%lu), set = %d, index = %d, start_index = %d", dbn, io_size, set_number, invalid, start_index);
                *index = invalid;
        } else if (oldest_clean != -1) {
                DPRINTK("Cache lookup MISS (VALID): dbn %llu(%lu), set = %d, index = %d, start_index = %d",
                             dbn, io_size, set_number, oldest_clean, start_index);
                *index = oldest_clean;
        } else {
                DPRINTK_LITE("Cache read lookup MISS (NOROOM): dbn %llu(%lu), set = %d",
                        dbn, io_size, set_number);
        }
        if (*index < (start_index + dmc->assoc))
                return INVALID;
        else {
                dmc->flashcache_stats.noroom++;
                return -1;
        }
}

注意,这就是寻找cacheblock的算法了。第一步是要寻找合适的set,因为flashcache默认情况下,每个set 512个cache block,首先要定位到那个cacheset,然后再cacheset中确定合适的cache block。通俗点说,就是分两步走:

  • 找到合适的cache set
  • 从该cache set中找到合适的cache block

第一步比较简单,根据bio的扇区号,计算hash,然后映射到对应的cache set:

unsigned long   
hash_block(struct cache_c *dmc, sector_t dbn)
{
        unsigned long set_number, value;
        int num_cache_sets = dmc->size >> dmc->assoc_shift;

        /*
         * Starting in Flashcache SSD Version 3 :
         * We map a sequential cluster of disk_assoc blocks onto a given set.
         * But each disk_assoc cluster can be randomly placed in any set.
         * But if we are running on an older on-ssd cache, we preserve old
         * behavior.
         */
        if (dmc->on_ssd_version < 3 || dmc->disk_assoc == 0) {
                value = (unsigned long)
                        (dbn >> (dmc->block_shift + dmc->assoc_shift));
        } else {
                /*我们走本分支*/
                value = (unsigned long) (dbn >> dmc->disk_assoc_shift);
                /* Then place it in a random set */
                value = jhash_1word(value, 0xbeef);
        }
        set_number = value % num_cache_sets;
        DPRINTK("Hash: %llu(%lu)->%lu", dbn, value, set_number);                                                                                       
        return set_number;
}

我们走else分支,这里面有一个参数,初看flashcache不容易理解,即disk_assoc_shift,这个参数在创建flashcache的时候可以指定disk_associativity :

root@XMT-S02:~# dmsetup table
osd4: 0 70316455903 flashcache conf:
	ssd dev (/dev/disk/by-partlabel/osd4-ssd), disk dev (/dev/disk/by-partlabel/osd4-data) cache mode(WRITE_BACK)
	capacity(446572M), associativity(512), data block size(4K) metadata block size(4096b)
	disk assoc(256K)
	skip sequential thresh(32K)
	total blocks(114322432), cached blocks(96119380), cache percent(84)
	dirty blocks(41155646), dirty percent(35)
	nr_queued(0)

我们看到,默认情况下,disk assoc的值是256K,事实上这个控制选项发挥作用也就是在寻找合适的cache set中发挥控制作用,如果没有这个选项,直接拿dbn进行hash,然后map到cacheset,相邻的两个dbn,可能压根就不会位于同一个cache set,那么将来对同一个cache set的io进行merge也就没啥必要了,因为相邻的dbn在同一个set的可能性并不大。

有了这个disk assoc参数就不同了,它hash之前,首先执行:

value = (unsigned long) (dbn >> dmc->disk_assoc_shift);

它确保的是,在同一个256KB块内的扇区,最终会得到同一个value,然后hash会map到同一个cache set,将来就有可能将相邻的请求merge,从而提高性能。

除了此处不太好理解意外,其他基本就是算出hash值,然后对cache set的个数求余,来决定落在那个cache set中。

第一步已经解决了,接下来是第二部,如何在cache set中找到

其算法核心可以分成三部:

  • find_valid_dbn
  • find_invalid_dbn
  • find_reclaim_dbn

find_valid_dbn

static void
find_valid_dbn(struct cache_c *dmc, sector_t dbn, 
               int start_index, int *index)
{
        *index = flashcache_hash_lookup(dmc, start_index / dmc->assoc, dbn);
        if (*index == -1)
                return;
        if (dmc->sysctl_reclaim_policy == FLASHCACHE_LRU &&
            ((dmc->cache[*index].cache_state & BLOCK_IO_INPROG) == 0))
                flashcache_lru_accessed(dmc, *index);
        /* 
         * If the block was DIRTY and earmarked for cleaning because it was old, make 
         * the block young again.
         */
        flashcache_clear_fallow(dmc, *index);
}

int
flashcache_hash_lookup(struct cache_c *dmc,
                       int set,
                       sector_t dbn)                                                  
{
        struct cache_set *cache_set = &dmc->cache_sets[set];
        int index;
        struct cacheblock *cacheblk;
        u_int16_t set_ix;
#if 0
        int start_index, end_index, i;
#endif
        
        set_ix = *flashcache_get_hash_bucket(dmc, cache_set, dbn);
        while (set_ix != FLASHCACHE_NULL) {
                index = set * dmc->assoc + set_ix;
                cacheblk = &dmc->cache[index];
                /* Only VALID blocks on the hash queue */
                VERIFY(cacheblk->cache_state & VALID);
                VERIFY((cacheblk->cache_state & INVALID) == 0);
                if (dbn == cacheblk->dbn)
                        return index;
                set_ix = cacheblk->hash_next;
        }
        return -1;
}  

static inline u_int16_t *
flashcache_get_hash_bucket(struct cache_c *dmc, struct cache_set *cache_set, sector_t dbn)  
{
        unsigned int hash = jhash_1word(dbn, 0xfeed);
     
        return &cache_set->hash_buckets[hash % NUM_BLOCK_HASH_BUCKETS];
}

我们已经找到了cache set,默认情况下cache set中有512个cache block,这些cache block中是否有我们需要的扇区呢?

最容易想到的是,逐个cache block比对,看下dbn号是否一致,状态是否是VALID。但是这种方法太蠢,效率太低。正确的方法是hash。

如果cache block中存在有效数据,他会根据对应的扇区号 dbn来计算hash,放入cache set中的合适bucket中。这种hash的做法,加速了cache set内部对某dbn是否存在在某个cacheblock的查找。

对于读来讲最完美的情况是,请求要求的数据块,恰巧位于flashcache的SSD 设备中,这种情况称为读命中。如果命中的话,因为该cacheblock的数据,相当于获得一次有效的访问,那么当空间吃紧的时候,应该降低该block被替换出去的概率,即提升其热度。

        if (dmc->sysctl_reclaim_policy == FLASHCACHE_LRU &&
            ((dmc->cache[*index].cache_state & BLOCK_IO_INPROG) == 0))
                flashcache_lru_accessed(dmc, *index);

这个flashcache_lru_accessed函数,即某个cacheblock最近被访问时,需要执行的操作,代码中有一段注释,言简意赅地介绍了这部分的算法:

/* 
 * Block is accessed.
 * 
 * Algorithm :
   if (block is in the warm list) {
       block_lru_refcnt++;
       if (block_lru_refcnt >= THRESHOLD) {
          clear refcnt
          Swap this block for the block at LRU end of hot list
       } else     
          move it to MRU end of the warm list
   }
   if (block is in the hot list)
       move it to MRU end of the hot list
 */

  • 如果block目前在warm list
    • 引用计数++
      • 如果引用计数大于等于门限值(sysctl_lru_promote_thresh),一般是2,则从warm list 移入 hot list的LRU端(最左端)
      • 如果引用计数低于门限值,则从移入 warm list的MRU端,即最右端
  • 如果block 目前在hot list
    • 将该block移入hot list的MRU端,即最右端。

两个链表host list和warm list,其最左端都是LRU端(Least Recent Used), 其最右端是MRU端(Most Recent Used)。一旦需要置换,将某些cacheblock中的内容踢出出去,选择的顺序如下:

Worm List  LRU -------->Worm List MRU--------->Hot List LRU -------------> Hot List MRU

代码部分就不列了,简单的链表操作。

从cache set中寻找cache block的第一步就完成,这种情况是最幸运的一种,即要读取的内容所在的扇区,恰好在flashcache的 SSD部分中,数据有效VALID,可以拿到cache block的index,因为本次访问,将该cache block的热度提升到合适的位置。

但是也许并没有这么幸运,SSD中没有dbn对应扇区的内容,这种情况下,需要选择一个cache block来盛放即将从 慢速设备的扇区中读取上来的内容。这种情况下,第一选择是选择一个并且投入使用的cache block,即INVALID状态的cache block。

Why?

如果不这么做,选择一个VALID状态的cache block,该cache block的内容就会被新的dbn的内容替换,那么该cache block中老的内容,就被逐出SSD了,如果紧接着发来一个访问cache block 中老的dbn扇区的内容的请求,就会造成miss。更恶劣的情况是该cache block的内容是dirty,flashcache 可能不得不先等待dirty内容flush下去之后,方能使用该cache block。

所以当命中已成不可能的时候,选择INVALID状态的cache block是上策:

find_invalid_dbn

static int
find_invalid_dbn(struct cache_c *dmc, int set)                                                 
{
        int index = flashcache_invalid_get(dmc, set);

        if (index != -1) {
                if (dmc->sysctl_reclaim_policy == FLASHCACHE_LRU)
                        flashcache_lru_accessed(dmc, index);
                VERIFY((dmc->cache[index].cache_state & FALLOW_DOCLEAN) == 0);
        }    
        return index;
}

寻找INVALID状态的cache block比较简单,因为对于一个cache set而言,所有的invalid都位于invalid_head为头部的链表,只需要摘下头部的cache block就可以了

int
flashcache_invalid_get(struct cache_c *dmc, int set)
{
        struct cache_set *cache_set;
        int index;
        struct cacheblock *cacheblk;

        cache_set = &dmc->cache_sets[set];
        index = cache_set->invalid_head;
        if (index == FLASHCACHE_NULL)
                return -1;
        index += (set * dmc->assoc);
        cacheblk = &dmc->cache[index];
        VERIFY(cacheblk->cache_state == INVALID);
        flashcache_invalid_remove(dmc, index);                                                                                                      
        return index;
}

同样的道理,因为该cache block从INVALID迁移到了warm list的MRU端。

其实这种情况还不错,因为还能找到闲置的cache block。随着flashcahe的使用,很可能这种情况也不可得。很可能该cache set下的所有的cache block都投入了战局,在该cache set已经找不到一块闲置的cache block了。

find_reclaim_dbn

这种情况下,就需要从投入使用的cacheblock中寻找一个牺牲品了,也就是cache block要回收了。优于SSD Dev和DIsk Dev大小的关系,不可能所有数据都存入SSD,所有的缓存算法都需要缓存替换算法,高效的缓存替换算法,能够获得更大的性能提升。

当选择牺牲品的时候,长期以来,我们维护hot list 和warm list的操作,就有了价值,这些信息给了我们选择牺牲品的依据。

static void 
find_reclaim_dbn(struct cache_c *dmc, int start_index, int *index)
{
        if (dmc->sysctl_reclaim_policy == FLASHCACHE_FIFO)
                flashcache_reclaim_fifo_get_old_block(dmc, start_index, index);
        else /* flashcache_reclaim_policy == FLASHCACHE_LRU */
                flashcache_reclaim_lru_get_old_block(dmc, start_index, index);                                                                      
}

Flashcache目前支持两种策略,FIFO和LRU。我们此处讨论LRU,这种算法就是将最近最不常使用的cache block替换出去。

代码给了注释:

/* 
 * Get least recently used LRU block
 * 
 * Algorithm :
 *      Always pick block from the LRU end of the warm list. 
 *      And move it to the MRU end of the warm list.
 *      If we don't find a suitable block in the "warm" list,
 *      pick the block from the hot list, demote it to the warm
 *      list and move a block from the warm list to the hot list.
 */

总是从worm list的LRU端找,然后把它移到MRU端。如果在warm list找不到合适的,那么从hot list的LRU端找,如果找到,执行demote操作,即将hot list的LRU和worm list MRU互换位置。

都是一些简单的链表操作,就不在此处贴代码了。

]]>
ceph-mon之Paxos算法(2) 2017-10-04T17:20:40+00:00 Bean Li http://bean-li.github.io/ceph-paxos-2 前言

上一篇文章介绍了一次提案通过的正常流程,尽管流程已经介绍完毕了,但是,总有一些困扰萦绕心头。

accepted_pn到底是什么鬼?

在monitor leader的begin 函数中:

 t->put(get_name(), last_committed+1, new_value);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", last_committed + 1);
  t->put(get_name(), "pending_pn", accepted_pn);

在Peon的handle_begin函数中:

  t->put(get_name(), v, begin->values[v]);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", v);
  t->put(get_name(), "pending_pn", accepted_pn);

讲提案编码这块是有意义的,因为commit的阶段要解码这段bufferlist,并提交事务,这好理解,可是后两句,pending_v和pending_pn到底是干嘛滴?后面一直也没下文,也不知道设置pending_v和pending_pn到底有啥用途。

这一步逻辑,其实是用于恢复的。正常情况下,自然不会用到,但是如果有异常发生,Paxos的恢复逻辑需要用到上述的信息。

基本概念

  • PN Proposal Number

Leader当选之后,会执行一次Phase 1过程来确定PN,在其为Leader的过程中,所有的Phase 2共用一个PN。所以省略了大量的Phase 1过程。这也是Paxos能够减小网络开销的原因。

A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm
                                                                                  -- << Paxos Make Simple>>
  • Version

verson可以理解为Paxos中的Instance ID。应用层的每一个提案,可以encode成二进制的字节流,作为value,而version或者Instance ID作为键值和该value对应。

需要持久化的数据结构有:

名称 含义 其他
last_pn 上次当选leader后生成的PN get_new_proposal_number()使用,下次当选后,接着生成
accepted_pn 我接受过的PN,可能是别的leader提议的PN peon根据这个值拒绝较小的PN
first_committed 本节点记录的第一个被commit的版本 更早的版本(日志),本节点没有了
last_committed 本节点记录的最后一次被commit的版本 往后的版本,未被commit,可能有一个
uncommitted_v 本节点记录的未commit的版本,如果有,只能等于last_commit+1 ceph只允许有一个未commit的版本
uncommitted_pn 未commit的版本对应的PN 与uncommitted_v,uncommitted_value在一个事务中记录
uncommitted_value 未commit的版本的内容 与uncommitted_v,uncommitted_value在一个事务中记录

注意,上述三个”uncommitted”开头的值,可能压根就不存在,比如正常关机,全部都commit了。

介绍完这些基本概念,我们需要开始考虑异常了。事实上,从时间顺序上讲,这一篇才是应该是第一篇,因为整个集群的mon要首先到达一个一致的状态,然后开始有条不紊地进行上一篇文章进行的步骤。

但是,从认知规律上讲,上一篇讲的内容,是Paxos主干路径,每天进行无数次,而ceph mon恢复到一致的状态,才是异常路径,只有发生异常的时候,才会走到。因此,我们选择了先介绍正常,然后介绍异常,以及从异常中恢复到一致的状态。

注意哈,Leader选举成功之后,会掉用collect,这个名字看起来怪怪的,其实是有意义的,是说可能发生了杂七杂八的异常,现在新的老大也已经选出来了,搜集一下各自的信息,然后将所有的成员的状态达成一致。

如果不能理清楚,可能会发生哪些异常,单纯流水账一样的阅读 collect handle_collect handle_last,可能无法体会代码为什么要这么写,为什么集群经过这么几个步骤就能达成一致。

所以,下面我们要从异常出发,可能产生哪几种异常,以及如何恢复的。

Recovery

当mon leader选举出来之后,会进入到STATE_RECOVERING状态,并调用collect函数,搜集peon的信息,以期互通有无,达成一致。

void Paxos::leader_init()
{
  cancel_events();
  new_value.clear();

  finish_contexts(g_ceph_context, proposals, -EAGAIN);

  logger->inc(l_paxos_start_leader);

  if (mon->get_quorum().size() == 1) {
    state = STATE_ACTIVE;
    return;
  }

  /*进入 recovering状态*/
  state = STATE_RECOVERING;
  lease_expire = utime_t();
  dout(10) << "leader_init -- starting paxos recovery" << dendl;
  
  /*掉用collect函数,进入phase 1*/
  collect(0);
}

注意在collect函数中,会生成一个新的PN(Proposal Number)。注意哈,这个编号有要求,要全局唯一,并且单调递增。那么集群这么多节点,mon leader也可能会变动,如何确保PN的这两个特点呢?

version_t Paxos::get_new_proposal_number(version_t gt)
{
  if (last_pn < gt) 
    last_pn = gt;
  
  // update. make it unique among all monitors.
  /*核心的算法在下面四句*/
  last_pn /= 100;
  last_pn++;
  last_pn *= 100;
  last_pn += (version_t)mon->rank;

  // write
  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), "last_pn", last_pn);

  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  t->dump(&f);
  f.flush(*_dout);
  *_dout << dendl;

  logger->inc(l_paxos_new_pn);
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_new_pn_latency, end - start);

  dout(10) << "get_new_proposal_number = " << last_pn << dendl;
  return last_pn;
}

在上次的值的基础上,加上100,然后加上该mon 的rank值即可。比如,如果rank值为0,1,2,最开始PN为100,每次触发选举,如果monitor 0存在的话,总是它获胜,那么下一个产生的PN= 100+100+0 = 200。如果当前PN为200,再次发生了monitor的选举,但是这一次,monitor 0并不在(发生了异常),那么monitor 1就会获胜,新产生的PN为200+100+1=301;如果突然monitor 0成功启动了,那么新的PN为(300/100+1)*100+0 = 400。

注意这个值,只会在leader选举完成后,collect的时候更新一次,当达成一致之后,后面可能有很多的提案,但是这个PN并不会发生变化。

步骤 Leader Peon 说明
1 collect() =>   Leader给quorum中各个peon发送PN以及其他附带信息,告诉peon,请将各自信息汇报上来
2   <=handle_collect() Peon同意或者拒绝PN。并中间可能分享已经commit的数据
3 handle_last()   Quorum中peon全部同意leader的PN,才算成功。这个函数会根据peon的信息以及自身的信息,要么重新propose uncommitted的提案,要么将某成员缺失的信息share出去,确保各个成员达成一致。

下面的内容,根据mon leader down还是Peon down,分开讨论

Peon down

Peon down的话,Leader会检测到。

首先有租约机制:

void Paxos::lease_ack_timeout()
{
  dout(1) << "lease_ack_timeout -- calling new election" << dendl;
  assert(mon->is_leader());
  assert(is_active());
  logger->inc(l_paxos_lease_ack_timeout);
  lease_ack_timeout_event = 0;
  mon->bootstrap();
}

其次,如果发送了OP_BEGIN,和peon因为down,无法回复OP_ACCEPT消息,会触发:

void Paxos::accept_timeout()
{
  dout(1) << "accept timeout, calling fresh election" << dendl;
  accept_timeout_event = 0;
  assert(mon->is_leader());
  assert(is_updating() || is_updating_previous() || is_writing() ||
	 is_writing_previous());
  logger->inc(l_paxos_accept_timeout);
  mon->bootstrap();
}

无论是哪一种情况,都会因bootstrap重新选举,选举结束后,原来的Leader仍然是Leader,这时候会调用collect函数。

这里我们分成 Peon Down 和Up两个阶段来讨论

Peon Down

注意,在collect函数中,会生成新的PN :

  accepted_pn = get_new_proposal_number(MAX(accepted_pn, oldpn));
  accepted_pn_from = last_committed;

Peon Down 就意味着 Leader一直都完好如初,而重新选举之后,Leader节点不会发生变化。这意味着所有Peon的数据并不会比Leader更新。

  • last_committed(leader) >= last_committed(peon)
  • accepted_pn(leader) > accepted_pn(peon)

第二条之所以成立,是因为在collect函数中,Leader重新生成了新的PN,因此,leader的accepted_pn要大于所有的Peon的accepted_pn。

timeout事件是在time线程内完成,time线程干活的时候会获取monitor lock,那么可以推断,leader的paxos流程可能被中断的情况包括以下几个点:

  1. Leader处于active状态,未开始任何提案
  2. leader为updating状态,即begin函数已经执行,等待accept中,此时leader有uncommitted数据,并且可能已经有部分accept消息
  3. leader为writing状态,说明已经接收到所有accept消息,即commit_start已经开始执行,事务已经排队等待执行
  4. leader为writing状态,写操作已经执行完成,即事务已经生效,只是回调函数(commit_finish)还没有被执行(回调函数没被执行是因为需要获取monitor lock的锁)

3和4会发生是因为Leader的commit采取了异步的方式:

  get_store()->queue_transaction(t, new C_Committed(this));
  
struct C_Committed : public Context {
  Paxos *paxos;
  explicit C_Committed(Paxos *p) : paxos(p) {}
  void finish(int r) {
    assert(r >= 0);
    Mutex::Locker l(paxos->mon->lock);
    paxos->commit_finish();
  }
};

一旦commit_finish 开始执行,就意味着持有monitor lock(paxos->mon->lock。leader不会被中断在refresh状态,因为一旦commit_finish函数开始执行, 会将refresh状态执行完成,重新回到active状态,time线程才可能获取到锁执行。

第1种情况,不需要处理,并没有什么新的提案在行进中,无需理会。 第二种情况下,存在uncommitted数据,Leader会重新开始一个propose的过程。如何做到?

注意哈,下面的注释部分,仅仅考虑Peon Down情况下的第二种情况,即Leader已经发起begin,正在等待OP_ACCEPT消息,可能收到了部分OP_ACCEPT的情况。

void Paxos::collect(version_t oldpn)
{
  // we're recoverying, it seems!
  state = STATE_RECOVERING;
  assert(mon->is_leader());

  /*uncommitted_v uncommitted_pn以及uncommitted_value是个三元组
   *collect也会搜集其他Peon的数据,因此此处为初始化*/
  uncommitted_v = 0;
  uncommitted_pn = 0;
  uncommitted_value.clear();
  peer_first_committed.clear();
  peer_last_committed.clear();

  /*注意哈,考虑第二种情况,Leader自己也有uncommitted数据,因此,本循环体是可以得到尚未commit的提案
   * 包括上一轮的PN存放到uncommitted_pn,
   * 上一轮的提案的Instance ID存放到 uncommitted_v,
   * 以及上一轮提案的值存放入uncommitted_value*/
  if (get_store()->exists(get_name(), last_committed+1)) {
    version_t v = get_store()->get(get_name(), "pending_v");
    version_t pn = get_store()->get(get_name(), "pending_pn");
    if (v && pn && v == last_committed + 1) {
      uncommitted_pn = pn;
    } else {
      dout(10) << "WARNING: no pending_pn on disk, using previous accepted_pn " << accepted_pn
	       << " and crossing our fingers" << dendl;
      uncommitted_pn = accepted_pn;
    }
    uncommitted_v = last_committed+1;

    get_store()->get(get_name(), last_committed+1, uncommitted_value);
    assert(uncommitted_value.length());
    dout(10) << "learned uncommitted " << (last_committed+1)
	     << " pn " << uncommitted_pn
	     << " (" << uncommitted_value.length() << " bytes) from myself" 
	     << dendl;

    logger->inc(l_paxos_collect_uncommitted);
  }

  /*重新生成新的PN,这个PN一定*/
  accepted_pn = get_new_proposal_number(MAX(accepted_pn, oldpn));
  accepted_pn_from = last_committed;
  num_last = 1;
  dout(10) << "collect with pn " << accepted_pn << dendl;

  // send collect
  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
    if (*p == mon->rank) continue;
    
    /*向其他节点发送OP_COLLECT,搜集信息,来使集群恢复到一致的状态*/
    
    MMonPaxos *collect = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_COLLECT,
				       ceph_clock_now(g_ceph_context));
    collect->last_committed = last_committed;
    collect->first_committed = first_committed;
    collect->pn = accepted_pn;
    mon->messenger->send_message(collect, mon->monmap->get_inst(*p));
  }

  // set timeout event
  collect_timeout_event = new C_MonContext(mon, [this](int r) {
	if (r == -ECANCELED)
	  return;
	collect_timeout();
    });
  mon->timer.add_event_after(g_conf->mon_accept_timeout_factor *
			     g_conf->mon_lease,
			     collect_timeout_event);
}


注意对于这种情况下,其他Peon节点,其accepted_pn一定会小于新产生的PN,即OP_COLLECT消息体中的PN。我们来看其他PEON节点的反应:

void Paxos::handle_collect(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_collect");

  MMonPaxos *collect = static_cast<MMonPaxos*>(op->get_req());
  dout(10) << "handle_collect " << *collect << dendl;

  assert(mon->is_peon()); // mon epoch filter should catch strays

  // we're recoverying, it seems!
  state = STATE_RECOVERING;

  /*这不会发生,对于我们限定的这种场景*/
  if (collect->first_committed > last_committed+1) {
    dout(2) << __func__
            << " leader's lowest version is too high for our last committed"
            << " (theirs: " << collect->first_committed
            << "; ours: " << last_committed << ") -- bootstrap!" << dendl;
    op->mark_paxos_event("need to bootstrap");
    mon->bootstrap();
    return;
  }

  /*回复OP_LAST消息,将自己的last_committed和first_committed放入消息体内*/
  MMonPaxos *last = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_LAST,
				  ceph_clock_now(g_ceph_context));
  last->last_committed = last_committed;
  last->first_committed = first_committed;
  
  version_t previous_pn = accepted_pn;

  /*注意,collect->pn是选举之后,原来的leader新产生出来的,因此一定会比PEON的accepted_n大*/
  if (collect->pn > accepted_pn) {
    // ok, accept it
    accepted_pn = collect->pn;
    accepted_pn_from = collect->pn_from;
    dout(10) << "accepting pn " << accepted_pn << " from " 
	     << accepted_pn_from << dendl;
  
    MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
    t->put(get_name(), "accepted_pn", accepted_pn);

    dout(30) << __func__ << " transaction dump:\n";
    JSONFormatter f(true);
    t->dump(&f);
    f.flush(*_dout);
    *_dout << dendl;

    logger->inc(l_paxos_collect);
    logger->inc(l_paxos_collect_keys, t->get_keys());
    logger->inc(l_paxos_collect_bytes, t->get_bytes());
    utime_t start = ceph_clock_now(NULL);

    get_store()->apply_transaction(t);

    utime_t end = ceph_clock_now(NULL);
    logger->tinc(l_paxos_collect_latency, end - start);
  } else {
    // don't accept!
    dout(10) << "NOT accepting pn " << collect->pn << " from " << collect->pn_from
	     << ", we already accepted " << accepted_pn
	     << " from " << accepted_pn_from << dendl;
  }
  last->pn = accepted_pn;
  last->pn_from = accepted_pn_from;

  // share whatever committed values we have
  if (collect->last_committed < last_committed)
    share_state(last, collect->first_committed, collect->last_committed);

  // do we have an accepted but uncommitted value?
  //  (it'll be at last_committed+1)
  bufferlist bl;
  
  /*注意,如果已经有Peon回复过OP_ACCEPT消息,那么此处就会走到*/
  if (collect->last_committed <= last_committed &&
      get_store()->exists(get_name(), last_committed+1)) {
    get_store()->get(get_name(), last_committed+1, bl);
    assert(bl.length() > 0);
    dout(10) << " sharing our accepted but uncommitted value for " 
	     << last_committed+1 << " (" << bl.length() << " bytes)" << dendl;
    last->values[last_committed+1] = bl;

    version_t v = get_store()->get(get_name(), "pending_v");
    version_t pn = get_store()->get(get_name(), "pending_pn");
    if (v && pn && v == last_committed + 1) {
      last->uncommitted_pn = pn;
    } else {
      // previously we didn't record which pn a value was accepted
      // under!  use the pn value we just had...  :(
      dout(10) << "WARNING: no pending_pn on disk, using previous accepted_pn " << previous_pn
	       << " and crossing our fingers" << dendl;
      last->uncommitted_pn = previous_pn;
    }

    logger->inc(l_paxos_collect_uncommitted);
  }

  // send reply
  collect->get_connection()->send_message(last);
}

我们以196 197 198集群为例,毫无疑问,196是monitor leader,在这种情况下,把197的mon 关闭,我们会看到:


196节点:
-------
2017-10-04 21:15:26.559490 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442) begin for 1737443 25958 bytes
2017-10-04 21:15:26.559516 7f36cefe9700 30 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442) begin transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "1737443",
          "length": 25958},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_v",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_pn",
          "length": 8}],
  "num_keys": 3,
  "num_bytes": 26015}
bl dump:
bl dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_432632",
          "length": 15884},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_latest",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "logm",
          "key": "432633",
          "length": 9882},
        { "op_num": 3,
          "type": "PUT",
          "prefix": "logm",
          "key": "last_committed",
          "length": 8}],
  "num_keys": 4,
  "num_bytes": 25840}
2017-10-04 21:15:26.580022 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442)  sending begin to mon.1
2017-10-04 21:15:26.580110 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442)  sending begin to mon.2




2017-10-04 21:15:26.594622 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442) handle_accept paxos(accept lc 1737442 fc 0 pn 1100 opn 0) v3
2017-10-04 21:15:26.594631 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating c 1736921..1737442)  now 0,2 have accepted


2017-10-04 21:15:40.996887 7f36cefe9700 10 mon.oquew@0(electing) e3 win_election epoch 26 quorum 0,2 features 211106232532991
2017-10-04 21:15:40.996955 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) leader_init -- starting paxos recovery
2017-10-04 21:15:40.997144 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) learned uncommitted 1737443 pn 1100 (25958 bytes) from myself
2017-10-04 21:15:40.997172 7f36cefe9700 30 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) get_new_proposal_number transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "last_pn",
          "length": 8}],
  "num_keys": 1,
  "num_bytes": 20}
2017-10-04 21:15:41.000424 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) get_new_proposal_number = 1200
2017-10-04 21:15:41.000456 7f36cefe9700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) collect with pn 1200






198节点
---------

2017-10-04 21:15:41.042089 7f7c043e3700 10 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442) handle_collect paxos(collect lc 1737442 fc 1736921 pn 1200 opn 0) v3
2017-10-04 21:15:41.042094 7f7c043e3700 10 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442) accepting pn 1200 from 0
2017-10-04 21:15:41.042101 7f7c043e3700 30 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442) handle_collect transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "accepted_pn",
          "length": 8}],
  "num_keys": 1,
  "num_bytes": 24}
2017-10-04 21:15:41.046361 7f7c043e3700 10 mon.yvmjl@2(peon).paxos(paxos recovering c 1736921..1737442)  sharing our accepted but uncommitted value for 1737443 (25958 bytes)


注意,1737443议案已经发起,并且收到了两个OP_ACCEPT,0和2,其中0是monitor leader本身,2是198发过来的OP_ACCEPT,1对应的mon是197的monitor,因为down所以迟迟收不到OP_ACCEPT。当196重新当选Leader之后,会发送OP_COLLECT消息到198,而198会接受新的PN 1200(之前是1100),但是它会在OP_LAST消息体中,告诉monitor leader,它曾经收到一份1737443号议案,它议案它已经accept,但是尚未committed。

那么monitor leader收到消息之后会怎样呢?

  if (last->pn > accepted_pn) {
    // no, try again.
    dout(10) << " they had a higher pn than us, picking a new one." << dendl;

    // cancel timeout event
    mon->timer.cancel_event(collect_timeout_event);
    collect_timeout_event = 0;

    collect(last->pn);
  } else if (last->pn == accepted_pn) {
  
    /*对于我们构造的这种场景,会走这个分支*/
    // yes, they accepted our pn.  great.
    num_last++;
    dout(10) << " they accepted our pn, we now have " 
	     << num_last << " peons" << dendl;

    
    /*记录下收到的uncommitted三元组*/
    if (last->uncommitted_pn) {
      if (last->uncommitted_pn >= uncommitted_pn &&
	       last->last_committed >= last_committed &&
	       last->last_committed + 1 >= uncommitted_v) {
	         uncommitted_v = last->last_committed+1;
	         uncommitted_pn = last->uncommitted_pn;
	         uncommitted_value = last->values[uncommitted_v];
	         dout(10) << "we learned an uncommitted value for " << uncommitted_v
	                  << " pn " << uncommitted_pn
	                  << " " << uncommitted_value.length() << " bytes"
	                  << dendl;
      } else {
        dout(10) << "ignoring uncommitted value for " << (last->last_committed+1)
                 << " pn " << last->uncommitted_pn
                 << " " << last->values[last->last_committed+1].length() << " bytes"
                 << dendl;
      }
    }
    
    /*如果已经搜集齐了所有的Peon的消息*/
    if (num_last == mon->get_quorum().size()) {
      // cancel timeout event
      mon->timer.cancel_event(collect_timeout_event);
      collect_timeout_event = 0;
      peer_first_committed.clear();
      peer_last_committed.clear();

      // almost...

      /*如果发现uncommitted等于last_committed+1*/
      if (uncommitted_v == last_committed+1 &&
          uncommitted_value.length()) {
          dout(10) << "that's everyone.  begin on old learned value" << dendl;
          
          /*注意后面两句,对于我们说的场景2,leader会把未完成的提案,再次begin,即重新发起一次,确保完成,
           *不过状态是STATE_UPDATING_PREVIOUS,即完成上一轮的情况*/
          state = STATE_UPDATING_PREVIOUS;
          begin(uncommitted_value);
      } else {
      // active!
      dout(10) << "that's everyone.  active!" << dendl;
      extend_lease();
      
      need_refresh = false;
      if (do_refresh()) {
        finish_round();
      }
     }
   }
 } else {
    // no, this is an old message, discard
    dout(10) << "old pn, ignoring" << dendl;
  }

注意哈,无论是否存在某个Peon已经回复了OP_ACCEPT,这个未完成的提案都会通过begin函数,再次发起。

  • 如果一个OP_ACCEPT都没有收到,那么Monitor Leader自己已经记录了uncommitted三元组,不需要通过Peon来学习到这个提案
  • 如果收到了某个OP_ACCEPT信息,那么该Peon在OP_LAST消息体中自然会告诉monitor leader uncommitted 三元组

无论哪种方法,monitor leader在 handle_last函数中都会执行 begin函数,完成上一轮未完成的提案。

2017-10-04 21:15:41.038753 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) handle_last paxos(last lc 1737442 fc 1736921 pn 1200 opn 1100) v3
2017-10-04 21:15:41.038759 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) store_state nothing to commit
2017-10-04 21:15:41.038824 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442)  they accepted our pn, we now have 2 peons
2017-10-04 21:15:41.038835 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) we learned an uncommitted value for 1737443 pn 1100 25958 bytes
2017-10-04 21:15:41.038843 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1736921..1737442) that's everyone.  begin on old learned value
2017-10-04 21:15:41.038848 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating-previous c 1736921..1737442) begin for 1737443 25958 bytes
2017-10-04 21:15:41.038868 7f36ce7e8700 30 mon.oquew@0(leader).paxos(paxos updating-previous c 1736921..1737442) begin transaction dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "paxos",
          "key": "1737443",
          "length": 25958},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_v",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "paxos",
          "key": "pending_pn",
          "length": 8}],
  "num_keys": 3,
  "num_bytes": 26015}
bl dump:
{ "ops": [
        { "op_num": 0,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_432632",
          "length": 15884},
        { "op_num": 1,
          "type": "PUT",
          "prefix": "logm",
          "key": "full_latest",
          "length": 8},
        { "op_num": 2,
          "type": "PUT",
          "prefix": "logm",
          "key": "432633",
          "length": 9882},
        { "op_num": 3,
          "type": "PUT",
          "prefix": "logm",
          "key": "last_committed",
          "length": 8}],
  "num_keys": 4,
  "num_bytes": 25840}

2017-10-04 21:15:41.057345 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos updating-previous c 1736921..1737442)  sending begin to mon.2

花了很长的篇幅,终于介绍完了当Peon down的时候的第二种情形。下面我们来考虑第三和第四种情况。

3. leader为writing状态,说明已经接收到所有accept消息,即commit_start已经开始执行,事务已经排队等待执行
4. leader为writing状态,写操作已经执行完成,即事务已经生效,只是回调函数(commit_finish)还没有被执行(回调函数没被执行是因为需要获取         monitor lock的锁)

注意,第3和4种情况会等待已经在writing状态的数据commit完成后,才会重新选举:

void Monitor::wait_for_paxos_write()
{
  if (paxos->is_writing() || paxos->is_writing_previous()) {
    dout(10) << __func__ << " flushing pending write" << dendl;
    lock.Unlock();
    store->flush();
    lock.Lock();
    dout(10) << __func__ << " flushed pending write" << dendl;
  }
}

void Monitor::bootstrap()
{
  dout(10) << "bootstrap" << dendl;
  wait_for_paxos_write();
  ...
  
}

void Monitor::start_election()
{
  dout(10) << "start_election" << dendl;
  wait_for_paxos_write();
  ...
}

对于第三种和第四种情况,Paxos应该处于writing或者writing_previous状态,这种情况下,会执行store->flush,在选举之前,确保已经处于writing状态的数据commit完成,然后开始选举。

对于其他的Peon,无论是否commit,Leader都已经完成了commit,在handle_last阶段:

 for (map<int,version_t>::iterator p = peer_last_committed.begin();
       p != peer_last_committed.end();
       ++p) {
    if (p->second + 1 < first_committed && first_committed > 1) {
      dout(5) << __func__
	      << " peon " << p->first
	      << " last_committed (" << p->second
	      << ") is too low for our first_committed (" << first_committed
	      << ") -- bootstrap!" << dendl;
      op->mark_paxos_event("need to bootstrap");
      mon->bootstrap();
      return;
    }
    
    /*对于第三第四种情况,mon leader可以将peon缺失的部分share给Peon,让Peon commit这些缺失的部分*/
    if (p->second < last_committed) {
      // share committed values
      dout(10) << " sending commit to mon." << p->first << dendl;
      MMonPaxos *commit = new MMonPaxos(mon->get_epoch(),
					MMonPaxos::OP_COMMIT,
					ceph_clock_now(g_ceph_context));
      share_state(commit, peer_first_committed[p->first], p->second);
      mon->messenger->send_message(commit, mon->monmap->get_inst(p->first));
    }
  }

我们来看下第三第四种情况下的log。其中197是down的Peon,198是正常的Peon,但是没来得及commit,这时候,Leader会发现198缺失1743405这个commit,会通过share_state函数,将缺失部分塞入消息体,发给198,即mon.2



2017-10-04 22:05:44.680463 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405) handle_last paxos(last lc 1743404 fc 1742694 pn 1300 opn 0) v3
2017-10-04 22:05:44.680481 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405) store_state nothing to commit

/*197*/
2017-10-04 22:05:44.680556 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405)  sending commit to mon.2
2017-10-04 22:05:44.680568 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405) share_state peer has fc 1742694 lc 1743404
2017-10-04 22:05:44.680639 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405)  sharing 1743405 (133 bytes)
2017-10-04 22:05:44.680730 7f36ce7e8700 10 mon.oquew@0(leader).paxos(paxos recovering c 1742694..1743405)  they accepted our pn, we now have 2 peons

Peon up

上面四种情况,讲述的是Peon down之后的4种可能性。当Down的Peon重新Up会发生什么事情呢?

因为Peon down了很长时间,它的很多信息都落后,因此启动的时候,会有sync的过程。这个过程并不是通过 collect–>handle_collect—>handle_last 完成的信息同步,而是在Peon启动的时候,调用sync_start函数,发起数据同步,进入STATE_SYNCHRONIZING状态。这部分内容不打算在此处展开。

数据sync完毕之后,调用sync_finish函数,在该函数中会再次bootstrap,会触发选举,当然,还是原来的leader会获胜。

Leader Down

Leader 可能会死在Paxos任意函数的任何地方,这时候,新的选举中,会从Peon中选择rank最小的Peon当新的Leader。和之前一样,我们来考虑,Leader down 和Leader Up这两件事情发生之后,集群如何恢复到一致。

Down

peon在lease超时后会重新选举,peon可能中断在active或updating状态,peon之间的状态并不是一样的,可能一些在active,一些在updating:

  • leader down在active状态,不需要特殊处理
  • leader down在updating状态,如果没有peon已经accept,不需要特殊处理,如果有peon已经accept,新的leader要么自己已经accept,要么会从其他peon学习到,会重新propose
  • leader down在writing状态,说明所有peon已经accept,新的leader会重新propose已经accept的值(此时down的leader可能已经写成功,也可能没有写成功)
  • leader down在refresh状态,down的leader已经写成功,如果有peon已经收到commit消息,新的commit会被新的leader在collect阶段学习到,如果没有peon收到commit消息,会重新propose

对于情况2中,如果有些peon已经accept,那么在handle_collect函数,该peon就会讲这些uncommitted三元组发给新的Leader,或者新的Leader自己就曾经accept,自己从自身也能获得uncommmited三元组,这时候就会掉用 begin重新propose。

    /*记录下收到的uncommitted三元组*/
    if (last->uncommitted_pn) {
      if (last->uncommitted_pn >= uncommitted_pn &&
	       last->last_committed >= last_committed &&
	       last->last_committed + 1 >= uncommitted_v) {
	         uncommitted_v = last->last_committed+1;
	         uncommitted_pn = last->uncommitted_pn;
	         uncommitted_value = last->values[uncommitted_v];
	         dout(10) << "we learned an uncommitted value for " << uncommitted_v
	                  << " pn " << uncommitted_pn
	                  << " " << uncommitted_value.length() << " bytes"
	                  << dendl;
      } else {
        dout(10) << "ignoring uncommitted value for " << (last->last_committed+1)
                 << " pn " << last->uncommitted_pn
                 << " " << last->values[last->last_committed+1].length() << " bytes"
                 << dendl;
      }
    }
    
    /*如果已经搜集齐了所有的Peon的消息*/
    if (num_last == mon->get_quorum().size()) {
      // cancel timeout event
      mon->timer.cancel_event(collect_timeout_event);
      collect_timeout_event = 0;
      peer_first_committed.clear();
      peer_last_committed.clear();

      // almost...

      /*如果发现uncommitted等于last_committed+1*/
      if (uncommitted_v == last_committed+1 &&
          uncommitted_value.length()) {
          dout(10) << "that's everyone.  begin on old learned value" << dendl;
          
          /*注意后面两句,对于我们说的场景2,leader会把未完成的提案,再次begin,即重新发起一次,确保完成,
           *不过状态是STATE_UPDATING_PREVIOUS,即完成上一轮的情况*/
          state = STATE_UPDATING_PREVIOUS;
          begin(uncommitted_value);
      }
      
      ....
      

对于情况3,和情况2一样,会通过如下代码,重新propose:

      if (uncommitted_v == last_committed+1 &&
          uncommitted_value.length()) {
          dout(10) << "that's everyone.  begin on old learned value" << dendl;
          state = STATE_UPDATING_PREVIOUS;
          begin(uncommitted_value);
      }
      

情况4 稍稍复杂一点,因为不确定是否有peon 执行过commit,如果没有peon执行过commit,和情况2 3一样,重新propose,但是如果曾经commit过,新的leader会通过collect函数学习到来自某peon的commit,同时将其他peon缺失的部分通过share_state分享给其他peon。

UP

leader重新up后,可能probing阶段就会做一次sync,此时数据可能会同步一部分,再一次被选举成leader,collect阶段会同步差异的几个版本数据, 同时,如果peon有uncommitted的数据,也会同步给leader,由新的leader重新propose。

唯一需要注意的是,leader down的时候存在的uncommitted的数据,由上面的情况可知,如果有peon已经接受,数据会被重新propose, 重新up后,根据pending_v,由于版本较低,pending数据会被抛弃。如果leader已经commit过,peon也一定会commit,所以不会导致数据不一致。

因为上一种情况,已经详细地分析了代码了,对于Leader down 的这种情况,我们就不全面展开了。

尾声

注意,本文大量的参考第一篇参考文献,我基本是按图索骥,我无意抄袭前辈的文章,只是前辈水平太高,很多东西高屋建瓴,语焉不详,对于初学者而言,可能不能领会其含义,本文做了一些展开,将某些内容和代码以及日志输出对应,帮助初学者更好地理解。

另外,参考文献2也是非常不错的文章,但是如果不分析可能发生的异常,这个Phase 1往往会知其然,不知其所以然,将代码读成了流水账。

参考文献

  1. Ceph Monitor Paxos
  2. Ceph的Paxos源码注释 - Phase 1
]]>
ceph-mon之Paxos算法 2017-09-24T17:20:40+00:00 Bean Li http://bean-li.github.io/ceph-paxos 前言

Paxos算法应该算是分布式系统中最赫赫有名的算法了,就如同江湖上那句 “为人不识陈近南,纵称英雄也枉然”,Paxos在分布式中的地位,只会比陈近南在江湖上的地位更高。

按照我的打算,这个PAXOS系列,应该有3篇文章,我并不打算一上来就介绍Paxos的原理,因为势必太枯燥,我们小时候学习数学也是从1+1开始,然后倒引申到变量,介绍一元一次方程 二元一次方程,最后引申到行列式 矩阵 线性代数。从逻辑上讲,为什么不直接学习线性代数呢? 不直观,而且不符合人类认知事物的规律。

首先要介绍下,为什么ceph-mon需要这个Paxos算法。举个简单的例子,如果两个client都需要写入同一个cephfs上的文件,那么它们需要OSDMap,因为必须要根据OSDMap和文件名来决定要写到哪些OSD上,注意client A和Client B看到的OSDMap必须是一致的,否则的话会造成不一致。

因此我们看出来了,对于分布式存储来讲,一致性( consensus )是一个强需求。而对于分布式consensus来讲,几乎就等同于Paxos。

世界上只有一种一致性协议,就是Paxos

其他协议要么是paxos的简化,要么是错误的

本文是第一篇,用来介绍正常的一次Proposal应该是怎么样的。

Paxos 规则

角色

  • Proposer 提案者,它可以提出议案
  • Proposal 未被批准的决议称为提案,由Proposer提出,一个提案由一个编号和value形成的对组成,编号非常重要,保证提案的可区分性。
  • Acceptor 提案的受理者,可以简单理解为独立法官,有权决定接受收到的提案还是拒绝提案。当然接受还是拒绝是有一定的规则的。
  • Choose 提案被批准,被选定。当有半数以上Acceptor接受该提案时,就认为该提案被选定了
  • Learner 旁观者,需要知道被选定的提案的那些人。Learner只能获取到被批准的提案。

算法

这里并不打算推导Paxos算法,或者证明算法的正确性,只介绍怎么做:

  1. P1: 一个acceptor必须通过(accept)它收到的第一个提案。

    P1a:当且仅当acceptor没有回应过编号大于n的prepare请求时,acceptor接受(accept)编号为n的提案。
    
  2. P2: 如果具有value值v的提案被选定(chosen)了,那么所有比它编号更高的被选定的提案的value值也必须是v。

  P2c:如果一个编号为n的提案具有value v,那么存在一个多数派,要么他们中所有人都没有接受(accept)编号小于n 
的任何提案,要么他们已经接受(accept)的所有编号小于n的提案中编号最大的那个提案具有value v。

ceph中的 Paxos 实现

截止到本文,只会以正常流程为主,并不会介绍异常恢复过程,那是下一篇的主题。我们学习下面内容的时候,要注意两点

  • 代码如何实现的Paxos的算法,和上一节的内容对应
  • 正常情况下的代码,做了那些准备工作,看似无用,其实用于异常发生时的恢复

何时需要发起提案Proposal

Paxos的Trigger点总是要发起提案,那么ceph中需要发起提案的地方,大抵有以下三种:

  • ConfigKeyService在修改或删除key/value对的时候。

    ceph提供了分布式的key-value服务,这个服务讲ceph-mon当成存储k/v的黑盒子。用户可以使用如下命令存放k/v

    ceph config-key put key value 
    ceph config-key get key
    ceph config-key del key
    

    ceph相关的函数接口在ConfigKeyService::store_put和store_delete

      void ConfigKeyService::store_put(string key, bufferlist &bl, Context *cb)
      {
        bufferlist proposal_bl;
        MonitorDBStore::TransactionRef t = paxos->get_pending_transaction();
        t->put(STORE_PREFIX, key, bl);
        if (cb)
          paxos->queue_pending_finisher(cb);
        paxos->trigger_propose();
      }
    	
      void ConfigKeyService::store_delete(string key, Context *cb)
      {
        bufferlist proposal_bl;
        MonitorDBStore::TransactionRef t = paxos->get_pending_transaction();
        t->erase(STORE_PREFIX, key);
        if (cb)
          paxos->queue_pending_finisher(cb);
        paxos->trigger_propose();
      }
    
  • Paxos以及PaxosService对数据做trim的时候,trim的目的是为了节省存储空间,参见Paxos::trim和PaxosService::maybe_trim

    注意,PaxosService是在Paxos基础上,封装了一些接口,用来构建基于Paxos的服务,早期的版本有六大PaxosService,如下图所示。

    这些PaxosService,为了节省存储空间,也会通过调用maybe_trim来删除一些太老太旧的数据:

     	void Monitor::tick()
      {
        // ok go.
        dout(11) << "tick" << dendl;
    	  
        for (vector<PaxosService*>::iterator p = paxos_service.begin(); p != paxos_service.end(); ++p) {
          (*p)->tick();
          (*p)->maybe_trim();
        }
       ...  
    }
    

    因此,每个Paxos都要定义自己的maybe_trim函数。

  • PaxosService的各种服务,需要更新值的时候,参见PaxosService::propose_pending

需要发起proposal的场合,主要是上面提到的这几种,在决定做proposal之前,都会讲操作封装成事务,存放在Paxos类的变量pending_proposal中.

  /**
   * Pending proposal transaction
   *
   * This is the transaction that is under construction and pending
   * proposal.  We will add operations to it until we decide it is
   * time to start a paxos round.
   */
  MonitorDBStore::TransactionRef pending_proposal;
  
  /**
   * Finishers for pending transaction
   *
   * These are waiting for updates in the pending proposal/transaction
   * to be committed.
   */
  list<Context*> pending_finishers;

  /**
   * Finishers for committing transaction
   *
   * When the pending_proposal is submitted, pending_finishers move to
   * this list.  When it commits, these finishers are notified.
   */
  list<Context*> committing_finishers;

事务操作pending_proposal会被编码到bufferlist中,作为此次决议的值,会存放在paxos相关的k/v中,key为版本号, value为bufferlist二进制数据。commit的时候需要将bufferlist中的二进制数据还原成transaction,然后执行其中的操作, 即让决议的值反应在各个服务中,更新相关map。

也就是说,事务操作的内容会被编码成bufferlist,这个二进制数据流作为value,而key会版本号,作为paxos的提案。

注意,很多逻辑完成Paxos提案全过程之后,会有一些回调函数,这些回调会暂时放入pending_finishers列表。当Paxos的滚滚车轮一旦启动,会存放入committing_finishers列表。

bool Paxos::trigger_propose()
{
  if (is_active()) {
    dout(10) << __func__ << " active, proposing now" << dendl;
    propose_pending();
    return true;
  } else {
    dout(10) << __func__ << " not active, will propose later" << dendl;
    return false;
  }
}

void Paxos::propose_pending()
{
  assert(is_active());
  assert(pending_proposal);

  cancel_events();

  bufferlist bl;
  pending_proposal->encode(bl);

  dout(10) << __func__ << " " << (last_committed + 1)
	   << " " << bl.length() << " bytes" << dendl;
  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  pending_proposal->dump(&f);
  f.flush(*_dout);
  *_dout << dendl;

  /*pengding_proposal 就可以reset了*/
  pending_proposal.reset();

  /*已经开始处理,因此,讲pending_finishers的内容存放入committing_finishers*/
  committing_finishers.swap(pending_finishers);
  
  /*注意,掉用begin之前,先将状态改成STATE_UPDATING*/
  state = STATE_UPDATING;
  begin(bl);
}

介绍了这些基本知识之后,可以看下Paxos决议的整体流程了。整个流程的起点是void Paxos::begin(bufferlist& v)。注意,这个函数只能由mon leader 发起,Peon不会掉用 begin函数,提出议案。

当然了,Paxos算法并未规定,只能有一个Proposer,但是ceph的实现通过只允许mon leader发起提案,简化了代码处理的流程。

Paxos 正常工作流程

整体的流程入下图所示:

begin

void Paxos::begin(bufferlist& v)
{
  dout(10) << "begin for " << last_committed+1 << " " 
	   << v.length() << " bytes"
	   << dendl;

  /*只有mon leader才能掉用begin,提出提案*/
  assert(mon->is_leader());
  assert(is_updating() || is_updating_previous());

  // we must already have a majority for this to work.
  assert(mon->get_quorum().size() == 1 ||
	 num_last > (unsigned)mon->monmap->size()/2);
  
  // and no value, yet.
  assert(new_value.length() == 0);

  /*刚刚发起提案,目前还没有收到任何Acceptor的接受提案的信息*/
  accepted.clear();
  /*在接受提案的Acceptor中插入mon leader自己,因为自己的提案,自己不会拒绝*/
  accepted.insert(mon->rank);
  
  /*将 new_value 赋值为v,即将事务encode得到的bufferlist*/
  new_value = v;

  /*第一个commit,只有第一次提出提案的时候才会遇到*/
  if (last_committed == 0) {
    MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
    // initial base case; set first_committed too
    t->put(get_name(), "first_committed", 1);
    decode_append_transaction(t, new_value);

    bufferlist tx_bl;
    t->encode(tx_bl);

    new_value = tx_bl;
  }

  // store the proposed value in the store. IF it is accepted, we will then
  // have to decode it into a transaction and apply it.
  
  /*注意截下来的三个put操作是begin的一个关键地方,首先将事务encode过的bufferlist存放到*/
  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), last_committed+1, new_value);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", last_committed + 1);
  t->put(get_name(), "pending_pn", accepted_pn);

  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  t->dump(&f);
  f.flush(*_dout);
  MonitorDBStore::TransactionRef debug_tx(new MonitorDBStore::Transaction);
  bufferlist::iterator new_value_it = new_value.begin();
  debug_tx->decode(new_value_it);
  debug_tx->dump(&f);
  *_dout << "\nbl dump:\n";
  f.flush(*_dout);
  *_dout << dendl;

  logger->inc(l_paxos_begin);
  logger->inc(l_paxos_begin_keys, t->get_keys());
  logger->inc(l_paxos_begin_bytes, t->get_bytes());
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_begin_latency, end - start);

  assert(g_conf->paxos_kill_at != 3);

  if (mon->get_quorum().size() == 1) {
    // we're alone, take it easy
    commit_start();
    return;
  }

  // ask others to accept it too!
  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
    if (*p == mon->rank) continue;
    
    dout(10) << " sending begin to mon." << *p << dendl;
    MMonPaxos *begin = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_BEGIN,
				     ceph_clock_now(g_ceph_context));
    begin->values[last_committed+1] = new_value;
    begin->last_committed = last_committed;
    begin->pn = accepted_pn;
    
    mon->messenger->send_message(begin, mon->monmap->get_inst(*p));
  }

  /*注册超时*/
  accept_timeout_event = new C_MonContext(mon, [this](int r) {
      if (r == -ECANCELED)
	return;
      accept_timeout();
    });
  mon->timer.add_event_after(g_conf->mon_accept_timeout_factor *
			     g_conf->mon_lease,
			     accept_timeout_event);
}

注意,下面的代码是begin函数的关键:

  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), last_committed+1, new_value);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", last_committed + 1);
  t->put(get_name(), "pending_pn", accepted_pn);
  
  ...
  
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_begin_latency, end - start);

首先,将要执行的transaction encode成的bufferlist先保存下来,并不真正执行,而仅仅是记录下,而这条信息以last_commit+1作为键值,一旦超过半数的Acceptor通过提案,那么就可以从leveldb或者rocksdb中根据last_commit+1,取出要执行的事务。

我们以如下值为例,介绍整个流程。

first_committed = 1
last_committed = 10
accepted_pn = 100

此次提案会新增如下信息到mon leader的 MonitorDBStore

# 此次提议增加的数据
v11=new_value; # 11是last_committed+1的值,这里key会有前缀,简单以v代替,new_value是最终事务的编码过的bufflerlist
pending_v=11
pending_pn=100

注意 get_store()->apply_transaction(t)执行之后,上述三个值就写入了mon leader的DB中了。

接下来的事情是向Peon发送OP_BEGIN消息,请Acceptor审核提案。

  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
       
    /*leader不必向自己发送*/
    if (*p == mon->rank) continue;
    
    dout(10) << " sending begin to mon." << *p << dendl;
    MMonPaxos *begin = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_BEGIN,
				     ceph_clock_now(g_ceph_context));
				     
	 /*将new_value和last_committed+1作为k/v对,发送个PEON*/
    begin->values[last_committed+1] = new_value;
   
    /*这两个值将来辅助Peon做决策,决定是否接受该提案*/
    begin->last_committed = last_committed;
    begin->pn = accepted_pn;
    
    mon->messenger->send_message(begin, mon->monmap->get_inst(*p));
  }

begin函数有特例,即整个集群只有一个mon,那么就可以跳过搜集其他Acceptor接受与否的过程,直接进入commit阶段:

  /*只有自己存在,就没有必要征求意见了*/
  if (mon->get_quorum().size() == 1) {
    // we're alone, take it easy
    commit_start();
    return;
  }

handle_begin

Peon收到OP_BEGIN消息之后,开始处理。

Peon只会处理pn>= accepted_pn的提案,否则就会拒绝该提案:

  // can we accept this?
  if (begin->pn < accepted_pn) {
    dout(10) << " we accepted a higher pn " << accepted_pn << ", ignoring" << dendl;
    op->mark_paxos_event("have higher pn, ignore");
    return;
  }
  
  assert(begin->pn == accepted_pn);
  assert(begin->last_committed == last_committed);
  
  assert(g_conf->paxos_kill_at != 4);

  logger->inc(l_paxos_begin);

  /*将状态改成STATE_UPDATING*/
  state = STATE_UPDATING;
  lease_expire = utime_t();  // cancel lease

对于Peon来讲:

first_committed = 1
last_committed =10 
accepted_pn = 100

v11=new_value
pending_v=11
pending_pn=100

当Peon决定接受提案的时候,将会讲new_value暂时保存到DB(leveldb or rocksdb)中,做的事情和mon leader是一致的:

  // yes.
  version_t v = last_committed+1;
  dout(10) << "accepting value for " << v << " pn " << accepted_pn << dendl;
  // store the accepted value onto our store. We will have to decode it and
  // apply its transaction once we receive permission to commit.
  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  t->put(get_name(), v, begin->values[v]);

  // note which pn this pending value is for.
  t->put(get_name(), "pending_v", v);
  t->put(get_name(), "pending_pn", accepted_pn);
  
  ....
  
  logger->inc(l_paxos_begin_bytes, t->get_bytes());
  utime_t start = ceph_clock_now(NULL);

  get_store()->apply_transaction(t);

  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_begin_latency, end - start);

接下来,就可以讲接受提案的消息发送给mon leader,即发送OP_ACCEPT消息给mon leader。

  // reply
  MMonPaxos *accept = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_ACCEPT,
				    ceph_clock_now(g_ceph_context));
  accept->pn = accepted_pn;
  accept->last_committed = last_committed;
  begin->get_connection()->send_message(accept);

handle_accept

mon leader自从向所有的peon发送了OP_BEGIN消息之后,就望穿秋水地等待回应。

// leader
void Paxos::handle_accept(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_accept");
  MMonPaxos *accept = static_cast<MMonPaxos*>(op->get_req());
  dout(10) << "handle_accept " << *accept << dendl;
  int from = accept->get_source().num();

  if (accept->pn != accepted_pn) {
    // we accepted a higher pn, from some other leader
    dout(10) << " we accepted a higher pn " << accepted_pn << ", ignoring" << dendl;
    op->mark_paxos_event("have higher pn, ignore");
    return;
  }
  if (last_committed > 0 &&
      accept->last_committed < last_committed-1) {
    dout(10) << " this is from an old round, ignoring" << dendl;
    op->mark_paxos_event("old round, ignore");
    return;
  }
  assert(accept->last_committed == last_committed ||   // not committed
	 accept->last_committed == last_committed-1);  // committed

  assert(is_updating() || is_updating_previous());
  assert(accepted.count(from) == 0);
  accepted.insert(from);
  dout(10) << " now " << accepted << " have accepted" << dendl;

  assert(g_conf->paxos_kill_at != 6);

  // only commit (and expose committed state) when we get *all* quorum
  // members to accept.  otherwise, they may still be sharing the now
  // stale state.
  // FIXME: we can improve this with an additional lease revocation message
  // that doesn't block for the persist.
  

  if (accepted == mon->get_quorum()) {
    // yay, commit!
    dout(10) << " got majority, committing, done with update" << dendl;
    op->mark_paxos_event("commit_start");
    commit_start();
  }
}

首先会做一些检查,比如accept->pn和accepted_pn是否相等之类的。如果通过检查,会讲对应peon放入accepted中,表示已经收到了来自该peon的消息,该peon已经同意该提案。

注意,和一般的Paxos不同的是,mon leader要收到所有的peon的OP_ACCEPT之后,才会进入下一阶段,而不是半数以上。

  /*要收到所有的peon的OP_ACCEPT,才会进入到commit阶段*/
  if (accepted == mon->get_quorum()) {
    // yay, commit!
    dout(10) << " got majority, committing, done with update" << dendl;
    op->mark_paxos_event("commit_start");
    commit_start();
  }

leader在begin函数中,为了防止无法及时收集齐所有的OP_ACCEPT消息,注册了超时事件:

  // set timeout event
  accept_timeout_event = new C_MonContext(mon, [this](int r) {
      if (r == -ECANCELED)
	return;
      accept_timeout();
    });
  mon->timer.add_event_after(g_conf->mon_accept_timeout_factor *
			     g_conf->mon_lease,
			     accept_timeout_event);
			     
OPTION(mon_lease, OPT_FLOAT, 5)       // lease interval
OPTION(mon_accept_timeout_factor, OPT_FLOAT, 2.0)    // on leader, if paxos update isn't accepted

也就是说10秒中之内,不能收到所有的OP_ACCEPT,mon leader就会掉用accept_timeout函数,会掉用mon->bootstrap.

void Paxos::accept_timeout()
{
  dout(1) << "accept timeout, calling fresh election" << dendl;
  accept_timeout_event = 0;
  assert(mon->is_leader());
  assert(is_updating() || is_updating_previous() || is_writing() ||
	 is_writing_previous());
  logger->inc(l_paxos_accept_timeout);
  mon->bootstrap();
}

commit_start

当mon leader掉用commit_start的时候,表示走到了第二阶段。和二阶段提交有点类似,该提案已经得到了全部的peon的同意,因此可以大刀阔斧地将真正的事务提交,让提案生效。

void Paxos::commit_start()
{
  dout(10) << __func__ << " " << (last_committed+1) << dendl;

  assert(g_conf->paxos_kill_at != 7);

  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);

  // commit locally
  /*last_committed的值 自加*/
  t->put(get_name(), "last_committed", last_committed + 1);

  // decode the value and apply its transaction to the store.
  // this value can now be read from last_committed.
  
  /*事务编码之后的bufferlist之前存储到了new_value这个成员,将事务decode,并追加到transaction中*/
  decode_append_transaction(t, new_value);

  dout(30) << __func__ << " transaction dump:\n";
  JSONFormatter f(true);
  t->dump(&f);
  f.flush(*_dout);
  *_dout << dendl;

  logger->inc(l_paxos_commit);
  logger->inc(l_paxos_commit_keys, t->get_keys());
  logger->inc(l_paxos_commit_bytes, t->get_bytes());
  commit_start_stamp = ceph_clock_now(NULL);

  /*让事务生效,注意,此处是异步掉用*/
  get_store()->queue_transaction(t, new C_Committed(this));

  if (is_updating_previous())
    state = STATE_WRITING_PREVIOUS;
  else if (is_updating())
    state = STATE_WRITING;
  else
    assert(0);

  if (mon->get_quorum().size() > 1) {
    // cancel timeout event
    mon->timer.cancel_event(accept_timeout_event);
    accept_timeout_event = 0;
  }
}

此处事务的处理,是异步的,掉用了MonitorDBStore的queue_transaction函数。当事务完成之后,会掉用相关的回调函数。

  void queue_transaction(MonitorDBStore::TransactionRef t,
			 Context *oncommit) {
    io_work.queue(new C_DoTransaction(this, t, oncommit));
  }

注意,当将事务放入队列之后,状态从UPDATING切换成了 STATE_WRITING。

回调函数定义在:

struct C_Committed : public Context {
  Paxos *paxos;
  explicit C_Committed(Paxos *p) : paxos(p) {}
  void finish(int r) {
    assert(r >= 0);
    Mutex::Locker l(paxos->mon->lock);
    paxos->commit_finish();
  }
};

注意事务完成之后,会掉用commit_finish函数。

commit_finish函数

这个函数主要做三件事:

  • 将内存中last_committed值+1
  • 向peon发送commit消息
  • 设置状态为refresh,刷新PaxosService服务
void Paxos::commit_finish()
{
  dout(20) << __func__ << " " << (last_committed+1) << dendl;
  utime_t end = ceph_clock_now(NULL);
  logger->tinc(l_paxos_commit_latency, end - commit_start_stamp);

  assert(g_conf->paxos_kill_at != 8);

  // cancel lease - it was for the old value.
  //  (this would only happen if message layer lost the 'begin', but
  //   leader still got a majority and committed with out us.)
  lease_expire = utime_t();  // cancel lease

  /*last_committed可以自加了*/
  last_committed++;
  last_commit_time = ceph_clock_now(NULL);

  // refresh first_committed; this txn may have trimmed.
  first_committed = get_store()->get(get_name(), "first_committed");

  _sanity_check_store();

  /*给所有的peon发送OP_COMMIT消息*/
  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
    if (*p == mon->rank) continue;

    dout(10) << " sending commit to mon." << *p << dendl;
    MMonPaxos *commit = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_COMMIT,
				      ceph_clock_now(g_ceph_context));
    commit->values[last_committed] = new_value;
    commit->pn = accepted_pn;
    commit->last_committed = last_committed;

    mon->messenger->send_message(commit, mon->monmap->get_inst(*p));
  }

  assert(g_conf->paxos_kill_at != 9);

  // get ready for a new round.
  new_value.clear();

  // WRITING -> REFRESH
  // among other things, this lets do_refresh() -> mon->bootstrap() know
  // it doesn't need to flush the store queue
  assert(is_writing() || is_writing_previous());
  state = STATE_REFRESH;

  if (do_refresh()) {
    commit_proposal();
    if (mon->get_quorum().size() > 1) {
      extend_lease();
    }

    finish_contexts(g_ceph_context, waiting_for_commit);

    assert(g_conf->paxos_kill_at != 10);

    finish_round();
  }
}

需要注意的是,refresh完成后,在变回状态active之前,会开始lease协议,即发送lease消息给peon,这会帮助peon也变为active。

handle_commit

  • 更新内存中和后端存储中last_committed值,即+1
  • 将new_value中的值解码成事务,然后调用后端存储接口执行请求,这里采用同步写,和leader节点不一样
  • 刷新PaxosService服务
void Paxos::handle_commit(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_commit");
  MMonPaxos *commit = static_cast<MMonPaxos*>(op->get_req());
  dout(10) << "handle_commit on " << commit->last_committed << dendl;

  logger->inc(l_paxos_commit);

  if (!mon->is_peon()) {
    dout(10) << "not a peon, dropping" << dendl;
    assert(0);
    return;
  }

  op->mark_paxos_event("store_state");
  
  /*store_state是函数之眼,同步地处理事务*/
  store_state(commit);

  if (do_refresh()) {
    finish_contexts(g_ceph_context, waiting_for_commit);
  }
}

handle_lease

peon收到延长租约的消息OP_LEASE之后,会掉用handle_lease,peon的状态从updating转变成active

参考文献

  1. Ceph Monitor Paxos
]]>