前言
软件开发中,没有银弹,我们有时需要多种语言互相调用支援,发挥各自的优势。本文介绍Rust的FFI 外部函数接口。
本文重点介绍C或者C++调用Rust library的方法。
Rust在实现std::slice::sort_unstable
的时候,用了一种新的快速排序变种PDQSort,相对其它语言里面普遍用到的IntroSort有较大的性能提升,我们偷个懒,不自己实现C的pdqsort,调用Rust的pdqsort,作为本次练习的任务。
Rust 侧
第一步:
cargo new pdqsort --lib
执行完毕后,我们得到了:
ROG-Manjaro Rust/pdqsort ‹master*› » tree
.
├── Cargo.lock
├── Cargo.toml
└── src
└── lib.rs
2 directories, 3 files
我们修改src/lib.rs
#[no_mangle]
pub unsafe extern fn rust_u32sort(elements: *mut u32, size: u64) {
let elements = std::slice::from_raw_parts_mut(elements, size as usize);
elements.sort_unstable();
}
同时修改Cargo.toml :
[lib]
crate-type = ["cdylib"]
crate-type = ["cdylib"]
会创建一个动态链接的库。 可查看 Cargo 文档的动态或静态库,了解更多信息.
cdylib
是在 RFC 1510 中引入,并改善了现有的dylib
文件,减小其大小,和导出更少符号。
然后我们可以通过调用:
ROG-Manjaro Rust/pdqsort ‹master*› » cargo build --release
Compiling pdqsort v0.1.0 (/home/manu/CODE/Rust/pdqsort)
Finished release [optimized] target(s) in 1.49s
ROG-Manjaro Rust/pdqsort ‹master*› » tree
.
├── Cargo.lock
├── Cargo.toml
├── src
│ └── lib.rs
└── target
├── CACHEDIR.TAG
└── release
├── build
├── deps
│ ├── libpdqsort.so
│ └── pdqsort.d
├── examples
├── incremental
├── libpdqsort.d
└── libpdqsort.so
8 directories, 8 files
可以看出我们已经可以编译出了libpdqsort.so。
如果想用静态链接:
[lib]
crate-type = ["staticlib"]
会编译出来如下内容:
ROG-Manjaro Rust/pdqsort ‹master*› » tree
.
├── Cargo.lock
├── Cargo.toml
├── src
│ └── lib.rs
└── target
├── CACHEDIR.TAG
└── release
├── build
├── deps
│ ├── libpdqsort-4b11be617319d092.a
│ └── pdqsort-4b11be617319d092.d
├── examples
├── incremental
├── libpdqsort.a
└── libpdqsort.d
C++ 侧
C++侧,我们写测试代码,即对100万个unsigned int数进行排序。
#include <algorithm>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <vector>
extern "C" int rust_u32sort(uint32_t* elements, uint64_t size);
std::vector<uint32_t> generate_test_vec(size_t n) {
auto test_vec = std::vector<uint32_t> {};
while (test_vec.size() < n) {
test_vec.push_back((uint32_t) std::rand());
}
return test_vec;
}
int main() {
for (int round = 0; round < 3; round++) {
std::printf("round %d:\n", round);
auto test_vec1 = generate_test_vec(10000000);
auto test_vec2 = test_vec1;
auto start_time1 = std::clock();
std::sort(test_vec1.begin(), test_vec1.end());
std::printf("std::sort time: %.3lf sec\n", (std::clock() - start_time1) * 1.0 / CLOCKS_PER_SEC);
auto start_time2 = std::clock();
rust_u32sort(test_vec2.data(), test_vec2.size());
std::printf("rust_u32sort time: %.3lf sec\n", (std::clock() - start_time2) * 1.0 / CLOCKS_PER_SEC);
assert(test_vec2 == test_vec1);
}
}
编译:
动态链接:
c++ -std=c++20 -O3 test.cpp -L ../../Rust/pdqsort/target/release/ -lpdqsort -o sort
静态链接:
c++ -std=c++20 -O3 test.cpp -L ../../Rust/pdqsort/target/release/ -lpdqsort -o sort
执行:
动态链接:
-----------
LD_LIBRARY_PATH=./target/release ../../C++/pdqsort/sort
round 0:
std::sort time: 1.746 sec
rust_u32sort time: 0.797 sec
round 1:
std::sort time: 1.786 sec
rust_u32sort time: 0.841 sec
round 2:
std::sort time: 1.829 sec
rust_u32sort time: 0.834 sec
静态链接:
---------
ROG-Manjaro C++/pdqsort » ./sort
round 0:
std::sort time: 1.787 sec
rust_u32sort time: 0.721 sec
round 1:
std::sort time: 1.848 sec
rust_u32sort time: 0.776 sec
round 2:
std::sort time: 1.831 sec
rust_u32sort time: 0.745 sec
Summary
结果上可以看到Rust的排序比C++ std::sort快了一倍多,结果比较出乎意料。目前在读PDQSort的代码,分析性能比传统快排更优的原因是使用了一种新的快排变种BlockQuickSort,这个变种算法可以较大改善传统快排中的分支预言失败的情况,具体实现还没有读完,后续再补充上。
附上BlockQuickSort的论文,2016年出的,BlockQuicksort: Avoiding Branch Mispredictions in Quicksort.
我们将上述测试放到物理机上跑:
[david@david-latitude3510 pdqsort]$ LD_LIBRARY_PATH=./target/release perf stat ../../C++/pdqsort/sort
round 0:
std::sort time: 0.782 sec
round 1:
std::sort time: 0.785 sec
round 2:
std::sort time: 0.779 sec
Performance counter stats for '../../C++/pdqsort/sort':
3,029.93 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
11,694 page-faults:u # 3.859 K/sec
9,780,539,711 cycles:u # 3.228 GHz (74.95%)
7,320,156,912 instructions:u # 0.75 insn per cycle (75.02%)
1,664,764,519 branches:u # 549.440 M/sec (75.05%)
305,855,559 branch-misses:u # 18.37% of all branches (74.97%)
3.030752533 seconds time elapsed
2.800187000 seconds user
0.133345000 seconds sys
[david@david-latitude3510 pdqsort]$ LD_LIBRARY_PATH=./target/release perf stat ../../C++/pdqsort/sort
round 0:
rust_u32sort time: 0.342 sec
round 1:
rust_u32sort time: 0.341 sec
round 2:
rust_u32sort time: 0.343 sec
Performance counter stats for '../../C++/pdqsort/sort':
1,619.17 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
11,702 page-faults:u # 7.227 K/sec
5,300,322,850 cycles:u # 3.273 GHz (74.99%)
9,081,461,937 instructions:u # 1.71 insn per cycle (74.99%)
1,185,526,644 branches:u # 732.182 M/sec (74.99%)
69,121,012 branch-misses:u # 5.83% of all branches (75.02%)
1.619959094 seconds time elapsed
1.566298000 seconds user
0.049854000 seconds sys
我们可以看到branch-misses,对于改进后的快排,只有5.83%的分支预测失败,而传统的快排,18.37%的branch-misses。