最近在优化自己以前写的一个程序,其中io部分由单线程的Reactor模型改成多线程的Proactor模型。即原来是异步io事件唤醒线程,进行io读写,现在是一个线程进行异步io读写,然后把数据交给另一个线程进行逻辑处理。那这就涉及到一个线程数据交换的问题,由于是io数据,这个需要交换的数据还比较大,即交换一大块内存(缓冲区)。这本来也没多大事,这个都是很成熟的设计,无非就是加个锁,或者使用无锁的环形缓冲区即可。但我写着写着就魔怔了,非常纠结加锁效率高,还是用无锁环形缓冲区高?虽然知道写业务逻辑不应该纠结这种技术细节,他们的差别不会太大,用哪个对业务上的感知是没有区别的。但不写个程序测试一下,心里就不舒服。
首选说我的程序,典型的一个线程产出数据,一个线程消耗数据,但产出和消耗逻辑互不相关,只是交换数据那一瞬间需要加锁,因此实际上出现竞争的概率不算大,所以我更在意的是当没有竞争时,他们的表现怎么样。而实现无锁环形缓冲区,大概需要2~3个atomic
#include <iostream>
#include <atomic>
#include <chrono>
#include <mutex>
#ifndef _MSC_VER
#include <pthread.h>
#endif
/// 用std::atomic_flag实现的spin lock
class SpinLock final
{
public:
SpinLock()
{
_flag.clear();
// C++20 This macro is no longer needed and deprecated,
// since default constructor of std::atomic_flag initializes it to clear state.
// _flag = ATOMIC_FLAG_INIT;
}
~SpinLock() = default;
SpinLock(const SpinLock&) = delete;
SpinLock(const SpinLock&&) = delete;
void lock() noexcept
{
// https://en.cppreference.com/w/cpp/atomic/atomic_flag_test_and_set
// Example A spinlock mutex can be implemented in userspace using an atomic_flag
//
while (_flag.test_and_set(std::memory_order_acquire));
}
bool try_lock() noexcept
{
return !_flag.test_and_set(std::memory_order_acquire);
}
void unlock() noexcept
{
_flag.clear(std::memory_order_release);
}
private:
std::atomic_flag _flag;
};
#ifndef _MSC_VER
/// https://rigtorp.se/spinlock/
struct spinlock {
std::atomic<bool> lock_ = { 0 };
void lock() noexcept {
for (;;) {
// Optimistically assume the lock is free on the first try
if (!lock_.exchange(true, std::memory_order_acquire)) {
return;
}
// Wait for lock to be released without generating cache misses
while (lock_.load(std::memory_order_relaxed)) {
// Issue X86 PAUSE or ARM YIELD instruction to reduce contention between
// hyper-threads
__builtin_ia32_pause();
}
}
}
bool try_lock() noexcept {
// First do a relaxed load to check if lock is free in order to prevent
// unnecessary cache misses if someone does while(!try_lock())
return !lock_.load(std::memory_order_relaxed) &&
!lock_.exchange(true, std::memory_order_acquire);
}
void unlock() noexcept {
lock_.store(false, std::memory_order_release);
}
};
#endif
int main()
{
const int ts = 10000000;
int ii1 = 0;
int ii2 = 0;
int ii3 = 0;
std::atomic<int> i1(0);
std::atomic<int> i2(0);
std::atomic<int> i3(0);
std::chrono::steady_clock::time_point beg;
std::chrono::steady_clock::time_point end;
beg = std::chrono::steady_clock::now();
for (int i = 0; i < ts; i++)
{
ii1 += i / 2;
ii2 += 1;
ii3 += ii1;
}
end = std::chrono::steady_clock::now();
std::cout << "run int " << ts << " time cost (ms) = "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
beg = std::chrono::steady_clock::now();
for (int i = 0; i < ts; i++)
{
i1 += i / 2;
i2 += 1;
i3 += i1;
}
end = std::chrono::steady_clock::now();
std::cout << "run atomic " << ts << " time cost (ms) = "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
ii1 = 0;
ii2 = 0;
ii3 = 0;
std::mutex m;
beg = std::chrono::steady_clock::now();
for (int i = 0; i < ts; i++)
{
m.lock();
ii1 += i / 2;
ii2 += 1;
ii3 += ii1;
m.unlock();
}
end = std::chrono::steady_clock::now();
std::cout << "run mutex " << ts << " time cost (ms) = "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
ii1 = 0;
ii2 = 0;
ii3 = 0;
SpinLock l;
beg = std::chrono::steady_clock::now();
for (int i = 0; i < ts; i++)
{
l.lock();
ii1 += i / 2;
ii2 += 1;
ii3 += ii1;
l.unlock();
}
end = std::chrono::steady_clock::now();
std::cout << "run SpinLock " << ts << " time cost (ms) = "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
#ifndef _MSC_VER
ii1 = 0;
ii2 = 0;
ii3 = 0;
spinlock ll;
beg = std::chrono::steady_clock::now();
for (int i = 0; i < ts; i++)
{
ll.lock();
ii1 += i / 2;
ii2 += 1;
ii3 += ii1;
ll.unlock();
}
end = std::chrono::steady_clock::now();
std::cout << "run spinlock " << ts << " time cost (ms) = "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
// using pthread_spin_lock
// https://docs.oracle.com/cd/E26502_01/html/E35303/ggecq.html
ii1 = 0;
ii2 = 0;
ii3 = 0;
pthread_spinlock_t lll;
int pshared;
int ret;
/* initialize a spin lock */
ret = pthread_spin_init(&lll, pshared);
beg = std::chrono::steady_clock::now();
for (int i = 0; i < ts; i++)
{
pthread_spin_lock(&lll);
ii1 += i / 2;
ii2 += 1;
ii3 += ii1;
pthread_spin_unlock(&lll);
}
end = std::chrono::steady_clock::now();
std::cout << "run pthread_spinlock_t " << ts << " time cost (ms) = "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
#endif
return 0;
}
编译参数:win下Visual Studio 2022,默认设置,linux下为g++ --std=c++11 Test.cpp
,结果:
- 物理机 CentOS 7, CPU I5-4460
atomic 267ms
mutex 162ms
- VirutlBox虚拟机,Debian 10,笔记本CPU
run 10000000 time cost (ms) = 326
run 10000000 time cost (ms) = 587
run 10000000 time cost (ms) = 729
run 10000000 time cost (ms) = 834
- 物理机 Win10, CPU AMD5700G
run int 10000000 time cost (ms) = 9
run atomic 10000000 time cost (ms) = 187
run mutex 10000000 time cost (ms) = 233
run SpinLock 10000000 time cost (ms) = 152
- VirtualBox虚拟机 Debian 10, CPU 5700G
run int 10000000 time cost (ms) = 9
run atomic 10000000 time cost (ms) = 92
run mutex 10000000 time cost (ms) = 200
run SpinLock 10000000 time cost (ms) = 194
run spinlock 10000000 time cost (ms) = 234
run pthread_spinlock_t 10000000 time cost (ms) = 119
需要注意下,CentOS 7那台物理器,一开始我只写了atomic和mutex的对比。后面我加了其他对比,但那台机子暂时没法用了,所以就只有两个数据。而win下,spinlock使用了一个linux下才有的函数,所以少了一个数据,在不同机子上的测试时我经常手动改代码,没有继续回到原来的机子重新,所以输出有些不一样,但逻辑是一样的。
在完全没有竞争的条件下,这些数据比较有意思:
- atomic
类型比原生的int类型要慢很多 - linux下结果比较统一,int > atomic
> SpinLock > mutex,2个atomic 大概等于一个mutex - win下 int > SpinLock > atomic
mutex - linux下mutex和atomic的实现要比win下快很多
由于我的程序多半是跑在linux下,win下的结果就不分析了。在linux下,2个atomicpthread_spinlock_t
测试,这个效率就很高,当然也有可能是编译参数的原因(pthread连接的是库,编译参数不一样),写得这个测试程序太简单,不能加优化(加O2优化锁直接就被优化掉了)。但pthread_spinlock_t
并不在C++标准中,因此我是不太可能用它的,剩下的区别不大。
上面测试的是完全无竞争的情况,没有测试有竞争的情况,因为在有竞争的情况下,atomic、mutex、SpinLock的表现不一样,是根据业务逻辑用哪个的问题,而不是对比哪个效率高。
- atomic只能保证自身变量读写的一致性,保证不了逻辑的一致性,它不能当作一个锁来用
- mutex是用来保证逻辑的一致性(如果只是一个变量,用atomic就不用考虑锁)。mutex在出现竞争时,会进入内核态,并让出CPU,因此适合需要加锁执行较长时间的逻辑
- SpinLock也是用来保证逻辑的一致性,但它不会让出CPU,适合加锁执行较短时间的逻辑
像下面的代码,它执行的是一个push逻辑,因此不考虑atomic,push的逻辑明显只需要很短时间,因此SpinLock比较合适。
std::vector v;
lock();
v.push(1);
unlock();
所以对于出现竞争的情况,是要根据业务逻辑实际情况来判断用哪个,写个简单的for循环程序来模拟是没什么意义的,mutex肯定是最慢的,但它能让出CPU,这在现实的程序中有很大的意义。