最近在优化自己以前写的一个程序,其中io部分由单线程的Reactor模型改成多线程的Proactor模型。即原来是异步io事件唤醒线程,进行io读写,现在是一个线程进行异步io读写,然后把数据交给另一个线程进行逻辑处理。那这就涉及到一个线程数据交换的问题,由于是io数据,这个需要交换的数据还比较大,即交换一大块内存(缓冲区)。这本来也没多大事,这个都是很成熟的设计,无非就是加个锁,或者使用无锁的环形缓冲区即可。但我写着写着就魔怔了,非常纠结加锁效率高,还是用无锁环形缓冲区高?虽然知道写业务逻辑不应该纠结这种技术细节,他们的差别不会太大,用哪个对业务上的感知是没有区别的。但不写个程序测试一下,心里就不舒服。

首选说我的程序,典型的一个线程产出数据,一个线程消耗数据,但产出和消耗逻辑互不相关,只是交换数据那一瞬间需要加锁,因此实际上出现竞争的概率不算大,所以我更在意的是当没有竞争时,他们的表现怎么样。而实现无锁环形缓冲区,大概需要2~3个atomic变量,如果使用锁,那就只是一把锁,直接上代码:

#include <iostream>
#include <atomic>
#include <chrono>
#include <mutex>

#ifndef _MSC_VER
#include <pthread.h>
#endif

/// 用std::atomic_flag实现的spin lock
class SpinLock final
{
public:
	SpinLock()
	{
		_flag.clear();

		// C++20 This macro is no longer needed and deprecated, 
		// since default constructor of std::atomic_flag initializes it to clear state.
		// _flag = ATOMIC_FLAG_INIT;
	}
	~SpinLock() = default;
	SpinLock(const SpinLock&) = delete;
	SpinLock(const SpinLock&&) = delete;

	void lock() noexcept
	{
		// https://en.cppreference.com/w/cpp/atomic/atomic_flag_test_and_set
		// Example A spinlock mutex can be implemented in userspace using an atomic_flag

		// 
		while (_flag.test_and_set(std::memory_order_acquire));
	}

	bool try_lock() noexcept
	{
		return !_flag.test_and_set(std::memory_order_acquire);
	}

	void unlock() noexcept
	{
		_flag.clear(std::memory_order_release);
	}
private:
	std::atomic_flag _flag;
};

#ifndef _MSC_VER
/// https://rigtorp.se/spinlock/
struct spinlock {
	std::atomic<bool> lock_ = { 0 };

	void lock() noexcept {
		for (;;) {
			// Optimistically assume the lock is free on the first try
			if (!lock_.exchange(true, std::memory_order_acquire)) {
				return;
			}
			// Wait for lock to be released without generating cache misses
			while (lock_.load(std::memory_order_relaxed)) {
				// Issue X86 PAUSE or ARM YIELD instruction to reduce contention between
				// hyper-threads
				__builtin_ia32_pause();
			}
		}
	}

	bool try_lock() noexcept {
		// First do a relaxed load to check if lock is free in order to prevent
		// unnecessary cache misses if someone does while(!try_lock())
		return !lock_.load(std::memory_order_relaxed) &&
			!lock_.exchange(true, std::memory_order_acquire);
	}

	void unlock() noexcept {
		lock_.store(false, std::memory_order_release);
	}
};
#endif

int main()
{
	const int ts = 10000000;

	int ii1 = 0;
	int ii2 = 0;
	int ii3 = 0;

	std::atomic<int> i1(0);
	std::atomic<int> i2(0);
	std::atomic<int> i3(0);

	std::chrono::steady_clock::time_point beg;
	std::chrono::steady_clock::time_point end;

	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run int      " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		i1 += i / 2;
		i2 += 1;
		i3 += i1;
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run atomic   " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	std::mutex m;
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		m.lock();
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		m.unlock();
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run mutex    " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	SpinLock l;
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		l.lock();
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		l.unlock();
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run SpinLock " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

#ifndef _MSC_VER
	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	spinlock ll;
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		ll.lock();
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		ll.unlock();
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run spinlock " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	// using pthread_spin_lock
	// https://docs.oracle.com/cd/E26502_01/html/E35303/ggecq.html
	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	pthread_spinlock_t lll;
	int pshared;
	int ret;

	/* initialize a spin lock */
	ret = pthread_spin_init(&lll, pshared);
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		pthread_spin_lock(&lll);
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		pthread_spin_unlock(&lll);
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run pthread_spinlock_t " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
#endif

	return 0;
}

编译参数:win下Visual Studio 2022,默认设置,linux下为g++ --std=c++11 Test.cpp,结果:

  1. 物理机 CentOS 7, CPU I5-4460
atomic 267ms
mutex 162ms
  1. VirutlBox虚拟机,Debian 10,笔记本CPU
run 10000000 time cost (ms) = 326
run 10000000 time cost (ms) = 587
run 10000000 time cost (ms) = 729
run 10000000 time cost (ms) = 834
  1. 物理机 Win10, CPU AMD5700G
run int      10000000 time cost (ms) = 9
run atomic   10000000 time cost (ms) = 187
run mutex    10000000 time cost (ms) = 233
run SpinLock 10000000 time cost (ms) = 152
  1. VirtualBox虚拟机 Debian 10, CPU 5700G
run int      10000000 time cost (ms) = 9
run atomic   10000000 time cost (ms) = 92
run mutex    10000000 time cost (ms) = 200
run SpinLock 10000000 time cost (ms) = 194
run spinlock 10000000 time cost (ms) = 234
run pthread_spinlock_t 10000000 time cost (ms) = 119

需要注意下,CentOS 7那台物理器,一开始我只写了atomic和mutex的对比。后面我加了其他对比,但那台机子暂时没法用了,所以就只有两个数据。而win下,spinlock使用了一个linux下才有的函数,所以少了一个数据,在不同机子上的测试时我经常手动改代码,没有继续回到原来的机子重新,所以输出有些不一样,但逻辑是一样的。

在完全没有竞争的条件下,这些数据比较有意思:

  1. atomic类型比原生的int类型要慢很多
  2. linux下结果比较统一,int > atomic > SpinLock > mutex,2个atomic大概等于一个mutex
  3. win下 int > SpinLock > atomic mutex
  4. linux下mutex和atomic的实现要比win下快很多

由于我的程序多半是跑在linux下,win下的结果就不分析了。在linux下,2个atomic大概等于一个mutex,这是符合预期的。因为一个mutex在没有竞争的条件下,就是compare and set两条指令,一次lock,一次unlock,相当于操作2个atomic。而使用atomic的SpinLock,和mutex几乎一致,有时候比mutex快,有时候慢,但相差不多。我担心这个实现方法效率不高,于是又在网上找了别人实现的一个,它还用了一个linux下专有的pause函数,结果发现更慢。接着使用pthread_spinlock_t测试,这个效率就很高,当然也有可能是编译参数的原因(pthread连接的是库,编译参数不一样),写得这个测试程序太简单,不能加优化(加O2优化锁直接就被优化掉了)。但pthread_spinlock_t并不在C++标准中,因此我是不太可能用它的,剩下的区别不大。

上面测试的是完全无竞争的情况,没有测试有竞争的情况,因为在有竞争的情况下,atomic、mutex、SpinLock的表现不一样,是根据业务逻辑用哪个的问题,而不是对比哪个效率高。

  1. atomic只能保证自身变量读写的一致性,保证不了逻辑的一致性,它不能当作一个锁来用
  2. mutex是用来保证逻辑的一致性(如果只是一个变量,用atomic就不用考虑锁)。mutex在出现竞争时,会进入内核态,并让出CPU,因此适合需要加锁执行较长时间的逻辑
  3. SpinLock也是用来保证逻辑的一致性,但它不会让出CPU,适合加锁执行较短时间的逻辑
    像下面的代码,它执行的是一个push逻辑,因此不考虑atomic,push的逻辑明显只需要很短时间,因此SpinLock比较合适。
std::vector v;
lock();
v.push(1);
unlock();

所以对于出现竞争的情况,是要根据业务逻辑实际情况来判断用哪个,写个简单的for循环程序来模拟是没什么意义的,mutex肯定是最慢的,但它能让出CPU,这在现实的程序中有很大的意义。

posted on 2022-01-08 18:09  coding my life  阅读(1167)  评论(0编辑  收藏  举报