让 GPT-4 来修复 Golang “数据竞争”问题(续) - 每天5分钟玩转 GPT 编程系列(7)

func TestGoPoolWithMutex(t *testing.T) {
	pool := NewGoPool(100, WithLock(new(sync.Mutex)), WithTimeout(10*time.Microsecond))
	defer pool.Release()
	for i := 0; i < 1000; i++ {
		pool.AddTask(func() (interface{}, error) {
			time.Sleep(10 * time.Millisecond)
			return nil, nil
		})
	}
	pool.Wait()
}

我在本地能够稳定复现这个问题。那么接下来，就是简化场景，用一个单独的 UT 来测试 WithTimeout(10*time.Microsecond) 功能。

3.2 让 GPT-4 写一个单元测试

我对 DevChat 说：

DevChat 的完整回答是：

要添加一个名为 TestGoPoolWithTimeout() 的测试函数，我们需要考虑以下几点：

我们需要创建一个带有超时选项的 GoPool。
我们需要添加一个任务，该任务的执行时间超过了设置的超时时间，以便测试超时功能是否正常工作。
我们需要检查任务是否因为超时而失败。

以下是添加的 TestGoPoolWithTimeout() 函数：

func TestGoPoolWithTimeout(t *testing.T) {
    var taskError = errors.New("task error")
    var taskRun = false

    pool := NewGoPool(100, WithTimeout(100*time.Millisecond), WithErrorCallback(func(err error) {
        if err != taskError {
            t.Errorf("Expected error %v, but got %v", taskError, err)
        }
        taskRun = true
    }))
    defer pool.Release()

    pool.AddTask(func() (interface{}, error) {
        time.Sleep(200 * time.Millisecond)
        return nil, taskError
    })

    pool.Wait()

    if !taskRun {
        t.Errorf("Expected task to run and timeout, but it did not run")
    }
}

这个测试函数首先创建了一个带有 100 毫秒超时的 GoPool。然后，它添加了一个任务，该任务需要 200 毫秒才能完成，这超过了设置的超时时间。如果任务因为超时而失败，它将调用错误回调函数，并将 taskRun 设置为 true。最后，测试函数检查 taskRun 是否为 true，如果不是，说明任务没有运行，这是一个错误。

这段代码“看起来精致，但是跑不通”。能不能发现错误所在决定了这段代码最终能不能用起来。你发现问题了吗？（答案在下文揭晓）

我又对 DevChat 说：

这个测试用例没有跑过。我注意到你在 WithErrorCallback 中判断 err != taskError 则报错，然而超时的时候 task 返回的错误不是 taskError，而是 "task timed out"，所以 != 是否应该改成 ==？

DevChat 给出的回答：

你是对的，我犯了一个错误。在这个测试用例中，我们应该期望的错误是 "task timed out"，而不是 taskError。因此，我们应该在 WithErrorCallback 中检查错误是否等于 "task timed out"。以下是修正后的测试函数：

func TestGoPoolWithTimeout(t *testing.T) {
    var taskRun = false

    pool := NewGoPool(100, WithTimeout(100*time.Millisecond), WithErrorCallback(func(err error) {
        if err.Error() != "task timed out" {
            t.Errorf("Expected error 'task timed out', but got %v", err)
        }
        taskRun = true
    }))
    defer pool.Release()

    pool.AddTask(func() (interface{}, error) {
        time.Sleep(200 * time.Millisecond)
        return nil, nil
    })

    pool.Wait()

    if !taskRun {
        t.Errorf("Expected task to run and timeout, but it did not run")
    }
}

在这个修正后的版本中，我们在 WithErrorCallback 中检查错误是否等于 "task timed out"。如果不是，我们就报告一个错误。我们还修改了添加的任务，使其在执行时不返回错误，因为我们期望的错误是由超时引起的，而不是任务本身引起的。

这时候 TestGoPoolWithTimeout() 其实能跑通了。不过当我执行 go test 命令加上了 -race 之后，有趣：

WARNING: DATA RACE

GPT-4 写的 bug 让 GPT-4 自己去修！

于是我语(一)重(股)心(脑)长(儿)地把错误信息发给了 DevChat：

这个测试用例能够跑通，但是如果执行 race detector 则会报错：

go test -race ./...
==================
WARNING: DATA RACE
Write at 0x00c00001a14f by goroutine 641:
  github.com/devchat-ai/gopool.TestGoPoolWithTimeout.func1()
      /Users/danielhu/go/mycode/gopool/gopool_test.go:160 +0xf4
  github.com/devchat-ai/gopool.(*worker).handleResult()
      /Users/danielhu/go/mycode/gopool/worker.go:92 +0x94
  github.com/devchat-ai/gopool.(*worker).start.func1()
      /Users/danielhu/go/mycode/gopool/worker.go:27 +0xa4

Previous read at 0x00c00001a14f by goroutine 541:
  github.com/devchat-ai/gopool.TestGoPoolWithTimeout()
      /Users/danielhu/go/mycode/gopool/gopool_test.go:171 +0x1a8
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x188
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x40

Goroutine 641 (running) created at:
  github.com/devchat-ai/gopool.(*worker).start()
      /Users/danielhu/go/mycode/gopool/worker.go:23 +0xdc
  github.com/devchat-ai/gopool.NewGoPool()
      /Users/danielhu/go/mycode/gopool/gopool.go:75 +0x490
  github.com/devchat-ai/gopool.TestGoPoolWithTimeout()
      /Users/danielhu/go/mycode/gopool/gopool_test.go:156 +0x138
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x188
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x40

Goroutine 541 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1629 +0x5e4
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:2036 +0x80
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x188
  testing.runTests()
      /usr/local/go/src/testing/testing.go:2034 +0x700
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1906 +0x950
  main.main()
      _testmain.go:63 +0x300
==================
--- FAIL: TestGoPoolWithTimeout (0.11s)
    gopool_test.go:172: Expected task to run and timeout, but it did not run
    testing.go:1446: race detected during execution of test
FAIL
FAIL    github.com/devchat-ai/gopool    1.267s
FAIL

这次 GPT-4 没有说对不起，可恶，这态度，差评：

DevChat 这次给出的完整回答：

这个数据竞争问题是由于在多个 goroutine 中同时读写 taskRun 变量导致的。在 Go 中，如果你有多个 goroutine 同时读写一个变量，你需要使用某种形式的同步来确保数据的一致性。

在这个测试用例中，我们可以使用 sync/atomic 包提供的原子操作来解决这个问题。以下是修正后的测试函数：

import "sync/atomic"

func TestGoPoolWithTimeout(t *testing.T) {
    var taskRun int32

    pool := NewGoPool(100, WithTimeout(100*time.Millisecond), WithErrorCallback(func(err error) {
        if err.Error() != "task timed out" {
            t.Errorf("Expected error 'task timed out', but got %v", err)
        }
        atomic.StoreInt32(&taskRun, 1)
    }))
    defer pool.Release()

    pool.AddTask(func() (interface{}, error) {
        time.Sleep(200 * time.Millisecond)
        return nil, nil
    })

    pool.Wait()

    if atomic.LoadInt32(&taskRun) == 0 {
        t.Errorf("Expected task to run and timeout, but it did not run")
    }
}

在这个修正后的版本中，我们使用 sync/atomic 包的 StoreInt32 和 LoadInt32 函数来安全地更新和读取 taskRun 变量。这样，即使在多个 goroutine 中同时操作这个变量，也不会出现数据竞争的问题。

行，至少 DATA RACE 问题解决了。

3.3 修复 Wait() 中的逻辑漏洞

前面解决了 DATA RACE 问题，不过 TestGoPoolWithTimeout() 运行结果是：

$ go test -race ./...
--- FAIL: TestGoPoolWithTimeout (0.11s)
    gopool_test.go:172: Expected task to run and timeout, but it did not run
FAIL
FAIL    github.com/devchat-ai/gopool    1.203s
FAIL

开始烧脑起来了，继续把这个日志抛给 GPT-4 吧。

这次 DevChat 的回答是：

看起来测试用例没有通过，原因是期望的任务没有运行。这可能是因为 pool.Wait() 在任务超时之前就返回了。pool.Wait() 的实现只是简单地等待任务队列为空，但这并不能保证所有的任务都已经完成执行，特别是在任务超时的情况下。

为了解决这个问题，我们需要在 pool.Wait() 中添加一些逻辑来确保所有的任务都已经完成执行。一种可能的解决方案是使用一个 sync.WaitGroup 来跟踪正在执行的任务数量。当一个任务开始执行时，我们调用 WaitGroup.Add(1)，当任务完成执行时，我们调用 WaitGroup.Done()。然后，我们可以在 pool.Wait() 中调用 WaitGroup.Wait() 来等待所有的任务完成执行。

这是一个涉及到多个函数的修改，所以我建议你根据你的代码库的具体情况来进行这个修改。如果你需要更具体的帮助，可以提供更多的代码上下文，我会很乐意帮助你。

GPT-4 说问题在 pool.Wait() 中，不过它给出的方案还是让我犹豫。当前 Wait() 方法如下：

// Wait waits for all tasks to be dispatched.
 func (p *goPool) Wait() {
 	for len(p.taskQueue) > 0 {
 		time.Sleep(100 * time.Millisecond)
 	}
 }

这里的逻辑是等待 taskQueue 为空，也就是所有的任务都被分发出去，有对应的 workers 去执行。GPT-4 分析的没错，这里确实少了“等待所有任务执行完成”的逻辑。

我对 DevChat 说：

在执行 task 的时候会有 popWorker() 动作，而 task 完成则有 pushWorker() 动作，因此如果 len(goPool.workerStack) == len(goPool.workers) 是不是也能说明所有的 tasks 都完成了？

DevChat 回答：

你的理解是正确的。在这个特定的 goPool 实现中，每当一个任务开始执行时，都会从 workerStack 中 pop 一个 worker，而当任务完成时，会将 worker push 回 workerStack。因此，如果 len(goPool.workerStack) == len(goPool.workers)，那么这确实意味着所有的任务都已经完成了。

因此，你可以修改 Wait() 函数，使其等待直到所有的 worker 都返回到 workerStack。以下是修改后的 Wait() 函数：

// Wait waits for all tasks to be dispatched and completed.
func (p *goPool) Wait() {
    for len(p.taskQueue) > 0 || len(p.workerStack) != len(p.workers) {
        time.Sleep(100 * time.Millisecond)
    }
}

在这个修改后的版本中，Wait() 函数会等待直到任务队列为空并且所有的 worker 都返回到 workerStack。这样，当 Wait() 函数返回时，我们就可以确保所有的任务都已经完成了。

于是我这样改之后，继续运行了下 Race Detector，结果如下：

$ go test -race -run TestGoPoolWithTimeout
==================
WARNING: DATA RACE
Read at 0x00c000146298 by goroutine 6:
  github.com/devchat-ai/gopool.(*goPool).Wait()
      /Users/danielhu/go/mycode/gopool/gopool.go:89 +0x7c
  github.com/devchat-ai/gopool.TestGoPoolWithTimeout()
      /Users/danielhu/go/mycode/gopool/gopool_test.go:169 +0x1a0
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x188
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x40

Previous write at 0x00c000146298 by goroutine 118:
  github.com/devchat-ai/gopool.(*goPool).popWorker()
      /Users/danielhu/go/mycode/gopool/gopool.go:113 +0xec
  github.com/devchat-ai/gopool.(*goPool).dispatch()
      /Users/danielhu/go/mycode/gopool/gopool.go:164 +0x10c
  github.com/devchat-ai/gopool.NewGoPool.func2()
      /Users/danielhu/go/mycode/gopool/gopool.go:78 +0x34

Goroutine 6 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1629 +0x5e4
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:2036 +0x80
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x188
  testing.runTests()
      /usr/local/go/src/testing/testing.go:2034 +0x700
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1906 +0x950
  main.main()
      _testmain.go:63 +0x300

Goroutine 118 (running) created at:
  github.com/devchat-ai/gopool.NewGoPool()
      /Users/danielhu/go/mycode/gopool/gopool.go:78 +0x660
  github.com/devchat-ai/gopool.TestGoPoolWithTimeout()
      /Users/danielhu/go/mycode/gopool/gopool_test.go:156 +0x138
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x188
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x40
==================
--- FAIL: TestGoPoolWithTimeout (0.21s)
    testing.go:1446: race detected during execution of test
FAIL
exit status 1
FAIL    github.com/devchat-ai/gopool    0.519s

惨啊，今天是和 DATA RACE 杠上了？？？

我把这段日志直接贴给了 GPT-4，然后 GPT-4 感慨了下：看起来我们又遇到了数据竞争问题。

DevChat 此处的完整回答是：

看起来我们又遇到了数据竞争问题。这次的问题是在 Wait() 函数和 popWorker() 函数中同时读写 workerStack。为了解决这个问题，我们需要在访问 workerStack 时使用互斥锁来保护数据。

以下是修改后的 Wait() 函数和 popWorker() 函数：

import "sync"

type goPool struct {
    // other fields...
    workerStackMutex sync.Mutex
}

// Wait waits for all tasks to be dispatched and completed.
func (p *goPool) Wait() {
    for {
        p.workerStackMutex.Lock()
        workerStackLen := len(p.workerStack)
        p.workerStackMutex.Unlock()

        if len(p.taskQueue) == 0 && workerStackLen == len(p.workers) {
            break
        }

        time.Sleep(100 * time.Millisecond)
    }
}

// popWorker pops a worker from the worker stack.
func (p *goPool) popWorker() *worker {
    p.workerStackMutex.Lock()
    defer p.workerStackMutex.Unlock()

    if len(p.workerStack) == 0 {
        return nil
    }

    w := p.workerStack[len(p.workerStack)-1]
    p.workerStack = p.workerStack[:len(p.workerStack)-1]
    return w
}

在这个修改后的版本中，我们添加了一个 workerStackMutex 字段到 goPool 结构体中，用于在访问 workerStack 时提供互斥保护。在 Wait() 和 popWorker() 函数中，我们都使用了这个互斥锁来保护 workerStack 的读写操作，从而避免了数据竞争的问题。

看起来来这个修改能跑通，不过似乎不够优雅。这时候就又一次体现“GPT 也不能帮你完成你认知之外的工作”这个道理了。这段代码整体不错，但是有瑕疵，能看懂，能微调，GPT 就能帮你提效；反之，和 GPT 聊天就是浪费口水。

我又对 DevChat 说：

但是 goPool 本来就有一个 lock 字段，能不能直接用？

这次 GPT-4 和我说“对不起”了。我很欣慰，她又开始变得懂礼貌了。

DevChat 回答：

对不起，我没有注意到 goPool 结构体中已经有一个 lock 字段。你完全可以使用这个已有的互斥锁来保护 workerStack 的访问，而不需要添加一个新的互斥锁。

以下是使用 lock 字段来保护 workerStack 访问的修改后的 Wait() 函数和 popWorker() 函数：

// Wait waits for all tasks to be dispatched and completed.
func (p *goPool) Wait() {
    for {
        p.lock.Lock()
        workerStackLen := len(p.workerStack)
        p.lock.Unlock()

        if len(p.taskQueue) == 0 && workerStackLen == len(p.workers) {
            break
        }

        time.Sleep(100 * time.Millisecond)
    }
}

// popWorker pops a worker from the worker stack.
func (p *goPool) popWorker() *worker {
    p.lock.Lock()
    defer p.lock.Unlock()

    if len(p.workerStack) == 0 {
        return nil
    }

    w := p.workerStack[len(p.workerStack)-1]
    p.workerStack = p.workerStack[:len(p.workerStack)-1]
    return w
}

在这个修改后的版本中，我们使用 lock 字段来保护 workerStack 的读写操作，从而避免了数据竞争的问题。

这个回答有点“后遗症”，也就是前面提到了 popWorker() 方法的修改，所以这里它也给出了修改内容，其实这个方法不需要动，我的原版代码是：

func (p *goPool) popWorker() int {
	p.lock.Lock()
	workerIndex := p.workerStack[len(p.workerStack)-1]
	p.workerStack = p.workerStack[:len(p.workerStack)-1]
	p.lock.Unlock()
	return workerIndex
}

len(p.workerStack) == 0 这个判断是多余的，因为调用 popWorker() 的 dispatch() 方法长这样：

// dispatch dispatches tasks to workers.
func (p *goPool) dispatch() {
	for t := range p.taskQueue {
		p.cond.L.Lock()
		for len(p.workerStack) == 0 {
			p.cond.Wait()
		}
		p.cond.L.Unlock()
		workerIndex := p.popWorker()
		p.workers[workerIndex].taskQueue <- t
	}
}

这里通过 len(p.workerStack) == 0 判断保证了调用 popWorker() 被调用时，len(p.workerStack) != 0。

所以最后我采纳了 GPT-4 给出的这个 Wait() 方法：

func (p *goPool) Wait() {
    for {
        p.lock.Lock()
        workerStackLen := len(p.workerStack)
        p.lock.Unlock()

        if len(p.taskQueue) == 0 && workerStackLen == len(p.workers) {
            break
        }

        time.Sleep(100 * time.Millisecond)
    }
}

最后我提了这个 PR：Improve task execution and testing in goPool