用 70 行 Go 代码击败 C 语言

　　Chris Penner 最近发布的一篇文章 Beating C with 80 Lines of Haskell[1] 引发了 Internet 领域内广泛的论战，进而引发了一场用不同语言实现 wc 的圣战：

　　Ada[2]C[3]Common Lisp[4]Dyalog APL[5]Futhark[6]Haskell[7]Rust[8]

　　今天我们用 Go 语言来实现 wc 的功能。作为有着杰出的并发基因的语言，实现与 C 语言相当的性能（原文为 comparable performance[9]）应该是小菜一碟。

　　虽然 wc 也被设计为从 stdin 读取信息，处理 non-ASCII 文本编码，从命令行解析 flags（manpage[10]），但我们并不去这样做。我们要做的是，像前面提到的那篇文章一样，让我们的实现尽可能简单。

　　本文涉及的源码可以在这里[11] 找到。

　　$ /usr/bin/time -f "%es %MKB" wc test.txt

　　我们使用与原文相同版本的 `wc`[12] ，用 gcc 9.2.1 编译，编译优化选项为 -O3。在我们的实现中，使用 Go 1.13.4（我确实也试过用 gccgo，但结果不是很理想）。我们用以下配置来运行所有的基准：

　　Intel Core i5-6200U @ 2.30 GHz (2 physical cores, 4 threads)4+4 GB RAM @ 2133 MHz240 GB M.2 SSDFedora 31

　　公平起见，所有实现都使用一个 16 KB 的 buffer 来读取输入。输入是两个 us-ascii 编码的文本文件，大小分别是 100 MB 和 1 GB。

　　因为我们只需要输入文件路径，所以解析参数很容易：

　　if len(os.Args) < 2 {

　　panic("no file path specified")

　　}

　　filePath :=os.Args[1]

　　file, err :=os.Open(filePath)

　　if err !=nil {

　　panic(err)

　　}

　　defer file.Close()

　　我们会逐字节遍历文本，跟踪状态。幸运的是，在本文案例中，我们只需要引入两个状态：

　　前一个字节是 whitespace前一个字节不是 whitespace

　　当从一个 whitespace 字符跳到一个 non-whitespace 字符时，单词计数加 1。这种方法可以直接从字节流读取信息，保持低内存消耗。

　　const bufferSize=16 * 1024

　　reader :=bufio.NewReaderSize(file, bufferSize)

　　lineCount :=0

　　wordCount :=0

　　byteCount :=0

　　prevByteIsSpace :=true

　　for {

　　b, err :=reader.ReadByte()

　　if err !=nil {

　　if err==io.EOF {

　　break

　　} else {

　　panic(err)

　　}

　　byteCount++

　　switch b {

　　case '

　　lineCount++

　　prevByteIsSpace=true

　　case ' ', ' ', '\r', '\v', '\f':

　　prevByteIsSpace=true

　　default:

　　if prevByteIsSpace {

　　wordCount++

　　prevByteIsSpace=false

　　}

　　为了展示结果，我们用原生的 println() 函数 — 在我的试验中，导入 fmt 包会导致运行时空间增加约 400 KB。

　　println(lineCount, wordCount, byteCount, file.Name())

　　运行结果：

　　input sizeelapsed timemax memorywc100 MB0.58 s2052 KBwc-naive100 MB0.77 s1416 KBwc1 GB5.56 s2036 KBwc-naive1 GB7.69 s1416 KB

　　好消息是我们的第一次尝试在性能方面非常接近 C 语言。事实上，在内存使用方面，我们做得比 C 语言更好。

　　虽然对 I/O 读取进行缓冲显著提升了性能，但是调用 ReadByte() 和在循环中检查 error 造成了一大笔不必要的开销。我们可以通过手动缓冲读请求来规避上述情况，而不再依赖 bufio.Reader。

　　为了实现手动缓冲，我们把输入分割成多个可以单独处理的缓冲块。幸运的是，我们只要知道前一个缓冲块（我们之前看到过）的最后一个字符是否是 whitespace 就可以处理当前的块。

　　我们写几个实用函数：

　　type Chunk struct {

　　PrevCharIsSpace bool

　　Buffer []byte

　　}

　　type Count struct {

　　LineCount int

　　WordCount int

　　}

　　func GetCount(chunk Chunk) Count {

　　count :=Count{}

　　prevCharIsSpace :=chunk.PrevCharIsSpace

　　for _, b :=range chunk.Buffer {

　　switch b {

　　case '

　　count.LineCount++

　　prevCharIsSpace=true

　　case ' ', ' ', '\r', '\v', '\f':

　　prevCharIsSpace=true

　　default:

　　if prevCharIsSpace {

　　prevCharIsSpace=false

　　count.WordCount++

　　}

　　return count

　　}

　　func IsSpace(b byte) bool {

　　return b==' ' || b==' ' || b=='

　　' || b=='\r' || b=='\v' || b=='\f'

　　}

　　现在，我们可以把输入分割成多个块，然后传入 GetCount 函数。

　　totalCount :=Count{}

　　lastCharIsSpace :=true

　　const bufferSize=16 * 1024

　　buffer :=make([]byte, bufferSize)

　　for {

　　bytes, err :=file.Read(buffer)

　　if err !=nil {

　　if err==io.EOF {

　　break

　　} else {

　　panic(err)

　　}

　　count :=GetCount(Chunk{lastCharIsSpace, buffer[:bytes]})

　　lastCharIsSpace=IsSpace(buffer[bytes-1])

　　totalCount.LineCount +=count.LineCount

　　totalCount.WordCount +=count.WordCount

　　}

　　为了计数字节，我们可以进行一次系统调用来查询文件的大小：

　　fileStat, err :=file.Stat()

　　if err !=nil {

　　panic(err)

　　}

　　byteCount :=fileStat.Size()

　　现在做完该做的了，来看一下表现如何：

　　input sizeelapsed timemax memorywc100 MB0.58 s2052 KBwc-chunks100 MB0.34 s1404 KBwc1 GB5.56 s2036 KBwc-chunks1 GB3.31 s1416 KB

　　看起来我们在两个统计维度上都超过了 wc，而且我们还没有开始并行化我们的程序。`tokei`[13] 统计结果显示这个程序一共只有 70 行代码！

　　诚然，并行化实现 wc 是大材小用了，但我们还是来看一下到底能达到什么程度。原文是并行地从输入文件中读的，尽管它缩短了运行时间，但作者同时也承认，并行读仅在特定几种存储的情况下对性能有提升，其他的情况下可能降低性能。

　　在我们的实现中，我们希望我们的代码在所有设备上运行都能有很好的性能，所以我们不像原文作者那样做。我们会创建两个 channel：chunks 和 counts 。每个 worker 都会从 chunks 读取和处理数据直到 channel 被 close，之后把结果写进 counts。

　　func ChunkCounter(chunks

　　totalCount :=Count{}

　　for {

　　chunk, ok :=<-chunks

　　if !ok {

　　break

　　}

　　count :=GetCount(chunk)

　　totalCount.LineCount +=count.LineCount

　　totalCount.WordCount +=count.WordCount

　　}

　　counts

　　}

　　我们在每个 CPU core 上起一个 worker：

　　numWorkers :=runtime.NumCPU()

　　chunks :=make(chan Chunk)

　　counts :=make(chan Count)

　　for i :=0; i < numWorkers; i++ {

　　Go ChunkCounter(chunks, counts)

　　}

　　现在，我们在循环中从硬盘读取数据和给每个 worker 分配 job：

　　const bufferSize=16 * 1024

　　lastCharIsSpace :=true

　　for {

　　buffer :=make([]byte, bufferSize)

　　bytes, err :=file.Read(buffer)

　　if err !=nil {

　　if err==io.EOF {

　　break

　　} else {

　　panic(err)

　　}

　　chunks

　　lastCharIsSpace=IsSpace(buffer[bytes-1])

　　}

　　close(chunks)

　　这些完成后，我们可以很简单地把所有 worker 的计数相加。

　　totalCount :=Count{}

　　for i :=0; i < numWorkers; i++ {

　　count :=<-counts

　　totalCount.LineCount +=count.LineCount

　　totalCount.WordCount +=count.WordCount

　　}

　　close(counts)

　　我们运行起来然后看一下与之前结果的对比：

　　input sizeelapsed timemax memorywc100 MB0.58 s2052 KBwc-channel100 MB0.27 s6644 KBwc1 GB5.56 s2036 KBwc-channel1 GB2.22 s6752 KB

　　我们实现的 wc 在速度方面有很大提升，但在内存使用方面与之前相比有些倒退。请特别留意我们的输入循环在每一次执行中是怎么样申请内存的！channel 是对共享内存的高度抽象，但在实际使用时，不使用 channel 可能会大幅提升性能。

　　在这部分，我们让每个 worker 都读取文件，并使用 sync.Mutex 来确保不会同时读取。我们可以创建一个 struct 来为我们处理这种情况：

　　type FileReader struct {

　　File *os.File

　　LastCharIsSpace bool

　　mutex sync.Mutex

　　}

　　func (fileReader *FileReader) ReadChunk(buffer []byte) (Chunk, error) {

　　fileReader.mutex.Lock()

　　defer fileReader.mutex.Unlock()

　　bytes, err :=fileReader.File.Read(buffer)

　　if err !=nil {

　　return Chunk{}, err

　　}

　　chunk :=Chunk{fileReader.LastCharIsSpace, buffer[:bytes]}

　　fileReader.LastCharIsSpace=IsSpace(buffer[bytes-1])

　　return chunk, nil

　　}

　　为了能直接读取文件，我们重写 worker 函数：

　　func FileReaderCounter(fileReader *FileReader, counts chan Count) {

　　const bufferSize=16 * 1024

　　buffer :=make([]byte, bufferSize)

　　totalCount :=Count{}

　　for {

　　chunk, err :=fileReader.ReadChunk(buffer)

　　if err !=nil {

　　if err==io.EOF {

　　break

　　} else {

　　panic(err)

　　}

　　count :=GetCount(chunk)

　　totalCount.LineCount +=count.LineCount

　　totalCount.WordCount +=count.WordCount

　　}

　　counts

　　}

　　像之前一样，我们还是在每个 CPU core 起一个 worker：

　　fileReader :=&FileReader{

　　File: file,

　　LastCharIsSpace: true,

　　}

　　counts :=make(chan Count)

　　for i :=0; i < numWorkers; i++ {

　　Go FileReaderCounter(fileReader, counts)

　　}

　　totalCount :=Count{}

　　for i :=0; i < numWorkers; i++ {

　　count :=<-counts

　　totalCount.LineCount +=count.LineCount

　　totalCount.WordCount +=count.WordCount

　　}

　　close(counts)

　　来看看表现如何：

　　intput sizeelapsed timemax memorywc100 MB0.58 s2052 KBwc-mutex100 MB0.12 s1580 KBwc1 GB5.56 s2036 KBwc-mutex1 GB1.21 s1576 KB

　　我们的并行化实现用更小的内存消耗比 wc 的运行速度快了 4.5 倍！这意义非凡，尤其是当你意识到 Go 是一种有垃圾回收机制的语言时。

　　尽管本文结论并不意味着 Go > C，但我希望它能证明 Go 作为一种系统编程语言可以是 C 语言的可替代项。

posted @ 2022-02-09 19:58 ebuybay 阅读(64) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 解密为何 Go 能从众多语言中脱颖而出

· Golang：Concurrency, Goroutines and GOMAXPROC

· 另一种语言

· 简单对比一下 C 与 Go 两种语言

· 【Go】0、golang简介

阅读排行：
· 全程不用写代码，我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了，比商业数据库还牛
· 白话解读 Dapr 1.15：你的「微服务管家」又秀新绝活了

公告

昵称： ebuybay
园龄： 3年5个月
粉丝： 4
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

ebuybay

用 70 行 Go 代码击败 C 语言

公告

搜索

常用链接

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论