PouchContainer Goroutine Leak 检测实践
2018-05-18 12:14 玲小喵 阅读(325) 评论(0) 编辑 收藏 举报0. 引言
PouchContainer 是阿里巴巴集团开源的一款容器运行时产品,它具备强隔离和可移植性等特点,可用来帮助企业快速实现存量业务容器化,以及提高企业内部物理资源的利用率。
1. Goroutine Leak
func main() {
waitCh := make(chan struct{})
go func() {
fmt.Println("Hi, Pouch. I'm new gopher!")
waitCh <- struct{}{}
}()
<-waitCh
}
正常情况下,一只土拨鼠完成任务之后,它将会回笼,然后等待你的下一次召唤。但是也有可能出现这只土拨鼠很长时间没有回笼的情况。
func main() {
// /exec?cmd=xx&args=yy runs the shell command in the host
http.HandleFunc("/exec", func(w http.ResponseWriter, r *http.Request) {
defer func() { log.Printf("finish %v\n", r.URL) }()
out, err := genCmd(r).CombinedOutput()
if err != nil {
w.WriteHeader(500)
w.Write([]byte(err.Error()))
return
}
w.Write(out)
})
log.Fatal(http.ListenAndServe(":8080", nil))
}
func genCmd(r *http.Request) (cmd *exec.Cmd) {
var args []string
if got := r.FormValue("args"); got != "" {
args = strings.Split(got, " ")
}
if c := r.FormValue("cmd"); len(args) == 0 {
cmd = exec.Command(c)
} else {
cmd = exec.Command(c, args...)
}
return
}
func main() {
logGoNum()
// without sender and blocking....
var ch chan int
go func(ch chan int) {
<-ch
}(ch)
for range time.Tick(2 * time.Second) {
logGoNum()
}
}
func logGoNum() {
log.Printf("goroutine number: %d\n", runtime.NumGoroutine())
}
造成 goroutine leak 有很多种不同的场景,本文接下来会通过描述 Pouch Logs API 场景,介绍如何对 goroutine leak 进行检测并给出相应的解决方案。
2. Pouch Logs API 实践
2.1 具体场景
为了更好地说明问题,本文将 Pouch Logs HTTP Handler 的代码进行简化:
func logsContainer(ctx context.Context, w http.ResponseWriter, r *http.Request) {
...
writeLogStream(ctx, w, msgCh)
return
}
func writeLogStream(ctx context.Context, w http.ResponseWriter, msgCh <-chan Message) {
for {
select {
case <-ctx.Done():
return
case msg, ok := <-msgCh:
if !ok {
return
}
w.Write(msg.Byte())
}
}
}
2.2 如何检测 goroutine leak?
# step 1: create background job
pouch run -d busybox sh -c "while true; do sleep 1; done"
# step 2: follow the log and stop it after 3 seconds
curl -m 3 {ip}:{port}/v1.24/containers/{container_id}/logs?stdout=1&follow=1
# step 3: after 3 seconds, dump the stack info
curl -s "{ip}:{port}/debug/pprof/goroutine?debug=2" | grep -A 10 logsContainer
github.com/alibaba/pouch/apis/server.(*Server).logsContainer(0xc420330b80, 0x251b3e0, 0xc420d93240, 0x251a1e0, 0xc420432c40, 0xc4203f7a00, 0x3, 0x3)
/tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/container_bridge.go:339 +0x347
github.com/alibaba/pouch/apis/server.(*Server).(github.com/alibaba/pouch/apis/server.logsContainer)-fm(0x251b3e0, 0xc420d93240, 0x251a1e0, 0xc420432c40, 0xc4203f7a00, 0x3, 0x3)
/tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/router.go:53 +0x5c
github.com/alibaba/pouch/apis/server.withCancelHandler.func1(0x251b3e0, 0xc420d93240, 0x251a1e0, 0xc420432c40, 0xc4203f7a00, 0xc4203f7a00, 0xc42091dad0)
/tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/router.go:114 +0x57
github.com/alibaba/pouch/apis/server.filter.func1(0x251a1e0, 0xc420432c40, 0xc4203f7a00)
/tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/router.go:181 +0x327
net/http.HandlerFunc.ServeHTTP(0xc420a84090, 0x251a1e0, 0xc420432c40, 0xc4203f7a00)
/usr/local/go/src/net/http/server.go:1918 +0x44
github.com/alibaba/pouch/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc4209fad20, 0x251a1e0, 0xc420432c40, 0xc4203f7a00)
/tmp/pouchbuild/src/github.com/alibaba/pouch/vendor/github.com/gorilla/mux/mux.go:133 +0xed
net/http.serverHandler.ServeHTTP(0xc420a18d00, 0x251a1e0, 0xc420432c40, 0xc4203f7800)
2.3 怎么解决?
golang 提供的包 net/http
有监控链接断开的功能:
// HTTP Handler Interceptors
func withCancelHandler(h handler) handler {
return func(ctx context.Context, rw http.ResponseWriter, req *http.Request) error {
// https://golang.org/pkg/net/http/#CloseNotifier
if notifier, ok := rw.(http.CloseNotifier); ok {
var cancel context.CancelFunc
ctx, cancel = context.WithCancel(ctx)
waitCh := make(chan struct{})
defer close(waitCh)
closeNotify := notifier.CloseNotify()
go func() {
select {
case <-closeNotify:
cancel()
case <-waitCh:
}
}()
}
return h(ctx, rw, req)
}
}
CloseNotify 并不适用于 Hijack 链接的场景,因为 Hijack 之后,有关于链接的所有处理都交给了实际的 Handler,HTTP Server 已经放弃了数据的管理权。
那么这样的检测可以做成自动化吗?下面会结合常用的分析工具来进行说明。
3. 常用的分析工具
3.1 net/http/pprof
goroutine 93 [chan receive]:
github.com/alibaba/pouch/daemon/mgr.NewContainerMonitor.func1(0xc4202ce618)
/tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container_monitor.go:62 +0x45
created by github.com/alibaba/pouch/daemon/mgr.NewContainerMonitor
/tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container_monitor.go:60 +0x8d
goroutine 94 [chan receive]:
github.com/alibaba/pouch/daemon/mgr.(*ContainerManager).execProcessGC(0xc42037e090)
/tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container.go:2177 +0x1a5
created by github.com/alibaba/pouch/daemon/mgr.NewContainerManager
/tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container.go:179 +0x50b
goroutine stack 通常第一行包含着 Goroutine ID,接下来的几行是具体的调用栈信息。有了调用栈信息,我们就可以通过 关键字匹配 的方式来检索是否存在泄漏的情况了。
总的来说,debug
接口的方式适用于 集成测试 ,因为测试用例和目标服务不在同一个进程里,需要 dump 目标进程的 goroutine stack 来获取泄漏信息。
3.2 runtime.NumGoroutine
当测试用例和目标函数/服务在同一个进程里时,可以通过 goroutine 的数目变化来判断是否存在泄漏问题。
func TestXXX(t *testing.T) {
orgNum := runtime.NumGoroutine()
defer func() {
if got := runtime.NumGoroutine(); orgNum != got {
t.Fatalf("xxx", orgNum, got)
}
}()
...
}