[bug]记一次同步数据问题排查
记一次线上事故排查
具体场景简介:
目前有一个数据同步的场景(Mongodb -> ES)
架构模型如下:
遇到的问题:
任务消费到某一时刻,就会hang住,无法继续消费,channel队列满了以后,导致整个同步任务全面停止
可以很明显的看到已经hang住了,并且最后一条日志明确的显示是阻塞在发送任务上
已经预先开启了pprof,我们看一下当前goroutine阻塞的内容
goroutine profile: total 56
10 @ 0x103d0b2 0x104d4b2 0x1e3ef6c 0x1e443d5 0x1fef5a5 0x1feeaa5 0x106ed81
# 0x1e3ef6b github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add+0x12b /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229
# 0x1e443d4 aqgs/sync_assistant/internal/toes.ToES+0x394 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:30
# 0x1fef5a4 aqgs/sync_assistant/internal/task/risk.StorageSoftware+0x604 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:141
# 0x1feeaa4 aqgs/sync_assistant/internal/task/risk.PoolStorageSoftware+0x1c4 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:106
8 @ 0x103d0b2 0x103d14a 0x106aafe 0x107ac78 0x1c1ada5 0x1c1a7ac 0x106ed81
# 0x106aafd sync.runtime_notifyListWait+0x13d /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sema.go:513
# 0x107ac77 sync.(*Cond).Wait+0x77 /usr/local/Cellar/go/1.17.5/libexec/src/sync/cond.go:56
# 0x1c1ada4 go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).createConnections.func2+0x1a4 /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/pool.go:616
# 0x1c1a7ab go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).createConnections+0x22b /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/pool.go:643
4 @ 0x103d0b2 0x104d4b2 0x11a4007 0x106ed81
# 0x11a4006 database/sql.(*DB).connectionOpener+0xc6 /usr/local/Cellar/go/1.17.5/libexec/src/database/sql/sql.go:1196
4 @ 0x103d0b2 0x104d4b2 0x1c1b8eb 0x106ed81
# 0x1c1b8ea go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).maintain+0x54a /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/pool.go:750
4 @ 0x103d0b2 0x104d4b2 0x1c1e3d5 0x106ed81
# 0x1c1e3d4 go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*rttMonitor).start+0x2b4 /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/rtt_monitor.go:95
4 @ 0x103d0b2 0x104d4b2 0x1c22b66 0x1c22a2c 0x106ed81
# 0x1c22b65 go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).update.func3+0x105 /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/server.go:493
# 0x1c22a2b go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).update+0x78b /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/server.go:556
4 @ 0x103d0b2 0x104d4b2 0x200da5e 0x106ed81
# 0x200da5d github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher.func1+0xfd /Users/songzhibin/go/pkg/mod/github.com/go-sql-driver/mysql@v1.6.0/connection.go:614
3 @ 0x103d0b2 0x104d4b2 0x1e3ef6c 0x1e44c7d 0x1e42aba 0x1e4064a 0x106ed81
# 0x1e3ef6b github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add+0x12b /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229
# 0x1e44c7c aqgs/sync_assistant/internal/toes.ToES.func1+0x27c /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41
# 0x1e42ab9 github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush+0x1639 /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590
# 0x1e40649 github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1+0x7a9 /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386
2 @ 0x103d0b2 0x103d14a 0x104e185 0x106a805 0x108019c 0x1debab2 0x106ed81
# 0x106a804 sync.runtime_Semacquire+0x24 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sema.go:56
# 0x108019b sync.(*WaitGroup).Wait+0xbb /usr/local/Cellar/go/1.17.5/libexec/src/sync/waitgroup.go:130
# 0x1debab1 github.com/spf13/viper.(*Viper).WatchConfig.func1+0x611 /Users/songzhibin/go/pkg/mod/github.com/spf13/viper@v1.10.1/viper.go:452
2 @ 0x103d0b2 0x104d4b2 0x1454007 0x106ed81
# 0x1454006 github.com/go-redis/redis/v8/internal/pool.(*ConnPool).reaper+0x166 /Users/songzhibin/go/pkg/mod/github.com/go-redis/redis/v8@v8.11.4/internal/pool/pool.go:485
2 @ 0x103d0b2 0x104d4b2 0x1debc4e 0x106ed81
# 0x1debc4d github.com/spf13/viper.(*Viper).WatchConfig.func1.1+0x14d /Users/songzhibin/go/pkg/mod/github.com/spf13/viper@v1.10.1/viper.go:413
2 @ 0x106b79b 0x1d2987a 0x1d29405 0x1d2d965 0x1d2bbcd 0x106ed81
# 0x106b79a syscall.syscall6+0x3a /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sys_darwin.go:44
# 0x1d29879 golang.org/x/sys/unix.kevent+0xb9 /Users/songzhibin/go/pkg/mod/golang.org/x/sys@v0.0.0-20211210111614-af8b64212486/unix/zsyscall_darwin_amd64.go:276
# 0x1d29404 golang.org/x/sys/unix.Kevent+0x144 /Users/songzhibin/go/pkg/mod/golang.org/x/sys@v0.0.0-20211210111614-af8b64212486/unix/syscall_bsd.go:417
# 0x1d2d964 github.com/fsnotify/fsnotify.read+0xa4 /Users/songzhibin/go/pkg/mod/github.com/fsnotify/fsnotify@v1.5.1/kqueue.go:512
# 0x1d2bbcc github.com/fsnotify/fsnotify.(*Watcher).readEvents+0x10c /Users/songzhibin/go/pkg/mod/github.com/fsnotify/fsnotify@v1.5.1/kqueue.go:275
1 @ 0x1036b3e 0x106af68 0x1e3e75d 0x106ed81
# 0x106af67 os/signal.signal_recv+0x27 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sigqueue.go:166
# 0x1e3e75c os/signal.loop+0x1c /usr/local/Cellar/go/1.17.5/libexec/src/os/signal/signal_unix.go:24
1 @ 0x103d0b2 0x1006e7c 0x1006bfd 0x1fee4b3 0x20adcf8 0x1e54f67 0x1e55d93 0x1e5536f 0x20ad31f 0x20ae037 0x103cc93 0x106ed81
# 0x1fee4b2 aqgs/sync_assistant/internal/task/risk.Software+0xcf2 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:80
# 0x20adcf7 aqgs/sync_assistant/cmd.glob..func4+0x277 /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/risk.go:53
# 0x1e54f66 github.com/spf13/cobra.(*Command).execute+0xb66 /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:860
# 0x1e55d92 github.com/spf13/cobra.(*Command).ExecuteC+0x8b2 /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:974
# 0x1e5536e github.com/spf13/cobra.(*Command).Execute+0x2e /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:902
# 0x20ad31e aqgs/sync_assistant/cmd.Execute+0x3e /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/root.go:55
# 0x20ae036 main.main+0x16 /Users/songzhibin/go/src/aqgs/sync_assistant/main.go:21
# 0x103cc92 runtime.main+0x1f2 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/proc.go:255
1 @ 0x103d0b2 0x100796d 0x10075f8 0x1e3eb56 0x106ed81
# 0x1e3eb55 aqgs/sync_assistant/internal/signal.Register.func1+0x35 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/signal/signal.go:14
1 @ 0x103d0b2 0x1035aae 0x1068fa5 0x1118fa8 0x1119057 0x111db37 0x1374cd2 0x139edf5 0x139d0c7 0x15b6e70 0x15b6825 0x15b8416 0x20ae009 0x106ed81
# 0x1068fa4 internal/poll.runtime_pollWait+0x44 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/netpoll.go:234
# 0x1118fa7 internal/poll.(*pollDesc).wait+0x87 /usr/local/Cellar/go/1.17.5/libexec/src/internal/poll/fd_poll_runtime.go:84
# 0x1119056 internal/poll.(*pollDesc).waitRead+0x36 /usr/local/Cellar/go/1.17.5/libexec/src/internal/poll/fd_poll_runtime.go:89
# 0x111db36 internal/poll.(*FD).Accept+0x4b6 /usr/local/Cellar/go/1.17.5/libexec/src/internal/poll/fd_unix.go:402
# 0x1374cd1 net.(*netFD).accept+0x71 /usr/local/Cellar/go/1.17.5/libexec/src/net/fd_unix.go:173
# 0x139edf4 net.(*TCPListener).accept+0x54 /usr/local/Cellar/go/1.17.5/libexec/src/net/tcpsock_posix.go:140
# 0x139d0c6 net.(*TCPListener).Accept+0x66 /usr/local/Cellar/go/1.17.5/libexec/src/net/tcpsock.go:262
# 0x15b6e6f net/http.(*Server).Serve+0x52f /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:3002
# 0x15b6824 net/http.(*Server).ListenAndServe+0x164 /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2931
# 0x15b8415 net/http.ListenAndServe+0xf5 /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:3185
# 0x20ae008 aqgs/sync_assistant/cmd.Execute.func1+0x28 /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/root.go:53
1 @ 0x103d0b2 0x103d14a 0x104e185 0x106a925 0x107d691 0x107d3f1 0x1e3fcea 0x106ed81
# 0x106a924 sync.runtime_SemacquireMutex+0x24 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sema.go:71
# 0x107d690 sync.(*Mutex).lockSlow+0x270 /usr/local/Cellar/go/1.17.5/libexec/src/sync/mutex.go:138
# 0x107d3f0 sync.(*Mutex).Lock+0x50 /usr/local/Cellar/go/1.17.5/libexec/src/sync/mutex.go:81
# 0x1e3fce9 github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).init.func1+0x269 /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:319
1 @ 0x1068c05 0x209e6f9 0x209e350 0x20995e7 0x20aa46d 0x20ab2c9 0x15b1783 0x15b4155 0x15b655a 0x15b0618 0x106ed81
# 0x1068c04 runtime/pprof.runtime_goroutineProfileWithLabels+0x24 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/mprof.go:746
# 0x209e6f8 runtime/pprof.writeRuntimeProfile+0x178 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/pprof/pprof.go:724
# 0x209e34f runtime/pprof.writeGoroutine+0x8f /usr/local/Cellar/go/1.17.5/libexec/src/runtime/pprof/pprof.go:684
# 0x20995e6 runtime/pprof.(*Profile).WriteTo+0xa6 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/pprof/pprof.go:331
# 0x20aa46c net/http/pprof.handler.ServeHTTP+0x42c /usr/local/Cellar/go/1.17.5/libexec/src/net/http/pprof/pprof.go:253
# 0x20ab2c8 net/http/pprof.Index+0xe8 /usr/local/Cellar/go/1.17.5/libexec/src/net/http/pprof/pprof.go:371
# 0x15b1782 net/http.HandlerFunc.ServeHTTP+0x42 /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2047
# 0x15b4154 net/http.(*ServeMux).ServeHTTP+0x134 /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2425
# 0x15b6559 net/http.serverHandler.ServeHTTP+0x479 /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2879
# 0x15b0617 net/http.(*conn).serve+0x18b7 /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:1930
1 @ 0x106ed81
我这个程序开了10个groutine去同步,我们可以看到10个worker都hang在了这里
goroutine profile: total 56
10 @ 0x103d0b2 0x104d4b2 0x1e3ef6c 0x1e443d5 0x1fef5a5 0x1feeaa5 0x106ed81
# 0x1e3ef6b github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add+0x12b /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229
# 0x1e443d4 aqgs/sync_assistant/internal/toes.ToES+0x394 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:30
# 0x1fef5a4 aqgs/sync_assistant/internal/task/risk.StorageSoftware+0x604 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:141
# 0x1feeaa4 aqgs/sync_assistant/internal/task/risk.PoolStorageSoftware+0x1c4 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:106
主程卡在了发送消息到channel中
1 @ 0x103d0b2 0x1006e7c 0x1006bfd 0x1fee4b3 0x20adcf8 0x1e54f67 0x1e55d93 0x1e5536f 0x20ad31f 0x20ae037 0x103cc93 0x106ed81
# 0x1fee4b2 aqgs/sync_assistant/internal/task/risk.Software+0xcf2 /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:80
# 0x20adcf7 aqgs/sync_assistant/cmd.glob..func4+0x277 /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/risk.go:53
# 0x1e54f66 github.com/spf13/cobra.(*Command).execute+0xb66 /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:860
# 0x1e55d92 github.com/spf13/cobra.(*Command).ExecuteC+0x8b2 /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:974
# 0x1e5536e github.com/spf13/cobra.(*Command).Execute+0x2e /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:902
# 0x20ad31e aqgs/sync_assistant/cmd.Execute+0x3e /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/root.go:55
# 0x20ae036 main.main+0x16 /Users/songzhibin/go/src/aqgs/sync_assistant/main.go:21
# 0x103cc92 runtime.main+0x1f2 /usr/local/Cellar/go/1.17.5/libexec/src/runtime/proc.go:255
点进去源码可以看到,阻塞到了select这里
定位到sdk代码
我们切回到 pprof中full groutine stack dump
中详细看一下
goroutine 175 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add(0xc00034a000, {0x2431250, 0xc00003e0d0}, {{0x0, 0x0}, {0x2353093, 0x5}, {0xc002dceea0, 0x2e}, {0x0, ...}, ...})
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229 +0x12c
aqgs/sync_assistant/internal/toes.ToES.func1({_, _}, {{0x0, 0x0}, {0x2355caa, 0x6}, {0xc002dceea0, 0x2e}, {0x0, 0x0}, ...}, ...)
/Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41 +0x27d
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e7e0, {0x2431250, 0xc00003e0d0})
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1()
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386 +0x7aa
created by github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:351 +0x68
goroutine 176 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add(0xc00034a000, {0x2431250, 0xc00003e0d0}, {{0x0, 0x0}, {0x2353093, 0x5}, {0xc000c1a450, 0x2e}, {0x0, ...}, ...})
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229 +0x12c
aqgs/sync_assistant/internal/toes.ToES.func1({_, _}, {{0x0, 0x0}, {0x2355caa, 0x6}, {0xc000c1a450, 0x2e}, {0x0, 0x0}, ...}, ...)
/Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41 +0x27d
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e840, {0x2431250, 0xc00003e0d0})
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1()
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386 +0x7aa
created by github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:351 +0x68
goroutine 177 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add(0xc00034a000, {0x2431250, 0xc00003e0d0}, {{0x0, 0x0}, {0x2353093, 0x5}, {0xc0034cc360, 0x2e}, {0x0, ...}, ...})
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229 +0x12c
aqgs/sync_assistant/internal/toes.ToES.func1({_, _}, {{0x0, 0x0}, {0x2355caa, 0x6}, {0xc0034cc360, 0x2e}, {0x0, 0x0}, ...}, ...)
/Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41 +0x27d
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e8a0, {0x2431250, 0xc00003e0d0})
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1()
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386 +0x7aa
created by github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:351 +0x68
我挑出重要的栈信息
goroutine 176 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e840, {0x2431250, 0xc00003e0d0})
/Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a
再结合我的代码
阿偶,问题找到了,在buffer满的时候刷新调用onfailure,flush时执行失败后调用onfailure,我的onfailure又调用了一个add方法导致消费者死锁!!!
解决方法当然也很简单,onfailure的任务直接go出去,或者推到另外一个channel中异步处理,齐活。。。
Songzhibin
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人