吞吐问题--3并发io模型
开始优化应用层!!
目前可以看出问题如下:
- select 耗时太多!!!!
- read 系统调用的errors次数占比13% 这是一个问题
- read的次数太多,是不是可以调大接收缓存减少read 次数,同时使用zero_copy tcp : A reworked TCP zero-copy receive API
- write 次数比read 还多,应该可以使用聚合接口
- socket bind ioctl 使用了3178次,那么client server fd 为 3178*2=6356 及时使用setsockopt 设置TCP_NODELAY 、SO_REUSEADDR、
SO_RCVBUF/SO_SNDBUF 、SO_KEEPALIVE 、SO_REUSEPORT、TCP_DEFER_ACCEPT 、SO_LINGER等 但是次数不至于达到27.8w次 - epoll_ctl比较频繁 远大于3178 目前为5.2w
- recvfrom 失败出错率较高50%
- 使用close 后继续使用shuwdown 通shutdown 出错率大约为50%
- futex 有必要吗?
- restart_syscall 太耗时
- close等系统调用时间太长,close都达到20us
当前模型:
明显存在竞争资源瓶颈
使用如下模型测试一下:
结果为:
perf stat -p 9880 sleep 10 Performance counter stats for process id '9880': 30372.082735 task-clock-msecs # 3.037 CPUs 340714 context-switches # 0.011 M/sec 288 CPU-migrations # 0.000 M/sec 150 page-faults # 0.000 M/sec 65299950534 cycles # 2149.999 M/sec (scaled from 66.24%) 12797366330 instructions # 0.196 IPC (scaled from 83.37%) 3284418549 branches # 108.139 M/sec (scaled from 83.17%) 15383662 branch-misses # 0.468 % (scaled from 83.69%) 195263517 cache-references # 6.429 M/sec (scaled from 83.42%) 29353131 cache-misses # 0.966 M/sec (scaled from 83.49%) 10.000647442 seconds time elapsed perf stat -p 9880 sleep 10 Performance counter stats for process id '9880': 30754.416664 task-clock-msecs # 3.075 CPUs 341624 context-switches # 0.011 M/sec 358 CPU-migrations # 0.000 M/sec 121 page-faults # 0.000 M/sec 66197865785 cycles # 2152.467 M/sec (scaled from 66.21%) 12878834358 instructions # 0.195 IPC (scaled from 83.57%) 3319059775 branches # 107.921 M/sec (scaled from 83.42%) 15481326 branch-misses # 0.466 % (scaled from 83.72%) 194900683 cache-references # 6.337 M/sec (scaled from 83.63%) 28394751 cache-misses # 0.923 M/sec (scaled from 83.03%) 10.001158953 seconds time elapsed
虽然解决了惊群以及多线程抢占 listen fd 问题
对比以前的stat结果:
perf stat -p 9884 sleep 10 Performance counter stats for process id '9884': 200815.127256 task-clock-msecs # 20.075 CPUs 2456764 context-switches # 0.012 M/sec 1294 CPU-migrations # 0.000 M/sec 3583 page-faults # 0.000 M/sec 430791607582 cycles # 2145.215 M/sec (scaled from 66.57%) 63233677155 instructions # 0.147 IPC (scaled from 83.19%) 18174748495 branches # 90.505 M/sec (scaled from 83.19%) 70154714 branch-misses # 0.386 % (scaled from 83.45%) 806455643 cache-references # 4.016 M/sec (scaled from 83.36%) 164527072 cache-misses # 0.819 M/sec (scaled from 83.44%) 10.003181505 seconds time elapsed
perf stat -p 9884 sleep 10 Performance counter stats for process id '9884': 203748.965387 task-clock-msecs # 20.373 CPUs 2274598 context-switches # 0.011 M/sec 1768 CPU-migrations # 0.000 M/sec 3570 page-faults # 0.000 M/sec 438541182863 cycles # 2152.360 M/sec (scaled from 66.80%) 63421130555 instructions # 0.145 IPC (scaled from 83.34%) 18428625598 branches # 90.448 M/sec (scaled from 83.23%) 69229085 branch-misses # 0.376 % (scaled from 83.41%) 770039314 cache-references # 3.779 M/sec (scaled from 83.14%) 158705951 cache-misses # 0.779 M/sec (scaled from 83.48%) 10.000827457 seconds time elapsed
使用perf stat 统计 分析:目前这种改动有一定提升;同时 cps 大约为6w
对于网络架构:目前认为可以做出如下考虑
但是tcp 协议栈存在spinlock:多个线程相互accept 时,会open一个fd file-struct存在spinlock也会出现问题,
listener线程负责接受所有的客户端链接
listener线程每接收到一个新的客户端链接产生一个新的fd,然后通过分发器发送给对应的工作线程(hash方式)
1. 进行accept处理的listener线程只有一个,在瞬间高并发场景容易成为瓶颈
2. 一个线程通过IO复用方式处理多个链接fd的数据读写、报文解析及后续业务逻辑处理,这个过程会有严重的排队现象,例如某个链接的报文接收解析完毕后的内部处理时间过长,则其他链接的请求就会阻塞排队
上述方案经典例子为memcache缓存,适用于内部处理比较快的缓存场景、代理中间场景;但是对更高性能场景, 多线程肯定会触发相关vfs的相关锁,毕竟vfs 部分锁是基于当前进程的不是基于当前线程的
目前对比一下 单listen 和多listen 在多线程下的区别:都是在cps为5w的前提下测试
一个CPU是使用100% 一个是使用30% 明显区别!!
http代理服务器(3-4-7层代理)-网络事件库公共组件、内核kernel驱动 摄像头驱动 tcpip网络协议栈、netfilter、bridge 好像看过!!!!
但行好事 莫问前程
--身高体重180的胖子