ClickHouse关于参数receive_timeout探究

 

ClickHouse关于参数receive_timeout探究

 

仿伪链接:https://www.cnblogs.com/PiscesCanon/p/16898032.html

这几天研究给这个参数搞懵逼了,完全不明意义。

一个查询可能出现好几种报错,人麻了。。

辣鸡CK只想说。

 

先看官网解释:https://clickhouse.com/docs/zh/operations/settings/settings/#connect-timeout-receive-timeout-send-timeout

connect_timeout,receive_timeout,send_timeout

用于与客户端通信的套接字上的超时以秒为单位。

默认值:10,300,300。

 

测试表信息:default.zkm_local

dev-app76 :) SELECT
                 host AS `主机名`,
                 sum(rows) AS `总行数`,
                 formatReadableSize(sum(data_uncompressed_bytes)) AS `原始大小`,
                 formatReadableSize(sum(data_compressed_bytes)) AS `压缩大小`,
                 round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100, 0) AS `压缩率`
             FROM `gv$parts`
             WHERE database='default'
             AND table='zkm_local'
             GROUP BY host
             ORDER BY host ASC;

SELECT
    host AS `主机名`,
    sum(rows) AS `总行数`,
    formatReadableSize(sum(data_uncompressed_bytes)) AS `原始大小`,
    formatReadableSize(sum(data_compressed_bytes)) AS `压缩大小`,
    round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100, 0) AS `压缩率`
FROM `gv$parts`
WHERE (database = 'default') AND (table = 'zkm_local')
GROUP BY host
ORDER BY host ASC

Query id: 729ba8e9-4b90-44be-9f74-1f2b487774bd

┌─主机名───-─┬──总行数─-┬─原始大小───-┬─压缩大小───-┬─压缩率-─┐
│ dev-app76 │ 2595814846.24 MiB │ 234.81 MiB │     28 │
│ dev-app77 │ 2595814846.24 MiB │ 234.81 MiB │     28 │
│ dev-app78 │ 2597249846.73 MiB │ 234.50 MiB │     28 │
│ dev-app79 │ 2597249846.73 MiB │ 234.50 MiB │     28 │
└───────────┴─────────┴────────────┴────────────┴────────┘

4 rows in set. Elapsed: 0.193 sec. 

 

receive_timeout=0的情况,远程连接同时设置该参:

[root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=0
ClickHouse client version 22.3.2.1.
Connecting to 192.168.1.172:9000 as user default.
Connected to ClickHouse server version 22.3.2 revision 54455.

dev-app77 :) select * from zkm_local;

SELECT *
FROM zkm_local

Query id: 1e66f2b7-fddc-4947-a21f-1ed699972915

Timeout exceeded while receiving data from server. Waited for 0 seconds, timeout is 0 seconds.
Cancelling query.
Ok.
Query was cancelled.

0 rows in set. Elapsed: 0.137 sec. Processed 158.84 thousand rows, 81.89 MB (1.16 million rows/s., 596.64 MB/s.)

 

对于查询SQL快的可能成功,也可能失败,如:

dev-app77 :) select * from system.settings where name='receive_timeout';

SELECT *
FROM system.settings
WHERE name = 'receive_timeout'

Query id: e937b13c-37d6-43c0-ba66-45e07ebd1b30

┌─name────────────┬─value─┬─changed─┬─description─┬─min──┬─max──┬─readonly─┬─type────┐
│ receive_timeout │ 01 │             │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │        0 │ Seconds │
└─────────────────┴───────┴─────────┴─────────────┴──────┴──────┴──────────┴─────────┘

1 rows in set. Elapsed: 0.005 sec. 

 

测试receive_timeout=1.

测试语句:

select t.* from zkm_local t;

 

结果1:

能够正常跑完

 

结果2:

远程连接第二个节点,同时设置receive_timeout=1,执行SQL,输出部分结果后报错如下:

↘ Progress: 117.88 thousand rows, 60.67 MB (18.12 thousand rows/s., 9.33 MB/s.) ████▌                                                                       (1.2 CPU, 152.95 MB RAM) 4%Exception on client:
Code: 33. DB::Exception: Cannot read all data. Bytes read: 171757. Bytes expected: 604103.: while receiving packet from 192.168.1.172:9000. (CANNOT_READ_ALL_DATA)

Connecting to 192.168.1.172:9000 as user default.
Connected to ClickHouse server version 22.3.2 revision 54455.

 

同时,在输出结果(client端还未抛出报错)的某个时刻,错误日志报错信息:

2022.11.16 17:30:49.354948 [ 6507 ] {6b5a5fde-a832-4c9a-af93-dceda3c12127} <Error> executeQuery: Code: 209. DB::NetException: Timeout exceeded while writing to socket (192.168.1.123:46930, 1000 ms). (SOCKET_TIMEOUT) (version 22.3.2.1) (from 192.168.1.123:46930) (in query: select t.* from zkm_local t;), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa4dde1a in /usr/bin/clickhouse
1. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x14187daf in /usr/bin/clickhouse
2. DB::TCPHandler::sendData(DB::Block const&) @ 0x1544a014 in /usr/bin/clickhouse
3. DB::TCPHandler::processOrdinaryQueryWithProcessors() @ 0x15448626 in /usr/bin/clickhouse
4. DB::TCPHandler::runImpl() @ 0x154417a7 in /usr/bin/clickhouse
5. DB::TCPHandler::run() @ 0x15450e79 in /usr/bin/clickhouse
6. Poco::Net::TCPServerConnection::start() @ 0x164b264f in /usr/bin/clickhouse
7. Poco::Net::TCPServerDispatcher::run() @ 0x164b4aa1 in /usr/bin/clickhouse
8. Poco::PooledThread::run() @ 0x16671e49 in /usr/bin/clickhouse
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x1666f1a0 in /usr/bin/clickhouse
10. start_thread @ 0x7ea5 in /usr/lib64/libpthread-2.17.so
11. __clone @ 0xfe96d in /usr/lib64/libc-2.17.so

2022.11.16 17:30:49.362383 [ 6507 ] {6b5a5fde-a832-4c9a-af93-dceda3c12127} <Error> TCPHandler: Code: 209. DB::Exception: Timeout exceeded while writing to socket (192.168.1.123:46930, 1000 ms). (SOCKET_TIMEOUT), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa4dde1a in /usr/bin/clickhouse
1. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x14187daf in /usr/bin/clickhouse
2. DB::TCPHandler::sendData(DB::Block const&) @ 0x1544a014 in /usr/bin/clickhouse
3. DB::TCPHandler::processOrdinaryQueryWithProcessors() @ 0x15448626 in /usr/bin/clickhouse
4. DB::TCPHandler::runImpl() @ 0x154417a7 in /usr/bin/clickhouse
5. DB::TCPHandler::run() @ 0x15450e79 in /usr/bin/clickhouse
6. Poco::Net::TCPServerConnection::start() @ 0x164b264f in /usr/bin/clickhouse
7. Poco::Net::TCPServerDispatcher::run() @ 0x164b4aa1 in /usr/bin/clickhouse
8. Poco::PooledThread::run() @ 0x16671e49 in /usr/bin/clickhouse
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x1666f1a0 in /usr/bin/clickhouse
10. start_thread @ 0x7ea5 in /usr/lib64/libpthread-2.17.so
11. __clone @ 0xfe96d in /usr/lib64/libc-2.17.so
View Code

 

同时,服务器上netstat -anputwco | grep 192.168.1.123的结果

tcp        0 341728 192.168.1.172:9000      192.168.1.123:46942     ESTABLISHED 1435/clickhouse-ser  on (0.04/0/0)
tcp        0 361629 192.168.1.172:9000      192.168.1.123:46942     ESTABLISHED 1435/clickhouse-ser  probe (0.14/0/0)
tcp        0 361629 192.168.1.172:9000      192.168.1.123:46942     ESTABLISHED 1435/clickhouse-ser  probe (0.06/0/1)
tcp        0 361629 192.168.1.172:9000      192.168.1.123:46942     ESTABLISHED 1435/clickhouse-ser  probe (0.91/0/0)
tcp        0 361629 192.168.1.172:9000      192.168.1.123:46942     ESTABLISHED 1435/clickhouse-ser  probe (3.63/0/0)
tcp        0 361629 192.168.1.172:9000      192.168.1.123:46942     ESTABLISHED 1435/clickhouse-ser  probe (2.62/0/0)
tcp        0 361629 192.168.1.172:9000      192.168.1.123:46942     ESTABLISHED 1435/clickhouse-ser  probe (1.61/0/0)
tcp        0      0 192.168.1.172:9000      192.168.1.123:46942     FIN_WAIT2   -                    timewait (59.37/0/0)
tcp        0      0 192.168.1.172:9000      192.168.1.123:46942     FIN_WAIT2   -                    timewait (58.36/0/0)
tcp        0      0 192.168.1.172:9000      192.168.1.123:46942     FIN_WAIT2   -                    timewait (57.35/0/0)
tcp        0      0 192.168.1.172:9000      192.168.1.123:46942     FIN_WAIT2   -                    timewait (56.34/0/0)

 

同时,使用dstat的观察结果:

17-11 09:37:12|  44k 1373k|  2   3  95   0   0   0|1195M  104M  924M 5759M
17-11 09:37:13|  18k  324k|  1   0  99   0   0   0|1201M  104M  924M 5753M
17-11 09:37:14|1300B 3997B|  1   0  98   0   0   0|1189M  104M  926M 5763M
17-11 09:37:15|  16k   10k|  0   0  99   0   0   0|1182M  104M  925M 5771M
17-11 09:37:16|  10k 3814B|  1   0  99   0   0   0|1182M  104M  925M 5771M
17-11 09:37:17|  38k   26k|  1   0  99   0   0   0|1185M  104M  925M 5768M
17-11 09:37:18|  17k   10k|  0   0  99   0   0   0|1184M  104M  925M 5770M
17-11 09:37:19|7390B  420k|  2   1  98   0   0   0|1198M  104M  924M 5756M
17-11 09:37:20|7923B 7543B|  1   1  99   0   0   0|1186M  104M  923M 5769M
17-11 09:37:21|3287B 2255B|  1   0  98   0   0   0|1190M  104M  925M 5763M
17-11 09:37:22|  31k   25k|  0   0  99   0   0   0|1173M  104M  925M 5780M
17-11 09:37:23|  17k   11k|  0   0  99   0   0   0|1163M  104M  925M 5790M

 

 

结果3:

结果4:

结果5:

........

其他结果一样也是报错,只是报错信息略有区别,有的没保留到。

太乱了就不全部贴出来。

 

换一种方式。

将输出内容重定向到文件中,则基本不会报错(偶尔出现):

clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --send_logs_level='debug' --query="select * from default.zkm_local" > zkm_local.csv
[root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --send_logs_level='debug' --query="select * from default.zkm_local" > zkm_local.csv
[dev-app77] 2022.11.17 09:28:11.124199 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> executeQuery: (from 192.168.1.123:47040) select * from default.zkm_local
[dev-app77] 2022.11.17 09:28:11.126679 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Key condition: unknown
[dev-app77] 2022.11.17 09:28:11.126926 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): MinMax index condition: unknown
[dev-app77] 2022.11.17 09:28:11.128069 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Selected 14/14 parts by partition key, 14 parts by primary key, 327/327 marks by primary key, 327 marks to read from 14 ranges
[dev-app77] 2022.11.17 09:28:11.130267 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Reading approx. 2595814 rows with 8 streams
[dev-app77] 2022.11.17 09:28:23.481457 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Information> executeQuery: Read 2595814 rows, 1.25 GiB in 12.357118953 sec., 210066 rows/sec., 103.74 MiB/sec.
[dev-app77] 2022.11.17 09:28:23.494248 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> MemoryTracker: Peak memory usage (for query): 169.81 MiB.
[root@dev-app80 daily]# 
[root@dev-app80 daily]# 
[root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --send_logs_level='debug' --query="select * from default.zkm_local" > zkm_local.csv
[dev-app77] 2022.11.17 09:29:05.007509 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> executeQuery: (from 192.168.1.123:47042) select * from default.zkm_local
[dev-app77] 2022.11.17 09:29:05.009623 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Key condition: unknown
[dev-app77] 2022.11.17 09:29:05.009747 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): MinMax index condition: unknown
[dev-app77] 2022.11.17 09:29:05.010641 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Selected 14/14 parts by partition key, 14 parts by primary key, 327/327 marks by primary key, 327 marks to read from 14 ranges
[dev-app77] 2022.11.17 09:29:05.011670 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Reading approx. 2595814 rows with 8 streams
[dev-app77] 2022.11.17 09:29:17.401748 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Information> executeQuery: Read 2595814 rows, 1.25 GiB in 12.394144753 sec., 209438 rows/sec., 103.43 MiB/sec.
[dev-app77] 2022.11.17 09:29:17.405031 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> MemoryTracker: Peak memory usage (for query): 169.69 MiB.
View Code

同时,服务器上netstat -anputwco | grep 192.168.1.123的结果:

tcp        0 2028857 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.19/0/0)
tcp        0 2884482 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.20/0/0)
tcp        0 2599228 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.22/0/0)
tcp        0 2597150 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.03/0/0)
tcp        0 3081369 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.21/0/0)
tcp        0 2537205 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.21/0/0)
tcp        0 2734426 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.19/0/0)
tcp        0 2711009 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.19/0/0)
tcp        0 3360854 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.22/0/0)
tcp        0 2708708 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.21/0/0)
tcp        0 3148247 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  on (0.20/0/0)
tcp        0 3248253 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  probe (0.22/0/0)
tcp        0      0 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  off (0.00/0/0)
tcp        0      0 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  off (0.00/0/0)
tcp        0      0 192.168.1.172:9000      192.168.1.123:47042     ESTABLISHED 1435/clickhouse-ser  off (0.00/0/0)

同时,使用dstat的观察结果:

[root@dev-app77 ~]# dstat -tncm --nocolor
----system---- -net/total- ----total-cpu-usage---- ------memory-usage-----
17-11 09:29:05|  71k   29M|  9   4  87   0   0   0|1267M  103M  913M 5699M
17-11 09:29:06|  72k   49M| 12   1  87   0   0   0|1298M  103M  913M 5667M
17-11 09:29:07|  10k   17M|  4   0  96   0   0   0|1304M  103M  913M 5661M
17-11 09:29:08|  16k   18M|  5   1  92   3   0   0|1298M  103M  913M 5668M
17-11 09:29:09|  10k   15M|  5   1  94   0   0   0|1304M  103M  912M 5663M
17-11 09:29:10|  24k   13M|  4   1  94   0   0   0|1307M  103M  913M 5658M
17-11 09:29:11|  26k   16M|  6   2  92   0   0   0|1339M  103M  915M 5625M
17-11 09:29:12|4661B   17M|  4   1  95   0   0   0|1327M  103M  915M 5637M
17-11 09:29:13|  18k   15M|  4   1  91   4   0   0|1335M  103M  915M 5629M
----system---- -net/total- ----total-cpu-usage---- ------memory-usage-----
     time     | recv  send|usr sys idl wai hiq siq| used  buff  cach  free
17-11 09:29:14|  14k   16M|  4   1  95   0   0   0|1322M  103M  915M 5641M
17-11 09:29:15|5627B   10M|  3   1  97   0   0   0|1305M  103M  915M 5658M
17-11 09:29:16|  31k 8571k|  3   1  96   0   0   0|1293M  103M  913M 5673M
17-11 09:29:17|6984B 9263k|  3   1  96   0   0   0|1260M  103M  914M 5704M
17-11 09:29:18|  13k  222k|  1   0  97   2   0   0|1257M  103M  915M 5707M
17-11 09:29:19|  16k 8609B|  2   1  93   4   0   0|1242M  103M  916M 5720M
17-11 09:29:20|9057B 3024B|  1   0  99   0   0   0|1240M  103M  916M 5722M

 

一开始以为receive_time控制客户端接受数据时间,

比如receive_time=1时看,执行一条sql返回数据超过1s立刻抛出错误。

(PS:控制SQL执行时间是由max_execution_time控制,默认0表示无限制)

从实验结果看,有时候返回数据(屏幕滚动多达几秒十几秒)超过1s然后最终不报错的都有。

而且若是重定向到文件,大部分时候是成功的。

根据报错试验期间服务器日志的某几次报错信息(类似信息,但有区别,可和前边的报错对比):

210. DB::NetException: Connection reset by peer, while writing to socket (127.0.0.1:33484). (NETWORK_ERROR)

 

Code: 209. DB::NetException: Timeout exceeded while writing to socket (127.0.0.1:33410, 1000 ms). (SOCKET_TIMEOUT)

 

关于Connection reset by peer,猜测原因是客户端主动断开重连,当然也可能是网络导致,不是很确定。只能确定连接没有正常维持而被reset了。

关于socket,读这两篇了解一下:学C语言会用到socket网络编程,那它是什么意思  socket的概念和原理

while writing to socket,可以知道是当对socket做写操作,即服务器对客户端发送数据的时候报错。

Timeout exceeded while writing to socket,即超过设定的1s超时报错,

结合实验过程,这个超时也就是当服务器对socket发送数据的时候,socket无响应或者来不及响应的时间超过1s则报错。

socket无响应或者来不及响应可能原因是网络拥堵导致的,也可能是客户端接受数据过慢导致的。

接受数据过慢是什么说法,

比如从上边看,通过clickhouse-client交互式执行SQL有时候在某段时间内几乎次次报错,同时dstat观察网络流量都是几十K并远远没有达到峰值,说明是clickhouse-client来不及接受数据导致的,估计是刷屏过慢导致的。

刷屏问题在Oracle中也有,sqlplus数据刷屏的时间是计算在statistics_level=all的时间内的,即使sql可能0.1s执行完成但是由于数据在sqlplus刷屏时间过长最终倒是sql执行时间是10s都有可能。

clickhouse-client由于数据刷屏慢来不及响应服务器对socket的写操作,所以报错了。

对比将数据用CSV格式重定向到文件无刷屏过程,几乎就次次成功无报错,并且dstat的网络流量从10几M到40几M都有,网络流量上去了说明socket的接收能力没有被限制,所以没有报错。

 

使用tc限制网络流量,再次观察重定向方式是否报错。

限制 ens160 网卡的带宽为 500kbit:

tc qdisc add dev ens160 root tbf rate 500kbit latency 50ms burst 15kb

 

再试执行重定向操作,

[root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --query="select * from default.zkm_local" > zkm_local.csv
Code: 33. DB::Exception: Cannot read all data. Bytes read: 152964. Bytes expected: 585803.: while receiving packet from 192.168.1.172:9000: (in query: select * from default.zkm_local). (CANNOT_READ_ALL_DATA)

 

可以看出,网络流量限制之后,出现client接收数据超时也报错了。

 

删除网卡 ens160 网卡限制。

tc qdisc del dev ens160 root

 

 

关于receive_timeout到这里。

太乱了,搞了有4天......

 

还有个send_timeout......

 

posted @ 2022-11-16 23:56  PiscesCanon  阅读(2110)  评论(0编辑  收藏  举报