[nginx] async_mode_nginx CPU 100% deadlock问题分析
很遗憾只定位到了一个比较小的问题范围,理清了root cause, 但是没有找到复现的边界条件以及solution.
Hi all, I have the quite same problem with the latest software version: async_nginx: 0.4.5 openssl: 1.1.1k qatengine: 0.6.4 qatdriver: 1.7.l.4.13.0.9 the reproduce situation: config values in nginx.conf : default_algorithms CIPHERS qat_poll_mode heuristic I have debuged async_ningx and found there is a infinite loop. I think this is the reason here. 1 in function ngx_http_do_read_client_request_body(), nginx goin the for(;;)[line:288] loop and never break. as recv()[line:343] always return NGX_AGAIN, and c->read->ready always == 1 go deep in recv(), the NGX_AGAIN is return by func ngx_ssl_handle_recv()::line:2546 because of async job is paused. 2. when async context swapd, an other infinite loop was happend. in function qat_chained_ciphers_do_cipher() line:1554 as the read()[qat_pause_job():line279] always return EAGAIN. 3. As I know qat_crypto_callbackFn() is called by func qat_engine_poll(). I think, this because of the callback function qat_crypto_callbackFn() never have any CPU chance/CPU TIME to be called, then the paused async job never be waked up. then I check the POLL logic in async_nginx. I found point 4 descripte below. 4. In function ngx_ssl_engine_qat_heuristic_poll(), all the values of the six variables(num_*) never grow up, so function qat_engine_poll() have no any chance to execute. when I change my engine config in nginx.conf, this issue is disappear, and i can work around. the config like below: qat_heuristic_poll_asym_threshold = 0 qat_heuristic_poll_sym_threshold = 0 It seems a logic deadlock here ? nginx want qat to update counters but counters updated need nginx release some CPU time. or, maybe the following code do not consider the long time idle SSL connections ? if (*num_asym_requests_in_flight + *num_kdf_requests_in_flight + *num_cipher_requests_in_flight + *num_asym_mb_items_in_queue + *num_kdf_mb_items_in_queue + *num_sym_mb_items_in_queue >= (int) *ngx_ssl_active) { Anyone have any idea about this ?
详见:https://github.com/intel/QAT_Engine/issues/181