No decoder surfaces left 和 CUDA_ERROR_OUT_OF_MEMORY的报错解决

背景

因为GPU解码输出的像素格式是NV12，而NV12转换BGR24的耗时比YUV420转换BGR24要高4倍，因此使用scale_npp在GPU上将像素格式转为YUV420再输出。

同时，也需要使用fps filter来设置帧率。

同样使用FFmpeg的api，类似功能是命令行如下：

ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i ~/video/test.mp4 -vf "fps=15,scale_npp=format=yuv420p,hwdownload,format=yuv420p" -f null /dev/null

报错现象

出错先打印下面的日志，应该是decoder的某个索引用完了，导致send packet出错，内部又不断的重复初始化，显存也就耗光了。
2021-06-09 12:14:42,473 FATAL 140468490848000 xxxx.cpp ffmpeg_log_callback No decoder surfaces left

运行一段时间后日志的报错：同时nvidia-smi查看显存占用，发现显存已经被占满。

2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback decoder->cvdl->cuvidCreateDecoder(&decoder->decoder, params) failed
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback

2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback Failed setup for format cuda: hwaccel initialisation returned error.

2021-06-09 12:51:30,353 NOTICE 140464455923456 xxxx.cpp get_hw_format Failed to get HW surface format.
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback decode_slice_header error

原因

经过测试，fps=12.5得设置在scale_npp后面才行。设置在前面就会有显存问题。可能是解码和npp都在显存上处理，设置framerate的filter插入在npp之前，丢掉的frame没有真正释放显存。

fps, as a filter, needs to be inserted in a filtergraph. It offers five rounding modes that affect which source frames are dropped or duplicated in order to achieve the target framerate.

2021-06-23更新

上述原因分析错误。实际将fps filter放在npp scale之后，100路并发测试发现有内存泄漏，最终引发oom异常。

最终确定出错原因是av_buffersink_get_frame的使用错误，需要在返回值不是EAGAIN或error时循环调用该接口。因为之前没有加fps filter时，基本是一次av_buffersrc_add_frame_flags对应一次av_buffersink_get_frame，所以没问题。

添加fps filter后，没有循环调用，导致滞留的frame没有取出，相关资源不会释放，导致最终av_buffer_pool_get失败，报错No decoder surfaces left

参考ffmpeg/doc/examples/filtering_video.c的源码，略去了初始化部分代码：

/* read all packets */
while (1) {
    if ((ret = av_read_frame(fmt_ctx, &packet)) < 0)
        break;
 
    if (packet.stream_index == video_stream_index) {
        ret = avcodec_send_packet(dec_ctx, &packet);
        if (ret < 0) {
            av_log(NULL, AV_LOG_ERROR, "Error while sending a packet to the decoder\n");
            break;
        }
 
        while (ret >= 0) {
            ret = avcodec_receive_frame(dec_ctx, frame);
            if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) {
                break;
            } else if (ret < 0) {
                av_log(NULL, AV_LOG_ERROR, "Error while receiving a frame from the decoder\n");
                goto end;
            }
 
            frame->pts = frame->best_effort_timestamp;
 
            /* push the decoded frame into the filtergraph */
            if (av_buffersrc_add_frame_flags(buffersrc_ctx, frame, AV_BUFFERSRC_FLAG_KEEP_REF) < 0) {
                av_log(NULL, AV_LOG_ERROR, "Error while feeding the filtergraph\n");
                break;
            }
 
            /* pull filtered frames from the filtergraph */
            while (1) {
                ret = av_buffersink_get_frame(buffersink_ctx, filt_frame);
                if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF)
                    break;
                if (ret < 0)
                    goto end;
                display_frame(filt_frame, buffersink_ctx->inputs[0]->time_base);
                av_frame_unref(filt_frame);
            }
            av_frame_unref(frame);
        }
    }
    av_packet_unref(&packet);
}

解决方案

第一次的错误尝试

修改init_filters时设置给avfilter_graph_parse_ptr的参数，将filters_descr从

fps=12.5,scale_npp=format=yuv420p,hwdownload,format=yuv420p

改为

scale_npp=format=yuv420p,hwdownload,format=yuv420p,fps=12.5

备注：调整filters_descr后，因为fps filter后移，可能会对效率有一定影响。

第二次修改方案

参照示例代码，将

avcodec_receive_frame和

av_buffersink_get_frame的调用过程根据返回值进行循环调用，取出内部缓存的frame

排查步骤

复现问题

经过多次测试，发现启动三个进程后，用postman给每个进程批量发送25路rtmp视频流并发，3-5分钟后即可复现。

确定导致出错的范围

1. 查看日志报错信息，进行汇总，发现首先出现的异常是No decoder surfaces left，正常情况不应该有这个报错。

2. 添加调试日志

3. 临时替换掉ffmpeg filter的代码，直接调用av_hwframe_transfer_data将解码结果拷贝回内存，测试发现没有出现问题。

4. 改回ffmpeg filter进行像素格式转换，复现问题。

5. 针对ffmpeg filter，修改filters_descr，去除fps的过滤进行测试，结果正常。因此出错和fps filter有关。

6. 尝试替换新的fps过滤方案。同时将filters_descr中的fps=后移，测试结果也正常。结合之前的测试结果，应该是fps filter插入到scale_npp之前时，缩小帧率会drop frame，但是显存没有正确释放。

TODO，尝试fps=在scale_npp之前时修复显存泄漏的问题。得深入看FFmpeg fps filter的代码。

其他，

一路并发，解码进程会占用205MB显存。
75路并发时，三个显卡各占用5128MB显存。

第二次分析问题

因为第一次修改将fps filter后移后，出现了内存问题。并且之前没有查到根本原因，所以继续深入排查。

在libavutil/buffer.c libavcodec/nvdec.c libavcodec/nvdec_h264.c等源码中添加日志。

经过多次测试，发现是nvdec_decoder_frame_alloc中，判断if (pool->nb_allocated >= pool->dpb_size) return NULL;

为什么nb_allocated会大于dpb_size呢？

日志显示，nvdec_decoder_frame_alloc申请次数过多，导致报错后，会重新申请新的NVDECFramePool *pool; 但是每次打印新的pool地址后，会很快重新nb_allocated大于dpb_size。而对比正常运行的解码线程，只会创建3次，nb_allocated最终是3. （实际75路并发中，会有部分线程解码正常）

是什么导致了这种差别？

对比ffmpeg/doc/examples/filtering_video.c以及其他demo源码，注意到avcodec_receive_frame和av_buffersink_get_frame的使用不规范。而且只有加上fps filter时才有内存问题。因此尝试将get frame的接口改成的while循环中调用，测试解决了内存问题。

[ffmpeg]$ git status libav*
On branch master
Changes not staged for commit:
modified: libavcodec/decode.c
modified: libavcodec/h264_slice.c
modified: libavcodec/h264dec.c
modified: libavcodec/nvdec.c
modified: libavcodec/nvdec_h264.c
modified: libavutil/buffer.c
modified: libavutil/mem.c

涉及到的函数：

static int decode_simple_internal(AVCodecContext *avctx, AVFrame *frame)

static AVBufferRef *nvdec_decoder_frame_alloc(void *opaque, int size) 重要

int ff_nvdec_decode_init(AVCodecContext *avctx) 重要

pool->dpb_size = frames_ctx->initial_pool_size; //dpb_size初始是10

ctx->decoder_pool = av_buffer_pool_init2(sizeof(int), pool, nvdec_decoder_frame_alloc, av_free); //设置decoder pool，会设置nvdec_decoder_frame_alloc来申请空间

ff_nvdec_start_frame

nvdec_h264_start_frame

av_buffer_create

AVBufferRef *av_buffer_pool_get(AVBufferPool *pool)

fps的问题

解码时设置framerate的filter，fps=12.5，处理完的tmp frame的pts就是加1递增了。之前frame的pts是间隔40ms。

不设置fps=xxx测试， npp scale像素转换的输出pts也是间隔40ms；

参考信息

AVBufferPool is an API for a lock-free thread-safe pool of AVBuffers.

Frequently allocating and freeing large buffers may be slow. AVBufferPool is meant to solve this in cases when the caller needs a set of buffers of the same size (the most obvious use case being buffers for raw video or audio frames).

At the beginning, the user must call av_buffer_pool_init() to create the buffer pool. Then whenever a buffer is needed, call av_buffer_pool_get() to get a reference to a new buffer, similar to av_buffer_alloc(). This new reference works in all aspects the same way as the one created by av_buffer_alloc(). However, when the last reference to this buffer is unreferenced, it is returned to the pool instead of being freed and will be reused for subsequent av_buffer_pool_get() calls.

When the caller is done with the pool and no longer needs to allocate any new buffers, av_buffer_pool_uninit() must be called to mark the pool as freeable. Once all the buffers are released, it will automatically be freed.

Allocating and releasing buffers with this API is thread-safe as long as either the default alloc callback is used, or the user-supplied one is thread-safe.

How do I reduce frames with blending in ffmpeg

Changing the frame rate

Framerate vs r vs Filter fps

Using ffmpeg to change framerate

using -hwaccel nvdec produces 'No decoder surfaces left' with interlaced input and 3 or more b-frames

posted on 2021-06-09 18:42 呆雁阅读(1743) 评论(0) 编辑收藏举报