ffmpeg音频处理

USB拾音器默认采样pcm_s16le格式数据，即PCM signed 16-bit little-endian，可通过如下命令保存原始PCM数据：

ffmpeg -y  -f alsa -thread_queue_size 2048 -ar 22050 -ac 1 -i hw:1,0 -f s16le -c:a copy -t 10 raw.pcm

声卡虽然是以音频采样点为播放单位，但通常我们每次往声卡缓冲区送一个音频frame，每送一个音频frame更新一下音频的播放时刻，即每隔一个音频frame时长更新一下音频时钟，实际上ffplay就是这么做的。

音频重要参数：

采样率（rate）：8kHz、11.025kHz、22.05kHz、16kHz、37.8kHz、44.1kHz、48kHz等。
采样数（sample）：采样点个数，AAC为1024，mp3位1152。
采样值格式（format）：一个采样点或一次采样的数值格式，U8，S16，FLT等。
通道数：1单声道，2立体声
通道布局：交错或平面（非交错）。

一、音频帧

ffmpeg中AVPacket中可能会含有多个音频帧（AVFrame）,一个音频帧含有多个采样，采样率决定了1s有多少个采样（或者说一个音频帧可以播放多久）。对于aac，一帧有1024个采样点，mp3一帧则固定为1152个采样点。

// 音频相关部分成员
typedef struct AVFrame{
#define AV_NUM_DATA_POINTERS 8
    
    uint8_t *data[AV_NUM_DATA_POINTERS];

    /**
     * For video, size in bytes of each picture line.
     * For audio, size in bytes of each plane.
     *
     * For audio, only linesize[0] may be set. For planar audio, each channel
     * plane must be the same size.
     *
     * For video the linesizes should be multiples of the CPUs alignment
     * preference, this is 16 or 32 for modern desktop CPUs.
     * Some code requires such alignment other code can be slower without
     * correct alignment, for yet other it makes no difference.
     *
     * @note The linesize may be larger than the size of usable data -- there
     * may be extra padding present for performance reasons.
     */
    int linesize[AV_NUM_DATA_POINTERS];

    /**
     * pointers to the data planes/channels.
     *
     * For video, this should simply point to data[].
     *
     * For planar audio, each channel has a separate data pointer, and
     * linesize[0] contains the size of each channel buffer.
     * For packed audio, there is just one data pointer, and linesize[0]
     * contains the total size of the buffer for all channels.
     *
     * Note: Both data and extended_data should always be set in a valid frame,
     * but for planar audio with more channels that can fit in data,
     * extended_data must be used in order to access all channels.
     */
    uint8_t **extended_data;
    
    /**
     * number of audio samples (per channel) described by this frame
     */
    int nb_samples;

    /**
     * format of the frame, -1 if unknown or unset
     * Values correspond to enum AVPixelFormat for video frames,
     * enum AVSampleFormat for audio)
     */
    int format
    
}AVFrame;

1.1 音频数据存储

对于音频，由于有多声道的音频，那么音频解码出来的数据不同声道也储存在不同的指针，如data[0]是左声道，data[1]是右声道，由于各个声道的数据长度是一样的，所以linesize[0]就代表了所有声道数据的长度。

成员extended_data指向了data，是一个拓展，上面可以看到data 是包含8个指针的数组，也就是说对于音频，最多只支持8个声道

packed格式：

// 音频交错格式
AV_SAMPLE_FMT_U8, ///< unsigned 8 bits
AV_SAMPLE_FMT_S16, ///< signed 16 bits
AV_SAMPLE_FMT_S32, ///< signed 32 bits
AV_SAMPLE_FMT_FLT, ///< float
AV_SAMPLE_FMT_DBL, ///< double

只能保存在AVFrame的uint8_t *data[0]; 音频保持格式：LRLRLR......

planar格式：

AV_SAMPLE_FMT_U8P, ///< unsigned 8 bits, planar
AV_SAMPLE_FMT_S16P, ///< signed 16 bits, planar
AV_SAMPLE_FMT_S32P, ///< signed 32 bits, planar
AV_SAMPLE_FMT_FLTP, ///< float, planar //aac只支持此格式
AV_SAMPLE_FMT_DBLP, ///< double, planar

plane 0: LLLLLLLLLLLLLLLLLLLLLLLLLL…

plane 1: RRRRRRRRRRRRRRRRRRRR…

plane 0对应uint8_t *data[0];plane 1对应uint8_t *data[1]。

无论是不是分片，数据总量是相同的；分片的存储时如果两声道则左右分开占用linesize[0]和linesize[1]；不分片时两声道不分开，左右左右....这样存储，只占用linesize[0]。

通过函数

int av_samples_get_buffer_size(int *linesize, 
    int nb_channels, int nb_samples,
    enum AVSampleFormat sample_fmt, int align)

可根据音频参数计算音频存储size，返回值为音频一帧数据总字节数，而传出参数linesize在packed格式时为总字节数，而planar格式时为单通道字节数。

1.2 音频帧数据大小

主要是控制两个参量：

AVCodecContext中int frame_size

//Audio only, Samples per packet.
// 对于ffmpeg音频的codec，好像每次只能编这个数量的采样

AVFrame中int nb_samples

//number of audio samples (per channel) described by this frame
// 对于ffmpeg音频的frame，表示frame中采样的数量

一般设置AVFrame.nb_samples = AVCodecContext.frame_size;

此外要注意：AV_CODEC_CAP_VARIABLE_FRAME_SIZE

// For audio: If AV_CODEC_CAP_VARIABLE_FRAME_SIZE is set, then each frame can have any number of samples. If it is not set, frame->nb_samples must be equal to avctx->frame_size for all frames except the last.

//如果 AV_CODEC_CAP_VARIABLE_FRAME_SIZE(在 AVCodecContext.codec.capabilities 变量中，只读)标志有效，表示编码器支持可变尺寸音频帧，送入编码器的音频帧可以包含任意数量的采样点。如果此标志无效，则每一个音频帧的采样点数目(frame->nb_samples)必须等于编码器设定的音频帧尺寸(avctx->frame_size)，最后一帧除外，最后一帧音频帧采样点数可以小于 avctx->frame_size

编解码应用

解码过程中的音频帧尺寸：
解码帧中的AVFrame.nb_samples。

编码过程中的音频帧尺寸：当编码器 AV_CODEC_CAP_VARIABLE_FRAME_SIZE 标志有效时，音频帧尺寸是可变的，AVFrame.nb_samples值可能为0；否则，AVFrame.nb_samples必须等于AVCodecContext.frame_size（最后一帧可以小于）。

FIFO应用：
上述代码中第一个判断条件是 "(stream.o_codec_ctx->codec->capabilities & AV_CODEC_CAP_VARIABLE_FRAME_SIZE) == 0)", 第二个判断条件是 "(stream.i_codec_ctx->frame_size != stream.o_codec_ctx->frame_size)"。如果编码器不支持可变尺寸音频帧(第一个判断条件生效)，而原始音频帧的尺寸又和编码器帧尺寸不一样(第二个判断条件生效)，则需要引入音频帧 FIFO，以保证每次从 FIFO 中取出的音频帧尺寸和编码器帧尺寸一样。音频 FIFO 输出的音频帧不含时间戳信息，因此需要重新生成时间戳。

pAudioFrame = avcodec_alloc_frame();
pAudioFrame->nb_samples= pAudioEncodeCtx->frame_size;
pAudioFrame->format= pAudioEncodeCtx->sample_fmt;

//依据channel，nb_sample，sample_fmt 计算frame的数据块的大小
int size = av_samples_get_buffer_size(NULL, pAudioEncodeCtx->channels, pAudioEncodeCtx->frame_size, pAudioEncodeCtx->sample_fmt, 1);

uint8_t * frame_buf = (uint8_t *)av_malloc(size);

//依据channel，nb_sample，sample_fmt 及frame的数据块的大小，设置frame中的信息
avcodec_fill_audio_frame(pAudioFrame, pAudioEncodeCtx->channels, pAudioEncodeCtx->sample_fmt,(const uint8_t*)frame_buf, size, 1);

while (1){
    int readSize = fread(frame_buf, 1, size, fInputPCM);

    if (readSize <= 0) {
        break;
    }

    pAudioFrame->data[0] = frame_buf;  //采样信号
    int got_frame = 0;

    int ret = avcodec_encode_audio2(pAudioEncodeCtx, &AudioPacket, pAudioFrame, &got_frame);

针对常用的aac和mp3分析如下：

nb_samples和frame_size = 1024

双声道一帧数据量：1024 x 2 x av_get_bytes_per_sample(fltp) = 8192个字节。

nb_samples和frame_size = 1152

双声道一帧数据量：1152 x 2 x av_get_bytes_per_sample(s32p) = 9216个字节。

注：有些MP3采样格式是s16P或FLTP。

注：aac有时采样数是2048.

/*
A HE-AAC v1 or v2 audio frame contains 2048 PCM samples per channel (there is
also one mode with 1920 samples per channel but this is only for special purposes
such as DAB+ digital radio).
These bits/frame figures are average figures where each AAC frame generally has a different
size in bytes. To calculate the same for AAC-LC just use 1024 instead of 2048 PCM samples per
frame and channel.
For AAC-LD/ELD it is either 480 or 512 PCM samples per frame and channel.
*/
当aac编码级别是LC时frame_size 和nb_samples是1024，如果是HE的时候是2048。

//这里的最后一个参数一定要注意用pInputFrame->nb_samples* per_sample_in，以AAC举例子，AVCodecContext中的profile会有LC，HE等不同，
//nb_samples在LC的时候是1024，在HE的时候是2048。如果不填写对会影响音频数据,nb_samples和AVCodecContext中的frame_size相同。
ret = avcodec_fill_audio_frame(pInputFrame,Channel_in,SampleFormat_in,buf_in,buf_size_in,pInputFrame->nb_samples* per_sample_in);

1.3 音频格式

// libavutil/channel_layout.h
#define AV_CH_FRONT_LEFT             0x00000001
#define AV_CH_FRONT_RIGHT            0x00000002
#define AV_CH_FRONT_CENTER           0x00000004
#define AV_CH_LOW_FREQUENCY          0x00000008
#define AV_CH_BACK_LEFT              0x00000010
#define AV_CH_BACK_RIGHT             0x00000020
#define AV_CH_FRONT_LEFT_OF_CENTER   0x00000040
#define AV_CH_FRONT_RIGHT_OF_CENTER  0x00000080
#define AV_CH_BACK_CENTER            0x00000100
#define AV_CH_SIDE_LEFT              0x00000200
#define AV_CH_SIDE_RIGHT             0x00000400
#define AV_CH_TOP_CENTER             0x00000800
#define AV_CH_TOP_FRONT_LEFT         0x00001000
#define AV_CH_TOP_FRONT_CENTER       0x00002000
#define AV_CH_TOP_FRONT_RIGHT        0x00004000
#define AV_CH_TOP_BACK_LEFT          0x00008000
#define AV_CH_TOP_BACK_CENTER        0x00010000
#define AV_CH_TOP_BACK_RIGHT         0x00020000
#define AV_CH_STEREO_LEFT            0x20000000  ///< Stereo downmix.
#define AV_CH_STEREO_RIGHT           0x40000000  ///< See AV_CH_STEREO_LEFT.
#define AV_CH_WIDE_LEFT              0x0000000080000000ULL
#define AV_CH_WIDE_RIGHT             0x0000000100000000ULL
#define AV_CH_SURROUND_DIRECT_LEFT   0x0000000200000000ULL
#define AV_CH_SURROUND_DIRECT_RIGHT  0x0000000400000000ULL
#define AV_CH_LOW_FREQUENCY_2        0x0000000800000000ULL

/** Channel mask value used for AVCodecContext.request_channel_layout
    to indicate that the user requests the channel order of the decoder output
    to be the native codec channel order. */
#define AV_CH_LAYOUT_NATIVE          0x8000000000000000ULL

/**
 * @}
 * @defgroup channel_mask_c Audio channel layouts
 * @{
 * */
#define AV_CH_LAYOUT_MONO              (AV_CH_FRONT_CENTER)
#define AV_CH_LAYOUT_STEREO            (AV_CH_FRONT_LEFT|AV_CH_FRONT_RIGHT)
#define AV_CH_LAYOUT_2POINT1           (AV_CH_LAYOUT_STEREO|AV_CH_LOW_FREQUENCY)
#define AV_CH_LAYOUT_2_1               (AV_CH_LAYOUT_STEREO|AV_CH_BACK_CENTER)
#define AV_CH_LAYOUT_SURROUND          (AV_CH_LAYOUT_STEREO|AV_CH_FRONT_CENTER)

// libavutil/channel_layout.h
/**
 * Return default channel layout for a given number of channels.
 */
int64_t av_get_default_channel_layout(int nb_channels);

// libavutil/samplefmt.h
enum AVSampleFormat {
    AV_SAMPLE_FMT_NONE = -1,
    AV_SAMPLE_FMT_U8,          ///< unsigned 8 bits
    AV_SAMPLE_FMT_S16,         ///< signed 16 bits
    AV_SAMPLE_FMT_S32,         ///< signed 32 bits
    AV_SAMPLE_FMT_FLT,         ///< float
    AV_SAMPLE_FMT_DBL,         ///< double

    AV_SAMPLE_FMT_U8P,         ///< unsigned 8 bits, planar
    AV_SAMPLE_FMT_S16P,        ///< signed 16 bits, planar
    AV_SAMPLE_FMT_S32P,        ///< signed 32 bits, planar
    AV_SAMPLE_FMT_FLTP,        ///< float, planar //aac只支持此格式
    AV_SAMPLE_FMT_DBLP,        ///< double, planar
    AV_SAMPLE_FMT_S64,         ///< signed 64 bits
    AV_SAMPLE_FMT_S64P,        ///< signed 64 bits, planar

    AV_SAMPLE_FMT_NB           ///< Number of sample formats. DO NOT USE if linking dynamically
};

const char *av_get_sample_fmt_name(enum AVSampleFormat sample_fmt);
char *av_get_sample_fmt_string(char *buf, int buf_size, enum AVSampleFormat sample_fmt);
int av_get_bytes_per_sample(enum AVSampleFormat sample_fmt);
int av_samples_copy(uint8_t **dst, uint8_t * const *src, int dst_offset,
    int src_offset, int nb_samples, int nb_channels,
    enum AVSampleFormat sample_fmt);

av_samples_get_buffer_size()用于获取音频帧数据大小，返回值为一帧音频数据总字节数，而传出参数linesize在packed格式时为总字节数，而planar格式时为单通道字节数。

// libavutil/samplefmt.h
/**
 * Get the required buffer size for the given audio parameters.
 *
 * @param[out] linesize calculated linesize, may be NULL
 * @param nb_channels   the number of channels
 * @param nb_samples    the number of samples in a single channel
 * @param sample_fmt    the sample format
 * @param align         buffer size alignment (0 = default, 1 = no alignment)
 * @return              required buffer size, or negative error code on failure
 */
int av_samples_get_buffer_size(int *linesize, int nb_channels, int nb_samples,
                               enum AVSampleFormat sample_fmt, int align)
{
    int line_size;
    int sample_size = av_get_bytes_per_sample(sample_fmt);
    int planar      = av_sample_fmt_is_planar(sample_fmt);

    /* validate parameter ranges */
    if (!sample_size || nb_samples <= 0 || nb_channels <= 0)
        return AVERROR(EINVAL);

    /* auto-select alignment if not specified */
    if (!align) {
        if (nb_samples > INT_MAX - 31)
            return AVERROR(EINVAL);
        align = 1;
        nb_samples = FFALIGN(nb_samples, 32);
    }

    /* check for integer overflow */
    if (nb_channels > INT_MAX / align ||
        (int64_t)nb_channels * nb_samples > (INT_MAX - (align * nb_channels)) / sample_size)
        return AVERROR(EINVAL);

    line_size = planar ? FFALIGN(nb_samples * sample_size,               align) :
                         FFALIGN(nb_samples * sample_size * nb_channels, align);
    if (linesize)
        *linesize = line_size;

    return planar ? line_size * nb_channels : line_size;
}  

/**
 * Allocate a samples buffer for nb_samples samples, and fill data pointers and
 * linesize accordingly.
 * The allocated samples buffer can be freed by using av_freep(&audio_data[0])
 * Allocated data will be initialized to silence.
 *
 * @see enum AVSampleFormat
 * The documentation for AVSampleFormat describes the data layout.
 *
 * @param[out] audio_data  array to be filled with the pointer for each channel
 * @param[out] linesize    aligned size for audio buffer(s), may be NULL
 * @param nb_channels      number of audio channels
 * @param nb_samples       number of samples per channel
 * @param align            buffer size alignment (0 = default, 1 = no alignment)
 * @return                 >=0 on success or a negative error code on failure
 * @todo return the size of the allocated buffer in case of success at the next bump
 * @see av_samples_fill_arrays()
 * @see av_samples_alloc_array_and_samples()
 */
int av_samples_alloc(uint8_t **audio_data, int *linesize, int nb_channels,
                     int nb_samples, enum AVSampleFormat sample_fmt, int align);

/**
 * Allocate a data pointers array, samples buffer for nb_samples
 * samples, and fill data pointers and linesize accordingly.
 *
 * This is the same as av_samples_alloc(), but also allocates the data
 * pointers array.
 *
 * @see av_samples_alloc()
 */
int av_samples_alloc_array_and_samples(uint8_t ***audio_data, int *linesize, int nb_channels,
    int nb_samples, enum AVSampleFormat sample_fmt, int align);

二、音频编解码

编解码器ID定义在libavcodec/avcodec.h中，描述定义在libavcodec/codec_desc.c，通过编解码器ID关联。

三、音频重采样

Resampler用于转换音频采样格式，而FIFO buffer用于储存音频采样以编码。

Audio重采样，采样格式转换和混流需要使用libswresample库。

音频交互使用SwrContext（通过swr_alloc()或swr_alloc_set_opts()分配），参数必须通过AVOptions设置。调用swr_init()初始化SwrContext，音频转换通过重复调用swr_convert()，At the end of conversion the resampling buffer can be flushed by calling swr_convert() with NULL in and 0 in_count.最后swr_free()释放。

输入输出间延迟可通过swr_get_delay()获取。

实际测试音频重采样，当采样频率变化大时，声音较失真。

// the following code will setup conversion from planar float sample format to interleaved signed 16-bit integer, 
// downsampling from 48kHz to 44.1kHz and downmixing from 5.1 channels to stereo (using the default mixing matrix).
SwrContext *swr = swr_alloc();

av_opt_set_channel_layout(swr, "in_channel_layout", AV_CH_LAYOUT_5POINT1, 0);
av_opt_set_channel_layout(swr, "out_channel_layout", AV_CH_LAYOUT_STEREO, 0);
av_opt_set_int(swr, "in_sample_rate", 48000, 0);
av_opt_set_int(swr, "out_sample_rate", 44100, 0);
av_opt_set_sample_fmt(swr, "in_sample_fmt", AV_SAMPLE_FMT_FLTP, 0);
av_opt_set_sample_fmt(swr, "out_sample_fmt", AV_SAMPLE_FMT_S16, 0);
swr_init(swr);

uint8_t **input;
int in_samples;

while (get_input(&input, &in_samples)) {
    uint8_t *output;
    int out_samples = av_rescale_rnd(swr_get_delay(swr, 48000) +in_samples, 44100, 48000, AV_ROUND_UP);
    av_samples_alloc(&output, NULL, 2, out_samples, AV_SAMPLE_FMT_S16, 0);
    out_samples = swr_convert(swr, &output, out_samples, input, in_samples);
    handle_output(output, out_samples);
    av_freep(&output);
}

3.1 重采样函数

struct SwrContext *swr_alloc(void);
struct SwrContext *swr_alloc_set_opts(struct SwrContext *s,
    int64_t out_ch_layout, enum AVSampleFormat out_sample_fmt, int out_sample_rate,
    int64_t in_ch_layout, enum AVSampleFormat in_sample_fmt, int in_sample_rate,
    int log_offset, void *log_ctx);
int swr_init(struct SwrContext *s);

void swr_free(struct SwrContext **s);

/** Convert audio.
 *
 * in and in_count can be set to 0 to flush the last few samples out at the
 * end.
 *
 * If more input is provided than output space, then the input will be buffered.
 * You can avoid this buffering by using swr_get_out_samples() to retrieve an
 * upper bound on the required number of output samples for the given number of
 * input samples. Conversion will run directly without copying whenever possible.
 *
 * @param s         allocated Swr context, with parameters set
 * @param out       output buffers, only the first one need be set in case of packed audio
 * @param out_count amount of space available for output in samples per channel
 * @param in        input buffers, only the first one need to be set in case of packed audio
 * @param in_count  number of input samples available in one channel
 *
 * @return number of samples output per channel, negative value on error
 */
int swr_convert(struct SwrContext *s, 
    uint8_t **out, int out_count,
    const uint8_t **in , int in_count);

注：in_count和out_count都是一个通道的采样点个数

/**
 * Gets the delay the next input sample will experience relative to the next output sample.
 *
 * Swresample can buffer data if more input has been provided than available
 * output space, also converting between sample rates needs a delay.
 * This function returns the sum of all such delays.
 * The exact delay is not necessarily an integer value in either input or
 * output sample rate. Especially when downsampling by a large value, the
 * output sample rate may be a poor choice to represent the delay, similarly
 * for upsampling and the input sample rate.
 *
 * @param s     swr context
 * @param base  timebase in which the returned delay will be:
 *              @li if it's set to 1 the returned delay is in seconds
 *              @li if it's set to 1000 the returned delay is in milliseconds
 *              @li if it's set to the input sample rate then the returned
 *                  delay is in input samples
 *              @li if it's set to the output sample rate then the returned
 *                  delay is in output samples
 *              @li if it's the least common multiple of in_sample_rate and
 *                  out_sample_rate then an exact rounding-free delay will be
 *                  returned
 * @returns     the delay in 1 / @c base units.
 */
int64_t swr_get_delay(struct SwrContext *s, int64_t base);

4. 编译

valgrind检测到alsa库有内存泄漏，普遍存在：

==16296== LEAK SUMMARY:
==16296==    definitely lost: 0 bytes in 0 blocks
==16296==    indirectly lost: 0 bytes in 0 blocks
==16296==      possibly lost: 22,748 bytes in 1,216 blocks
==16296==    still reachable: 164 bytes in 6 blocks

参考：alsa - mem leak? - stackoverflow

参考：

posted @ 2017-06-22 00:16 yuxi_o 阅读(2310) 评论(0) 收藏举报

刷新页面返回顶部