关于H.263 实现源码优化的一点闲话

Posted on 2010-11-28 20:19 晓彻阅读(359) 评论(0) 编辑收藏举报

要开始进行H263 Tmn3.2的裁剪和优化了，上午远程会议马舵主叹了很长时间。主要是针对优化的，发现软件的算法级别优化果真很十分特别必要。

首先看我之前做的profile吧，调用的频率、耗时等参数一目了然。

1.1 SAD（绝对差和）块匹配准则的整像素搜索算法

SAD用绝对值运算代替了MSE的乘方运算，明显降低了运算量，从而可以加快计算速度。

测试表明，SAD的计算量要比MSE的计算量减少三分之一，而它们的图像效果相当。

建议利用硬件特性加速块匹配准则的运算速度，Intel的MMX技术提供了这种特性。SAD等块匹配准则主要针对短数据的重复计算，MMX增加了系统单个指令操作数据的数目（SIMD），从而可以在一个指令中完成多组数据的计算，实现并行机制，从而加快运算速度。

在mnt3.2的实现中调用频率最高，测试达到百万计，所以建议硬件实现。

/********************************************************************

* Name: SAD_Macroblock

* Description: fast way to find the SAD of one vector

* Input: pointers to search_area and currentblock,

* Min_F1/F2/FR

* Returns: sad_f1/f2

* Side effects:

********************************************************************/

int SAD_Macroblock (unsigned char *ii,unsigned char *act_block,

int h_length, intMin_FRAME)

1.2 DCT协同矩阵初始化

仅仅是一个初始化函数，由于有浮点计算，测试中使用频率达到十万级，所以开销较大。

/* initialize DCT coefficient matrix */
void init_idctref ()
{
  intfreq, time;
 double scale;
 
  for(freq = 0; freq < 8; freq++)
  {
   scale = (freq == 0) ? sqrt (0.125) : 0.5;
   for (time = 0; time < 8; time++)
     c[freq][time] = scale * cos ((PI / 8.0) * freq * (time + 0.5));
  }
}

可是仔细看一看代码实现，发现这个初始化代码效率低的简直是不可理喻。 sqrt (0.125)放进循环，PI / 8.0（用移位不是更好么），机上cos、freq的浮点计算，测试中的调用次数十万级以上，消耗资源当然会很大了。一种较好的优化是用静态数组直接填充的方法初始化。也就是典型的空间换时间。

1.3 半像素精度搜索

半像素精度的运动估计中使用到。其计算要求较高，建议使用硬件实现。

void FindHalfPel (int x, int y,MotionVector * fr, unsigned char *prev,

int *curr, int bs, int comp)

/********************************************************************

* Name: FindHalfPel

* Description: Find the optimal half pel prediction

* Input: position, vector, array withcurrent data

* pointer to previous interpolatedluminance,

* Returns:

********************************************************************/

1.4 量化器

1) 量化原理

量化是指用规定范围内的一个值来表示值的一个范围; 例如把实数转换成最接近的一个整数即是一种量化; 量化范围可以被精确地表示成一整数码该整数码在解码过程中可用来恢复被量化的那个值; 实际值与量化值之间的差值称为量化噪声; 在某些场合人类视觉系统对量化噪声不敏感量化噪声可以很大因此量化可以提高编码效率; 量化在整个视频序列编码中占据着很重要的地位因为我们是先将B 8 6变换后的系数矩阵进行量化然后再对这个量化矩阵编码如果量化后的非零系数越少则编码效果越好而这是这个编码方案性能良好的

主要原因; 也就是说宏块经过 B 8 6 变换后形成一个大幅度系数集中在低频区域而高频系数都比较小量化后许多高频系数为零这使传输码率降低从而达到压缩的目的;

2) Tmn量化器的实现Quant_blk函数

/********************************************************************

* Name: Quant_blk

* Description: quantizer

* Input: pointers to coeff and qcoeff

* Returns:

* Side effects:

********************************************************************/

void Quant_blk (int *coeff, int *qcoeff,int QP, int Mode, int block)

3) 量化器是编码器的核心组件，推荐使用硬件实现。

1.5 DCT离散余弦变换

下面是Tmn 对DCT的实现函数。这一部分具有模块独立性，使用频率高，有一定的计算消耗，推荐使用硬件实现。

/********************************************************************

* Name: Dct

* Description: Does dct on an 8x8 block

* Input: 64 pixels in a 1D array

* Returns: 64coefficients in a 1D array

* Date: 930128 Author:Robert.Danielsen@nta.no

********************************************************************/

int Dct (int *block, int *coeff)

1.6 LoadImage

加载图像Load image申请大量的空间，对图像加载。所以profile显示的资源消耗很大。主要是大量申请内存引起的消耗，不适宜使用硬件实现。

/**********************************************************************
 *
 *    Name:        LoadArea
 *    Description:    fills array with a square of image-data
 *
 *    Input:           pointer to image and position, x and y size
 *    Returns:       pointer to area
 *    Side effects:  memory allocated to array
 *
 *    Date: 940203    Author: PGB
 *                      Mod: KOL
 *
 ***********************************************************************/

unsigned char *LoadArea (unsigned char *im, int x, int y,
                          int x_size, int y_size, int lx)
{
  unsigned char *res = (unsigned char *) malloc (sizeof (char) * x_size * y_size);
  unsigned char *in;
  unsigned char *out;
  int i = x_size;
  int j = y_size;

  in = im + (y * lx) + x;
  out = res;

  while (j--)
  {
    while (i--)
      *out++ = *in++;
    i = x_size;
    in += lx - x_size;
  };
  return res;
}

这段代码的时间主要消耗在申请内存上，那好吧，我的优化思路是使用内存池那一套，先申请一大块内存。然后让程序自己去切蛋糕。此谓化整为零。置于这个大块内存有多大，还需要权衡一下。具体应用中看一下prifile可再作决断。

1.7 运动预测

还没仔细啊看，需要进一步对代码进行阅读。

/********************************************************************

* Name: MotionEstimation

* Description: Estimate all motion vectors for one MB

* Input: pointers to current and reference (previous)

* image, pointers to currentslice and

* current MB

* Side effects: motionvector information in MB changed

********************************************************************/

void MotionEstimation (unsigned char *curr,unsigned char *reference, int x_curr,

int y_curr, int xoff,int yoff, int seek_dist,

MotionVector *MV[7][MBR + 1][MBC + 2], int

*SAD_0, intestimation_type, int backward_pred,

int pmv0, int pmv1)

1.8 信噪比计算

好像使用了文件读写，因而增加了系统开销，不适宜使用硬件实现。

/********************************************************************

* Name: SNRcomp

* Description: Compares two image files using SNR

* No conversion to 422

* Input:

* Returns:

* Side effects:

********************************************************************/

void ComputeSNR (PictImage * im1, PictImage* im2, Results * res, int pict_type, int write)

1.9 色度填充

P frame的填充，由于其调用频繁，个人觉得可以考虑硬件实现。

/********************************************************************

* Name: DoPredChrom_P

* Description: Doesthe chrominance prediction for P-frames

* Input: motionvectors for each field,

* current position in image,

* pointers to current and previos image,

* pointer to pred_error array,

* (int) field: 1 if field coding

* Side effects: fillschrom-array in pred_error structure

* Date: 930211 Author:Karl.Lillevold@nta.no

*********************************************************************/

void DoPredChrom_P (int x_curr, int y_curr,int dx, int dy,

PictImage * curr, PictImage * prev,

MB_Structure * prediction,

MB_Structure * pred_error,int rtype)

会员力量，点亮园子希望

刷新页面返回顶部

暴走的指压师

公告