[快速阅读八] Matlab中bwlookup的实现及其在计算二值图像的欧拉数、面积及其他morph变形中的应用。

　　以前看过matlab的bwlookup函数，但是总感觉有点神秘，一直没有去仔细分析，最近在分析计算二值图像的欧拉数时，发现自己写的代码和matlab的总是对不少，于是又去翻了下matlab的源代码，看到了matlab里实现欧拉数的代码非常简单，如下所示：

if n==4
    lut = 4*[0 0.25 0.25 0 0.25 0  .5 -0.25 0.25  0.5  0 -0.25 0 ...
             -0.25 -0.25 0] + 2;
else
    lut = 4*[0 0.25 0.25 0 0.25 0 -.5 -0.25 0.25 -0.5  0 -0.25 0 ...
             -0.25 -0.25 0] + 2;
end
% Need to zero-pad the input
b = padarray(a,[1 1],'both');

weights = bwlookup(b,lut);
if coder.isColumnMajor
    e = (sum(weights(:),'double') - 2*numel(b)) / 4;
else
    e = (sum(sum(weights,2,'double'),1,'double') - 2*numel(b)) / 4;
end

　　在内部就是调用了bwlookup函数，那没办法了，就仔细看了M的帮助文档，发现原来这个函数真的非常简单，我们贴下M的帮助文档：

　　A = bwlookup(BW,lut)

　　The bwlookup function performs these steps to determine the value of each pixel in the processed image A:

Locate the pixel neighborhood in input image BW based on the coordinates of the target pixel in A. The function zero-pads border pixels of image BW when the neighborhood extends past the edge of BW.
Calculate an index, idx, based on the binary pixel pattern of the neighborhood.
Set the target pixel in A as the value of the lookup table at the index idx, in other words, the value of lut(idx).

2-by-2 Neighborhood Lookup

For 2-by-2 neighborhoods, there are four pixels in each neighborhood. Each binary pixel has two possible states, therefore the total number of permutations is 24 and the length of the lookup table lut is 16.

To find the value of the target output pixel at (row, column) coordinate (r,c), bwlookup uses the 2-by-2 pixel neighborhood in the input binary image BW whose top left pixel is at coordinate (r,c). The index idx into the lookup table is the weighted sum of the four pixels in the neighborhood, plus 1.

Pixel weights start with a weight of 1 for the top left pixel in the neighborhood, and increase as powers of two along rows then along columns, with a final weight of 8 for the bottom right pixel.

3-by-3 Neighborhood Lookup

For 3-by-3 neighborhoods, there are nine pixels in each neighborhood. Each binary pixel has two possible states, therefore the total number of permutations is 29 and the length of the lookup table lut is 256.

To find the value of the target output pixel at (row, column) coordinate (r,c), bwlookup uses the 3-by-3 pixel neighborhood in the input binary image BW centered on coordinate (r,c). The index idx into the lookup table is the weighted sum of the nine pixels in the neighborhood, plus 1.

Pixel weights start with a weight of 1 for the top left pixel in the neighborhood, and increase as powers of two along rows then along columns, with a final weight of 256 for the bottom right pixel.

　　简单的说，这个查找表只有2种可能，一个是表的元素是16个，一种是由512个元素，其中16个元素时，查表的依据是以当前点为起点，向右向下各扩展一个像素范围，得到一个2*2领域4个像素，4个像素，每个像素也就只有2种可能的取值，这样就有16种可能的组合，对应16个元素。

　　当表的元素是512个时，查表的依据是以当前点为中心点，向左向右，向上向下各扩展一个像素，得到一个3*3的领域9个像素，同样每个像素也只有2种可能取值，所以共有1<<9即512种取值。

　　因为涉及到领域，所有在原图边缘的地方会有越界的图像，这个时候的原则是填充0，而不是复制最近的像素。

　　这样，当表中有不同的内容时，就可以实现不同的结果。

　　我们以16个元素的表为例，假定 lut[16] ={ 6 3 16 11 7 14 8 5 15 1 2 4 13 9 10 12},

　　对于下面的一个4*4的二值数据：

   0   0   1   1
   0   0   1   1
   1   1   0   0
   1   1   0   0

　　如果当前的点位于第二行第一列，则其2*2的领域范围的像素信息为：

   0   0
   1   1
　　按照位置信息即对应的值计算索引为:  1 + 0 * 1 + 1 * 2 + 0 * 4 + 1 * 8 = 11， 对用的查表结果为 lut[11] = 2。
　　使用类似的计算方法可以得到其他位置经过查表后的结果为：

  　　6    13    12    11
     2     8    14     3
    12    11     6     6
    14     3     6     6

　注意，以上所有的下标起点都是1(matlab内部的约定），而且行列顺序和我们常用的C语言的二维数组也是掉了边的。 
  除了刚刚说的欧拉数的计算（其实没有搞懂欧拉数里的那个查找表是如何设计的，且8领域的欧拉数也是用的2*2的表，而不是3*3的），我们以matlab的bwarea函数为例，说明这个函数的价值。

  matlab中关于bwarea函数的定义为：

　　bwarea 通过对图像中每个像素的面积求和来估计图像中所有 on 像素的面积。单个像素的面积是通过观察其 2×2 邻域来确定的。有六种不同的情形，每种情形表示一个不同面积：

具有零个 on 像素的情形（面积 = 0）
具有一个 on 像素的情形（面积 = 1/4）
具有两个相邻 on 像素的情形（面积 = 1/2）
具有两个对角 on 像素的情形（面积 = 3/4）
具有三个 on 像素的情形（面积 = 7/8）
具有所有四个 on 像素的情形（面积 = 1）

　　用图形化的表达方式上述六种情况对应如下（为了显示，使用绿色表示黑色值，红色表示白色值）：

索引值　 0　　　 1　　　 2　　　　3　　 4　　　 5　　　 6　　　 7　　　 8　　 9　　 10　　 11　　 12　　 13　　 14　　 15　　

面积值　　　　 0　　　1/4　　 1/4　　　1/2　　 1/4　　 1/2　　 3/4　　　 7/8　　　1/4　　 3/4　　 1/2　　 7/8　　 1/2　　 7/8　　 7/8　　 1

把面积值放大八倍得到面积值依次为： 0 2 2 4 2 4 6 7 2 6 4 7 4 7 7 8，我们翻看matlab的bwarea函数，可以得到其代码如下：

lut = [0     2     2     4     2     4     6     7     2     6     4     7 ...
       4     7     7     8];

% Need to zero-pad the input.
bb = false(size(b)+2);
bb(2:end-1,2:end-1) = b;

weights = bwlookup(bb,lut);
a = sum(weights(:),[],'double')/8;

　　查找表和这里的一摸一样。

　　使用查找表的优点是把领域所有的组合提前计算出结果来，而不用到程序里进行一些列复杂的判断，这样在很多情况下是可以提高速度的，比如刚刚这个面积计算的东西，如果在循环内部做，则要做N种判断，那如果是涉及到3*3的领域，那就更为复杂了。

　　朴素的来说，有了bwlookup的原理，要把这个算法移植到C++中，就是一个比较简单的过程了，假如是处理16元素的表，也就是涉及到了4个像素，假定他们的值分别为P0\P1\P2\P3，则最基本的C代码如下所示：

int Index = 0;
if (P0 == 255)    Index += 1;
if (P1 == 255)    Index += 2;
if (P2 == 255)    Index += 4;
if (P3 == 255)    Index += 8;
LinePD[X] = Lut[Index];

　　处理512个元素的过程也是类似的，只不过索引值多加了几个（注意，这里用255表示白色，而不是1）。

　　如果考虑指令集优化，一个自然的翻译过程如下所示：

    __m128i Index = _mm_setzero_si128();
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(1)), _mm_cmpeq_epi8(P0, _mm_set1_epi8(255)));
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(2)), _mm_cmpeq_epi8(P1, _mm_set1_epi8(255)));
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(4)), _mm_cmpeq_epi8(P2, _mm_set1_epi8(255)));
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(8)), _mm_cmpeq_epi8(P3, _mm_set1_epi8(255)));
    _mm_storeu_si128((__m128i*)(LinePD + X), _mm_shuffle_epi8(Lut128, Index));

　　一次性处理16个字节，最为关键的一点是，这个查表可以方便的直接使用_mm_shuffle_epi8非常高效的实现，这个在我前期的文章里多次提到（16个字节表的查找，我们假定lut的类型为字节类型，这个在实际的应用中也足够了），

　　这个代码的性能提升相对于C++来说是非常可观的，大概又4倍多的速度提升。

　　仔细上述代码其实还有提高的地方，我们使用_mm_blendv_epi8来进行结果的混合，而混合的标志是值是否等于255，这里用_mm_cmpeq_epi8做判断，而_mm_cmpeq_epi8的结果就是如果P等于255，则返回255，否则返回0，那这个不就是P本省吗，所以根本就不用做这个判断，这样代码就变为：

    __m128i Index = _mm_setzero_si128();
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(1)), P0, _mm_set1_epi8(255));
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(2)), P1, _mm_set1_epi8(255));
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(4)), P2, _mm_set1_epi8(255));
    Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(8)), P3, _mm_set1_epi8(255));
    _mm_storeu_si128((__m128i*)(LinePD + X), _mm_shuffle_epi8(Lut128, Index));

　　另外，再观察这个C语言的特殊性，我们可以把上述C代码修改为：

    int Index = 0;
    Index += (1 & P0);
    Index += (2 & P1);
    Index += (4 & P2);
    Index += (8 & P3);
    LinePD[X] = Lut[Index];

　　即直接使用位运算替换掉那个判断，这样对应的SSE指令也可以进行修改为如下代码：

    __m128i Index = _mm_setzero_si128();
    Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(1), P0));
    Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(2), P1));
    Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(4), P2));
    Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(8), P3));
    //    可以直接接用shuffle实现查找表
    _mm_storeu_si128((__m128i*)(LinePD + X), _mm_shuffle_epi8(Lut128, Index));

　　相比于原始的SSE代码，这个改动也约有30%的性能提升的。

　　对于512个元素的表，情况会有所不同，因为索引值大于了255，所以已经无法用字节类型来保存累加后得到的索引了，至少的用ushort类型，但是呢，前8个位置的索引相加还是不会超出字节类型的，所以前8个位置还是可以一次性处理16个像素，到最后一个位置时，单独转换为16位，然后再接用2次16数据的处理指令，就可以一次性得到16个字节对应的查找表索引。

　　注意，512个元素的查表如果是纯C++代码，最后一个的 & 操作，一定要注意数据范围，不能写成 Index += (256 & P9);, 而是 Index += (256 & (65280 + P9));

　　但是这个时候，由于表的大小时512，所以已经无法使用SSE去优化这个查表功能了，只能把索引值单个的提取出来，然后用普通的C语言去执行代码。如果CPU能支持AVX2，那么AVX2倒是又有关指令能直接查表，不过也要做很多的改造工程。

　　我们测试，对于512个元素的表，优化后的SSE指令处理一副 3000*2000的二值图，大概耗时在2ms不到，速度还是相当的快的。

　　搜索matlab的代码，除了bweuler及bwarea内使用了bwlookup，另外就是bwmorph里也大量的使用了bwlookup，而且都是使用的512个元素的表，也就是说使用3*3的领域，我们看bwmorph的帮助文档，有很多相关内容：