不断认识自己的无知是人类获得智慧的表现  

The following tips and tricks put some of the techniques described above into practice.

  • Initializing Data:
     -  Set a register to zero:

 

movd eax, 0

 

 

 Faster:

 

               

xor eax, eax
pxor mm0, mm0
pxor xmm0, xmm0


        - Set all bits of MM0 to 1s:

                   

C declaration: unsigned temp[4] = {0xFFFFFFFF, 0xFFFFFFFF,
0xFFFFFFFF, 0xFFFFFFFF};

asm { movq mm0, temp
movdq xmm1, temp}

 

 

Faster:

 

                 

pcmpeqd mm0, mm0
pcmpeqd xmm1, xmm1

  • Creating Constants:
    - Set mm7 to 0x FF00FF00FF00FF00:

                     

pcmpeqd mm7, mm7 // 0xFF FF FF FF FF FF FF FF
psllq mm7, 8 // 0xFF FF FF FF FF FF FF 00
pshufw mm7, mm7, 0x0 // 0xFF 00 FF 00 FF 00 FF 00

 

              


Each instruction takes two clock cycles to complete. The whole operation will finish in six clock cycles. Faster:

 

                    

pxor mm7, mm7 // 0x 0
pcmpeqd mm0, mm0 // 0x FFFFFFFFFFFFFFFF
punpcklbw mm7, mm0 // 0x FF00FF00FF00FF00

 

             

Now, pxor and pcmpeqd are handled by the MMX-ALU execution unit, and punpcklbw is taken care of by MMX-SHIFT execution units. Each instruction takes two clock cycles to complete, but the MMX-ALU only waits one cycle instead of waiting for the completion of the instruction pxor before serving the instruction pcmpeqd. Thus, the whole operation only takes five clock cycles to complete instead of six.

- Set mm7 to 0x 00FF00FF00FF00FF:

                       

pxor mm0, mm0 // 0x 0
pcmpeqd mm7, mm7 // 0x FFFFFFFFFFFFFFFF
punpcklbw mm7, mm0 // 0x 00FF00FF00FF00FF

 

               

Note: The same technique can be used with XMM registers with some minor modifications, since we can only work on half of the XMM register at a time.

  • Loading Data:

        

movq mm1, mm2

 

 

Faster:

 

            

pshufw mm1, mm2, 0xE4

 

               

Note: The trick lies in the magic number 0xE4; it means do not change the order.

This is a useful way to copy the contents of one register to another. The instruction movq takes six clock cycles to complete, compared with only two for the pshufw instruction. Do not substitute movq with pshufw automatically, however; make sure that the appropriate execution unit is not busy at that time. The movq and pshufw instructions use the FP_MOV and MMX_SHFT execution units, respectively.

  • Swapping Data:
    - Swapping the hi and lo portions of a register:

              

pshufw mm0, mm0, 0x4E
pshufd xmm0, xmm0, 0x4E

 

              

Note: If you reverse the order number from 0x4E to 0xE4, the operation will become copy instead of swap.

               
          - Creating patterns:

        

Load register mm0 with 0xAADDAADDAADDAADD:

 

              

movd eax, 0xAADD
movd mm0, eax
pshufw mm0, mm0, 0x0

 

              

Note: The number 0x0 will copy the first word “AADD” to all subsequent words of mm0.

You can use the same technique with XMM registers by doing it in the lower half; shift left to move it to the upper half and issue the command again to take care of the lower half.

  • Using lea Instructions:

        

mov edx,ecx
sal edx,3

 

 

Faster:

 

            

lea edx, [ecx + ecx]
add edx, edx
add edx, edx

 

             

Note: lea instructions with two more add instructions will be fast, but do not go beyond three adds; when the throughput gets larger, it will defeat the benefit of the lea instruction.

posted on 2011-06-16 09:47  loleng  阅读(490)  评论(0编辑  收藏  举报