The following tips and tricks put some of the techniques described above into practice.
- Initializing Data:
- Set a register to zero:
|
movd eax, 0 |
|
Faster: |
|
xor eax, eax |
- Set all bits of MM0 to 1s:
|
C declaration: unsigned temp[4] = {0xFFFFFFFF, 0xFFFFFFFF, |
|
Faster: |
|
pcmpeqd mm0, mm0 |
- Creating Constants:
- Set mm7 to 0x FF00FF00FF00FF00:
|
pcmpeqd mm7, mm7 // 0xFF FF FF FF FF FF FF FF |
|
|
|
pxor mm7, mm7 // 0x 0 |
|
Now, pxor and pcmpeqd are handled by the MMX-ALU execution unit, and punpcklbw is taken care of by MMX-SHIFT execution units. Each instruction takes two clock cycles to complete, but the MMX-ALU only waits one cycle instead of waiting for the completion of the instruction pxor before serving the instruction pcmpeqd. Thus, the whole operation only takes five clock cycles to complete instead of six. |
- Set mm7 to 0x 00FF00FF00FF00FF:
|
pxor mm0, mm0 // 0x 0 |
|
Note: The same technique can be used with XMM registers with some minor modifications, since we can only work on half of the XMM register at a time. |
- Loading Data:
|
movq mm1, mm2 |
|
Faster: |
|
pshufw mm1, mm2, 0xE4 |
|
Note: The trick lies in the magic number 0xE4; it means do not change the order. |
- Swapping Data:
- Swapping the hi and lo portions of a register:
|
pshufw mm0, mm0, 0x4E |
|
Note: If you reverse the order number from 0x4E to 0xE4, the operation will become copy instead of swap. |
- Creating patterns:
|
Load register mm0 with 0xAADDAADDAADDAADD: |
|
movd eax, 0xAADD |
|
Note: The number 0x0 will copy the first word “AADD” to all subsequent words of mm0. |
- Using lea Instructions:
|
mov edx,ecx |
|
Faster: |
|
lea edx, [ecx + ecx] |
|
Note: lea instructions with two more add instructions will be fast, but do not go beyond three adds; when the throughput gets larger, it will defeat the benefit of the lea instruction. |