optimizing cpp (1)
operating system
64 bit systems have several advantages over 32 bit systems:
The number of registers is doubled. This makes it possible to store intermediate data and local variables in registers rather than in memory.
Function parameters are transferred in registers rather than on the stack. This makes function calls more efficient.
The size of the integer registers is extended to 64 bits. This is only an advantage in applications that can take advantage of 64-bit integers.
The allocation and deallocation of big memory blocks is more efficient.
The SSE2 instruction set is supported on all 64-bit CPUs and operating systems.
The 64 bit instruction set supports self-relative addressing of data. This makes position-independent code more efficient.
64 bit systems have the following disadvantages compared to 32 bit systems:
Pointers, references, and stack entries use 64 bits rather than 32 bits. This makes data caching slightly less efficient.
Access to static or global arrays require a few extra instructions for address calculation in 64 bit mode if the image base is not guaranteed to be less than 2^31. This extra cost is seen in 64 bit Windows and Mac programs but rarely in Linux.
Address calculation is more complicated in a large memory model where the combined size of code and data can exceed 2 Gbytes. This large memory model is hardly ever used, though.
Some instructions are one byte longer in 64 bit mode than in 32 bit mode.
In general, you can expect 64-bit programs to run a little faster than 32-bit programs if there are many function calls, if there are many allocations of large memory blocks, or if the program can take advantage of 64-bit integer calculations. It is necessary to use 64-bit systems if the program uses more than 2 gigabytes of data.
dynamic library vs static library
The advantages of using static linking rather than dynamic linking are:
Static linking includes only the part of the library that is actually needed by the application, while dynamic linking makes the entire library (or at least a large part of it) load into memory even when just a single function is needed.
All the code is included in a single executable file when static linking is used. Dynamic linking makes it necessary to load several files when the program is started.
It takes longer time to call a function in a dynamic link library than in a static link library because it needs an extra jump through a pointer in an import table.
The memory space becomes more fragmented when the code is distributed between multiple DLLs. The DLLs are always loaded at round memory addresses divisible by the memory page size (4096). This will make all DLLs contend for the same cache lines. This makes code caching and data caching less efficient.
DLLs are less efficient in some systems because of the needs of position-independent code, see below.
Installing a second application that uses a newer version of the same DLL can change the behavior of the first application if dynamic linking is used, but not if static linking is used.
The advantages of dynamic linking are:
Multiple applications running simultaneously can share the same DLL without the need to load more than one instance of the DLL into memory. This is useful on servers that run many processes simultaneously.
A DLL can be updated to a new version without the need to update the program that calls it.
A DLL can be called from programming languages that do not support static linking.
A DLL can be useful for making plug-ins that add functionality to an existing program.
Weighing the above advantages of each method, it is clear that static linking is preferable incritical applications except for parts of the code that are rarely used. Many function libraries are available in both static and dynamic versions. It is recommended to use the static version if a performance gain can be expected.
Different kinds of variable storage
The stack is the most efficient place to store data because the same range of memory addresses is reused again and again. If there are no big arrays, then it is almost certain that this part of the memory is mirrored in the level-1 data cache, where it is accessed quite fast.
Do not make variables global if you can avoid it.
Most compilers will recognize that the two constants are identical so that only one constant needs to be stored. All identical constants in the entire program will be joined together in order to minimize the amount of cache space used for constants.
Integer constants are usually included as part of the instruction code. You can assume that there are no caching problems for integer constants.
Most compilers can make thread-local storage of static and global variables by using the keyword __thread or __declspec(thread). Such variables have one instance for each thread. Thread-local storage is inefficient because it is accessed through a pointer stored in a thread environment block. Thread-local storage should be avoided, if possible.
Dynamic memory allocation is done with the operators new and delete or with the functions malloc and free. These operators and functions consume a significant amount of time.
Integers variables and operators
Integer operations are fast in most cases, regardless of the size. However, it is inefficient to use an integer size that is larger than the largest available register size. In other words, it is inefficient to use 32-bit integers in 16-bit systems or 64-bit integers in 32-bit systems, especially if the code involves multiplication or division.
Signed versus unsigned integers
In most cases, there is no difference in speed between using signed and unsigned integers. But there are a few cases where it matters:
Division by a constant: Unsigned is faster than signed when you divide an integer with a constant. This also applies to the modulo operator %.
Conversion to floating point is faster with signed than with unsigned integers.
Integer operators
Addition, subtraction, comparison, bit operations and shift operations take only one clock cycle on most microprocessors.Integer multiplication takes 11 clock cycles on Pentium 4 processors, and 3 - 4 clock cycles on most other microprocessors. Integer division takes 40 - 80 clock cycles, depending on the microprocessor.
The pre-increment operator ++i and the post-increment operator i++ are as fast as additions. When used simply to increment an integer variable, it makes no difference whether you use pre-increment or post-increment.
x = array[i++] is more efficient than x = array[++i] because in the latter case, the calculation of the address of the array element has to wait for the new value of i which will delay the availability of x for approximately two clock cycles.
in the case a = ++b; the compiler will recognize that the values of a and b are the same after this statement so that it can use the same register for both, while the expression a = b++; will make the values of a and b different so that they cannot use the same register.
Floating point variables and operators
In most cases, double precision calculations take no more time than single precision.
Floating point addition takes 3 - 6 clock cycles, depending on the microprocessor. Multiplication takes 4 - 8 clock cycles. Division takes 32 - 45 clock cycles. Floating point comparisons are inefficient when the floating point stack registers are used. Conversions of float or double to integer takes a long time when the floating point stack registers are used.
Do not mix single and double precision when the XMM registers are used.
Avoid conversions between integers and floating point variables, if possible.
Applications that generate floating point underflow in XMM registers can benefit from setting the flush-to-zero mode rather than generating denormal numbers in case of underflow:
// Example 7.4. Set flush-to-zero mode (SSE):
#include <xmmintrin.h>
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
It is strongly recommended to set the flush-to-zero mode unless you have special reasons to use denormal numbers. You may, in addition, set the denormals-are-zero mode if SSE2 is available:
// Example 7.5. Set flush-to-zero and denormals-are-zero mode (SSE2):
#include <xmmintrin.h>
_mm_setcsr(_mm_getcsr() | 0x8040);
Boolean
It may be advantageous to put the operand that is most often true last in an && expression, or first in an || expression.
If one operand is more predictable than the other, then put the most predictable operand first.
If one operand is faster to calculate than the other then put the operand that is calculated the fastest first.
Boolean variables are stored as 8-bit integers with the value 0 for false and 1 for true.
Pointers and references
Pointers and references are equally efficient because they are in fact doing the same thing.
there are disadvantages of using pointers and references. Most importantly, it requires an extra register to hold the value of the pointer or reference. Registers are a scarce resource, especially in 32-bit mode. If there are not enough registers then the pointer has to be loaded from memory each time it is used and this will make the program slower. Another disadvantage is that the value of the pointer is needed a few clock cycles before the time the variable pointed to can be accessed. The object pointed to can be accessed approximately two clock cycles after the value of the pointer has been calculated.
Function pointers
Calling a function through a function pointer typically takes a few clock cycles more than calling the function directly if the target address can be predicted.
Member pointers
In simple cases, a data member pointer simply stores the offset of a data member relative to the beginning of the object, and a member function pointer is simply the address of the member function.
Smart pointers
There is no extra cost to accessing an object through a smart pointer. Accessing an object by *p or p->member is equally fast whether p is a simple pointer or a smart pointer. But there is an extra cost whenever a smart pointer is created, deleted, copied or transferred from one function to another. These costs are higher for shared_ptr than for auto_ptr.
Type conversions
Signed / unsigned conversion
Conversions between signed and unsigned integers simply makes the compiler interpret the bits of the integer in a different way. There is no checking for overflow, and the code takes
no extra time.
Integer size conversion
An integer is converted to a longer size by extending the sign-bit if the integer is signed, or by extending with zero-bits if unsigned. This typically takes one clock cycle if the source is an arithmetic expression.
Converting an integer to a smaller size is done simply by ignoring the higher bits. There is no check for overflow. This conversion takes no extra time.
Floating point precision conversion
Conversions between float, double and long double take no extra time when the floating point register stack is used. It takes between 2 and 15 clock cycles (depending on the processor) when the XMM registers are used.
Integer to float conversion
Conversion of a signed integer to a float or double takes 4 - 16 clock cycles, depending on the processor and the type of registers used. Conversion of an unsigned integer takes longer time.
Float to integer conversion
Conversion of a floating point number to an integer takes a very long time unless the SSE2 or later instruction set is enabled. Typically, the conversion takes 50 - 100 clock cycles. The reason is that the C/C++ standard specifies truncation so the floating point rounding mode has to be changed to truncation and back again.