Rule of Programming: C

My summary after reading this.

Before doing large memory writes: _mm*_stream_s[i,s,d]*: non-temporal memory store: When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.

NTA: non-temporal accesses is the streaming option for reading memory circumventing caches: _mm*_stream_load_si (SSE). This loads the target bytes into several buffers, each buffer is sized as cacheline. And subsequent access to the buffers speeds up. These buffers not affecting caches, thus load buffers will not lead to any cache to be evicted. But loading buffers will possibly evict another buffer loaded by load_si.

Note: the non-temporal memory read/write will not accelerate the speed we read/write, it prevents useful caches evicting from caches. So the NT read/write alone will not speed up the memory read/write. Consider normally, write to memory(not modifying), will read the memory into cache, then we write the cache. The dirty cacheline will be evicted out when cache is not enough and written back to memory. Thus the time cpu spends are loading memory into cache and write to cache. While using NT memory writings, cpu will directly locate the memory address and send the data to RAM. The time consumes in locating memory address is as much time consuming as cache loading, however, writing memory is a lot more time consuming. Also note that write to memory/cache has a write-combining feature. (write to subsequent memory locations of same cacheline can be combined into one write instruction). Also note when using NT memory writings, FENCES need to be manually initiated.

Why much slower when using NT writing? Here is another POV: that NT writes to memory is based  on cacheline, thus write to a location inside a cacheline without using the write-combining logic to fill a full cacheline will cause cpu to load the original cacheline to cache again and modify it with data, making the whole step much slower. (Now I think this POV is true.)

Thus when doing streaming load, you load memory on a buffer sized basis (cacheline), like dispatching a sequence of stream_load in a line.

NT memory accessing are very nicely optimized for sequential data access, and caches can help to cover up some-but-not-all of the ramdom accesses to memory.

---------------

Improve branch predicting: using gcc: -fprofile-generate, fprofile-use (PGO) better than __buitin_expect. use gcov to convert to txt stats

mmap: parameter MAP_POPULATE to load all mapped file data into cache, so coarse that if file too large the cache be overwritten several times causing slobs.

Larger page size lead to large memory space unused.

Use cpuid to check tlb entry size.

Use oprofile/perf to test caching/paging.

Hyperthread used to software prefetchiing for the other thread.

Cache (T:S:O) tag:set:offset calculation: for a 8-way associative 32KiB L1d cache, cacheline can be checked by: (nomrally 64B)

getconf LEVEL1_DCACHE_LINESIZE

Thus L1d contains 32KiB/64B = 512 entries, since it is 8 way, it has 512/8 = 64 sets. Thus bits(S)  = log64 = 6. And cacheline 64B, so bits(O) = log64 = 6. So bits(T) = 48 - 6 - 6 =36.

Write-back vs Write-through: latter is simplier  than former. The latter writes to memory whenever cache is written to. The former writes to memory only when it is evicted from cache. In terms of multi core, cache can be sent from one core to another, when one core requesting cacheline existing on the other core, the other core will send the cacheline to the one who requesting it, and then invalidates its own copy of cache.

Cache info be checked from /sys/devices/system/cpu/cpu0/cache/

Per process memory location over nodes can be seen from /proc/PID/numa_maps

Inline functions adv and demerit: inlined function can cause code oversized, and if the inline function called one by one shortly, non-inline the function should be better, because the function may just stay in L1i, and need not to be decoded again. However, when function are called once from another page, it somehow suggest to inline it, because loading faster, and larger code gets compiler optimize better.

And to note some optimization level like O2 or O3 might inline functions though not explicitly specified inline. In these condition noinline should be used.

Having instruction at the beginning of the cacheline to be retrieved, lead to maximize the code retrieved useful from a single prefetch. However if each function is of very small size, this can also lead to space waste and run-time slow down. (noops)

1. at the beginning of functions;
2. at the beginning of basic blocks which are reached only through jumps;
3. to some extent, at the beginning of loops.

2 kinds of prefetching: 

Hardware: when sequence of 2 or more cache miss occurs, hardware prefetch starts to work. This is somewhat error-prone. And only apply to linearly caches within same page. If hardware prefetching occurs, not so much a programmer can do, maybe try to change code layout.

Software prefetching:

Need to carefully select the gap between prefetchings. By comparing loop running cycle and the prefetching data set size(cpu cycles). One can try several different gaps through tools viewing working prefetching counters. (oprofile ?)

Helper threads to prefetch data. (SMT or DUMB core), set cpu affinity here.

Some device has functionality aside from DMA, called DCA, on ethernet NICs i.e. When packets arrived, they write into RAM, after that it initiates a DCA flag to Northbridge? thus the processor snoops it and load the memory to lower cache(L3). Also ignorable.

Concurrency: if several threads using same cacheline but diffent variables, these variables should be splitted to different cachelines that the problem of RFO(request for ownership) not coming into our way. However this enlarges data size of the application, thus to put all data written and used by one thread into same section seems better choice. Things like below put the bar and xyzzy as neighbour. (This technique can only be used as global variables). Thread-only variables can use __thread to define global variables for threads, allocated not in data but each thread's stack.

int foo = 1;
int bar __attribute__((section(".data.ro"))) = 2;
int baz = 3;
int xyzzy __attribute__((section(".data.ro"))) = 4;

minor page faults: i.e. when starting process, page table is initialized but not filled, the physical memory already loads code/data that hasn't mapped into its page table, which is called minor page faults. (Can be ignored)

major page faults: when trying to load data/code from a vm addr, the page table says it does not reside in memory now, and need to be loaded from disk/swap by OS, which is called major page faults. (take care)

Valgrind can emulate different size/layer of caches for performance testing. Using cachegrind.

------false sharing of caching is a probable problem facing in parallel threads--- try to check out this using perf c2c

 

posted on 2018-05-24 12:04  三叁  阅读(145)  评论(0编辑  收藏  举报

导航