What a C programmer should know about memory

What a C programmer should know about memoryFeb 20, 2015
35 minute read

 
te>Source: Weapons by T4LLBERG, on Flickr (CC-BY-SA)te>

In 2007, Ulrich Drepper wrote a “What every programmer should know about memory”. Yes, it’s a wee long-winded, but it’s worth its salt. Many years and “every programmer should know about” articles later, the concept of virtual memory is still elusive to many, as if it was a kind of magic. Awww, I couldn’t resist the reference. Even the validity of the original article was questioned many years later. What gives?

te>“North bridge? What is this crap? That ain’t street-fighting.”te>

I’ll try to convey the practical side of things (i.e. what can you do) from “getting your fundamentals on a lock”, to more fun stuff. Think of it as a glue between the original article, and the things that you use every day. The examples are going to be C99 on Linux, but a lot of topics are universal. EDIT: I don’t have much knowledge about Windows, but I’d be thrilled to link an article which explains it. I tried my best to mention which functions are platform-specific, but again I’m only a human. If you find a discrepancy, please let me know.

Without further ado, grab a cup of coffee and let’s get to it.

Understanding virtual memory - the plot thickens

Unless you’re dealing with some embedded systems or kernel-space code, you’re going to be working in protected mode. This is awesome, since your program is guaranteed to have it’s own [virtual] address space. The word “virtual” is important here. This means, among other things, that you’re not bounded by the available memory, but also not entitled to any. In order to use this space, you have to ask the OS to back it with something real, this is called mapping. A backing can be either a physical memory (not necessarily RAM), or a persistent storage. The former is also called an “anonymous mapping”. But hold your horses.

The virtual memory allocator (VMA) may give you a memory it doesn’t have, all in a vain hope that you’re not going to use it. Just like banks today. This is called overcommiting, and while it has legitimate applications (sparse arrays), it also means that the memory allocation is not going to simply say “NO”.

de>char *block = malloc(1024 * sizeof(char));if (block == NULL) { return -ENOMEM; /* Sad :( */}de>

The de>NULLde> return value checking is a good practice, but it’s not as powerful as it once was. With the overcommit, the OS may give your memory allocator a valid pointer to memory, but if you’re going to access it - dang*. The dang in this case is platform-specific, but generally an OOM killer killing your process.

* — This is an oversimplification, as timbatron noted, and it’s further explained in the “Demand paging explained” section. But I’d like to go through the well-known stuff first before we delve into specifics.

Detour - a process memory layout

The layout of a process memory is well covered in the Anatomy of a Program in Memory by Gustavo Duarte, so I’m going to quote and reference to the original article, I hope it’s a fair use. I have only a few minor quibbles, for one it covers only a x86-32 memory layout, but fortunately nothing much has changed for x86-64. Except that a process can use much more space — the whopping 48 bits on Linux.


 
te>Source: Linux address space layout by Gustavo Duartete>

It also shows the memory mapping segment (MMS) growing down, but that may not always be the case. The MMS usually starts (x86/mm/mmap.c:113 and arch/mm/mmap.c:1953) at a randomized address just below the lowest address of the stack. Usually, because it may start above the stack and grow upwards iff the stack limit is large (or unlimited), or the compatibility layout is enabled. How is this important? It’s not, but it helps to give you an idea about the free address ranges.

Looking at the diagram, you can see three possible variable placements: the process data segment (static storage or heap allocation), the memory mapping segment, and the stack. Let’s start with that one.

Understanding stack allocation

Utility belt:

The stack is kind of easy to digest, everybody knows how to make a variable on the stack right? Here are two:

de>int stairway = 2;int heaven[] = { 6, 5, 4 };de>

The validity of the variables is limited by scope. In C, that means this: de>{}de>. So each time a closing curly bracket comes, a variable dies. And then there’s de>alloca()de>, which allocates memory dynamically in the current stack frame. A stack frame is not (entirely) the same thing as memory frame (aka physical page), it’s simply a group of data that gets pushed onto the stack (function, parameters, variables…). Since we’re on the top of the stack, we can use the remaining memory up to the stack size limit.

This is how variable-length arrays (VLA), and also de>alloca()de> work, with one difference - VLA validity is limited by the scope, alloca’d memory persists until the current function returns (or unwinds if you’re feeling sophisticated). This is no language lawyering, but a real issue if you’re using the alloca inside a loop, as you don’t have any means to free it.

de>void laugh(void) { for (unsigned i = 0; i < megatron; ++i) { char *res = alloca(2); memcpy(res, "ha", 2); char vla[2] = {'h','a'} } /* vla dies, res lives */} /* all allocas die */de>

Neither VLA or alloca play nice with large allocations, because you have almost no control over the available stack memory and the allocation past the stack limits leads to the jolly stack overflow. There are two ways around it, but neither is practical.

The first idea is to use a de>sigaltstack()de> to catch and handle the de>SIGSEGVde>. However this just lets you catch the stack overflow.

The other way is to compile with split-stacks. It’s called this way, because it really splits the monolithic stack into a linked-list of smaller stacks called stacklets. As far as I know, GCC and clang support it with the de>-fsplit-stackde> option. In theory this also improves memory consumption and reduces the cost of creating threads — because the stack can start really small and grow on demand. In reality, expect compatibility issues, as it needs a split-stack aware linker (i.e. gold) to play nice with the split-stack unaware libraries, and performance issues (the “hot split” problem in Go is nicely explained by Agis Anastasopoulos).

Understanding heap allocation

Utility belt:

The heap allocation can be as simple as moving a program break and claiming the memory between the old position, and the new position. Up to this point, a heap allocation is as fast as stack allocation (sans the paging, presuming the stack is already locked in memory). But there’s a cat, I mean catch, dammit.

de>char *block = sbrk(1024 * sizeof(char));de>

⑴ we can’t reclaim unused memory blocks, ⑵ is not thread-safe since the heap is shared between threads, ⑶ the interface is hardly portable, libraries must not touch the break

de>man 3 sbrkde> — Various systems use various types for the argument of sbrk(). Common are int, ssizet, ptrdifft, intptr_t.

For these reasons libc implements a centralized interface for memory allocation. The implementation varies, but it provides you a thread safe memory allocation of any size … at a cost. The cost is latency, as there is now locking involved, data structures keeping the information about used / free blocks and an extra memory overhead. The heap is not used exclusively either, as the memory mapping segment is often utilised for large blocks as well.

de>man 3 mallocde> — Normally, de>malloc()de> allocates memory from the heap, … when allocating blocks of memory larger than MMAP_THRESHOLD, the glibc de>malloc()de> implementation allocates the memory as a private anonymous mapping.

As the heap is always contiguous from de>start_brkde> to de>brkde>, you can’t exactly punch holes through it and reduce the data segment size. Imagine the following scenario:

de>char *truck = malloc(1024 * 1024 * sizeof(char));char *bike = malloc(sizeof(char));free(truck);de>

The heap [allocator] moves the brk to make space for de>truckde>. The same for the de>bikede>. But after the de>truckde> is freed, the brk can’t be moved down, as it’s the de>bikede> that occupies the highest address. The result is that your process can reuse the former de>truckde> memory, but it can’t be returned to the system until the de>bikede> is freed. But presuming the de>truckde> was mmaped, it wouldn’t reside in the heap segment, and couldn’t affect the program break. Still, this trick doesn’t prevent the holes created by small allocations (in another words “cause fragmentation”).

Note that the de>free()de> doesn’t always try to shrink the data segment, as that is a potentially expensive operation. This is a problem for long-running programs, such as daemons. A GNU extension, called de>malloc_trim()de>, exists for releasing memory from the top of the heap, but it can be painfully slow. It hurts real bad for a lot of small objects, so it should be used sparingly.

When to bother with a custom allocator

There are a few practical use cases where a GP allocator falls short — for example an allocation of a large number of small fixed-size chunks. This might not look like a typical pattern, but it is very frequent. For example, lookup data structures like trees and tries typically require nodes to build hierarchy. In this case, not only the fragmentation is the problem, but also the data locality. A cache-efficient data structure packs the keys together (preferably on the same page), instead of mixing it with data. With the default allocator, there is no guarantee about the locality of the blocks from the subsequent allocations. Even worse is the space overhead for allocating small units. Here comes the solution!

  
te>Source: Slab by wadem, on Flickr (CC-BY-SA)te>

Slab allocator

Utility belt:

The principle of slab allocation was described by Bonwick for a kernel object cache, but it applies for the user-space as well. Oh-kay, we’re not interested in pinning slabs to CPUs, but back to the gist — you ask the allocator for a slab of memory, let’s say a whole page, and you cut it into many fixed-size pieces. Presuming each piece can hold at least a pointer or an integer, you can link them into a list, where the list head points to the first free element.

de>/* Super-simple slab. */struct slab { void **head;};/* Create page-aligned slab */struct slab *slab = NULL;posix_memalign(&slab, page_size, page_size);slab->head = slab + sizeof(struct slab);/* Create a NULL-terminated slab freelist */char* item = (char*)slab->head;for(unsigned i = 0; i < item_count; ++i) { *((void**)item) = item + item_size; item += item_size;}*((void**)item) = NULL;de>

Allocation is then as simple as popping a list head. Freeing is equal to as pushing a new list head. There is also a neat trick. If the slab is aligned to the de>page_sizede> boundary, you can get the slab pointer as cheaply as rounding down to the de>page_sizede>.

de>/* Free an element */struct slab *slab = ptr & PAGESIZE_BITS;*((void**)ptr) = (void*)slab->head;slab->head = (void**)ptr;/* Allocate an element */if((item = slab->head)) { slab->head = (void**)*item;} else { /* No elements left. */}de>

Great, but what about binning, variable size storage, cache aliasing and caffeine, …? Peek at my old implementation for Knot DNS to get the idea, or use a library that implements it. For example, *gasp*, the glib implementation has a tidy documentation and calls it “memory slices”.

Memory pools

Utility belt:

As with the slab, you’re going to outsmart the GP allocator by asking it for whole chunks of memory only. Then you just slice the cake until it runs out, and then ask for a new one. And another one. When you’re done with the cakes, you call it a day and free everything in one go.

Does it sound obvious and stupid simple? Because it is, but that’s what makes it great for specific use cases. You don’t have to worry about synchronisation, not about freeing either. There are no use-after-free bugs, data locality is much more predictable, there is almost zero overhead for small fragments.

The pattern is surprisingly suitable for many tasks, ranging from short-lived repetitive (i.e. “network request processing”), to long-lived immutable data (i.e. “frozen set”). You don’t have to free everything either. If you can make an educated guess on how much memory is needed on average, you can just free the excess memory and reuse. This reduces the memory allocation problem to simple pointer arithmetic.

And you’re in luck here, as the GNU libc provides, *whoa*, an actual API for this. It’s called obstacks, as in “stack of objects”. The HTML documentation formatting is a bit underwhelming, but minor quibbles aside — it allows you to do both pool allocation, and full or partial unwinding.

de>/* Define block allocator. */#define obstack_chunk_alloc malloc#define obstack_chunk_free free/* Initialize obstack and allocate a bunch of animals. */struct obstack animal_stack;obstack_init (&animal_stack);char *bob = obstack_alloc(&animal_stack, sizeof(animal));char *fred = obstack_alloc(&animal_stack, sizeof(animal));char *roger = obstack_alloc(&animal_stack, sizeof(animal));/* Free everything after fred (i.e. fred and roger). */obstack_free(&animal_stack, fred);/* Free everything. */obstack_free(&animal_stack, NULL);de>

There is one more trick to it, you can grow the object on the top of the stack. Think buffering input, variable-length arrays, or just a way to combat the de>realloc()-strcpy()de> pattern.

de>/* This is wrong, I better cancel it. */obstack_grow(&animal_stack, "long", 4);obstack_grow(&animal_stack, "fred", 5);obstack_free (&animal_stack, obstack_finish(&animal_stack));/* This time for real. */obstack_grow(&animal_stack, "long", 4);obstack_grow(&animal_stack, "bob", 4);char *result = obstack_finish(&animal_stack);printf("%s\n", result); /* "longbob" */de>

Demand paging explained

Utility belt:

One of the reasons why the GP memory allocator doesn’t immediately return the memory to the system is, that it’s costly. The system has to do two things: ⑴ establish the mapping of a virtual page to real page, and ⑵ give you a blanked real page. The real page is called frame, now you know the difference. Each frame must be sanitized, because you don’t want the operating system to leak your secrets to another process, would you? But here’s the trick, remember the overcommit? The virtual memory allocator honours the only the first part of the deal, and then plays some “now you see me and now you don’t” shit — instead of pointing you to a real page, it points to a special page de>0de>.

Each time you try to access the special page, a page fault occurs, which means that: the kernel pauses process execution and fetches a real page, then it updates the page tables, and resumes like nothing happened. That’s about the best explanation I could muster in one sentence, here’s more detailed one. This is also called “demand paging” or “lazy loading”.

The Spock said that “one man cannot summon the future”, but here you can pull the strings.

The memory manager is no oracle and it makes very conservative predictions about how you’re going to access memory, but you may know better. You can lock the contiguous memory block in physical memory, avoiding further page faulting:

de>char *block = malloc(1024 * sizeof(char));mlock(block, 1024 * sizeof(char));de>

*psst*, you can also give an advise about your memory usage pattern:

de>char *block = malloc(1024 * sizeof(block));madvise(block, 1024 * sizeof(block), MADV_SEQUENTIAL);de>

The interpretation of the actual advice is platform-specific, the system may even choose to ignore it altogether, but most of the platforms play nice. Not all advices are well-supported, and some even change semantics (de>MADV_FREEde> drops dirty private memory), but the de>MADV_SEQUENTIALde>, de>MADV_WILLNEEDde>, and de>MADV_DONTNEEDde> holy trinity is what you’re going to use most.

Fun with flags memory mapping

Utility belt:

There are several things that the memory allocator just can’t do, memory maps to to rescue! To pick one, the fact that you can’t choose the allocated address range. For that we’re willing to sacrifice some comfort — we’re going to be working with whole pages from now on. Just to make things clear, a page is usually a 4K block, but you shouldn’t rely on it and use de>sysconf()de> to discover it.

de>long page_size = sysconf(_SC_PAGESIZE); /* Slice and dice. */de>

Side note — even if the platform advertises a uniform page size, it may not do so in the background. For example a Linux has a concept of transparent huge pages (THP) to reduce the cost of address translation and page faulting for contiguous blocks. This is however questionable, as the huge contiguous blocks become scarce when the physical memory gets fragmented. The cost of faulting a huge page also increases with the page size, so it’s not very efficient for “small random I/O” workload. This is unfortunately transparent to you, but there is a Linux-specific de>mmapde> option de>MAP_HUGETLBde> that allows you to use it explicitly, so you should be aware of the costs.

Fixed memory mappings

Say you want to do fixed mapping for a poor man’s IPC for example, how do you choose an address? On x86-32 bit it’s a risky proposal, but on the 64-bit, an address around 2/3rds of the de>TASK_SIZEde> (highest usable address of the user space process) is a safe bet. You can get away without fixed mapping, but then forget pointers in your shared memory.

de>#define TASK_SIZE 0x800000000000#define SHARED_BLOCK (void *)(2 * TASK_SIZE / 3)void *shared_cats = shmat(shm_key, SHARED_BLOCK, 0);if(shared_cats == (void *)-1) { perror("shmat"); /* Sad :( */}de>

Okay, I get it, this is hardly a portable example, but you get the gist. Mapping a fixed address range is usually considered unsafe at least, as it doesn’t check whether there is something already mapped or not. There is a de>mincore()de> function to tell you whether a page is mapped or not, but you’re out of luck in mult

Tag:

Freenovo 发表于2015-03-11 01:35:00 | 编辑 | 分享 0

引用地址:

posted on 2015-03-11 01:35  Freedom000  阅读(233)  评论(0编辑  收藏  举报

导航