
勤学似春起之苗,不见其增,日有所长; 辍学如磨刀之石,不见其损,日所有亏!


翻译:Understanding Linux Network Internals 2.1. The Socket Buffer: sk_buff Structure 套接字缓存数据结构sk_buff


2.1. The Socket Buffer: sk_buff Structure 套接字缓存数据结构sk_buff

This is probably the most important data structure in the Linux networking code, representing the headers for data that has been received or is about to be transmitted. Defined in the <include/linux/skbuff.h> include file, it consists of a tremendous heap of variables that try to be all things to all people.


The structure has changed many times in the history of the kernel, both to add new options and to reorganize existing fields into a cleaner layout. Its fields can be classified roughly into the following categories:这个结构随着内核的发展,已经修改过很多次了,都是添加一些新的选项或者整理一下已经存在的成员,使它们的排列更清晰。该结构的成员可以大概的分为以下几类:

  • Layout 布局

  • General 通用

  • Feature-specific 特殊功能

  • Management functions 管理函数

This structure is used by several different network layers (MAC or another link protocol on the L2 layer, IP on L3, TCP or UDP on L4), and various fields of the structure change as it is passed from one layer to another. L4 appends a header before passing it to L3, which in turn puts on its own header before passing it to L2. Appending headers is more efficient than copying the data from one layer to another. Since adding space to the beginning of a bufferwhich means changing the variable that points to itis a complicated operation, the kernel provides the skb_reserve function (described later in this chapter) to carry it out. Thus, one of the first things done by each protocol, as the buffer passes down through layers, is to call skb_reserve to reserve space for the protocol's header.[] In the later section "Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull," we will see an example of how the kernel makes sure enough space is reserved at the head of the buffer to allow each layer to add its own header while the buffer traverses the layers.
这个结构被几个不同的网络层(MAC层,或者其它的二层链路层,三层的IP,TCP或者四层的UDP)所使用,而很多字段(的含意)在该结构从一个网络层转到另一个网络层时会有所改变。四层在把数据传到三层时,会添加一个数据头,而三层在传到二层以前,就轮到三层把自己的数据头也加进去了。直接添加数据头,要比把数据从一个网络层COPY到另一个网络层要高效得多。正因为,在缓存前面添加一点存储空间,并改变成员变量让它指向新的存储空间上,是一个复杂的操作,所以内核为我们提供了一个skb_reserve函数(在这一章的后面会讨论这个函数)来帮助我们解决这个问题。因此,对于每一种协议来说,最先要做的事情就是:当帧缓存数据在网络层之间传递时,调用skb_reserve函数来为该协议预留出协议头的存储空间。在后面的“Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull”小节中,我们将会看到一个例子:当一个数据缓存在各个网络层之间传递时,内核是如何确保在每个缓存前面有足够的空间,从而允许每个协议层的头信息可以存储到缓存前面。

[] skb_reserve is also used by device drivers to align the IP header of ingress frames. See Chapter 10.

When the buffer passes up through the network layers, each header from the old layer is no longer of interest. The L2 header, for instance, is used only by the device drivers that handle the L2 protocol, so it is of no interest to L3. Instead of removing the L2 header from the buffer, the pointer to the beginning of the payload is moved ahead to the beginning of the L3 header, which requires fewer CPU cycles.

The rest of this section explains a basic principle about conditional (optional) fields, and then covers each of the categories just listed.

2.1.1. Networking Options and Kernel Structures网络选项和内核结构

As you can see from glancing at TCP/IP specifications or configuring a kernel, network code provides an enormous number of options that are useful but not always required, such as a Firewall, Multicasting, and other features. Most of these options require additional fields in kernel data structures. Therefore, sk_buff is peppered with C preprocessor #ifdef directives. For example, near the bottom of the sk_buff definition you can find:

struct sk_buff {
    ... ... ...
    _ _u32    tc_index;
    _ _u32    tc_verd;
    _ _u32    tc_classid;

This shows that the field tc_index is part of the data structure only if the CONFIG_NET_SCHED symbol is defined at compile time, which means that the right option (in this example, "Device Drivers Networking support Networking options QoS and/or fair queueing") has been enabled with some version of make config by an administrator or by an automated installation utility.

The previous example actually shows two nested options: the fields used by CONFIG_NET_CLS_ACT (packet classifier) are considered for inclusion only if support for "QoS and/or fair queueing" is present.
前面这个例子切实的反映了两个嵌套的选项:CONFIG_NET_CLS_ACT(分包)选项所要使用的字段,只有在"QoS and/or fair queueing" 选项有效时它们才会被包含进来。

Notice, by the way, that the QoS option cannot be compiled as a module. The reason is that most of the consequences of enabling the option will not be reversible after the kernel is compiled. In general, any option that causes a change in a kernel data structure (such as adding the tc_index field to the sk_buff structure) renders the option unfit to be compiled as a module.
注意,顺便说一下,这里的QoS选项不能用于编译模块。原因是大多数重要的让该功能使能的选项在内核编译过后是不可逆的(译注:这里就是说该功能只能随内核一起编译,一起使用内核的编译选项;而不能先把内核编译完成,再来配置编译QoS模块)。一般来说,任何一个引发内核修改数据结构的选项(例如添加一个tc_index字段到sk_buff结构中),都会让选项在编译模块时不合适。(译注:关于内核与模块的概念,可以参考:Understanding the Linux Kernel,深入理解Linux内核)

You'll often want to find out which compile option from make config or its variants is associated with a given #ifdef symbol, to understand when a block of code is included in the kernel. The fastest way to make the association, in the 2.6 kernels, is to look for the symbol in the kconfig files that are spread all over the source tree (one per directory). In 2.4 kernels, you can consult the file Documentation/Configure.help.

2.1.2. Layout Fields字段布局

A few of the sk_buff's fields exist just to facilitate searching and to organize the data structure itself. The kernel maintains all sk_buff structures in a doubly linked list. But the organization of this list is somewhat more complicated than that of a traditional doubly linked list.

Like any doubly linked list, this one is tied together by next and prev fields in each sk_buff structure, the next field pointing forward and the prev field pointing backward. But this list has another requirement: each sk_buff structure must be able to find the head of the whole list quickly. To implement this requirement, an extra structure of type sk_buff_head is inserted at the beginning of the list, as a kind of dummy element. The sk_buff_head structure is:
和所有的双向链表一样,这个链表也是通过在每个sk_buff结构里使用一个next和一个prev字段来把它们放在一起的,next字段指向下一个而prev字段指向前一个。但这个链表有一个其它的要求:任何一个sk_buff结构必须可以快速的找个整个链表的头。为了实现这一功能,在链表的前面添加了一个额外的sk_buff_head 结构,它就是一个伪链接成员(译注:就只用来记录链表头)。sk_buff_head结构:

struct sk_buff_head {
    /* These two members must be first. */
    struct sk_buff    * next;
    struct sk_buff    * prev;

    _ _u32        qlen;
    spinlock_t    lock;

qlen represents the number of elements in the list. lock is used to prevent simultaneous accesses to the list and is described in the section "List management functions," later in this chapter.
qlen就是这个链表的长度。lock用于保护(多个线程在)同时访问链表,这个在本章的后面章节“List management functions”中说明。

The first two elements of both sk_buff and sk_buff_head are the same: the next and prev pointers. This allows the two structures to coexist in the same list, even though sk_buff_head is positively skimpy in comparison to sk_buff. In addition, the same functions can be used to manipulate both sk_buff and sk_buff_head.

To add to the complexity, every sk_buff structure contains a pointer to the single sk_buff_head structure. This pointer has the field name list. See Figure 2-1 for help finding your way around these data structures.

Figure 2-1. List of sk_buff elements

Other interesting fields of sk_buff follow:

struct sock *sk

This is a pointer to a sock data structure of the socket that owns this buffer. This pointer is needed when data is either locally generated or being received by a local process, because the data and socket-related information is used by L4 (TCP or UDP) and by the user application. When a buffer is merely being forwarded (that is, neither the source nor the destination is on the local machine), this pointer is NULL.

unsigned int len

This is the size of the block of data in the buffer. This length includes both the data in the main buffer (i.e., the one pointed to by head) and the data in the fragments.[] Its value changes as the buffer moves from one network layer to the next, because headers are discarded while moving up in the stack and are added while moving down the stack. len accounts for protocol headers as well, as shown in Figure 2-8 in the section "Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull."
这表示缓存里的数据块大小。这个长度包含数主缓存区(例如:一个指出头大小的)里的大小以及分片区里的大小。这个值在缓存数据从一个网络层被移到另一个网络层时会发生改变,因为缓存数据在协议栈里向上移动时要丢弃协议头,而向下移动时要添加协议头。len很好为协议头计算长度,如"Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull."中的图2-8所示。

[] See Chapter 21 for a discussion of fragmented buffers.

unsigned int data_len

Unlike len, data_len accounts only for the size of the data in the fragments.

unsigned int mac_len

This is the size of the MAC header.

atomic_t users

This is the reference count, or the number of entities using this sk_buff buffer. The main use of this parameter is to avoid freeing the sk_buff structure when someone is still using it. For this reason, each user of the buffer should increment and decrement this field when necessary. This counter covers only the users of the sk_buff data structure; the buffer containing the actual data is covered by a similar field (dataref) that will be introduced later in the chapter, in the section "The skb_shared_info structure and the skb_shinfo function."
这是一个引用计数,或者说就是记录有多少实例在使用这个sk_buff缓存。这个参数的主要用途就是防止在删除这个数据结构时,还有人在使用它。为此,每一个数据缓存的用户必须在须要的时候增加该值或者减少该值。这个计数只对sk_buff数据结构的用户有效,而这个结构还用一个类似的字段(dataref)包含了真实的数据,这个会在本章的后面一节,“The skb_shared_info structure and the skb_shinfo function.”中介绍。

users is sometimes incremented and decremented directly with the atomic_inc and atomic_dec functions, but most of the time it is manipulated with skb_get and kfree_skb.

unsigned int truesize

This field represents the total size of the buffer, including the sk_buff structure itself. It is initially set by the function alloc_skb to len+sizeof(sk_buff) when the buffer is allocated for a requested data space of len bytes.

struct sk_buff *alloc_skb(unsigned int size,int gfp_mask)
     ... ... ...
     skb->truesize = size + sizeof(struct sk_buff);
     ... ... ...

The field gets updated whenever skb->len is increased.

unsigned char *head

unsigned char *end

unsigned char *data

unsigned char *tail

These represent the boundaries of the buffer and the data within it. When each layer prepares the buffer for its activities, it may allocate more space for a header or for more data. head and end point to the beginning and end of the space allocated to the buffer, and data and tail point to the beginning and end of the actual data. See Figure 2-2. The layer can then fill in the gap between head and data with a protocol header, or the gap between tail and end with new data. You will see in the later section "Allocating memory: alloc_skb and dev_alloc_skb" that the buffer on the right side of Figure 2-2 includes an additional header at the bottom.
这几个字段表示缓存以及里面数据的边界。当每个网络层准备使用缓存区时,可能要为协议头或者更多的数据分配更多的空间。head和end指向分配的缓存开始和结束的地方,而data和tail指向实际数据的开始和结束的地方。参见图2-2。网络层可以用协议头填充头和数据之间的间隙,或者是新数据的tail和end之间的间隙。你会在后面的章节"Allocating memory: alloc_skb and dev_alloc_skb"中看到,这个在图右边的缓存包含一个附加的头和尾。

Figure 2-2. head/end versus data/tail pointers

void (*destructor)(...)

This function pointer can be initialized to a routine that performs some activity when the buffer is removed. When the buffer does not belong to a socket, the destructor is usually not initialized. When the buffer belongs to a socket, it is usually set to sock_rfree or sock_wfree (by the skb_set_owner_r and skb_set_owner_w initialization functions, respectively). The two sock_xxx routines are used to update the amount of memory held by the socket in its queues.

2.1.3. General Fields

This section covers the majority of sk_buff fields, which are not associated with specific kernel features:

struct timeval stamp

This is usually meaningful only for a received packet. It is a timestamp that represents when a packet was received or (occasionally) when one is scheduled for transmission. It is set by the function netif_rx with net_timestamp, which is called by the device driver after the reception of each packet and is described in Chapter 21.

struct net_device *dev

This field, whose type (net_device) will be described in more detail later in the chapter, describes a network device. The role of the device represented by dev depends on whether the packet stored in the buffer is about to be transmitted or has just been received.

When a packet is received, the device driver updates this field with the pointer to the data structure representing the receiving interface, as illustrated by the following piece of code from vortex_rx, the function called by the driver of the 3c59x Ethernet card series when receiving a frame (in drivers/net/3c59x.c):

static int vortex_rx(struct net_device *dev)
           ... ... ...
        skb->dev = dev;
           ... ... ...
        skb->protocol = eth_type_trans(skb, dev);
        netif_rx(skb); /* Pass the packet to the higher layer */
           ... ... ...

When a packet is to be transmitted, this parameter represents the device through which it will be sent out. The code that sets the value is more complicated than the code for receiving a packet, so I will postpone a discussion until Chapter 21 and Chapter 35.

Some network features allow a few devices to be grouped together to represent a single virtual interface (that is, one that is not directly associated with a hardware device), served by a virtual device driver. When the device driver is invoked, the dev parameter points to the virtual device's net_device data structure. The driver chooses a specific device from its group and changes the dev parameter to point to the net_device data structure of that device. Under these circumstances, therefore, the pointer to the transmitting device may be changed during packet processing.

struct net_device *input_dev

This is the device the packet has been received from. It is a NULL pointer when the packet has been generated locally. For Ethernet devices, it is initialized in eth_type_trans (see Chapters 10 and 13). It is used mainly by Traffic Control.

struct net_device *real_dev

This field is meaningful only for virtual devices, and represents the real device the virtual one is associated with. The Bonding and VLAN interfaces use it, for example, to remember where the real device ingress traffic is received from.

union {...} h

union {...} nh

union {...} mac

These are pointers to the protocol headers of the TCP/IP stack: h for L4, nh for L3, and mac for L2. Each field points to a union of various structures, one structure for each protocol understood by the kernel at that layer. For instance, h is a union that includes a field for the header of each L4 protocol understood by the kernel. One member of each union is called raw and is used for initialization; all later accesses are through the protocol-specific members.

When receiving a data packet, the function responsible for processing the layer n header receives a buffer from layer n-1 with skb->data pointing to the beginning of the layer n header. The function that handles layer n initializes the proper pointer for this layer (for instance, skb->nh for L3 handlers) to preserve the skb->data field, because the contents of this pointer will be lost during the processing at the next layer, when skb->data is initialized to a different offset within the buffer. The function then completes the layer n processing and, before passing the packet to the layer n+1 handler, updates skb->data to make it point to the end of the layer n header, which is the beginning of the layer n+1 header (see Figure 2-3).

Sending a packet reverses this process, with the added complexity of adding a new header at each layer.

Figure 2-3. Header's pointer initializations while moving from layer two to layer three

struct dst_entry dst

This is used by the routing subsystem. Because the data structure is quite complex and requires knowledge of how other subsystems work, I'll postpone a description of it until Part VII.

char cb[40]

This is a "control buffer," or storage for private information, maintained by each layer for internal use. It is statically allocated within the sk_buff structure (currently with a size of 40 bytes) and is large enough to hold whatever private data is needed by each layer. In the code for each layer, access is done through macros to make the code more readable. TCP, for example, uses that space to store a tcp_skb_cb data structure, which is defined in include/net/tcp.h:

struct tcp_skb_cb {
    ... ... ...
    _ _u32        seq;        /* Starting sequence number */
    _ _u32        end_seq;    /* SEQ + FIN + SYN + datalen*/
    _ _u32        when;       /* used to compute rtt's    */
    _ _u8         flags;      /* TCP header flags.        */
    ... ... ...

And this is the macro used by the TCP code to access the structure. The macro consists simply of a pointer cast:

#define TCP_SKB_CB(_ _skb)    ((struct tcp_skb_cb *)&((_ _skb)->cb[0]))

Here is an example where the TCP subsystem fills in the structure upon receipt of a segment:

int tcp_v4_rcv(struct sk_buff *skb)
        ... ... ...
        th = skb->h.th;
        TCP_SKB_CB(skb)->seq = ntohl(th->seq);
        TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
                                    skb->len - th->doff * 4);
        TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
        TCP_SKB_CB(skb)->when = 0;
        TCP_SKB_CB(skb)->flags = skb->nh.iph->tos;
        TCP_SKB_CB(skb)->sacked = 0;
        ... ... ...

To see how the parameters in the cb buffer are retrieved, take a look at the function tcp_transmit_skb in net/ipv4/tcp_output.c. That function is used by TCP to push a data segment down to the IP layer for transmission.

In Chapter 22, you will also see how IPv4 uses cb to store information about IP fragmentation.

unsigned int csum

unsigned char ip_summed

These represent the checksum and associated status flag. Their use is described in Chapter 19.

unsigned char cloned

A boolean flag that, when set, indicates that this structure is a clone of another sk_buff buffer. See the later section "Cloning and copying buffers."
这是一个布尔类型的标志,当它被设置时,用于指示这个结构是另一个sk_buff缓存的克隆。我们会在后面的“Cloning and copying buffers”中看到它。

unsigned char pkt_type

This field classifies the type of frame based on its L2 destination address. The possible values are listed in include/linux/if_packet.h. For Ethernet devices, this parameter is initialized by the function eth_type_trans, which is described in Chapter 13.

The main values it can be assigned are:主要几个以被指定的值:


The destination address of the received frame is that of the receiving interface; in other words, the packet has reached its destination.


The destination address of the received frame is one of the multicast addresses to which the interface is registered.


The destination address of the received frame is the broadcast address of the receiving interface.


The destination address of the received frame does not belong to the ones associated with the interface (unicast, multicast, and broadcast); thus, the frame will have to be forwarded if forwarding is enabled, and dropped otherwise.


The packet is being sent out; among the users of this flag are the Decnet protocol and the function that gives each network tap a copy of the outgoing packet (see dev_queue_xmit_nit in Chapter 11).
该值表示包正在被送出,在用户的这些标志值中是正式协议(Decnet protocol),并且这个功能会在每个网络上分发一份正在送出的包的COPY(参见第11章的dev_queue_xmit_nit)。


The packet is being sent out to the loopback device. Thanks to this flag, when dealing with the loopback device, the kernel can skip some operations needed for real devices.


The packet is being routed using the Fastroute feature. Fastroute support is not available anymore in 2.6 kernels.

Chapter 13 details how those values are set based on the L2 destination address value.

_ _u32 priority

This indicates the Quality of Service (QoS) class of a packet being transmitted or forwarded. If the packet is generated locally, the socket layer defines the priority value. If instead the packet is being forwarded, the function rt_tos2priority (called from the ip_forward function) defines the value of the field according to the value of the Type of Service (ToS) field in the IP header itself. The value of this parameter has nothing to do with the DiffServ Code Point (DSCP) described in Chapter 18. I will discuss its role in the section "ip_forward Function" in Chapter 20.
该字段标识一个发送或者转发数据包的QoS级别。如果这个包就是本地生成的,套接字层就会定义优先级。如果数据包换作是要被转发的,rt_tos2priority函数(被ip_forward函数调用)会根据数据包IP头中ToS字段的值来定义该字段的值。该参数的值对于DSCP来说是无效的,这个会在第18中讲解。我会在第20章的“ip_forward Function”这一节中讨论它的角色。

unsigned short protocol

This is the protocol used at the next-higher layer from the perspective of the device driver at L2. Typical protocols listed here are IP, IPv6, and ARP; a complete list is available in include/linux/if_ether.h. Since each protocol has its own function handler for the processing of incoming packets, this field is used by the driver to inform the layer above it what handler to use. Each driver calls netif_rx to invoke the handler for the upper network layer, so the protocol field must be initialized before that function is invoked. See Chapters 10 and 13 for more detail.

unsigned short security

This is the security level of the packet. This field was originally introduced for use with IPsec but is no longer used.

2.1.4. Feature-Specific Fields

The Linux kernel is modular, allowing you to select what to include and what to leave out. Thus, some fields are included in the sk_buff data structure only if the kernel is compiled with support for particular features such as firewalling (Netfilter) or QoS:

unsigned long nfmark

_ _u32 nfcache

_ _u32 nfctinfo

struct nf_conntrack *nfct

unsigned int nfdebug

struct nf_bridge_info *nf_bridge

These parameters are used by Netfilter (the firewall code), and more specifically by the kernel option "Device Drivers Networking support Networking options Network packet filtering" and its two suboptions, "Network packet filtering debugging" and "Bridged IP/ARP packets filtering."
这些参数是被Netfilter (防火墙的代码)使用的,而更多的特性是由内核的“Device Drivers  Networking support  Networking options  Network packet filtering”选项和两个子选项“Network packet filtering debugging”,“Bridged IP/ARP packets filtering”所决定的。

union {...} private

This union is used by the High Performance Parallel Interface (HIPPI). The associated kernel option is "Device Drivers Networking support Network device support HIPPI driver support."
这是一个被HIPPI使用的联合字段。与之相关的内核选项是“Device Drivers  Networking support  Network device support  HIPPI driver support”。

_ _u32 tc_index

_ _u32 tc_verd

_ _u32 tc_classid

These parameters are used by the Traffic Control, and more specifically by the kernel option "Device Drivers Networking support Networking options QoS and/or fair queueing" and its suboption, "Packet classifier API."
这些参数是由流量控制所使用,而更多的特殊功能由内核选项“Device Drivers  Networking support  Networking options  QoS and/or fair queueing”以及子选项“Packet classifier API”所决定。

struct sec_path *sp

This is used by the IPsec protocol suite to keep track of transformations.

2.1.5. Management Functions

Lots of functions , usually very short and simple, are offered by the kernel to manipulate sk_buff elements or lists of elements. With the help of Figure 2-4, I'll describe the most important ones. First we will see the functions used to allocate and free buffers, and then the ones used to manipulate the pointers (i.e., skb->data) to reserve space at the head or at the tail of a frame.

If you take a look at the files include/linux/skbuff.h and net/core/skbuff.c, you will notice that almost all of the functions exist in two versions, with names like do_something and _ _do_something. Usually, the first one is a wrapper that adds extra sanity checks or locking mechanisms around a call to the second one. The internal _ _do_something form is generally not called directly (unless specific conditions are meti.e., lock requirements, to name one). Exceptions to that rule are usually poorly coded functions that will be fixed eventually.
如果你看看include/linux/skbuff.h文件和net/core/skbuff.c,你看注意到,很多函数有两个版本,它们的名字就你是do_something和__do_something这样的。通常,第一个是第二的一个封装,该封装在第二个函数的调用上,额外增加了一些健壮性检测以及加锁机制。内部的 _ _do_something 一般情况下不被直接调用(除非遇到特殊的情况,例如: lock requirements, to name one)。违返这一规则的不好的函数代码最终会被修订。

Figure 2-4. Before and after: (a)skb_put, (b)skb_push, (c)skb_pull, and (d)skb_reserve Allocating memory: alloc_skb and dev_alloc_skb

alloc_skb is the main function for the allocation of buffers and is defined in net/core/skbuff.c. We have already seen that the data buffer and the header (the sk_buff data structure) are two different entities, which means that creating a single buffer involves two allocations of memory (one for the buffer and one for the sk_buff structure).
alloc_skb是用于分配缓存的主要函数,它定义在net/core/skbuff.c中。我们已经看到数据缓存和帧头(sk_buff 数据结构)有两个不同的实体,也就是说在创建一个缓存时,会引发两个内存分配(一个用于缓存,另一个用于sk_buff数据结构)。(译注:sk_buff结构中的data字段指向分配的数据缓存)

alloc_skb takes an sk_buff data structure from a cache by calling the function kmem_cache_alloc, and gets a data buffer by calling kmalloc, which also uses cached memory if it is available. The code (slightly simplified) is:

    skb = kmem_cache_alloc(skbuff_head_cache, gfp_mask & ~_ _GFP_DMA);
    ... ... ...
    size = SKB_DATA_ALIGN(size);
    data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);

Before calling kmalloc, the size parameter is tuned with the macro SKB_DATA_ALIGN to force alignment. Before returning, the function initializes a few parameters in the structure, producing the final result shown in Figure 2-5.

At the bottom of the memory block on the right side of Figure 2-5 you can see the padding area introduced to force the alignment. The skb_shared_info block is mainly used to handle IP fragments and is described later in this chapter. The fields shown on the left side of the figure were explained earlier.

Figure 2-5. alloc_skb function

dev_alloc_skb is the buffer allocation function meant for use by device drivers and expected to be executed in interrupt mode. It is simply a wrapper around alloc_skb that adds 16 bytes to the requested size for optimization reasons and asks for an atomic operation (GFP_ATOMIC) since it will be called from within an interrupt handler routine:

static inline struct sk_buff *dev_alloc_skb(unsigned int length)
    return _ _dev_alloc_skb(length, GFP_ATOMIC);

static inline
struct sk_buff *_ _dev_alloc_skb(unsigned int length, int gfp_mask)
    struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
    if (likely(skb))
            skb_reserve(skb, 16);
    return skb;

This definition of _ _dev_alloc_skb is the default one used when there is no architecture-specific definition.
_ _dev_alloc_skb的定义是在没有architecture-specific定义时的默认操作。 Freeing memory: kfree_skb and dev_kfree_skb

These two functions release a buffer, which results in its return to the buffer pool (cache). kfree_skb is both called directly and invoked through the dev_kfree_skb wrapper. The latter is defined for use by device drivers, to have a name that parallels dev_alloc_skb but consists of a simple macro that does nothing but call kfree_skb. This basic function releases a buffer only when the skb->users counter is 1 (when no users of the buffer are left). Otherwise, the function simply decrements that counter. So if a buffer had three users, only the third call to dev_kfree_skb or kfree_skb would free memory.

The flowchart in Figure 2-6 shows all the steps involved in freeing a buffer. As you will see in Chapter 33, an sk_buff structure can hold a reference on a dst_entry data structure. When the sk_buff structure is freed, therefore, dst_release also has to be called to decrement the reference count on the associated dst_entry data structure.
2-6的流程图展示了释放内存时的所有调用步骤。你会在第33章中看到,一个sk_buff结构可以被一个dst_entry的数据结构所引用。因此,当sk_buff结构被释放以后,dst_release 也会被调用,用于减少与之相关的在dst_entry数据结构上的引用计数。

When the destructor function pointer has been initialized, it is called here (see the section "Layout Fields" earlier in this chapter).
当析构函数指针被初始化以后,它会在这里被调用(参见章前面的小节“Layout Fields”)。

We have seen in Figure 2-5 what a simple scenario looks like: an sk_buff data structure is associated to another memory block where the actual data is stored. However, the skb_shared_info data structure at the bottom of that data block, as shown in Figure 2-5, can hold pointers to other memory fragments. See Chapter 21 for some examples. kfree_skb releases the memory held by those fragments as well, when they are present. Finally, the sk_buff data structure is returned to the skbuff_head_cache cache.
你已经看过图2-5了,它就是一个简单的场景:一个sk_buff数据结构,以及与之相关的另一个内存块,该内存块保存了实际的数据。然而, skb_shared_info 数据结构在数据块的底部,正如图2-5所示,它可以存放一个指向另一块内存片的指针。参见第21章中的一些例子。 当出现这些内存片时,kfree_skb可以很好的释放由这些指针控制的内存片。最后,sk_buff数据结构会被返回到skbuff_head_cache这个高速缓存中。 Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull

skb_reserve reserves some space (headroom) at the head of the buffer and is commonly used to allow the insertion of a header or to force data to be aligned on some boundary. The function shifts the data and tail pointers (discussed earlier in the section "Layout Fields") that mark the beginning and the end of the payload, respectively. Figure 2-4(d) shows the result of calling skb_reserve(skb,n). This function is usually called soon after the allocation of the buffer, when data and tail are still the same.
skb_reserve用于在缓存块前面预留一些空间(headroom:净空),这些通常空间用于插入一些帧头,或者强制数据在一些边界上对齐。该函数移动data和tail指针(在前面的“Layout Fields”中讨论过),用于标记各自的净载开始和结束位置。图2-4(d)展示了调用skb_reserve(skb,n)的结果。当data和tail还是一样的时候,这个函数通常在分配缓存以后很快的被调用。

If you look at the receive function of one of the Ethernet drivers (for instance, vortex_rx in drivers/net/3c59x.c) you will see that they all use the following command before storing any data in the buffer they have just allocated:
如果你看一下一个以太网设备(例如:vortex_rx in drivers/net/3c59x.c)的接收函数,你会发现他们在刚刚分配缓存中存储任何一块数据以前都会使用下面这个命令:

skb_reserve(skb, 2);    /* Align IP on 16 byte boundaries */

Figure 2-6. kfree_skb function

Because they know that they are about to copy an Ethernet frame that has a header 14 octets long into the buffer, the argument of 2 shifts the head of the buffer 2 bytes. This keeps the IP header, which follows immediately after the Ethernet header, aligned on a 16-byte boundary from the beginning of the buffer, as shown in Figure 2-7.

Figure 2-7. (a) before skb_reserve, (b) after skb_reserve, and (c) after copying the frame on the buffer

Figure 2-8 shows an example of using skb_reserve in the opposite direction, during data transmission.

Figure 2-8. Buffer that is filled in while traversing the stack from the TCP layer down to the link layer

  1. When TCP is asked to transmit some data, it allocates a buffer following certain criteria (TCP Maximum Segment Size (mss), support for scatter gather I/O, etc.).
    当TCP被要求传输一些数据时,它会根据确定的标准(TCP最大的片大小,支持scatter gather I/O等)分配一个缓存。

  2. TCP reserves (with skb_reserve) enough space at the head of the buffer to hold all the headers of all layers (TCP, IP, link layer). The parameter MAX_TCP_HEADER is the sum of all headers of all levels and is calculated taking into account the worst-case scenarios: because the TCP layer does not know what type of interface will be used for the transmission, it reserves the biggest possible header for each layer. It even accounts for the possibility of multiple IP headers (because you can have multiple IP headers when the kernel is compiled with support for IP over IP).
    TCP在缓存前面(用skb_reserve)保留足够的空间,用于保存所有协议层(TCP,IP,Link层)的帧头。MAX_TCP_HEADER参数是在最坏情况下,所有网络协议层的帧头的总和:因为TCP层不知道使用什么样的网络接口来传输(译注:因为不仅仅是在以太网上),所以它为所有网络层保留了最大可能的帧头。它甚至计算了多IP头的可能性(因为你在内核编译中选择支持IP over IP时你可以使用多IP头)。

  3. The TCP payload is copied into the buffer. Note that Figure 2-8 is just an example. The TCP payload could be organized differently; for example, it could be stored as fragments. In Chapter 21, we will see what a fragmented buffer (also commonly called a paged buffer) looks like.

  4. The TCP layer adds its header.

  5. The TCP layer hands the buffer to the IP layer, which adds its header as well.

  6. The IP layer hands the IP packet to the neighboring layer, which adds the link layer header.

Note that while the buffer travels down the network stack, each protocol moves skb->data down, copies in its header, and updates skb->len. All of this is accomplished with the functions we saw in Figure 2-4.
注意到,skb_reserve 函数并没有真正的移走或者添加任何东西到数据缓存中。如图2-4(d)中所示的,只是简单的更新一下两个指针。

Note that the skb_reserve function does not really move anything into or within the data buffer; it simply updates the two pointers as depicted in Figure 2-4(d).
注意到,skb_reserve 函数并没有真正的移走或者添加任何东西到数据缓存中。如图2-4(d)中所示的,只是简单的更新一下两个指针。

static inline void skb_reserve(struct sk_buff *skb, unsigned int len)

skb_push adds one block of data to the beginning of the buffer, and skb_put adds one to the end. Like skb_reserve, these functions don't really add any data to the buffer; they simply move the pointers to its head or tail. The new data is supposed to be copied explicitly by other functions. skb_pull removes a block of data from the head of the buffer by moving the head pointer forward. Figure 2-4 shows how these functions work.
skb_push在缓存块的开始处添加一个数据块,而skb_put在缓存尾添加。和skb_reserve一样,这些函数没有真正的添加任何数据到缓存块中,而只是简单的移动一下它们的头和尾指针。而假定的新数据是通过另外几个函数被明确的Copy的(译注:前面只是添加块,没有写数据)。skb_pull通过向前移动头指针,从缓存的前面移除数据块。图2-4展示了这些函数是如何工作折。 The skb_shared_info structure and the skb_shinfo function

As shown in Figure 2-5, there is a structure called skb_shared_info at the end of the data buffer that keeps additional information about the data block. The data structure immediately follows the end pointer that marks the end of the data. This is the definition of the data structure:

struct skb_shared_info {
    atomic_t        dataref;
    unsigned int    nr_frags;
    unsigned short  tso_size;
    unsigned short  tso_seqs;
    struct sk_buff  *frag_list;
    skb_frag_t      frags[MAX_SKB_FRAGS];

dataref represents the number of "users" of the data block and is described in the next section, "Cloning and copying buffers." nr_frags, frag_list, and frags are used to handle IP fragments and are described in Chapter 21. The skb_is_nonlinear routine can be used to check whether the buffer is fragmented, and skb_linearize[] can be used to collapse the fragments into a single flat buffer. Collapsing the fragments involves copying, which introduces a performance penalty.
dataref承载了该数据块的用户数目,这个我们在下一节“Cloning and copying buffers”中描述。nr_frags, frag_list, 和frags用于处理IP分片,我们会在第21章中描述它们。skb_is_nonlinear函数可以用于检测该缓存是否是分片,而 skb_linearize[1]可以用于把分片折叠到一个平坦的缓存中。折叠分片会引发内存Copy,而这会带来性能的下降。

[] See the section "dev_queue_xmit Function" in Chapter 11 for an example of its use.
该函数的使用,参见第11章中的“dev_queue_xmit Function”一节中的例子。

Some network interface cards (NICs) can handle in hardware some of the tasks that have traditionally been done by the CPU. The most common example is the computation of the L3 and L4 checksums. Some NICs can even maintain the L4 protocol's state machines. For the sake of the code shown here, we are interested in TCP segmentation offload, where the NIC implements a subset of the TCP layer. tso_size and tso_seqs are used by this feature.

Note that there is no field inside the sk_buff structure pointing at the skb_shared_info data structure. To access that structure, functions need to use the skb_shinfo macro, which simply returns the end pointer:

#define skb_shinfo(SKB)    ((struct skb_shared_info *)((SKB)->end))

The following statement, for instance, shows how the macro is used to increment a field of the private block:

skb_shinfo(skb)->dataref++; Cloning and copying buffers

When the same buffer needs to be processed independently by different consumers, and they may need to change the content of the sk_buff descriptor (the h and nh pointers to the protocol headers), the kernel does not need to make a complete copy of both the sk_buff structure and the associated data buffers. Instead, to be more efficient, the kernel can clone the original, which consists of making a copy of the sk_buff structure only and playing with the reference counts to avoid releasing the shared data block prematurely. Buffer cloning is done with the skb_clone function.

An example of a situation using cloning is when an ingress packet needs to be delivered to multiple recipients, such as the protocol handler and one or more network taps (see Chapter 21).

The sk_buff clone is not linked to any list and has no reference to the socket owner. The field skb->cloned is set to 1 in both the clone and the original buffer. skb->users is set to 1 in the clone so that the first attempt to remove it succeeds, and the number of references (dataref) to the buffer containing the data is incremented (since now there is one more sk_buff data structure pointing to it). Figure 2-9 shows an example of a cloned buffer.
sk_buff 克隆不会链接到任何一个链表,而且也不会引用到任何一个套接字的所有者上。在原缓存块和克隆块中,skb->cloned字段都被设置为1。在克隆块中的skb->users被设置为1,这样第一次删除它的时候就可以成功。而在包含了数据块的缓存块中的引用计数(dataref)会被增加(至此,这里有多于一个的sk_buff数据结构指向它)。图2-9展示了克隆缓存的一个例子。

Figure 2-9. skb_clone function

The skb_clone routine can be used to check the cloned status of an skb buffer.

Figure 2-9 shows an example of a fragmented bufferthat is to say, a buffer that has some data stored in data fragments linked with the frags array. We will see how fragmented buffers are used in Chapter 21; for now, let's not bother with those details.

The skb_share_check routine can be used to check the reference count skb->users and clone the buffer skb when the users field says the buffer is shared.

When a buffer is cloned, the contents of the data block cannot be modified. This means that code can access the data without any need for locking. When, however, a function needs to modify not only the contents of the sk_buff structure but the data too, it needs to clone the data block as well. In this case, the programmer has two options. When he knows he needs to modify only the contents of the data in the area between skb->start and skb->end, he can use pskb_copy to clone just that area. When he thinks he may need to modify the content of the fragment data blocks too, he must use skb_copy. The result of both pskb_copy and skb_copy is shown in Figure 2-10. You will see in Chapter 21 that the skb_shared_info data structure can include a list of sk_buff structures too (linked to a field called frag_list). That list is handled by pskb_copy and skb_copy in the same way as the frags array (this detail has been omitted from Figure 2-10 to keep the latter more readable).
当一个缓存被克隆时,数据块的一些实际内容不能被修改。这也就是说,那些代码在不须要加锁的情况下访问数据。然而,当一个函数要不仅仅要修改sk_buff 数据结构的内容,还须要修改数据时,它须要很好的copy一份数据块。在这种情况下,程序员有两个选择。当他知道只须要修改数据块中skb->start 和skb->end之间的数据时,他可以使用pskb_copy来只克隆这一区域的数据。当他想他可能要修改分片块中的数据内容时,他必须使用skb_copy。 pskb_copy 和 skb_copy的调用结果展示在图2-10中。你将会在第21章中看到,skb_shared_info 数据结构也可以包含一个sk_buff 结构链表(链接到一个叫做frag_list的字段)。这个链表被pskb_copy 和skb_copy 以同样的方式以标志数组的形式所处理(为了让后面的内容更容易阅读,这一细节在图2-10中被省略)。

Figure 2-10. (a) pskb_copy function and (b) skb_copy function

You may not be able to appreciate all of the details in Figures 2-9 and 2-10 at this point. Later in the book, especially once you have gone through Part V, everything will make more sense.

While discussing the various topics of this book, I will sometimes emphasize that a given function needs to clone or copy a buffer. When deciding to make a clone or copy of a buffer, programmers of each subsystem cannot anticipate whether other kernel components (or other users of their subsystems) will need the original information in that buffer. The kernel is very modular and changes in a very dynamic and unpredictable way, so each subsystem is ignorant of what other subsystems may do with a buffer. Therefore, the programmers of each subsystem just keep track of any modifications they make to the buffer, and take care to make a copy before modifying anything in case some other part of the kernel needs the original information.
在讨论本书中大量的话题时,我有时会强调一个给定的函数须要克隆或者Copy一个缓存。当决定去克隆或者copy一块缓存时,每一个子系统的程序员都不能预料内核(或者其它子系统的用户)是否须要该缓存的原始信息。内核是非常模块化的,而且在每个动态和不可预知的情况下发生改变,所以每一个子系统是不知道另一个子系统是否会要处理该缓存。因此,每个子系统的程序员应该坚持记录每一个对缓存所做的修改,而且在修改任何内容前要小心的做一个copy,因为内核的其它部份须要原始的信息。 List management functions

These functions manipulate the lists of sk_buff elements, also called queues. For a complete list of functions, see <include/linux/skbuff.h> and <net/core/skbuff.c>. Some of the most commonly used functions are:
这些函数用于操作sk_buff元素的链表,也叫做队列。要得到完整的函数列表,参见<include/linux/skbuff.h> 和<net/core/skbuff.c>。这里有一些最常用的函数:


Initializes an sk_buff_head with an empty queue of elements.用空元素来初始化一个sk_buff_head 。

skb_queue_head, skb_queue_tail

Adds one buffer to the head or to the tail of a queue, respectively.分别添加各别的缓存到队头或者队尾。

skb_dequeue, skb_dequeue_tail

Dequeues an element from the head or from the tail, respectively. The second function should probably have been called skb_dequeue_head to be consistent with the names of the other queueing functions.
分别从队头或者队尾取出一个元素。第二个函数应该是在已经在调用了skb_dequeue_head 以后,与其它入队函数名已经一致的情况下调用。


Empties a queue.清空队列


Runs a loop on each element of a queue in turn.循环遍历队列中的每个元素。

All functions of this class must be executed atomicallythat is, they must grab the spin lock provided by the sk_buff_head structure for the queue. Otherwise, they could be interrupted by asynchronous events that enqueue or dequeue elements from the queues, such as functions invoked by expired timers, which would lead to race conditions.
所有这些级别的函数都必须在原子级上操作,也就是说,他们必须为操作队列而获取一个由sk_buff_head 结构提供的自旋锁。另一方面,它们可以被步的事件所中断,而该事件可能就是在队列上入队或者出队,例如由到时的计时器调用这些函数,这些可能会引发条件竞争。

Thus, each function is implemented as follows:

static inline function_name ( parameter_list )
        unsigned long flags;

        _ _ _function_name ( parameter_list )

The function consists of a wrapper that grabs the lock, does its work by invoking a function whose name begins with two underscores, and releases the lock.

posted on 2008-11-22 00:00  Wu.Country@侠缘  阅读(2148)  评论(2编辑  收藏  举报