linux网络设备驱动
linux网络设备不同于字符设备和块设备,没有文件与网络设备对应。应用程序通过socket操作网络设备。
网络设备驱动属于数据链路层,对上与IP/ARP协议通信,对下直接操作物理层芯片(网卡芯片)。三层协议通过dev_queue_xmit()发送数据,通过netif_rx()接收数据;网络设备驱动通过hard_startx_xmit()操作物理网卡发送数据,通过中断方式接收数据包。而网络驱动的核心是struct net_device,其将千变万化的网卡进行抽象。
一. 重要数据结构
1. struct net_device, include/linux/netdevice.h
重要成员如下:
>>全局信息
char name[IFNAMESIZ]; 网卡设备名字,全局变量
>>硬件信息
unsigned long mem_end; /* shared mem end */
unsigned long mem_start; /* shared mem start */
共享内存的起始和结束地址
unsigned long base_addr; /* device I/O address */ 网络设备I/O基地址
unsigned int irq; /* device IRQ number */ 设备使用的中断号
unsigned char if_port; /* Selectable AUI, TP,..*/
多端口设备使用哪一个端口,该字段仅针对多端口设备。如设备同时支持IF_PORT_10BASE2(同轴电缆)和IF_PORT_10BASET(双绞线)则可以使用该字段。
unsigned char dma; /* DMA channel */
>>接口信息
unsigned int flags; /* interface flags (a la BSD) */
网络接口标志,以IFF_(Interface Flag)开头,部分标志由内核管理,其他的在接口初始化时被设置以说明设备接口的能力和特性。接口标志包括IFF_UP(当设备激活并可以发送数据包时,内核设置),IFF_AUTOMEDIA(设备科在多种媒介间切换),IFF_BROADCAST(允许广播),IFF_DEBUG(调试模式,可用于控制printk调用的详细程序),IFF_LOOPBACK(回环),IFF_MULTICAST(允许组播),IFF_NOARP(接口不能执行ARP)和IFF_POINTTOPOINT(接口连接到点到点链路)等。
unsigned int mtu; /* interface MTU value */
unsigned short type; /* interface hardware type */
unsigned short hard_header_len; /* hardware hdr length */
网路设备的硬件头长度,在以太网设备的初始化函数中,该成员被赋值为ETH_HLEN,即14.
/* Interface address info used in eth_type_trans() */
unsigned char *dev_addr; /* hw address, (before bcast because most packets are unicast) */
设备硬件地址,驱动可能提供了设置MAC地址的接口,这会导致设置的MAC地址等存入该成员。
>>设备操作函数
/* Management operations */
const struct net_device_ops *netdev_ops;
const struct ethtool_ops *ethtool_ops;
ethtool_ops成员函数与用户空间ethtool工具的各个命令选项对应,ethtool提供了网卡及网卡驱动管理能力,能够为linux网络开发人员提供对网卡硬件、驱动程序和网络协议栈的设置、查看以及调试等功能。
/* Hardware header description */
const struct header_ops *header_ops;
对应于硬件头部操作,主要完成创建硬件头部和从给定的的sk_buff分析出硬件头部等操作。
>>辅助成员
/*
* trans_start here is expensive for high speed devices on SMP,
* please use netdev_queue->trans_start instead.
*/
unsigned long trans_start; /* Time (in jiffies) of last Tx */
unsigned long last_rx; /* Time of last Rx
* This should not be set in
* drivers, unless really needed,
* because network stack (bonding)
* use it if/when necessary, to
* avoid dirtying this cache line.
*/
trans_start记录最后的数据包开始发送时的时间戳,last_rx记录最后一次接收到数据包时的时间戳,这两个时间戳记录的都是jiffies,驱动程序应该维护这两个成员。
2. struct net_device_ops, include/linux/netdevice.h
int (*ndo_open)(struct net_device *dev);
打开网络接口设备,获得设备需要的I/O地址、IRQ、DMA通道等。stop()函数的作用是停止网络接口设备。
netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev);
启动数据包的发送
void (*ndo_tx_timeout) (struct net_device *dev);
当数据包发送超时时,该函数需采取重新启动数据包发送过程或重新启动硬件等措施来恢复网络设备到正常状态。
struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);
返回网络设备的状态信息,net_device_stats结构体保存了详细的网络设备流量统计信息。
int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);
设备特定的IO控制
int (*ndo_set_config)(struct net_device *dev, struct ifmap *map);
配置接口,也可用于改变设备的IO地址和中断。
int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
设置设备的MAC地址
3. more
二. sk_buff操作
struct sk_buff定义于include/linux/skbuff.h
/** * struct sk_buff - socket buffer * @next: Next buffer in list * @prev: Previous buffer in list * @tstamp: Time we arrived * @sk: Socket we are owned by * @dev: Device we arrived on/are leaving by * @cb: Control buffer. Free for use by every layer. Put private vars here * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header * @hdr_len: writable header length of cloned skb * @csum: Checksum (must include start/offset pair) * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority * @local_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum * @nohdr: Payload reference only, must not modify header * @nfctinfo: Relationship of this skb to the connection * @pkt_type: Packet class * @fclone: skbuff clone status * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been * done for it, don't do them again * @nf_trace: netfilter packet trace flag * @protocol: Packet protocol from driver * @destructor: Destruct function * @nfct: Associated connection, if any * @nfct_reasm: netfilter conntrack re-assembly pointer * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c * @skb_iif: ifindex of device we arrived on * @tc_index: Traffic control index * @tc_verd: traffic control verdict * @rxhash: the packet hash computed on receive * @queue_mapping: Queue mapping for multiqueue devices * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_rxhash: indicate rxhash is a canonical 4-tuple hash over transport ports. * @dma_cookie: a cookie to one of several possible DMA operations * done by skb DMA functions * @secmark: security marking * @mark: Generic packet mark * @dropcount: total number of sk_receive_queue overflows * @vlan_tci: vlan tag control information * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header * @tail: Tail pointer * @end: End pointer * @head: Head of buffer * @data: Data head pointer * @truesize: Buffer size * @users: User count - see {datagram,tcp}.c */ struct sk_buff { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; …… /* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users; };
head和end指向缓冲区的头部和尾部,而data和tail指向实际数据的头部和尾部。每一层会在head和data之间填充协议头,或者在tail和end之间添加新的协议数据。
>>分配
static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority);
extern struct sk_buff *dev_alloc_skb(unsigned int length);
alloc_skb()的size以L1_CACHE_BYTES字节(对于ARM为32)对齐,参数priority为内存分配的优先级。dev_alloc_skb()以GFP_ATOMIC优先级进行skb的分配,原因为该函数常在设备驱动的接收中断里被调用。
>>释放
void kfree_skb(struct sk_buff *skb);
void consume_skb(struct sk_buff *skb);
#define dev_kfree_skb(a) consume_skb(a)
void dev_kfree_skb_irq(struct sk_buff *skb);
void dev_kfree_skb_any(struct sk_buff *skb);
linux内核内部使用kfree_skb()函数,而在网络设备驱动程序中则最好用dev_kfree_skb()、dev_kfree_skb_irq()和dev_kfree_skb_any()进行套接字缓冲区的释放。其中dev_kfree_skb()用于非中断上下文,dev_kfree_skb_irq()用于中断上下文,dev_kfree_skb_any()在中断和非中断上下文中均可采用。
>>变更
unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
{
unsigned char *tmp = skb_tail_pointer(skb);
SKB_LINEAR_ASSERT(skb);
skb->tail += len;
skb->len += len;
return tmp;
}
它会导致skb->tail后移len而skb->len会增加len的大小。
unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len){
skb->data -= len;
skb->len += len;
return skb->data;
}
它会导致skb->data前移len,而skb->len会增加len。
extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
{
skb->len -= len;
BUG_ON(skb->len < skb->data_len);
return skb->data += len;
}
它在缓冲区开头移除数据。
static inline void skb_reserve(struct sk_buff *skb, int len)
{
skb->data += len;
skb->tail += len;
}
对于一个空的缓冲区,调用如下函数可以调整缓冲区的头部。将skb->data和skb->tail同时后移len。
>>示例
skb = alloc_skb(len+headspace, GFP_KERNEL);
skb_reserve(skb, headspace);
skb_put(skb, len);
memcpy_fromfs(skb->data, data, len);
pass_to_m_protocol(skb);
上述代码先分配一个全新的sk_buff,接着调用skb_reserve()腾出头部空间,之后调用skb_put()腾出数据空间,然后把数据复制进去,最后把sk_buff传给协议栈。
三. 重要操作函数
1. NAPI
通常情况下,网络设备驱动以中断方式接收数据包,而poll_controller()则采用纯轮询方式,另外一种数据接收方式是NAPI(New API),其数据接收流程为“接收中断来临->关闭接收中断->以轮询方式接收所有数据包直到空->开启接收中断->接收中断来临…”,内核提供如下函数:
/**
* netif_napi_add - initialize a napi context
* @dev: network device
* @napi: napi context
* @poll: polling function
* @weight: default weight
*
* netif_napi_add() must be used to initialize a napi context prior to calling
* *any* of the other napi related functions.
*/
void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight);
/**
* netif_napi_del - remove a napi context
* @napi: napi context
*
* netif_napi_del() removes a napi context from the network device napi list
*/
void netif_napi_del(struct napi_struct *napi);
static inline void napi_enable(struct napi_struct *n);
static inline void napi_disable(struct napi_struct *n);
static inline int napi_schedule_prep(struct napi_struct *n);
用于检查napi是否可以调度
static inline void napi_schedule(struct napi_struct *n);
调度轮询实例的运行
extern void napi_complete(struct napi_struct *n);
NAPI处理完成时调用
2. 上层操作函数
net/core/dev.c
int dev_queue_xmit(struct sk_buff *skb);
int netif_rx(struct sk_buff *skb);
3. more
四. 网络驱动
1. 注册注销
extern int register_netdev(struct net_device *dev);
extern void unregister_netdev(struct net_device *dev);
net_device的生成和成员的赋值并不一定要由工程师亲自动手逐个完成,可利用下列宏填充:
extern struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, void (*setup)(struct net_device *),
unsigned int txqs, unsigned int rxqs);
第一个参数为设备私有成员大小,第二个参数为设备名,第三个参数为net_device的setup()函数指针,第四、五个参数为要分配的发送和接收子队列的数量。setup()函数接受的参数也是net_device指针,用于预置net_device成员的值。
#define alloc_netdev(sizeof_priv, name, setup) \
alloc_netdev_mqs(sizeof_priv, name, setup, 1, 1)
#define alloc_netdev_mq(sizeof_priv, name, setup, count) \
alloc_netdev_mqs(sizeof_priv, name, setup, count, count)
void free_netdev(struct net_device *dev); 释放net_device
2. 网络设备的初始化
3. 网络设备的打开与释放
4. 数据发送流程
5. 数据接收流程
6. 网络连接状态
7. 参数设置和统计数据
五. more
参考:
1. linux设备驱动开发详解--宋宝华
2. Linux操作系统网络驱动程序编写 mailto:bordi@bordi.dhs.org