链路层的输出

网络层发包 将通过dev_queue_xmit 将数据包发送到 输出设备层中,

调用dev_queue_xmit 函数输出数据包,前提是必须启用中断,只有启用中断之后才能激活下半部。

1.设备在调用这个函数之前,必须设置设备优先级 和缓冲区buffer

2.如果此函数发送失败,会返回一个负数的Error number,不过即使返回正数,也不一定保证发送成功,封包也许会被网络拥塞给drop掉

3.这个函数也可以从队列规则中返回error,NET_XMIT_DROP, 这个错误是一个整数,所以错误也有可能是整数,也验证了点2 ,所以在协议栈的上一层使用这个函数的时候,可能需要注意error的处理部分

4. 不管这个函数返回什么值,这个skb最终的宿命就是被consume,也就是free掉了...  所以这个时候上层不要尝试重发了... 除非有协议栈的重传, 要不skb已经被free了

5.在调用这个函数的时候,必须打开中断

 *      When calling this method, interrupts MUST be enabled.  This is because
 *      the BH enable code must have IRQs enabled so that it will not deadlock.

1.有拥塞控制策略的情况,比较复杂,但是目前最常用

2.没有enqueue的状况,比较简单,直接发送到driver,如loopback等使用

       先检查是否有enqueue的规则,如果有即调用__dev_xmit_skb进入拥塞控制的flow,如果没有且txq处于On的状态,那么就调用dev_hard_start_xmit直接发送到driver,好 那先分析带Qdisc策略的flow 进入__dev_xmit_skb

 

int dev_queue_xmit( struct sk_buff *skb)
{
    return __dev_queue_xmit(skb, NULL);
}
/*
**
 *  __dev_queue_xmit - transmit a buffer
 *  @skb: buffer to transmit
 *  @accel_priv: private data used for L2 forwarding offload
 *
 *  Queue a buffer for transmission to a network device. The caller must
 *  have set the device and priority and built the buffer before calling
 *  this function. The function can be called from an interrupt.
 *
 *  A negative errno code is returned on a failure. A success does not
 *  guarantee the frame will be transmitted as it may be dropped due
 *  to congestion or traffic shaping.
 *
 * -----------------------------------------------------------------------------------
 *      I notice this method can also return errors from the queue disciplines,
 *      including NET_XMIT_DROP, which is a positive value.  So, errors can also
 *      be positive.
 *
 *      Regardless of the return value, the skb is consumed, so it is currently
 *      difficult to retry a send to this method.  (You can bump the ref count
 *      before sending to hold a reference for retry if you are careful.)
 *
 *      When calling this method, interrupts MUST be enabled.  This is because
 *      the BH enable code must have IRQs enabled so that it will not deadlock.
 *          --BLG
 */
 /*
  * 网络接口口核心层向网络协议层提供的统一
  * 的发送接口,无论IP,还是ARP协议,以及其它
  * 各种底层协议,通过这个函数把要发送的数据
  * 传递给网络接口核心层
  * 
  * update:
  *   若支持流量控制,则将待输出的数据包根据规则
  * 加入到输出网络队列中排队,并在合适的时机激活
  * 网络设备输出软中断,依次将报文从队列中取出通过
  * 网络设备输出。若不支持流量控制,则直接将数据包
  * 从网络设备输出。
  *   如果提交失败,则返回相应的错误码,然而返回
  * 成功也并不能确保数据包被成功发送,因为有可能
  * 由于拥塞而导致流量控制机制将数据包丢弃。
  *   调用dev_queue_xmit()函数输出数据包,前提是必须启用
  * 中断,只有启用中断之后才能激活下半部。
   //到这里的skb可能有以下三种:
   支持GSO(FRAGLIST类型的聚合分散I/O数据包, 对于SG类型的聚合分散I/O数据包),
   或者是非GSO的SKB,但这里的skb是在ip_finish_output中分片后的skb
*/
static int __dev_queue_xmit( struct sk_buff *skb,  void *accel_priv)
{
    struct net_device *dev = skb->dev;
    struct netdev_queue *txq;
    struct Qdisc *q;
    int rc = -ENOMEM;

    skb_reset_mac_header(skb);

    if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
        __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);

    /* Disable soft irqs for various locks below. Also
     * stops preemption for RCU.
     */ //关闭软中断 - __rcu_read_lock_bh()--->local_bh_disable();
    rcu_read_lock_bh();

    skb_update_prio(skb); ////设置skb->priority值

    qdisc_pkt_len_init(skb); //算上gso分段后的总大小
#ifdef CONFIG_NET_CLS_ACT
    skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
# ifdef CONFIG_NET_EGRESS
    if (static_key_false(&egress_needed)) {
        skb = sch_handle_egress(skb, &rc, dev);
        if (!skb)
            goto out;
    }
# endif
#endif
    /* If device/qdisc don't need skb->dst, release it right now while
     * its hot in this cpu cache.
     */ //释放dst对象
    if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
        skb_dst_drop(skb);
    else
        skb_dst_force(skb);

#ifdef CONFIG_NET_SWITCHDEV
    /* Don't forward if offload device already forwarded */
    if (skb->offload_fwd_mark &&
        skb->offload_fwd_mark == dev->offload_fwd_mark) {
        consume_skb(skb);
        rc = NET_XMIT_SUCCESS;
        goto out;
    }
#endif
/*此处主要是取出此netdevice的txq和txq的Qdisc,Qdisc主要用于进行拥塞处理,一般的情况下,直接将
         *数据包发送给driver了,如果遇到Busy的状况,就需要进行拥塞处理了,就会用到Qdisc
选择一个发送队列,如果设备提供了select_queue回调函数就使用它,
否则由内核选择一个队列,这里只是Linux内核多队列的实现,但是要真正的使用都队列,需要网卡支持多队列才可以?
话愕耐ǘ贾挥幸桓龆恿小T诘饔胊lloc_etherdev分配net_device是,设置队列的个数
    */
    /* 获取dev设备上的排队规程,如果执行了tc qdisc add dev eth0 就会找到对应的Qdisc */
    txq = netdev_pick_tx(dev, skb, accel_priv);
    q = rcu_dereference_bh(txq->qdisc);
/*如果Qdisc有对应的enqueue规则,就会调用__dev_xmit_skb,进入带有拥塞的控制的Flow,注意这个地方,虽然是走拥塞控制的
     *Flow但是并不一定非得进行enqueue操作啦,只有Busy的状况下,才会走Qdisc的enqueue/dequeue操作进行
*/
    trace_net_dev_queue(skb);
    if (q->enqueue) { /*如果这个设备启动了TC,那么把数据包压入队列  见tc_modify_qdisc中的qdisc_graft*/
         /*
          * 将待发送的数据包按排队规则插入到
          * 队列,然后进行流量控制,调度队列
          * 输出数据包,完成后返回。
          */
        rc = __dev_xmit_skb(skb, q, dev, txq);
        goto out;
    }
//下面的处理是在没有发送队列的情况下
    /* The device has no queue. Common case for software devices:
       loopback, all the sorts of tunnels...

       Really, it is unlikely that netif_tx_lock protection is necessary
       here.  (f.e. loopback and IP tunnels are clean ignoring statistics
       counters.)
       However, it is possible, that they rely on protection
       made by us here.

       Check this and shot the lock. It is not prone from deadlocks.
       Either shot noqueue qdisc, it is even simpler 8)
     */ /*此处是设备没有Qdisc的,实际上没有enqueue/dequeue的规则,无法进行拥塞控制的操作,
     *对于一些loopback/tunnel interface比较常见,判断下设备是否处于UP状态*/

    if (dev->flags & IFF_UP) {
        int cpu = smp_processor_id();  /* ok because BHs are off */

        if (txq->xmit_lock_owner != cpu) {

            if (__this_cpu_read(xmit_recursion) > RECURSION_LIMIT)
                goto recursion_alert;

            skb = validate_xmit_skb(skb, dev);
            if (!skb)
                goto out;
/*
HARD_TX_LOCK 、HARD_TX_UNLOCK 一对操作,这两个操作之间不能再次调用邋dev_queue_xmit 
所以如果网络设备发送数据包的cpu又调用dev_queue_xmit  输出数据包文,说明代码有bug
否则需要首先加锁 以防止其他cpu 并发,然后如果网络接口已经up起来了
调用邋錮ev_hard_start_xmit 输出数据包文到 网络设备
*/
            HARD_TX_LOCK(dev, txq, cpu);

            if (!netif_xmit_stopped(txq)) {
                __this_cpu_inc(xmit_recursion);
                skb = dev_hard_start_xmit(skb, dev, txq, &rc);
                __this_cpu_dec(xmit_recursion);
                if (dev_xmit_complete(rc)) {
                    HARD_TX_UNLOCK(dev, txq);
                    goto out;
                }
            }
            HARD_TX_UNLOCK(dev, txq);
            net_crit_ratelimited( "Virtual device %s asks to queue packet!\n" ,
                         dev->name);
        }  else {
            /* Recursion is detected! It is possible,
             * unfortunately
             */
recursion_alert:
            net_crit_ratelimited( "Dead loop on virtual device %s, fix it urgently!\n" ,
                         dev->name);
        }
    }
// 网络设备关闭 返回对应错误码
    rc = -ENETDOWN;
    rcu_read_unlock_bh();

    atomic_long_inc(&dev->tx_dropped);
    kfree_skb_list(skb);
    return rc;
out:
    rcu_read_unlock_bh();
    return rc;
}

 

 

1)如果流控对象为空的,试图直接发送数据包。

(2)如果流控对象不空,将数据包加入流控对象,并运行流控对象

static inline int __dev_xmit_skb( struct sk_buff *skb,  struct Qdisc *q,
                 struct net_device *dev,
                 struct netdev_queue *txq)
{
    spinlock_t *root_lock = qdisc_lock(q);
    bool contended;
    int rc;

    qdisc_calculate_pkt_len(skb, q);
    /*
     * Heuristic to force contended enqueues to serialize on a
     * separate lock before trying to get qdisc main lock.
     * This permits __QDISC___STATE_RUNNING owner to get the lock more
     * often and dequeue packets faster.
     */
    contended = qdisc_is_running(q); //判断qdisc是否运行
    if (unlikely(contended))
        spin_lock(&q->busylock); //正在运行则busylock

    spin_lock(root_lock);
    /*这个地方主要是判定Qdisc的state: __QDISC_STATE_DEACTIVATED,如果处于非活动的状态,就DROP这个包,返回NET_XMIT_DROP
     *一般情况下带有Qdisc策略的interface,在被close的时候才会打上这个flag */
    if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
        kfree_skb(skb);
        rc = NET_XMIT_DROP;
    }  else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
           qdisc_run_begin(q)) {  // 如果qdisc中skb队列的长度为0,并且可以忽略qdisc规则(pfifo_fast设置有这个标志), 尝试直接发送
        /*
         * This is a work-conserving queue; there are no old skbs
         * waiting to be sent out; and the qdisc is not running -
         * xmit the skb directly.
         */ /* ?
                 * 1.flag必须有TCQ_F_CAN_BYPASS,默认条件下是有的,表明可以By PASS Qdisc规则
                 * 2.q的len为0,也就是说Qdisc中一个包也没有
                 * 3.Qdisc 起初并没有处于running的状态,然后置位Running!
                 * 满足上述3个条件调用sch_direct_xmit*/

        qdisc_bstats_update(q, skb);
       
        if (sch_direct_xmit(skb, q, dev, txq, root_lock,  true )) { //直接发送这个skb
            if (unlikely(contended)) {
                spin_unlock(&q->busylock);
                contended =  false ;
            }
            __qdisc_run(q); //直接发完发现qdisc中还有,则调用dequeue后发送
        }  else
            qdisc_run_end(q); //直接发送发现qdisc中没有了,结束

        rc = NET_XMIT_SUCCESS;
    }  else {
        rc = q->enqueue(skb, q) & NET_XMIT_MASK; //排入qdisc队列,默认pfifo_fast_ops算法,pfifo_fast_enqueue
        if (qdisc_run_begin(q)) { //之前qdisc在running,现在停了(有其他人在dequeue这个qdisc)* 也就是如果q不是运行状态,就设置成运行状况,如果一直是运行状态,那么就不用管了!
            if (unlikely(contended)) {
                spin_unlock(&q->busylock);
                contended =  false ;
            }
            __qdisc_run(q);  //出队并发送
        }
    }
    spin_unlock(root_lock);
    if (unlikely(contended))
        spin_unlock(&q->busylock);
    return rc;
}

/*
 * Transmit possibly several skbs, and handle the return status as
 * required. Holding the __QDISC___STATE_RUNNING bit guarantees that
 * only one CPU can execute this function.
 *
 * Returns to the caller:
 *              0  - queue is empty or throttled.
 *              >0 - queue is not empty.
 sch_direct_xmit,这个函数可能传输几个数据包,因为在不经过queue状况下和经过queue的状况下都会调通过这个函数发送
 ,如果是queue状况,肯定是能够传输多个数据包了,本文后面也有分析,并按照需求处理return的状态
 ,需要拿着__QDISC___STATE_RUNNING  bit,只有一个CPU 可以执行这个函数, 在这里有可能会出现BUSY的状况!
 */

int sch_direct_xmit( struct sk_buff *skb,  struct Qdisc *q,
            struct net_device *dev,  struct netdev_queue *txq,
            spinlock_t *root_lock,  bool validate)
{
    int ret = NETDEV_TX_BUSY;

    /* And release qdisc */
    spin_unlock(root_lock);

    /* Note that we validate skb (GSO, checksum, ...) outside of locks */
    if (validate) //报文校验,gso分段、csum计算
        skb = validate_xmit_skb_list(skb, dev);

    if (likely(skb)) {
        HARD_TX_LOCK(dev, txq, smp_processor_id());
        /*如果说txq被stop,即置位QUEUE_STATE_ANY_XOFF_OR_FROZEN,就直接ret = NETDEV_TX_BUSY
         *如果说txq 正常运行,那么直接调用dev_hard_start_xmit发送数据包*/
        if (!netif_xmit_frozen_or_stopped(txq))
            skb = dev_hard_start_xmit(skb, dev, txq, &ret); //调用驱动发送报文

        HARD_TX_UNLOCK(dev, txq);
    }  else {
        spin_lock(root_lock);
        return qdisc_qlen(q);
    }
    spin_lock(root_lock);
/*进行返回值处理! 如果ret < NET_XMIT_MASK 为true 否则 flase*/
    if (dev_xmit_complete(ret)) {
        /* Driver sent out skb successfully or skb was consumed */
        /*这个地方需要注意可能有driver的负数的case,也意味着这个skb被drop了*/
        ret = qdisc_qlen(q); //成功发送报文,如果缓存区中还有报文,则尝试继续发送报文
    }  else {
        /* Driver returned NETDEV_TX_BUSY - requeue skb */
        if (unlikely(ret != NETDEV_TX_BUSY))
            net_warn_ratelimited( "BUG %s code %d qlen %d\n" ,
                         dev->name, ret, q->q.qlen);
         /*发生Tx Busy的时候,重新进行requeue*/
        ret = dev_requeue_skb(skb, q); //发送失败,则  入队重新进行发送
    }
 /*如果txq stop并且ret !=0  说明已经无法发送数据包了ret = 0*/
    if (ret && netif_xmit_frozen_or_stopped(txq))
        ret = 0;

    return ret;
}

 

 

struct sk_buff *dev_hard_start_xmit( struct sk_buff *first,  struct net_device *dev,
                    struct netdev_queue *txq,  int *ret)
{
    struct sk_buff *skb = first;
    int rc = NETDEV_TX_OK;

    while (skb) { /*取出skb的下一个数据单元*/
        struct sk_buff *next = skb->next;

        skb->next = NULL; /*将此数据包送到driver Tx函数,因为dequeue的数据也会从这里发送,所以会有netx!*/
        rc = xmit_one(skb, dev, txq, next != NULL);
        if (unlikely(!dev_xmit_complete(rc))) { /*如果发送不成功,next还原到skb->next 退出*/
            skb->next = next;
            goto out;
        }

        skb = next; /*如果发送成功,把next置给skb,一般的next为空 这样就返回,如果不为空就继续发!*/

        if (netif_xmit_stopped(txq) && skb) { /*如果txq被stop,并且skb需要发送,就产生TX Busy的问题!*/
            rc = NETDEV_TX_BUSY;
            break ;
        }
    }

out:
    *ret = rc;
    return skb;
}

static

 

 

/*
xmit_one这个来讲比较简单了,下面代码中列出了xmit_one, netdev_start_xmit,__netdev_start_xmit 这个三个函数
,其目的就是将封包送到driver的tx函数了..中间在送往driver之前
,还会经历抓包的过程
*/
static int xmit_one( struct sk_buff *skb,  struct net_device *dev,
            struct netdev_queue *txq,  bool more)
{
    unsigned  int len;
    int rc;

    if (!list_empty(&ptype_all) || !list_empty(&dev->ptype_all))
        dev_queue_xmit_nit(skb, dev); //报文处理,根据packet_type中的func进行处理

    len = skb->len;
    trace_net_dev_start_xmit(skb, dev);
    rc = netdev_start_xmit(skb, dev, txq, more); //发送报文
    trace_net_dev_xmit(skb, rc, dev, len);

    return rc;
}

static inline netdev_tx_t netdev_start_xmit( struct sk_buff *skb,  struct net_device *dev,
                        struct netdev_queue *txq,  bool more)
{
    const struct net_device_ops *ops = dev->netdev_ops;
    int rc;
 
    rc = __netdev_start_xmit(ops, skb, dev, more);/发送报文
    if (rc == NETDEV_TX_OK)
        txq_trans_update(txq);
 
    return rc;
}

static inline netdev_tx_t __netdev_start_xmit( const struct net_device_ops *ops,
                          struct sk_buff *skb,  struct net_device *dev,
                          bool more)
{
    skb->xmit_more = more ? 1 : 0;
    return ops->ndo_start_xmit(skb, dev);     //调用设备驱动的发送函数
}

  __dev_queue_xmit函数会根据不同的情况会调用__dev_xmit_skb或者sch_direct_xmit函数,最终会调用dev_hard_start_xmit函数,该函数最终会调用xmit_one来发送一到多个数据包。

 

struct sk_buff *validate_xmit_skb_list( struct sk_buff *skb,  struct net_device *dev)
{
    struct sk_buff *next, *head = NULL, *tail;

    for (; skb != NULL; skb = next) {
        next = skb->next;
        skb->next = NULL;

        /* in case skb wont be segmented, point to itself */
        skb->prev = skb;

        skb = validate_xmit_skb(skb, dev);   //校验每一个skb报文
        if (!skb)
            continue ;

        if (!head)
            head = skb;
        else
            tail->next = skb;
        /* If skb was segmented, skb->prev points to
         * the last segment. If not, it still contains skb.
         */
        tail = skb->prev;
    }
    return head;
}
static struct sk_buff *validate_xmit_skb( struct sk_buff *skb,  struct net_device *dev)
{
    netdev_features_t features;

    if (skb->next)        //validate_xmit_skb_list调用的场景,此条件不成立
        return skb;

    features = netif_skb_features(skb);      //获取设备的features
    skb = validate_xmit_vlan(skb, features);
    if (unlikely(!skb))
        goto out_null;

    if (netif_needs_gso(skb, features)) {        //判断features是否包含skb->gso_type
        struct sk_buff *segs;

        segs = skb_gso_segment(skb, features);   //报文GSO分段
        if (IS_ERR(segs)) {
            goto out_kfree_skb;
        }  else if (segs) {
            consume_skb(skb);
            skb = segs;
        }
    }  else {
        if (skb_needs_linearize(skb, features) &&
            __skb_linearize(skb))
            goto out_kfree_skb;

        /* If packet is not checksummed and device does not
         * support checksumming for this protocol, complete
         * checksumming here.//如果数据包没有被计算校验和并且发送设备不支持这个协议的校验,则在此进行校验和的计算(注1)。
如果上面已经线性化了一次,这里的__skb_linearize就会直接返回,注意区别frags和frag_list,
前者是将多的数据放到单独分配的页面中,sk_buff只有一个。而后者则是连接多个sk_buff 
*/ if (skb->ip_summed == CHECKSUM_PARTIAL) { if (skb->encapsulation) skb_set_inner_transport_header(skb, skb_checksum_start_offset(skb)); else skb_set_transport_header(skb, skb_checksum_start_offset(skb)); if (!(features & NETIF_F_ALL_CSUM) && skb_checksum_help(skb)) goto out_kfree_skb; } } return skb; out_kfree_skb: kfree_skb(skb); out_null: return NULL; }

 

 

/**
 *    skb_needs_linearize - check if we need to linearize a given skb
 *                  depending on the given device features.
 *    @skb: socket buffer to check
 *    @features: net device features
 *
 *    Returns true if either:
 *    1. skb has frag_list and the device doesn't support FRAGLIST, or
 *    2. skb is fragmented and the device does not support SG.
 //如果skb有分片但是发送设备不支持分片,或分片中有分片在高端内存但发送设备不支持DMA,需要将所有段重新组合成一个段 ,这里__skb_linearize其实就是__pskb_pull_tail(skb, skb->data_len),
这个函数基本上等同于pskb_may_pull ,pskb_may_pull的作用就是检测skb对应的主buf中是否有足够的空间来pull出len长度,如果不够就重新分配skb并将frags中的数据拷贝入新分配的主buff中,
而这里将参数len设置为skb->datalen, 也就是会将所有的数据全部拷贝到主buff中,以这种方式完成skb的线性化。
*/ static inline bool skb_needs_linearize(struct sk_buff *skb, netdev_features_t features) { return skb_is_nonlinear(skb) && ((skb_has_frag_list(skb) && !(features & NETIF_F_FRAGLIST)) || (skb_shinfo(skb)->nr_frags && !(features & NETIF_F_SG))); }

 

1. Qdisc满足上述3个条件,置位TCQ_F_CAN_BYPASS,0个包,没有running直接调用sch_direct_xmit,感觉这种状况,是一开始刚发第一个包的时候肯定是这种状况...

2.不满足上述的3个条件 一个或者多个,那就直接进行enqueue操作,然后运行qdisc

个人认为,第一种状况是在网络通畅的状况下遇到的状况,qdisc的队列基本上处于空的状态,都是直接传送给driver了,第二种情况是属于出现网络拥塞的情况,出现发送失败的状况了

Q里面还有一些待发送的数据包,为了保证Q中的数据按照Qdisc的规则发送,比如先进先出,就需要enqueue操作,然后再去dequeue发送出去!

PS 对于驱动发包的时候:会涉及到报文释放,一般在硬件中断中不会立即释放skb,而会延时到软中断释放;要赶紧 离开硬件中断

void dev_kfree_skb_irq(struct sk_buff *skb)
{
    if (atomic_dec_and_test(&skb->users)) {
        struct softnet_data *sd;
        unsigned long flags;

        local_irq_save(flags);
        sd = &__get_cpu_var(softnet_data);
        skb->next = sd->completion_queue;
        sd->completion_queue = skb;
        raise_softirq_irqoff(NET_TX_SOFTIRQ);
        local_irq_restore(flags);
    }
}
EXPORT_SYMBOL(dev_kfree_skb_irq);

void dev_kfree_skb_any(struct sk_buff *skb)
{
    if (in_irq() || irqs_disabled())
        dev_kfree_skb_irq(skb);
    else
        dev_kfree_skb(skb);
}
in_irq : 表示是否在硬件中断;
in_irq()       - We're in (hard) IRQ context
 * in_softirq()   - We have BH disabled, or are processing softirqs
 * in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
 * in_serving_softirq() - We're in softirq context

 * in_task()      - We're in task context
#define in_irq()		(hardirq_count())
#define in_softirq()		(softirq_count())
#define in_interrupt()		(irq_count())
#define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
#define in_nmi()		(preempt_count() & NMI_MASK)
#define in_task()		(!(preempt_count() & \ (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
 

 

posted @ 2019-05-11 17:08  codestacklinuxer  阅读(1088)  评论(0编辑  收藏  举报