1. sk_buff 结构体
可以看出 sk_buff 结构体很重要,
sk_buff --- 套接字缓冲区,用来在linux网络子系统中各层之间数据传递,起到了“神经中枢”的作用。
当发送数据包时,linux内核的网络模块必须建立一个包含要传输的数据包的sk_buff,然后将sk_buff传递给下一层,各层在 sk_buff 中添加不同的协议头,直到交给网络设备发送。
同样,当接收数据包时,网络设备从物理媒介层接收到数据后,他必须将接收到的数据转换为sk_buff,并传递给上层,各层剥去相应的协议头后直到交给用户。
sk_buff结构如下图所示:
sk_buff定义如下:
/** * struct sk_buff - socket buffer * @next: Next buffer in list * @prev: Previous buffer in list * @sk: Socket we are owned by * @tstamp: Time we arrived * @dev: Device we arrived on/are leaving by * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm * @cb: Control buffer. Free for use by every layer. Put private vars here * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header * @hdr_len: writable header length of cloned skb * @csum: Checksum (must include start/offset pair) * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @local_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @nohdr: Payload reference only, must not modify header * @pkt_type: Packet class * @fclone: skbuff clone status * @ip_summed: Driver fed us an IP checksum * @priority: Packet queueing priority * @users: User count - see {datagram,tcp}.c * @protocol: Packet protocol from driver * @truesize: Buffer size * @head: Head of buffer * @data: Data head pointer * @tail: Tail pointer * @end: End pointer * @destructor: Destruct function * @mark: Generic packet mark * @nfct: Associated connection, if any * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been * done for it, don't do them again * @nf_trace: netfilter packet trace flag * @nfctinfo: Relationship of this skb to the connection * @nfct_reasm: netfilter conntrack re-assembly pointer * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c * @skb_iif: ifindex of device we arrived on * @rxhash: the packet hash computed on receive * @queue_mapping: Queue mapping for multiqueue devices * @tc_index: Traffic control index * @tc_verd: traffic control verdict * @ndisc_nodetype: router type (from link layer) * @dma_cookie: a cookie to one of several possible DMA operations * done by skb DMA functions * @secmark: security marking * @vlan_tci: vlan tag control information */ struct sk_buff { /* These two members must be first. */ struct sk_buff *next; //链表指针,指向后一个和前一个 struct sk_buff *prev; ktime_t tstamp; //socket 到达时的时间戳 struct sock *sk; //socket的所有者 struct net_device *dev; //发送或接受该缓冲区的网络设备 /* * This is the control buffer. It is free to use for every * layer. Please put your private variables there. If you * want to keep them across layers you have to do a skb_clone() * first. This is owned by whoever has the skb queued ATM. */ char cb[48] __aligned(8); unsigned long _skb_refdst; #ifdef CONFIG_XFRM struct sec_path *sp; #endif unsigned int len, data_len; __u16 mac_len, hdr_len; union { __wsum csum; struct { __u16 csum_start; __u16 csum_offset; }; }; __u32 priority; kmemcheck_bitfield_begin(flags1); __u8 local_df:1, cloned:1, ip_summed:2, //对数据包的校验策略 nohdr:1, nfctinfo:3; __u8 pkt_type:3, fclone:2, ipvs_property:1, peeked:1, nf_trace:1; kmemcheck_bitfield_end(flags1); __be16 protocol; void (*destructor)(struct sk_buff *skb); #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack *nfct; struct sk_buff *nfct_reasm; #endif #ifdef CONFIG_BRIDGE_NETFILTER struct nf_bridge_info *nf_bridge; #endif int skb_iif; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #ifdef CONFIG_NET_CLS_ACT __u16 tc_verd; /* traffic control verdict */ #endif #endif __u32 rxhash; kmemcheck_bitfield_begin(flags2); __u16 queue_mapping:16; #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2, deliver_no_wcard:1; #else __u8 deliver_no_wcard:1; #endif kmemcheck_bitfield_end(flags2); /* 0/14 bit hole */ #ifdef CONFIG_NET_DMA dma_cookie_t dma_cookie; #endif #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif union { __u32 mark; __u32 dropcount; }; __u16 vlan_tci; sk_buff_data_t transport_header; //传输层协议头 sk_buff_data_t network_header; //网络层协议头 sk_buff_data_t mac_header; //链路层协议头 /* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users; };
sk_buff主要成员如下:
1.1 各层协议头:
--- transport_header : 传输层协议头,如 TCP, UDP , ICMP, IGMP等协议头
--- network_header : 网络层协议头, 如IP, IPv6, ARP 协议头
--- mac_header : 链路层协议头。
--- sk_buff_data_t 原型就是一个char 指针
#ifdef NET_SKBUFF_DATA_USES_OFFSET typedef unsigned int sk_buff_data_t; #else typedef unsigned char *sk_buff_data_t; #endif
1.2
数据缓冲区指针 head, data, tail, end
--- *head : 指向内存中已分配的用于存放网络数据缓冲区的起始地址,
sk_buff和相关数据被分配后,该指针值就固定了
--- *data : 指向对应当前协议层有效数据的起始地址。
每个协议层的有效数据内容不一样,各层有效数据的内容如下:
a. 对于传输层,有效数据包括用户数据和传输层协议头
b. 对于网络层,有效数据包括用户数据、传输层协议和网络层协议头。
c. 对于数据链路层,有效数据包括用户数据、传输层协议、网络层协议和链路层协议。
因此,data指针随着当前拥有sk_buff的协议层的变化而进行相应的移动。
--- tail : 指向对应当前协议层有效数据的结尾地址,与data指针相对应。
--- end : 指向内存中分配的网络数据缓冲区的结尾,与head指针相对应。和head一样,sk_buff被分配后,end指针就固定了。
head, data, tail, end 关系如下图所示:
1.3 长度信息 len, data_len, truesize
--- len : 指网络数据包的有效数据的长度,包括协议头和负载(payload).
--- data_len : 记录分片的数据长度
--- truesize : 表述缓存区的整体长度, 一般为 sizeof(sk_buff).
1.4 数据包类型
--- pkt_type : 指定数据包类型。驱动程序负责将其设置为:
PACKET_HOST --- 该数据包是给我的。
PACKET_OTHERHOST --- 该数据包不是给我的。
PACKET_BROADCAST --- 广播类型的数据包
PACKET_MULTICAST --- 组播类型的数据包
驱动程序不必显式的修改pkt_type,因为eth_type_trans会完成该工作。
2. 套接字缓冲区的操作
2.1 分配套接字缓冲区
struct sk_buff *alloc_skb(unsigned intlen, int priority);
alloc_skb()函数 分配一个套接字缓冲区和一个数据缓冲区。
--- len : 为数据缓冲区的大小
--- priority : 内存分配的优先级
static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { return __alloc_skb(size, priority, 0, NUMA_NO_NODE); }
/** * __alloc_skb - allocate a network buffer * @size: size to allocate * @gfp_mask: allocation mask * @fclone: allocate from fclone cache instead of head cache * and allocate a cloned (child) skb * @node: numa node to allocate memory on * * Allocate a new &sk_buff. The returned buffer has no headroom and a * tail room of size bytes. The object has a reference count of one. * The return is the buffer. On a failure the return is %NULL. * * Buffers may only be allocated from interrupts using a @gfp_mask of * %GFP_ATOMIC. */ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone, int node) { struct kmem_cache *cache; struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); //分配套接字缓冲区 if (!skb) goto out; prefetchw(skb); size = SKB_DATA_ALIGN(size); data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info), //分配数据缓冲区 gfp_mask, node); if (!data) goto nodata; prefetchw(data + size); /* * Only clear those fields we need to clear, not those that we will * actually initialise below. Hence, don't put any more fields after * the tail pointer in struct sk_buff! */ memset(skb, 0, offsetof(struct sk_buff, tail)); skb->truesize = size + sizeof(struct sk_buff); atomic_set(&skb->users, 1); skb->head = data; skb->data = data; skb_reset_tail_pointer(skb); skb->end = skb->tail + size; #ifdef NET_SKBUFF_DATA_USES_OFFSET skb->mac_header = ~0U; #endif /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); atomic_set(&shinfo->dataref, 1); if (fclone) { struct sk_buff *child = skb + 1; atomic_t *fclone_ref = (atomic_t *) (child + 1); kmemcheck_annotate_bitfield(child, flags1); kmemcheck_annotate_bitfield(child, flags2); skb->fclone = SKB_FCLONE_ORIG; atomic_set(fclone_ref, 1); child->fclone = SKB_FCLONE_UNAVAILABLE; } out: return skb; nodata: kmem_cache_free(cache, skb); skb = NULL; goto out; } EXPORT_SYMBOL(__alloc_skb);
struct sk_buff *dev_alloc_skb(unsignedint len);
dev_alloc_skb()函数以GFP_ATOMIC 优先级调用上面的alloc_skb()函数。
并保存skb->dead 和 skb->data之间的16个字节
/** * dev_alloc_skb - allocate an skbuff for receiving * @length: length to allocate * * Allocate a new &sk_buff and assign it a usage count of one. The * buffer has unspecified headroom built in. Users should allocate * the headroom they think they need without accounting for the * built in space. The built in space is used for optimisations. * * %NULL is returned if there is no free memory. Although this function * allocates memory it can be called from an interrupt. */ struct sk_buff *dev_alloc_skb(unsigned int length) { /* * There is more code here than it seems: * __dev_alloc_skb is an inline */ return __dev_alloc_skb(length, GFP_ATOMIC); } EXPORT_SYMBOL(dev_alloc_skb);
/** * __dev_alloc_skb - allocate an skbuff for receiving * @length: length to allocate * @gfp_mask: get_free_pages mask, passed to alloc_skb * * Allocate a new &sk_buff and assign it a usage count of one. The * buffer has unspecified headroom built in. Users should allocate * the headroom they think they need without accounting for the * built in space. The built in space is used for optimisations. * * %NULL is returned if there is no free memory. */ static inline struct sk_buff *__dev_alloc_skb(unsigned int length, gfp_t gfp_mask) { struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask); if (likely(skb)) skb_reserve(skb, NET_SKB_PAD); return skb; }
2.2 释放套接字缓冲区
void kfree_skb(struct sk_buff *skb);
/** * kfree_skb - free an sk_buff * @skb: buffer to free * * Drop a reference to the buffer and free it if the usage count has * hit zero. */ void kfree_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; trace_kfree_skb(skb, __builtin_return_address(0)); __kfree_skb(skb); } EXPORT_SYMBOL(kfree_skb);
--- kfree_skb() 函数只能在内核内部使用,网络设备驱动中必须使用dev_kfree_skb()、dev_kfree_skb_irq() 或 dev_kfree_skb_any().
void dev_kfree_skb(struct sk_buff *skb);
--- dev_kfree_skb()用于非中断上下文。
#define dev_kfree_skb(a) consume_skb(a)
/** * consume_skb - free an skbuff * @skb: buffer to free * * Drop a ref to the buffer and free it if the usage count has hit zero * Functions identically to kfree_skb, but kfree_skb assumes that the frame * is being dropped after a failure and notes that */ void consume_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; trace_consume_skb(skb); __kfree_skb(skb); } EXPORT_SYMBOL(consume_skb);
void dev_kfree_skb_irq(struct sk_buff *skb);
--- dev_kfree_skb_irq() 用于中断上下文。
void dev_kfree_skb_irq(struct sk_buff *skb) { if (atomic_dec_and_test(&skb->users)) { struct softnet_data *sd; unsigned long flags; local_irq_save(flags); sd = &__get_cpu_var(softnet_data); skb->next = sd->completion_queue; sd->completion_queue = skb; raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); } } EXPORT_SYMBOL(dev_kfree_skb_irq);
void dev_kfree_skb_any(struct sk_buff *skb);
--- dev_kfree_skb_any() 在中断或非中断上下文中都能使用。
void dev_kfree_skb_any(struct sk_buff *skb) { if (in_irq() || irqs_disabled()) dev_kfree_skb_irq(skb); else dev_kfree_skb(skb); } EXPORT_SYMBOL(dev_kfree_skb_any);
2.3 移动指针
Linux套接字缓冲区中的指针移动操作有:put(放置), push(推), pull(拉) 和 reserve(保留) 等。
2.3.1 put操作
unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
将 tail 指针下移,增加 sk_buff 的 len 值,并返回 skb->tail 的当前值。
将数据添加在buffer的尾部。
/** * skb_put - add data to a buffer * @skb: buffer to use * @len: amount of data to add * * This function extends the used data area of the buffer. If this would * exceed the total buffer size the kernel will panic. A pointer to the * first byte of the extra data is returned. */ unsigned char *skb_put(struct sk_buff *skb, unsigned int len) { unsigned char *tmp = skb_tail_pointer(skb); // tmp = skb->tail SKB_LINEAR_ASSERT(skb); skb->tail += len; skb->len += len; if (unlikely(skb->tail > skb->end)) skb_over_panic(skb, len, __builtin_return_address(0)); //检测放入缓冲区的数据 return tmp; } EXPORT_SYMBOL(skb_put);
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb) { return skb->tail; }
unsigned char *__skb_put(struct sk_buff *skb, unsigned int len);
__skb_put() 与 skb_put()的区别在于 skb_put()会检测放入缓冲区的数据, 而__skb_put()不会检查
static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len) { unsigned char *tmp = skb_tail_pointer(skb); SKB_LINEAR_ASSERT(skb); skb->tail += len; skb->len += len; return tmp; }
2.3.2 push操作:
unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
skb_push()会将data指针上移,也就是将数据添加在buffer的起始点,因此也要增加sk_buff的len值。
/** * skb_push - add data to the start of a buffer * @skb: buffer to use * @len: amount of data to add * * This function extends the used data area of the buffer at the buffer * start. If this would exceed the total buffer headroom the kernel will * panic. A pointer to the first byte of the extra data is returned. */ unsigned char *skb_push(struct sk_buff *skb, unsigned int len) { skb->data -= len; skb->len += len; if (unlikely(skb->data<skb->head)) skb_under_panic(skb, len, __builtin_return_address(0)); return skb->data; } EXPORT_SYMBOL(skb_push);
unsigned char *__skb_push(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len) { skb->data -= len; skb->len += len; return skb->data; }
__skb_push()和skb_push()的区别 与 __skb_put() 和 skb_put()的区别一样。
push操作在缓冲区的头部增加一段可以存储网络数据包的空间,而put操作在缓冲区的尾部增加一段可以存储网络数据包的空间。
2.3.3 pull操作:
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
skb_pull()将data指针下移,并减少skb的len值, 这个操作与skb_push()对应。
这个操作主要用于下层协议向上层协议移交数据包,使data指针指向上一层协议头
/** * skb_pull - remove data from the start of a buffer * @skb: buffer to use * @len: amount of data to remove * * This function removes data from the start of a buffer, returning * the memory to the headroom. A pointer to the next data in the buffer * is returned. Once the data has been pulled future pushes will overwrite * the old data. */ unsigned char *skb_pull(struct sk_buff *skb, unsigned int len) { return skb_pull_inline(skb, len); } EXPORT_SYMBOL(skb_pull);
static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len) { return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len); }
static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len) { skb->len -= len; BUG_ON(skb->len < skb->data_len); return skb->data += len; }
2.3.4 reserve 操作
void skb_reserve(struct sk_buff *skb, unsigned int len);
skb_reserve()将data指针 和 tail 指针同时下移。
这个操作用于在缓冲区头部预留len长度的空间
/** * skb_reserve - adjust headroom * @skb: buffer to alter * @len: bytes to move * * Increase the headroom of an empty &sk_buff by reducing the tail * room. This is only allowed for an empty buffer. */ static inline void skb_reserve(struct sk_buff *skb, int len) { skb->data += len; skb->tail += len; }
3. 例子:
Linux处理 一个UDP数据包的接收流程,来说明对sk_buff的操作过程。
这一过程绝大部分工作会在内核完成,驱动中只需要完成涉及数据链路层部分。
假设网卡收到一个UDP数据包,Linux处理流程如下:
3.1 网卡收到一个UDP数据包后,驱动程序需要创建一个sk_buff结构体和数据缓冲区,将接收到的数据全部复制到data指向的空间,并将skb->mac_header指向data。
此时有效数据的开始位置data是一个以太网头部,即链路层协议头。
示例代码如下:
//分配新的套接字缓冲区和数据缓冲区
skb = dev_alloc_skb(length + 2); if(skb == NULL) { ... //分配失败 return ; } skb_reserve(skb, 2); //在缓冲区头部预留空间,以使网络层协议头对齐。 //将硬件接收到的数据复制到数据缓冲区 readwords(ioaddr, RX_FRAME_PORT, skb_put(skb, length), length >> 1); if(length & 1){ skb->data[length - 1] = readword(ioaddr, RX_FRAME_PORT); }
工作内容如下图所示:
3.2 数据链路层通过调用 skb_pull() 剥掉以太网协议头,向网络层IP传送数据包。
在剥离过程中,data指针会下移 一个 以太网头部的长度 sizeof(struct ethhdr), 而len 也减去 sizeof(struct ethhdr)长度。
此时有效数据的开始位置是一个IP协议头,skb->network_head指向data,即IP协议头, 而 skb->mac_header 依旧指向以太网头, 即链路层协议头。
内容如下图所示:
3.3 网络层通过skb_pull()剥掉IP协议头,向UDP传输层传递数据包。
剥离过程中,data指针会下移一个IP协议头长度 sizeof(struct iphdr), 而len也会减少sizeof(struct iphdr)长度。
此时有效数据开始位置是一个UDP协议头, skb->transport_header指向data,即UDP协议头。
而skb->network_header继续指向IP协议头, skb->mac_header 继续指向链路层协议头。
如下图所示:
3.4 应用程序在调用 recv() 接收数据时,从 skb->data + sizeof(struct udphdr) 的位置开始复制到应用层缓冲区。
可见,UPD协议头到最后也没有被剥离。