sk_buff 定义及其操作

现在的位置: 首页 > 综合 > 正文

RSS

sk_buff 定义及其操作

2013年07月26日 ⁄ 综合 ⁄ 共 14079字 ⁄ 字号小中大 ⁄ 评论关闭

1. sk_buff 结构体

可以看出 sk_buff 结构体很重要，

sk_buff --- 套接字缓冲区,用来在linux网络子系统中各层之间数据传递，起到了“神经中枢”的作用。

当发送数据包时，linux内核的网络模块必须建立一个包含要传输的数据包的sk_buff,然后将sk_buff传递给下一层，各层在 sk_buff 中添加不同的协议头，直到交给网络设备发送。

同样，当接收数据包时，网络设备从物理媒介层接收到数据后，他必须将接收到的数据转换为sk_buff，并传递给上层，各层剥去相应的协议头后直到交给用户。

sk_buff结构如下图所示：

sk_buff定义如下：

/** 
 *	struct sk_buff - socket buffer
 *	@next: Next buffer in list
 *	@prev: Previous buffer in list
 *	@sk: Socket we are owned by
 *	@tstamp: Time we arrived
 *	@dev: Device we arrived on/are leaving by
 *	@transport_header: Transport layer header
 *	@network_header: Network layer header
 *	@mac_header: Link layer header
 *	@_skb_refdst: destination entry (with norefcount bit)
 *	@sp: the security path, used for xfrm
 *	@cb: Control buffer. Free for use by every layer. Put private vars here
 *	@len: Length of actual data
 *	@data_len: Data length
 *	@mac_len: Length of link layer header
 *	@hdr_len: writable header length of cloned skb
 *	@csum: Checksum (must include start/offset pair)
 *	@csum_start: Offset from skb->head where checksumming should start
 *	@csum_offset: Offset from csum_start where checksum should be stored
 *	@local_df: allow local fragmentation
 *	@cloned: Head may be cloned (check refcnt to be sure)
 *	@nohdr: Payload reference only, must not modify header
 *	@pkt_type: Packet class
 *	@fclone: skbuff clone status
 *	@ip_summed: Driver fed us an IP checksum
 *	@priority: Packet queueing priority
 *	@users: User count - see {datagram,tcp}.c
 *	@protocol: Packet protocol from driver
 *	@truesize: Buffer size 
 *	@head: Head of buffer
 *	@data: Data head pointer
 *	@tail: Tail pointer
 *	@end: End pointer
 *	@destructor: Destruct function
 *	@mark: Generic packet mark
 *	@nfct: Associated connection, if any
 *	@ipvs_property: skbuff is owned by ipvs
 *	@peeked: this packet has been seen already, so stats have been
 *		done for it, don't do them again
 *	@nf_trace: netfilter packet trace flag
 *	@nfctinfo: Relationship of this skb to the connection
 *	@nfct_reasm: netfilter conntrack re-assembly pointer
 *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
 *	@skb_iif: ifindex of device we arrived on
 *	@rxhash: the packet hash computed on receive
 *	@queue_mapping: Queue mapping for multiqueue devices
 *	@tc_index: Traffic control index
 *	@tc_verd: traffic control verdict
 *	@ndisc_nodetype: router type (from link layer)
 *	@dma_cookie: a cookie to one of several possible DMA operations
 *		done by skb DMA functions
 *	@secmark: security marking
 *	@vlan_tci: vlan tag control information
 */

struct sk_buff {
	/* These two members must be first. */
	struct sk_buff		*next;    //链表指针，指向后一个和前一个
	struct sk_buff		*prev;

	ktime_t			tstamp;   //socket 到达时的时间戳

	struct sock		*sk;      //socket的所有者
	struct net_device	*dev;     //发送或接受该缓冲区的网络设备

	/*
	 * This is the control buffer. It is free to use for every
	 * layer. Please put your private variables there. If you
	 * want to keep them across layers you have to do a skb_clone()
	 * first. This is owned by whoever has the skb queued ATM.
	 */
	char			cb[48] __aligned(8);

	unsigned long		_skb_refdst;
#ifdef CONFIG_XFRM
	struct	sec_path	*sp;
#endif
	unsigned int		len,
				data_len;
	__u16			mac_len,
				hdr_len;
	union {
		__wsum		csum;
		struct {
			__u16	csum_start;
			__u16	csum_offset;
		};
	};
	__u32			priority;
	kmemcheck_bitfield_begin(flags1);
	__u8			local_df:1,
				cloned:1,
				ip_summed:2,   //对数据包的校验策略
				nohdr:1,
				nfctinfo:3;
	__u8			pkt_type:3,    
				fclone:2,
				ipvs_property:1,
				peeked:1,
				nf_trace:1;
	kmemcheck_bitfield_end(flags1);
	__be16			protocol;

	void			(*destructor)(struct sk_buff *skb);
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
	struct nf_conntrack	*nfct;
	struct sk_buff		*nfct_reasm;
#endif
#ifdef CONFIG_BRIDGE_NETFILTER
	struct nf_bridge_info	*nf_bridge;
#endif

	int			skb_iif;
#ifdef CONFIG_NET_SCHED
	__u16			tc_index;	/* traffic control index */
#ifdef CONFIG_NET_CLS_ACT
	__u16			tc_verd;	/* traffic control verdict */
#endif
#endif

	__u32			rxhash;

	kmemcheck_bitfield_begin(flags2);
	__u16			queue_mapping:16;
#ifdef CONFIG_IPV6_NDISC_NODETYPE
	__u8			ndisc_nodetype:2,
				deliver_no_wcard:1;
#else
	__u8			deliver_no_wcard:1;
#endif
	kmemcheck_bitfield_end(flags2);

	/* 0/14 bit hole */

#ifdef CONFIG_NET_DMA
	dma_cookie_t		dma_cookie;
#endif
#ifdef CONFIG_NETWORK_SECMARK
	__u32			secmark;
#endif
	union {
		__u32		mark;
		__u32		dropcount;
	};

	__u16			vlan_tci;

	sk_buff_data_t		transport_header;   //传输层协议头
	sk_buff_data_t		network_header;     //网络层协议头
	sk_buff_data_t		mac_header;         //链路层协议头
	/* These elements must be at the end, see alloc_skb() for details.  */
	sk_buff_data_t		tail;
	sk_buff_data_t		end;
	unsigned char		*head,
				*data;
	unsigned int		truesize;
	atomic_t		users;
};

sk_buff主要成员如下：

1.1 各层协议头：

--- transport_header : 传输层协议头，如 TCP， UDP , ICMP， IGMP等协议头

--- network_header : 网络层协议头， 如IP, IPv6， ARP 协议头

--- mac_header : 链路层协议头。

--- sk_buff_data_t　原型就是一个ｃｈａｒ　指针

#ifdef NET_SKBUFF_DATA_USES_OFFSET
typedef unsigned int sk_buff_data_t;
#else
typedef unsigned char *sk_buff_data_t;
#endif

1.2
数据缓冲区指针 head, data, tail, end
--- *head : 指向内存中已分配的用于存放网络数据缓冲区的起始地址，
sk_buff和相关数据被分配后，该指针值就固定了

--- *data : 指向对应当前协议层有效数据的起始地址。

每个协议层的有效数据内容不一样，各层有效数据的内容如下：

a. 对于传输层，有效数据包括用户数据和传输层协议头

b. 对于网络层，有效数据包括用户数据、传输层协议和网络层协议头。

c. 对于数据链路层，有效数据包括用户数据、传输层协议、网络层协议和链路层协议。

因此，data指针随着当前拥有sk_buff的协议层的变化而进行相应的移动。

--- tail : 指向对应当前协议层有效数据的结尾地址，与data指针相对应。

--- end ：指向内存中分配的网络数据缓冲区的结尾，与head指针相对应。和head一样，sk_buff被分配后，end指针就固定了。

head, data, tail, end 关系如下图所示：

1.3 长度信息 len, data_len, truesize

--- len : 指网络数据包的有效数据的长度，包括协议头和负载（payload）.

--- data_len : 记录分片的数据长度

--- truesize ：表述缓存区的整体长度， 一般为 sizeof(sk_buff).

1.4 数据包类型

--- pkt_type ：指定数据包类型。驱动程序负责将其设置为：

PACKET_HOST --- 该数据包是给我的。

PACKET_OTHERHOST --- 该数据包不是给我的。

PACKET_BROADCAST --- 广播类型的数据包

PACKET_MULTICAST --- 组播类型的数据包

驱动程序不必显式的修改pkt_type，因为eth_type_trans会完成该工作。

2. 套接字缓冲区的操作

2.1 分配套接字缓冲区

struct sk_buff *alloc_skb(unsigned intlen, int priority);

alloc_skb()函数分配一个套接字缓冲区和一个数据缓冲区。

--- len : 为数据缓冲区的大小

--- priority : 内存分配的优先级

static inline struct sk_buff *alloc_skb(unsigned int size,
					gfp_t priority)
{
	return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
}

/**
 *	__alloc_skb	-	allocate a network buffer
 *	@size: size to allocate
 *	@gfp_mask: allocation mask
 *	@fclone: allocate from fclone cache instead of head cache
 *		and allocate a cloned (child) skb
 *	@node: numa node to allocate memory on
 *
 *	Allocate a new &sk_buff. The returned buffer has no headroom and a
 *	tail room of size bytes. The object has a reference count of one.
 *	The return is the buffer. On a failure the return is %NULL.
 *
 *	Buffers may only be allocated from interrupts using a @gfp_mask of
 *	%GFP_ATOMIC.
 */
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
			    int fclone, int node)
{
	struct kmem_cache *cache;
	struct skb_shared_info *shinfo;
	struct sk_buff *skb;
	u8 *data;

	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

	/* Get the HEAD */
	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);  //分配套接字缓冲区
	if (!skb)
		goto out;
	prefetchw(skb);

	size = SKB_DATA_ALIGN(size);
	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),   //分配数据缓冲区
			gfp_mask, node);
	if (!data)
		goto nodata;
	prefetchw(data + size);

	/*
	 * Only clear those fields we need to clear, not those that we will
	 * actually initialise below. Hence, don't put any more fields after
	 * the tail pointer in struct sk_buff!
	 */
	memset(skb, 0, offsetof(struct sk_buff, tail));
	skb->truesize = size + sizeof(struct sk_buff);
	atomic_set(&skb->users, 1);
	skb->head = data;
	skb->data = data;
	skb_reset_tail_pointer(skb);
	skb->end = skb->tail + size;
#ifdef NET_SKBUFF_DATA_USES_OFFSET
	skb->mac_header = ~0U;
#endif

	/* make sure we initialize shinfo sequentially */
	shinfo = skb_shinfo(skb);
	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
	atomic_set(&shinfo->dataref, 1);

	if (fclone) {
		struct sk_buff *child = skb + 1;
		atomic_t *fclone_ref = (atomic_t *) (child + 1);

		kmemcheck_annotate_bitfield(child, flags1);
		kmemcheck_annotate_bitfield(child, flags2);
		skb->fclone = SKB_FCLONE_ORIG;
		atomic_set(fclone_ref, 1);

		child->fclone = SKB_FCLONE_UNAVAILABLE;
	}
out:
	return skb;
nodata:
	kmem_cache_free(cache, skb);
	skb = NULL;
	goto out;
}
EXPORT_SYMBOL(__alloc_skb);

struct sk_buff *dev_alloc_skb(unsignedint len);

dev_alloc_skb()函数以GFP_ATOMIC 优先级调用上面的alloc_skb()函数。

并保存skb->dead 和 skb->data之间的16个字节

/**
 *	dev_alloc_skb - allocate an skbuff for receiving
 *	@length: length to allocate
 *
 *	Allocate a new &sk_buff and assign it a usage count of one. The
 *	buffer has unspecified headroom built in. Users should allocate
 *	the headroom they think they need without accounting for the
 *	built in space. The built in space is used for optimisations.
 *
 *	%NULL is returned if there is no free memory. Although this function
 *	allocates memory it can be called from an interrupt.
 */
struct sk_buff *dev_alloc_skb(unsigned int length)
{
	/*
	 * There is more code here than it seems:
	 * __dev_alloc_skb is an inline
	 */
	return __dev_alloc_skb(length, GFP_ATOMIC);
}
EXPORT_SYMBOL(dev_alloc_skb);

/**
 *	__dev_alloc_skb - allocate an skbuff for receiving
 *	@length: length to allocate
 *	@gfp_mask: get_free_pages mask, passed to alloc_skb
 *
 *	Allocate a new &sk_buff and assign it a usage count of one. The
 *	buffer has unspecified headroom built in. Users should allocate
 *	the headroom they think they need without accounting for the
 *	built in space. The built in space is used for optimisations.
 *
 *	%NULL is returned if there is no free memory.
 */
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
					      gfp_t gfp_mask)
{
	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
	if (likely(skb))
		skb_reserve(skb, NET_SKB_PAD);
	return skb;
}

2.2 释放套接字缓冲区

void kfree_skb(struct sk_buff *skb);

/**
 *	kfree_skb - free an sk_buff
 *	@skb: buffer to free
 *
 *	Drop a reference to the buffer and free it if the usage count has
 *	hit zero.
 */
void kfree_skb(struct sk_buff *skb)
{
	if (unlikely(!skb))
		return;
	if (likely(atomic_read(&skb->users) == 1))
		smp_rmb();
	else if (likely(!atomic_dec_and_test(&skb->users)))
		return;
	trace_kfree_skb(skb, __builtin_return_address(0));
	__kfree_skb(skb);
}
EXPORT_SYMBOL(kfree_skb);

--- kfree_skb() 函数只能在内核内部使用，网络设备驱动中必须使用dev_kfree_skb()、dev_kfree_skb_irq() 或 dev_kfree_skb_any().

void dev_kfree_skb(struct sk_buff *skb);

--- dev_kfree_skb()用于非中断上下文。

#define dev_kfree_skb(a)	consume_skb(a)

/**
 *	consume_skb - free an skbuff
 *	@skb: buffer to free
 *
 *	Drop a ref to the buffer and free it if the usage count has hit zero
 *	Functions identically to kfree_skb, but kfree_skb assumes that the frame
 *	is being dropped after a failure and notes that
 */
void consume_skb(struct sk_buff *skb)
{
	if (unlikely(!skb))
		return;
	if (likely(atomic_read(&skb->users) == 1))
		smp_rmb();
	else if (likely(!atomic_dec_and_test(&skb->users)))
		return;
	trace_consume_skb(skb);
	__kfree_skb(skb);
}
EXPORT_SYMBOL(consume_skb);

void dev_kfree_skb_irq(struct sk_buff *skb);

--- dev_kfree_skb_irq() 用于中断上下文。

void dev_kfree_skb_irq(struct sk_buff *skb)
{
	if (atomic_dec_and_test(&skb->users)) {
		struct softnet_data *sd;
		unsigned long flags;

		local_irq_save(flags);
		sd = &__get_cpu_var(softnet_data);
		skb->next = sd->completion_queue;
		sd->completion_queue = skb;
		raise_softirq_irqoff(NET_TX_SOFTIRQ);
		local_irq_restore(flags);
	}
}
EXPORT_SYMBOL(dev_kfree_skb_irq);

void dev_kfree_skb_any(struct sk_buff *skb);

--- dev_kfree_skb_any() 在中断或非中断上下文中都能使用。

void dev_kfree_skb_any(struct sk_buff *skb)
{
	if (in_irq() || irqs_disabled())
		dev_kfree_skb_irq(skb);
	else
		dev_kfree_skb(skb);
}
EXPORT_SYMBOL(dev_kfree_skb_any);

2.3 移动指针
Linux套接字缓冲区中的指针移动操作有：put(放置)， push(推)， pull(拉) 和 reserve(保留) 等。

2.3.1 put操作

unsigned char *skb_put(struct sk_buff *skb, unsigned int len);

将 tail 指针下移，增加 sk_buff 的 len 值，并返回 skb->tail 的当前值。

将数据添加在buffer的尾部。

/**
 *	skb_put - add data to a buffer
 *	@skb: buffer to use
 *	@len: amount of data to add
 *
 *	This function extends the used data area of the buffer. If this would
 *	exceed the total buffer size the kernel will panic. A pointer to the
 *	first byte of the extra data is returned.
 */
unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
{
	unsigned char *tmp = skb_tail_pointer(skb);    // tmp = skb->tail
	SKB_LINEAR_ASSERT(skb);
	skb->tail += len;
	skb->len  += len;
	if (unlikely(skb->tail > skb->end))
		skb_over_panic(skb, len, __builtin_return_address(0));   //检测放入缓冲区的数据
	return tmp;
}
EXPORT_SYMBOL(skb_put);

static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb)
{
	return skb->tail;
}

unsigned char *__skb_put(struct sk_buff *skb, unsigned int len);

__skb_put() 与 skb_put()的区别在于 skb_put()会检测放入缓冲区的数据，而__skb_put()不会检查

static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
{
	unsigned char *tmp = skb_tail_pointer(skb);
	SKB_LINEAR_ASSERT(skb);
	skb->tail += len;
	skb->len  += len;
	return tmp;
}

2.3.2 push操作：

unsigned char *skb_push(struct sk_buff *skb, unsigned int len);

skb_push()会将data指针上移，也就是将数据添加在buffer的起始点，因此也要增加sk_buff的len值。

/**
 *	skb_push - add data to the start of a buffer
 *	@skb: buffer to use
 *	@len: amount of data to add
 *
 *	This function extends the used data area of the buffer at the buffer
 *	start. If this would exceed the total buffer headroom the kernel will
 *	panic. A pointer to the first byte of the extra data is returned.
 */
unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
{
	skb->data -= len;
	skb->len  += len;
	if (unlikely(skb->data<skb->head))
		skb_under_panic(skb, len, __builtin_return_address(0));
	return skb->data;
}
EXPORT_SYMBOL(skb_push);

unsigned char *__skb_push(struct sk_buff *skb, unsigned int len);

static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
{
	skb->data -= len;
	skb->len  += len;
	return skb->data;
}

__skb_push()和skb_push()的区别与 __skb_put() 和 skb_put()的区别一样。

push操作在缓冲区的头部增加一段可以存储网络数据包的空间，而put操作在缓冲区的尾部增加一段可以存储网络数据包的空间。
2.3.3 pull操作：

unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);

skb_pull()将data指针下移，并减少skb的len值，这个操作与skb_push()对应。

这个操作主要用于下层协议向上层协议移交数据包，使data指针指向上一层协议头

/**
 *	skb_pull - remove data from the start of a buffer
 *	@skb: buffer to use
 *	@len: amount of data to remove
 *
 *	This function removes data from the start of a buffer, returning
 *	the memory to the headroom. A pointer to the next data in the buffer
 *	is returned. Once the data has been pulled future pushes will overwrite
 *	the old data.
 */
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
{
	return skb_pull_inline(skb, len);
}
EXPORT_SYMBOL(skb_pull);

static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len)
{
	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
}

static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
{
	skb->len -= len;
	BUG_ON(skb->len < skb->data_len);
	return skb->data += len;
}

2.3.4 reserve 操作

void skb_reserve(struct sk_buff *skb, unsigned int len);

skb_reserve()将data指针和 tail 指针同时下移。

这个操作用于在缓冲区头部预留len长度的空间

/**
 *	skb_reserve - adjust headroom
 *	@skb: buffer to alter
 *	@len: bytes to move
 *
 *	Increase the headroom of an empty &sk_buff by reducing the tail
 *	room. This is only allowed for an empty buffer.
 */
static inline void skb_reserve(struct sk_buff *skb, int len)
{
	skb->data += len;
	skb->tail += len;
}

3. 例子：

Linux处理一个UDP数据包的接收流程，来说明对sk_buff的操作过程。

这一过程绝大部分工作会在内核完成，驱动中只需要完成涉及数据链路层部分。

假设网卡收到一个UDP数据包，Linux处理流程如下：

3.1 网卡收到一个UDP数据包后，驱动程序需要创建一个sk_buff结构体和数据缓冲区，将接收到的数据全部复制到data指向的空间，并将skb->mac_header指向data。

此时有效数据的开始位置data是一个以太网头部，即链路层协议头。

示例代码如下：

//分配新的套接字缓冲区和数据缓冲区

skb = dev_alloc_skb(length + 2);
if(skb == NULL) {
    ... //分配失败
    return ;
}

skb_reserve(skb, 2);  //在缓冲区头部预留空间，以使网络层协议头对齐。

//将硬件接收到的数据复制到数据缓冲区
readwords(ioaddr, RX_FRAME_PORT, skb_put(skb, length), length >> 1);
if(length & 1){
    skb->data[length - 1] = readword(ioaddr, RX_FRAME_PORT);
}

工作内容如下图所示：