1 Cavium OCTEON
本文主要参考Cavium的programmer guide和CPU的硬件文档,涉及的内容主要包括Cavium的收发包流程以及针对收发包过程中的特性来进行性能上的调整。
首先从一般的包收发过程来说,一般情况下,网卡收到数据包后通过DMA映射到指定的内存位置,然后中断通知CPU来取数据包,经过几次内存拷贝后到达协议栈。为了加速包的处理效率,一些CPU采用各种协处理器来帮助完成包的处理,经过多年的发展逐渐形成了以FreeScal Cavium Netlogic几家以RICS的MIPS架构为主流的NP处理器。相比于通用处理器所面对的各种各样的需求,网络处理器面对的应用需求是有限的,因此网络处理器逐渐形成了除了收发包流程加速的协处理器外的针对类似于加密解密一类VPN SSL应用的加速协处理器。在Cavium中,前者主要是指:SSO(POW)和PKO,后者则有相应的处理引擎。
2 OCTEON COPROCESSER
Cavium的OCTEON为网络做了大量优化,主要包括数量众多的协处理器,不同的协处理器完成特定的任务,大大简化了软件的复杂度提高了性能,并且能够从硬件上保证一些特性比如包保序。如图是OCTEON的Arch:
主要有:FPA PIP/IPD SSO PKO RAID_Engine FAU
2.1 FPA
FPA-Free Pool Alloctor主要负责分配收发包过程中的packet work entry以及packet的data buffer和PKO command buffer。
对FPA的操作主要有三种:
- buffer_allocte(synchronous): core会等待可用的buffer返回或者是返回的NULL
- buffer_allocte(asynchronous): core不会等待buffer地址返回,而是在之后的时间里会从特定位置接受到buffer地址
- buffer_free(synchronous): 这个操作会把buffer地址返还到特定的FPA pool
FPA的alloc free操作
static inline void *cvmx_fpa_alloc(uint64_t pool) {
adress = cvmx_read_csr(CVMX_ADDR_DID(CVMX_FULL_DID(CMVX_OCT_DID_FPA, pool)));
}
static inline void cmvx_fpa_free(......) {
newptr.u64 = cvmx_ptr_to_phys(ptr);
newptr.sfilldidispace.didspace = CVMX_ADDR_DIDSPACE(CVMX_FULL_DID(CVMX_OCT_DID_FPA,
pool));
cvmx_write_io(newptr.u64, unm_cache_lines);
}
FPA的initialize, cvmx-helper-fpa.c
int cvmx_helper_initialize_fpa(.....) {
return __cvmx_helper_initialize_fpa(
CVMX_FPA_PACKET_POOL, cvmx_fpa_packet_pool_size_get(), packet_buffers,
CVMX_FPA_WQE_POOL, CVMX_FPA_WQE_POOL_SIZE, work_queue_entries,
CVMX_FPA_OUTPUT_BUFFER_POOL, CVMX_FPA_OUTPUT_BUFFER_POOL_SIZE, pko_buffers,
CVMX_FPA_TIMER_POOL, CVMX_FPA_TIMER_POOL_SIZE, tim_buffers,
CVMX_FPA_DFA_POOL, CVMX_FPA_DFA_POOL_SIZE, dfa_buffers,
}
在__cvmx_helper_initialize_fpa_pool()
有我们需要的重要细节:
在原来的SDK中时这样的:
memory = cvmx_bootmem_alloc(buffer_size * buffers, align);
但是代码却是这样:
memory = KMALLOC(buffer_size*buffers + CVMX_CACHE_LINE_SIZE);
区别在于cvmx_bootmeme_alloc()
和KMALLOC()
分别是在哪里分配的内存。
void *cmvx_bootmem_alloc_range(uint64_t size, uint64_t alignment,
uint64_t min_addr, unit64_t max_addr) {
int64_t address;
......
address = cvmx_bootmem_phy_alloc(size, min_addr, max_addr, alignment, 0)
......
return cvmx_phy_to_ptr(address);
......
}
可见address
是physical address。cvmx_bootmem_phy_alloc()
是executive中提供的底层的memory
alloc的操作。
FPA的配置,cvmx-config.h
#define CVMX_CACHE_LINE_SIZE (128) //in bytes
#define CVMX_FPA_POOL_0_SZIE (17 * CVMX_CACHE_LINE_SIZE)
#define CVMX_FPA_POOL_1_SIZE (1 * CVMX_CACHE_LINE_SIZE)
#define CVMX_FPA_POOL_2_SIZE (8 * CVMX_CACHE_LINE_SIZE)
#define CVMX_FPA_POOL_3_SIZE (8 * CVMX_CACHE_LINE_SIZE)
#define CVMX_FPA_POOL_4_SZIE (17 * CVMX_CACHE_LINE_SIZE)
#define CVMX_FPA_PACKET_POOL (0) /* packet buffers */
#define CVMX_FPA_PACKET_POOL_SIZE CVMX_FPA_POOL_0_SIZE
int cvmx_fpa_setup_pool(......) {
cvmx_fpa_pool_info[pool].name = name;
cvmx_fpa_pool_info[pool].size = block_size;
cvmx_fpa_pool_info[pool].starting_element_count = num_blocks;
cvmx_fpa_pool_info[pool].base = buffer;
}
主要使用了FPA 0 1 2,0负责在收到packet后存储packet buffer data部分,1负责对packet header部分进行简单hash后的信息存储,2是pko command buffers。在banfflite的一个BUG中人为的在FPA0中分配packet buffer的时候会出现crash的现象,Cavium的人就此给出一个workaround就是建议在FPA4中分配packet buffer去发送这个包,但是OCTEON在硬件上规定收包必须用FPA0.因此packet buffer的大小就是2176。
在cvmx_fpa_setup_pool()
中除了配置名字 block_size主要是配置FPA的base address。
2.2 PIP/IPD
这个协处理器的主要作用就是从interface上比如:SGMII XAUI口上收到数据后由PIP对数据包的头部进行5-tuple的计算后得到这个包的WQE和group tag Qos类提交给SSO所需要的数据,而IPD则是将收到的数据包的数据部分copy到data FPA中申请的fpa buffer中去。这个协处理器提供收包以及包的最简单的CRC 错误检查以及丢弃操作,OCTEON也提供另外一种收包方法,在不用PIP/IPD的时候可以使用内核的NAPI的方式使用cnMIPS
core的poll方式来进行收包。
- 检查packet,包括L2/L3头部的错误
- 提供拥塞控制,可在PIP部分丢掉部分包
- 创建WQE(Work Queue Entry)
- 决定提供给SSO的WQE的包属性(Group Qos Tag-Type Tage-Value)
- 把收到的包存储在PIP/IPD内部的buffer和RAM中
- 发送WQE到SSO完成调度
一般配置流程:
- 一个core调用:
cvmx_helper_initialize_packet_io_global()
- 每个core都调用:
cvmx_helper_initialize_packet_io_local()
- 使用
cvmx_pip*
和cvmx_ipd*
来配置PIP/IPD - 使用:
cvmx_helper_ipd_and_packet_input_enable()
打开收包功能
大致的介绍PIP IPD的配置代码:
cvmx_helper_initialize_packet_io_global() {
......
result |= __cvmx_helper_global_setup_ipd();
cvmx_ipd_config(cvmx_fpa_packet_pool_size_get()/8,
CVMX_HELPER_FIRST_MBUFF_SKIP/8,
CVMX_HELPER_NOT_FIRST_MBUFF_SKIP/8,
(CVMX_HELPER_FIRST_MBUFF_SKIP+8) / 128,
/* The +8 is to account for the next ptr */
(CVMX_HELPER_NOT_FIRST_MBUFF_SKIP+8) / 128,
/* The +8 is to account for the next ptr */
CVMX_FPA_WQE_POOL,
CVMX_HELPER_IPD_DRAM_MODE,
1);
......
}
cvmx_ipd_config(.....) {
......
first_skip.u64 = 0;
first_skip.s.skip_sz = first_mbuff_skip;
cvmx_write_csr(CVMX_IPD_1ST_MBUFF_SKIP, first_skip.u64);
not_first_skip.u64 = 0;
not_first_skip.s.skip_sz = not_first_mbuff_skip;
cvmx_write_csr(CVMX_IPD_NOT_1ST_MBUFF_SKIP, not_first_skip.u64);
size.u64 = 0;
size.s.mb_size = mbuff_size;
cvmx_write_csr(CVMX_IPD_PACKET_MBUFF_SIZE, size.u64);
ipd_ctl_reg.u64 = cvmx_read_csr(CVMX_IPD_CTL_STATUS);
ipd_ctl_reg.s.opc_mode = cache_mode;
ipd_ctl_reg.s.pbp_en = back_pres_enable_flag;
cvmx_write_csr(CVMX_IPD_CTL_STATUS, ipd_ctl_reg.u64);
/* Note: the example RED code that used to be here has been moved to
cvmx_helper_setup_red */
}
来着重说说cvmx_ipd_config()
中用到的一些reg:
- IPD_1ST_MBUFF_SKIP: The number of eight-byte words from the top of the first MBUF that the IPD stores the next pointer. Legal values for this field are 0 to 32
- IPD_NOT_1ST_MBUFF_SKIP: The number of eight-byte words from the top of any MBUF that is not the first MBUF that the IPD writes the next-pointer
- IPD_1ST_NEXT_PTR_BACK: Used to find head of buffer from the next pointer header
- IPD_WQE_FPA_QUEUE: Specifies the FPA queue from which to fetch page-pointers for work-queue entries
- IPD_CTL_STATUS:OPC_MODE
IPD_CTL_STATUS:OPC_MODE: Select the style of write to the L2C.
- 0 = all packet data and next-buffer pointers are written through to memory.
- 1 = all packet data and next-buffer pointers are written into the cache.
- 2 = the first aligned cache block holding the packet data and initial next-buffer pointer is written to the L2 cache. All remaining cache blocks are not written to the L2 cache.
- 3 = the first two aligned cache blocks holding the packet data and initial next-buffer pointer is written to the L2 cache. All remaining cache blocks are not written to the L2 cache
如果需要调试PIP/IPD可以从这些寄存器入手。
这个时候的Qos的丢弃策略配置:
cvmx_helper_setup_red()
Per-QoS RED拥塞控制,所有queue都一样cvmx_helper_setup_red_queue()
queue的阀值都不一样
Qos相关的一些拥塞设置:
int cvmx_helper_setup_red(int pass_thresh, int drop_thresh) {
......
cvmx_write_csr(CVMX_IPD_ON_BP_DROP_PKTX(0), 0);
#define IPD_RED_AVG_DLY 1000
#define IPD_RED_PRB_DLY 1000
......
red_delay.s.avg_dly = IPD_RED_AVG_DLY;
red_delay.s.prb_dly = IPD_RED_PRB_DLY;