Linux cgroup机制分析之cpuset subsystem

现在的位置: 首页 > 综合 > 正文

RSS

Linux cgroup机制分析之cpuset subsystem

2013年02月10日 ⁄ 综合 ⁄ 共 34230字 ⁄ 字号小中大 ⁄ 评论关闭

------------------------------------------

本文系本站原创,欢迎转载!

转载请注明出处:http://ericxiao.cublog.cn/

------------------------------------------

一:前言

前面已经分析了cgroup的框架,下面来分析cpuset子系统.所谓cpuset,就是在用户空间中操作cgroup文件系统来执行进程与cpu和进程与内存结点之间的绑定.有关cpuset的详细描述可以参考文档: linux-2.6.28-rc7/Documentation/cpusets.txt.本文从cpuset的源代码角度来对cpuset进行详细分析.以下的代码分析是基于linux-2.6.28.

二:cpuset的数据结构

每一个cpuset都对应着一个struct cpuset结构,如下示:

struct cpuset {

/*用于从cgroup到cpuset的转换*/

struct cgroup_subsys_state css;

/*cpuset的标志*/

unsigned long flags; /* "unsigned long" so bitops work */

/*该cpuset所绑定的cpu*/

cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */

/*该cpuset所绑定的内存结点*/

nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */

/*cpuset的父结点*/

struct cpuset *parent; /* my parent */

* Copy of global cpuset_mems_generation as of the most

* recent time this cpuset changed its mems_allowed.

/*是当前cpuset_mems_generation的拷贝.每更新一次

*mems_allowed,cpuset_mems_generation就会加1

int mems_generation;

/*用于memory_pressure*/

struct fmeter fmeter; /* memory_pressure filter */

/* partition number for rebuild_sched_domains() */

/*对应调度域的分区号*/

int pn;

/* for custom sched domain */

/*与sched domain相关*/

int relax_domain_level;

/* used for walking a cpuset heirarchy */

/*用来遍历所有的cpuset*/

struct list_head stack_list;

}

这个数据结构中的成员含义现在没必要深究,到代码分析遇到的时候再来详细讲解.在这里我们要注意的是struct cpuset中内嵌了struct cgroup_subsys_state css.也就是说,我们可以从struct cgroup_subsys_state css的地址导出struct cpuset的地址.故内核中,从cpuset到cgroup的转换有以下关系:

static inline struct cpuset *cgroup_cs(struct cgroup *cont)

{

return container_of(cgroup_subsys_state(cont, cpuset_subsys_id),

struct cpuset, css);

}

Cgroup_subsys_state()代码如下:

static inline struct cgroup_subsys_state *cgroup_subsys_state(

struct cgroup *cgrp, int subsys_id)

{

return cgrp->subsys[subsys_id];

}

即从cgroup中求得对应的cgroup_subsys_state.再用container_of宏利用地址偏移求得cpuset.

另外,在内核中还有下面这个函数:

static inline struct cpuset *task_cs(struct task_struct *task)

{

return container_of(task_subsys_state(task, cpuset_subsys_id),

struct cpuset, css);

}

同理,从struct task_struct->cgroup得到cgroup_subsys_state结构.再取得cpuset.

三:cpuset的初始化

Cpuset的初始化分为三部份.如下所示:

asmlinkage void __init start_kernel(void)

{

……

cpuset_init_early();

……

cpuset_init();

……

}

Start_kernel() à kernel_init() à cpuset_init_smp()

下面依次分析这些初始化函数.

3.1:Cpuset_init_eary()

该函数代码如下:

int __init cpuset_init_early(void)

{

top_cpuset.mems_generation = cpuset_mems_generation++;

return 0;

}

该函数十分简单,就是初始化top_cpuset.mems_generation.在这里我们遇到了前面分析cpuset数据结构中提到的全局变量cpuset_mems_generation.它的定义如下:

* Increment this integer everytime any cpuset changes its

* mems_allowed value. Users of cpusets can track this generation

* number, and avoid having to lock and reload mems_allowed unless

* the cpuset they're using changes generation.

* A single, global generation is needed because cpuset_attach_task() could

* reattach a task to a different cpuset, which must not have its

* generation numbers aliased with those of that tasks previous cpuset.

* Generations are needed for mems_allowed because one task cannot

* modify another's memory placement. So we must enable every task,

* on every visit to __alloc_pages(), to efficiently check whether

* its current->cpuset->mems_allowed has changed, requiring an update

* of its current->mems_allowed.

* Since writes to cpuset_mems_generation are guarded by the cgroup lock

* there is no need to mark it atomic.

static int cpuset_mems_generation;

注释上说的很详细,简而言之,全局变量cpuset_mems_generation就是起一个对比作用,它在每次改更了cpuset的mems_allowed都是加1.然后进程在关联cpuset的时候,会将task->cpuset_mems_generation.设置成进程所在cpuset的cpuset->cpuset_mems_generation的值.每次cpuset中的mems_allowed发生更改的时候,都会将cpuset-> mems_generation设置成当前cpuset_mems_generation的值.这样,进程在分配内存的时候就会对比task->cpuset_mems_generation和cpuset->cpuset_mems_generation的值,如果不相等,说明cpuset的mems_allowed的值发生了更改,所以在分配内存之前首先就要更新进程的mems_allowed.举个例子:

alloc_pages()->alloc_pages_current()->cpuset_update_task_memory_state().重点来跟踪一下cpuset_update_task_memory_state().代码如下:

void cpuset_update_task_memory_state(void)

{

int my_cpusets_mem_gen;

struct task_struct *tsk = current;

struct cpuset *cs;

/*取得进程对应的cpuset的,然后求得要对比的mems_generation*/

/*在这里要注意访问top_cpuset和其它cpuset的区别.访问top_cpuset

*的时候不必要持rcu .因为它是一个静态结构.永远都不会被释放

*因此无论什么访问他都是安全的

if (task_cs(tsk) == &top_cpuset) {

/* Don't need rcu for top_cpuset. It's never freed. */

my_cpusets_mem_gen = top_cpuset.mems_generation;

} else {

rcu_read_lock();

my_cpusets_mem_gen = task_cs(tsk)->mems_generation;

rcu_read_unlock();

}

/*如果所在cpuset的mems_generaton不和进程的cpuset_mems_generation相同

*说明进程所在的cpuset的mems_allowed发生了改变.所以要更改进程

*的mems_allowed.

if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {

mutex_lock(&callback_mutex);

task_lock(tsk);

cs = task_cs(tsk); /* Maybe changed when task not locked */

/*更新进程的mems_allowed*/

guarantee_online_mems(cs, &tsk->mems_allowed);

/*更新进程的cpuset_mems_generation*/

tsk->cpuset_mems_generation = cs->mems_generation;

/*PF_SPREAD_PAGE和PF_SPREAD_SLAB*/

if (is_spread_page(cs))

tsk->flags |= PF_SPREAD_PAGE;

else

tsk->flags &= ~PF_SPREAD_PAGE;

if (is_spread_slab(cs))

tsk->flags |= PF_SPREAD_SLAB;

else

tsk->flags &= ~PF_SPREAD_SLAB;

task_unlock(tsk);

mutex_unlock(&callback_mutex);

/*重新绑定进程和允许的内存结点*/

mpol_rebind_task(tsk, &tsk->mems_allowed);

}

这个函数就是用来在请求内存的判断进程的cpuset->mems_allowed有没有更改.如果有更改就更新进程的相关域.最后再重新绑定进程到允许的内存结点.

在这里,我们遇到了cpuset的两个标志.一个是is_spread_page()测试的CS_SPREAD_PAGE和is_spread_slab()测试的CS_SPREAD_SLAB.这两个标识是什么意思呢?从代码中可以看到,它就是对应进程的PF_SPREAD_PAGE和PF_SPREAD_SLAB.它的作用是在为页面缓页或者是inode分配空间的时候,平均使用进程所允许使用的内存结点.举个例子:

__page_cache_alloc() à cpuset_mem_spread_node():

int cpuset_mem_spread_node(void)

{

int node;

node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);

if (node == MAX_NUMNODES)

node = first_node(current->mems_allowed);

current->cpuset_mem_spread_rotor = node;

return node;

}

看到是怎么找分配节点了吧?代码中current->cpuset_mem_spread_rotor是上次文件缓存分配的内存结点.它就是轮流使用进程所允许的内存结点.

返回到cpuset_update_task_memory_state()中,看一下里面涉及到的几个子函数:

guarantee_online_mems()用来更新进程的mems_allowed.代码如下:

static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)

{

while (cs && !nodes_intersects(cs->mems_allowed,

node_states[N_HIGH_MEMORY]))

cs = cs->parent;

if (cs)

nodes_and(*pmask, cs->mems_allowed,

node_states[N_HIGH_MEMORY]);

else

*pmask = node_states[N_HIGH_MEMORY];

BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));

}

在内核中,所有在线的内存结点都存放在node_states[N_HIGH_MEMORY].这个函数的作用就是到所允许的在线的内存结点.何所谓”在线的”内存结点呢?听说过热插拨吧?服务器上的内存也是这样的,可以运态插拨的.

另一个重要的子函数是mpol_rebind_task(),它将进程与所允许的内存结点重新绑定.也就是移动旧节点的数值到新结点中.这个结点是mempolicy方面的东西了.在这里不做详细讲解了.可以自行跟踪看一下,代码很简单的.

分析完全局变量cpuset_mems_generation的作用之后,来看下一个初始化函数.

3.2: cpuset_init()

Cpuset_init()代码如下:

int __init cpuset_init(void)

{

int err = 0;

/*初始化top_cpuset的cpus_allowed和mems_allowed

*将它初始化成系统中的所有cpu和所有的内存节点

cpus_setall(top_cpuset.cpus_allowed);

nodes_setall(top_cpuset.mems_allowed);

/*初始化top_cpuset.fmeter*/

fmeter_init(&top_cpuset.fmeter);

/*因为更改了top_cpuset->mems_allowed

*所以要更新cpuset_mems_generation

top_cpuset.mems_generation = cpuset_mems_generation++;

/*设置top_cpuset的CS_SCHED_LOAD_BALANCE*/

set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);

/*设置top_spuset.relax_domain_level*/

top_cpuset.relax_domain_level = -1;

/*注意cpuset 文件系统*/

err = register_filesystem(&cpuset_fs_type);

if (err < 0)

return err;

/*cpuset 个数计数*/

number_of_cpusets = 1;

return 0;

}

在这里主要初始化了顶层cpuset的相关信息.在这里,我们又遇到了几个标志.下面一一讲解:

CS_SCHED_LOAD_BALANCE:

Cpuset中cpu的负载均衡标志.如果cpuset设置了此标志,表示该cpuset下的cpu在调度的时候,实现负载均衡.

relax_domain_level:

它是调度域的一个标志,表示在NUMA中负载均衡时寻找空闲CPU的标志.有以下几种取值:

-1 : no request. use system default or follow request of others.

0 : no search.

1 : search siblings (hyperthreads in a core).

2 : search cores in a package.

3 : search cpus in a node [= system wide on non-NUMA system]

( 4 : search nodes in a chunk of node [on NUMA system] )

( 5 : search system wide [on NUMA system] )

在这个函数还出现了fmeter.有关fmeter我们之后等遇到再来分析.

另外,cpuset还对应一个文件系统,这是为了兼容cgroup之前的cpuset操作.跟踪这个文件系统看一下:

static struct file_system_type cpuset_fs_type = {

.name = "cpuset",

.get_sb = cpuset_get_sb,

};

Cpuset_get_sb()代码如下;

static int cpuset_get_sb(struct file_system_type *fs_type,

int flags, const char *unused_dev_name,

void *data, struct vfsmount *mnt)

{

struct file_system_type *cgroup_fs = get_fs_type("cgroup");

int ret = -ENODEV;

if (cgroup_fs) {

char mountopts[] =

"cpuset,noprefix,"

"release_agent=/sbin/cpuset_release_agent";

ret = cgroup_fs->get_sb(cgroup_fs, flags,

unused_dev_name, mountopts, mnt);

put_filesystem(cgroup_fs);

}

return ret;

}

可见就是使用cpuset,noprefix,release_agent=/sbin/cpuset_release_agent选项挂载cgroup文件系统.

即相当于如下操作:

Mount –t cgroup cgroup –o puset,noprefix,release_agent=/sbin/cpuset_release_agent mount_dir

其中,mount_dir指文件系统挂载点.

3.3: cpuset_init_smp()

代码如下:

void __init cpuset_init_smp(void)

{

top_cpuset.cpus_allowed = cpu_online_map;

top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];

hotcpu_notifier(cpuset_track_online_cpus, 0);

hotplug_memory_notifier(cpuset_track_online_nodes, 10);

}

它将cpus_allowed和mems_allwed更新为在线的cpu和在线的内存结点.最后为cpu热插拨和内存热插拨注册了hook.来看一下.

在分析这两个hook之前,有必要提醒一下,在这个hook里面涉及的一些子函数有些是cpuset中一些核心的函数.在之后对cpuset的流程进行分析的时候,有很多地方都会调用这两个hook中的子函数.因此理解这部份代码是理解整个cpuset子系统的关键。好了,闲言少叙,转入正题.

Cpu hotplug对应的hook为cpuset_track_online_cpus.代码如下:

static int cpuset_track_online_cpus(struct notifier_block *unused_nb,

unsigned long phase, void *unused_cpu)

{

struct sched_domain_attr *attr;

cpumask_t *doms;

int ndoms;

/*只处理CPU_ONLINE,CPU_ONLINE_FROZEN,CPU_DEAD,CPU_DEAD_FROZEM*/

switch (phase) {

case CPU_ONLINE:

case CPU_ONLINE_FROZEN:

case CPU_DEAD:

case CPU_DEAD_FROZEN:

break;

default:

return NOTIFY_DONE;

}

/*更新top_cpuset.cpus_allowed*/

cgroup_lock();

top_cpuset.cpus_allowed = cpu_online_map;

scan_for_empty_cpusets(&top_cpuset);

/*更新cpuset 调度域*/

ndoms = generate_sched_domains(&doms, &attr);

cgroup_unlock();

/* Have scheduler rebuild the domains */

/*更新scheduler的调度域信息*/

partition_sched_domains(ndoms, doms, attr);

return NOTIFY_OK;

}

这个函数是对应cpu hotplug的处理,如果系统中的cpu发生了改变,比如添加/删除,就必须要修正cpuset中的cpu信息.首先,我们在之前分析过,top_cpuset中包含了所有的cpu和memory node,因此首先要修正top_cpuset中的cpu信息，其次，系统中cpu发生改变，有可能引起某些cpuse中的cpu信息变为了空值，因此要对这些空值cpuset下的进程进行处理。同理，也要更新调度域信息。下面一一来分析里面涉及到的子函数。

3.3.1：scan_for_empty_cpusets（）

这一个要分析的就是scan_for_empty_cpusets（），它用来扫描空的cpuset,将它空集cpuset下的task移到它的上级非空的cpuset的，代码如下：

static void scan_for_empty_cpusets(struct cpuset *root)

{

LIST_HEAD(queue);

struct cpuset *cp; /* scans cpusets being updated */

struct cpuset *child; /* scans child cpusets of cp */

struct cgroup *cont;

nodemask_t oldmems;

list_add_tail((struct list_head *)&root->stack_list, &queue);

/*遍历所有的cpuset*/

while (!list_empty(&queue)) {

cp = list_first_entry(&queue, struct cpuset, stack_list);

list_del(queue.next);

list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {

child = cgroup_cs(cont);

list_add_tail(&child->stack_list, &queue);

}

/* Continue past cpusets with all cpus, mems online */

/*所包含的cpuset 和内存结点如果都是正常的*/

if (cpus_subset(cp->cpus_allowed, cpu_online_map) &&

nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))

continue;

/*之前的mems_allowed*/

oldmems = cp->mems_allowed;

/* Remove offline cpus and mems from this cpuset. */

/*丢弃掉已经移除的内存结点和cpu*/

mutex_lock(&callback_mutex);

cpus_and(cp->cpus_allowed, cp->cpus_allowed, cpu_online_map);

nodes_and(cp->mems_allowed, cp->mems_allowed,

node_states[N_HIGH_MEMORY]);

mutex_unlock(&callback_mutex);

/* Move tasks from the empty cpuset to a parent */

/*如果调整之后的cpu和内存结点信息为空*/

if (cpus_empty(cp->cpus_allowed) ||

nodes_empty(cp->mems_allowed))

remove_tasks_in_empty_cpuset(cp);

/*更新cpuset下进程的cpu和内存结点信息*/

else {

update_tasks_cpumask(cp, NULL);

update_tasks_nodemask(cp, &oldmems);

}

首先要看懂这个函数，必须要了解cgroup的架构了，关于这部份，请参阅本站的另一篇文档《linux cgroup机制分析之框架分析》.cpuset-> stack_list成员在这里派上用场了，它就是用来链入临时链表中。我们从代码中可以看到，它是一个从top_cpuset往下层的层次遍次。

对于遍历到的每一个cpuset,

1：如果cpuset的cpu和memory信息都是正常的（分别是cpu_online_map和node_states[N_HIGH_MEMORY]的子集）那就用不着更新了。

2：丢弃掉已经离线的cpu和memory.（也就是与cpu_online_map和n ode_states[N_HIGH_MEMORY]取交集）。

3：如果调整之后的cpuset中cpu或者是memory为空，就要处理它下面的所关联进程的了。这是在remove_tasks_in_empty_cpuset()中处理的.

4:如果调整之后的cpuset的cpu和memory都不都为空。说明它所关联的进程还有资源可用，只需更新所关联进程的mems_allowed和cpus_allowed位图即可。这是在update_tasks_cpumask()和update_tasks_nodemask()中处理的。

下面来分析一下scan_for_empty_cpusets()中调用的几个子函数.

3.3.1.1: remove_tasks_in_empty_cpuset()

代码如下：

static void remove_tasks_in_empty_cpuset(struct cpuset *cs)

{

struct cpuset *parent;

* The cgroup's css_sets list is in use if there are tasks

* in the cpuset; the list is empty if there are none;

* the cs->css.refcnt seems always 0.

/*如果这个cpuset下没有关联的进程*/

if (list_empty(&cs->css.cgroup->css_sets))

return;

* Find its next-highest non-empty parent, (top cpuset

* has online cpus, so can't be empty).

/*向上找到一个cpu和mems不为空的cpuset*/

parent = cs->parent;

while (cpus_empty(parent->cpus_allowed) ||

nodes_empty(parent->mems_allowed))

parent = parent->parent;

/*将cpuset中的进程移到parent上*/

move_member_tasks_to_cpuset(cs, parent);

}

如果cpuset中有关联的进程，但cpuset允许的相关资源为空，那么就向上找到有资源的cpuset,并将其关联的task移到找到的cpuset中。对照代码中的注释，应该很好理解，这里就不详细分析了。

Move_member_tasks_to_cpuset()代码如下：

static void move_member_tasks_to_cpuset(struct cpuset *from, struct cpuset *to)

{

struct cpuset_hotplug_scanner scan;

scan.scan.cg = from->css.cgroup;

scan.scan.test_task = NULL; /* select all tasks in cgroup */

scan.scan.process_task = cpuset_do_move_task;

scan.scan.heap = NULL;

scan.to = to->css.cgroup;

if (cgroup_scan_tasks(&scan.scan))

printk(KERN_ERR "move_member_tasks_to_cpuset: "

"cgroup_scan_tasks failed/n");

}

这里涉及到cgroup中的另外一个接口cgroup_scan_tasks（）。这个接口在后面再来详细分析，这里先大概说一下，它就是一个遍历cgroup中关联进程的迭代器。对cgroup中关联的每个进程都会调用回调函数scan.scan.process_task.在上面的这段代码中也就是cpuset_do_move_task().代码如下：

static void cpuset_do_move_task(struct task_struct *tsk,

struct cgroup_scanner *scan)

{

struct cpuset_hotplug_scanner *chsp;

chsp = container_of(scan, struct cpuset_hotplug_scanner, scan);

cgroup_attach_task(chsp->to, tsk);

}

在这个函数中，调用了cgroup_attach_task（）将进程关联到了chsp->to.chsp->to也就是我们在上面的代码中看到的parent.

3.3.1.2: update_tasks_cpumask()

这个函数用来更新cpuset下所有进程的cpu信息，代码如下：

static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)

{

struct cgroup_scanner scan;

/*遍历cpuset 下的所有task.

*对每一个task调用cpuset_change_cpumask()

scan.cg = cs->css.cgroup;

scan.test_task = cpuset_test_cpumask;

scan.process_task = cpuset_change_cpumask;

scan.heap = heap;

cgroup_scan_tasks(&scan);

}

Cgroup_scan_tasks()这个接口我们在上面已经讨论过来，对cpuset中的每一个进程都会调用cpuset_change_cpumask().代码如下：

static void cpuset_change_cpumask(struct task_struct *tsk,

struct cgroup_scanner *scan)

{

set_cpus_allowed_ptr(tsk, &((cgroup_cs(scan->cg))->cpus_allowed));

}

该函数很简单，就是设置进程的cpus_allowed域，在下次进程被调度回来的时候，就会切换到允许的cpu上面运行。

3.3.1.3：update_tasks_nodemask（）

该函数用来更新cpuset下的task的memory node信息。代码如下：

static int update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem)

{

struct task_struct *p;

struct mm_struct **mmarray;

int i, n, ntasks;

int migrate;

int fudge;

struct cgroup_iter it;

int retval;

cpuset_being_rebound = cs; /* causes mpol_dup() rebind */

/*fudge是为mmarray[ ]提供适当多余的长度*/

fudge = 10; /* spare mmarray[] slots */

fudge += cpus_weight(cs->cpus_allowed); /* imagine one fork-bomb/cpu */

retval = -ENOMEM;

* Allocate mmarray[] to hold mm reference for each task

* in cpuset cs. Can't kmalloc GFP_KERNEL while holding

* tasklist_lock. We could use GFP_ATOMIC, but with a

* few more lines of code, we can retry until we get a big

* enough mmarray[] w/o using GFP_ATOMIC.

/*取得cpuset中task 的个数,这里加上fudge是为了防止在

*操作的过程中,又fork出了一些新的进程,分配空间不够

while (1) {

ntasks = cgroup_task_count(cs->css.cgroup); /* guess */

ntasks += fudge;

mmarray = kmalloc(ntasks * sizeof(*mmarray), GFP_KERNEL);

if (!mmarray)

goto done;

read_lock(&tasklist_lock); /* block fork */

if (cgroup_task_count(cs->css.cgroup) <= ntasks)

break; /* got enough */

read_unlock(&tasklist_lock); /* try again */

kfree(mmarray);

}

n = 0;

/* Load up mmarray[] with mm reference for each task in cpuset. */

/*将cpuset下的所有进程的mm都保存至mmarray[ ]中

*n用来计算所取得task的个数

cgroup_iter_start(cs->css.cgroup, &it);

while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {

struct mm_struct *mm;

if (n >= ntasks) {

printk(KERN_WARNING

"Cpuset mempolicy rebind incomplete./n");

break;

}

mm = get_task_mm(p);

if (!mm)

continue;

mmarray[n++] = mm;

}

cgroup_iter_end(cs->css.cgroup, &it);

read_unlock(&tasklist_lock);

* Now that we've dropped the tasklist spinlock, we can

* rebind the vma mempolicies of each mm in mmarray[] to their

* new cpuset, and release that mm. The mpol_rebind_mm()

* call takes mmap_sem, which we couldn't take while holding

* tasklist_lock. Forks can happen again now - the mpol_dup()

* cpuset_being_rebound check will catch such forks, and rebind

* their vma mempolicies too. Because we still hold the global

* cgroup_mutex, we know that no other rebind effort will

* be contending for the global variable cpuset_being_rebound.

* It's ok if we rebind the same mm twice; mpol_rebind_mm()

* is idempotent. Also migrate pages in each mm to new nodes.

*更新进程的内存分配策略

*如果设置了CS_MEMORY_MIGRATE,就表示需要将进程的

*内存空间从旧结点移动到新结点上

migrate = is_memory_migrate(cs);

for (i = 0; i < n; i++) {

struct mm_struct *mm = mmarray[i];

mpol_rebind_mm(mm, &cs->mems_allowed);

if (migrate)

cpuset_migrate_mm(mm, oldmem, &cs->mems_allowed);

mmput(mm);

}

/* We're done rebinding vmas to this cpuset's new mems_allowed. */

kfree(mmarray);

cpuset_being_rebound = NULL;

retval = 0;

done:

return retval;

}

根据代码中的注释，应该比较容易理解这段代码。在这里涉及到一个新的东西：cgroup_iter。这也是我们之前遇到的Cgroup_scan_tasks()中所使用的迭代器，这部份我们在后面分析Cgroup_scan_tasks()代码的时候再来详细分析。

另外，这里还涉及到mmpolicy 的一些接口，比如mpol_rebind_mm（）cpuset_migrate_mm（）à do_migrate_pages()这里就不再分析了。感兴趣的，可自行阅读其源代码。

此外，在这个函数中还涉及到一个全局cpuset_being_rebound.它在mpol_dup()拷贝当前进程的内存分存policy的时候会用到。

回到cpuset_track_online_cpus（）中，在上面已经分析完了scan_for_empty_cpusets().现在来分析其它的子函数。

3.3.2: generate_sched_domains()

该函数用来取得cpuset中的调度域信息，将取得的调度域信息保存进它的两上函数中，如下示：

static int generate_sched_domains(cpumask_t **domains,

struct sched_domain_attr **attributes)

{

LIST_HEAD(q); /* queue of cpusets to be scanned */

struct cpuset *cp; /* scans q */

struct cpuset **csa; /* array of all cpuset ptrs */

int csn; /* how many cpuset ptrs in csa so far */

int i, j, k; /* indices for partition finding loops */

cpumask_t *doms; /* resulting partition; i.e. sched domains */

struct sched_domain_attr *dattr; /* attributes for custom domains */

int ndoms = 0; /* number of sched domains in result */

int nslot; /* next empty doms[] cpumask_t slot */

doms = NULL;

dattr = NULL;

csa = NULL;

/* Special case for the 99% of systems with one, full, sched domain */

/*如果top_cpuset设置了CS_SCHED_LOAD_BALANCE

*说明要在系统全部的cpu间实现sched balance*/

if (is_sched_load_balance(&top_cpuset)) {

doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);

if (!doms)

goto done;

dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);

if (dattr) {

*dattr = SD_ATTR_INIT;

/* 取得top_cpuset以及它下面子层的最大relax_domain_level */

update_domain_attr_tree(dattr, &top_cpuset);

}

/* 顶层的cpus_allowed */

*doms = top_cpuset.cpus_allowed;

ndoms = 1;

goto done;

}

/* cpuset数组*/

csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);

if (!csa)

goto done;

csn = 0;

/*遍历整个cpuset tree,将设置了CS_SCHED_LOAD_BALANCE

*的cpuset放入csa[]中. csn表示cpuset 的项数*/

list_add(&top_cpuset.stack_list, &q);

while (!list_empty(&q)) {

struct cgroup *cont;

struct cpuset *child; /* scans child cpusets of cp */

cp = list_first_entry(&q, struct cpuset, stack_list);

list_del(q.next);

if (cpus_empty(cp->cpus_allowed))

continue;

* All child cpusets contain a subset of the parent's cpus, so

* just skip them, and then we call update_domain_attr_tree()

* to calc relax_domain_level of the corresponding sched

* domain.

if (is_sched_load_balance(cp)) {

csa[csn++] = cp;

continue;

}

list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {

child = cgroup_cs(cont);

list_add_tail(&child->stack_list, &q);

}

/*将csa[]中的cpuset->pn设置为所在的数组项*/

for (i = 0; i < csn; i++)

csa[i]->pn = i;

ndoms = csn;

restart:

/* Find the best partition (set of sched domains) */

/*遍历csa数组中的cpuset.将有交叉的cpuset->pn设为相同

*ndoms即为csa中没有交叉的cpuset的cpuset 个数*/

for (i = 0; i < csn; i++) {

struct cpuset *a = csa[i];

int apn = a->pn;

for (j = 0; j < csn; j++) {

struct cpuset *b = csa[j];

int bpn = b->pn;

if (apn != bpn && cpusets_overlap(a, b)) {

for (k = 0; k < csn; k++) {

struct cpuset *c = csa[k];

if (c->pn == bpn)

c->pn = apn;

}

ndoms--; /* one less element */

goto restart;

}

* Now we know how many domains to create.

* Convert <csn, csa> to <ndoms, doms> and populate cpu masks.

/*有多少个不交叉的设置了CS_SCHED_LOAD_BALANCE的cpuset

*就有多少个调度域*/

doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);

if (!doms)

goto done;

* The rest of the code, including the scheduler, can deal with

* dattr==NULL case. No need to abort if alloc fails.

/*有多少个调度域,就有多少个调度域属性*/

dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL);

/*填充doms和dattr,分别为同一项的cpu_allowed合集和

*该层cpuset下面最大relax_domain_level 值

for (nslot = 0, i = 0; i < csn; i++) {

struct cpuset *a = csa[i];

cpumask_t *dp;

int apn = a->pn;

if (apn < 0) {

/* Skip completed partitions */

continue;

}

dp = doms + nslot;

/*按理说,nslot不可能毛坯地ndoms.因为ndoms代表调度域的个数

*而nslot是cas中pn不相同的cpuset项数-1 .因为nslot是从0开始计数的*/

if (nslot == ndoms) {

static int warnings = 10;

if (warnings) {

printk(KERN_WARNING

"rebuild_sched_domains confused:"

" nslot %d, ndoms %d, csn %d, i %d,"

" apn %d/n",

nslot, ndoms, csn, i, apn);

warnings--;

}

continue;

}

cpus_clear(*dp);

if (dattr)

*(dattr + nslot) = SD_ATTR_INIT;

for (j = i; j < csn; j++) {

struct cpuset *b = csa[j];

if (apn == b->pn) {

cpus_or(*dp, *dp, b->cpus_allowed);

if (dattr)

update_domain_attr_tree(dattr + nslot, b);

/* Done with this partition */

b->pn = -1;

}

nslot++;

}

BUG_ON(nslot != ndoms);

done:

kfree(csa);

* Fallback to the default domain if kmalloc() failed.

* See comments in partition_sched_domains().

if (doms == NULL)

ndoms = 1;

*domains = doms;

*attributes = dattr;

return ndoms;

}

这个函数比较简单，就不详细分析了。请对照添加的注释自行分析。

至此，cpuset的初始化就分析完了.

四:cpuset中的相关操作

下面来分析cpuset中的相关操作，

Cpuset subsystem的结构如下：

struct cgroup_subsys cpuset_subsys = {

.name = "cpuset",

.create = cpuset_create,

.destroy = cpuset_destroy,

.can_attach = cpuset_can_attach,

.attach = cpuset_attach,

.populate = cpuset_populate,

.post_clone = cpuset_post_clone,

.subsys_id = cpuset_subsys_id,

.early_init = 1,

};

根据上面的结构再结合我们之前分析过的cgroup子系统，可以得知相关的操作流程。

4.1:创建cgroup时

经过前面的分析，我们知道在创建cgroup的时候会调用subsystem的create接口。在cpuset中对应就是cpuset_create().代码如下：

static struct cgroup_subsys_state *cpuset_create(

struct cgroup_subsys *ss,

struct cgroup *cont)

{

struct cpuset *cs;

struct cpuset *parent;

/*如果是根目录.返回top_cpuset即可.*/

if (!cont->parent) {

/* This is early initialization for the top cgroup */

top_cpuset.mems_generation = cpuset_mems_generation++;

return &top_cpuset.css;

}

/*取得父结点的cpuset*/

parent = cgroup_cs(cont->parent);

/*分配并初始化一个cpuset*/

cs = kmalloc(sizeof(*cs), GFP_KERNEL);

if (!cs)

return ERR_PTR(-ENOMEM);

cpuset_update_task_memory_state();

cs->flags = 0;

if (is_spread_page(parent))

set_bit(CS_SPREAD_PAGE, &cs->flags);

if (is_spread_slab(parent))

set_bit(CS_SPREAD_SLAB, &cs->flags);

set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);

/*清空cpus_allowed and mems_allowed*/

cpus_clear(cs->cpus_allowed);

nodes_clear(cs->mems_allowed);

cs->mems_generation = cpuset_mems_generation++;

fmeter_init(&cs->fmeter);

cs->relax_domain_level = -1;

/*设置父结点*/

cs->parent = parent;

number_of_cpusets++;

return &cs->css ;

}

上面的代码比较简单，在这里是返回cpuset->css.因此就可以根据cgroup_subsys_state这个结构找到所属的cpuset结构。

另外，我们在这里也可以看到，新建一个cpuset，它的mems_allowed和cpus_allowed都是空的。而relax_domain_level则是默认值-1.

4.2:关联进程时

在为cgroup关联进程的时候，首先会调用subsys->can_attach()来判断进程是否能够关联到cgroup。返回0说明可以。如果可以关联的时候，还会调用subsys->attach()来对进程进行关联。下面分别来分析这两个接口.

4.2.1: cpuset_can_attach()

代码如下：

static int cpuset_can_attach(struct cgroup_subsys *ss,

struct cgroup *cont, struct task_struct *tsk)

{

struct cpuset *cs = cgroup_cs(cont);

/*如果此cpuset中允许的资源为空,进程无法运行,不可关联*/

if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))

return -ENOSPC;

/*如果进程已经指定了绑定的cpu.

*如果指定绑定的cpu集不同于cpuset中的cpu集,不可关联*/

if (tsk->flags & PF_THREAD_BOUND) {

cpumask_t mask;

mutex_lock(&callback_mutex);

mask = cs->cpus_allowed;

mutex_unlock(&callback_mutex);

if (!cpus_equal(tsk->cpus_allowed, mask))

return -EINVAL;

}

/*进行常规安全性检查*/

return security_task_setscheduler(tsk, 0, NULL);

}

这函数比较简单，就不详细分析了。

4.2.2: cpuset_attach()

代码如下：

static void cpuset_attach(struct cgroup_subsys *ss,

struct cgroup *cont, struct cgroup *oldcont,

struct task_struct *tsk)

{

cpumask_t cpus;

nodemask_t from, to;

struct mm_struct *mm;

struct cpuset *cs = cgroup_cs(cont);

struct cpuset *oldcs = cgroup_cs(oldcont);

int err;

/*cs:是进程即将要移到的cpuset. oldcs是进程之前所在的cpuset*/

/*更新进程的cpu位图*/

mutex_lock(&callback_mutex);

guarantee_online_cpus(cs, &cpus);

err = set_cpus_allowed_ptr(tsk, &cpus);

mutex_unlock(&callback_mutex);

if (err)

return;

/*更新进程的内存结点位图.如果定义了CS_MEMORY_MIGRATE

*还需要将进程从旧结点移动到新结点中

from = oldcs->mems_allowed;

to = cs->mems_allowed;

mm = get_task_mm(tsk);

if (mm) {

mpol_rebind_mm(mm, &to);

if (is_memory_migrate(cs))

cpuset_migrate_mm(mm, &from, &to);

mmput(mm);

}

这个函数也比较简单，请参照代码注释自行分析。

4.3:创建操作文件时

当cpuset在创建时，会在其文件系统下创建操作文件，相应的会调用subsys-> populate().代码如下：

static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)

{

int err;

err = cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));

if (err)

return err;

/* memory_pressure_enabled is in root cpuset only */

if (!cont->parent)

err = cgroup_add_file(cont, ss,

&cft_memory_pressure_enabled);

return err;

}

从代码中可以看到，cpuset顶层多了一个文件，相应的cftype结构为cft_memory_pressure_enabled.如下所示：

static struct cftype cft_memory_pressure_enabled = {

.name = "memory_pressure_enabled",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_MEMORY_PRESSURE_ENABLED,

};

也就是一个名为” memory_pressure_enabled”的文件。

在所有cpuset目录下都有的文件为file对应的cftype,结构如下示：

static struct cftype files[] = {

{

.name = "cpus",

.read = cpuset_common_file_read,

.write_string = cpuset_write_resmask,

.max_write_len = (100U + 6 * NR_CPUS),

.private = FILE_CPULIST,

{

.name = "mems",

.read = cpuset_common_file_read,

.write_string = cpuset_write_resmask,

.max_write_len = (100U + 6 * MAX_NUMNODES),

.private = FILE_MEMLIST,

{

.name = "cpu_exclusive",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_CPU_EXCLUSIVE,

{

.name = "mem_exclusive",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_MEM_EXCLUSIVE,

{

.name = "mem_hardwall",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_MEM_HARDWALL,

{

.name = "sched_load_balance",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_SCHED_LOAD_BALANCE,

{

.name = "sched_relax_domain_level",

.read_s64 = cpuset_read_s64,

.write_s64 = cpuset_write_s64,

.private = FILE_SCHED_RELAX_DOMAIN_LEVEL,

{

.name = "memory_migrate",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_MEMORY_MIGRATE,

{

.name = "memory_pressure",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_MEMORY_PRESSURE,

{

.name = "memory_spread_page",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_SPREAD_PAGE,

{

.name = "memory_spread_slab",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_SPREAD_SLAB,

}

也就是名为cpus, mems, cpu_exclusive, mem_exclusive, mem_hardwall, sched_load_balance, sched_relax_domain_level, memory_migrate, memory_pressure, memory_spread_page, memory_spread_slab这几个文件。

其中有几个文件代表的含义我们在上面已经分析过了，如：cpus,mems,sched_load_balance.sched_relax_domain_level,memory_migreate, memory_spread_page和memory_spread_slab.下面我们重点分析一下其它文件是代表的意义。

五：cpuset中的文件操作

5.1: memory_pressure_enabled文件

我们从顶层目录看起,对于cpuset subsystem而言,顶层有个特有的文件,即memory_pressure_enabled.这个文件的含义为:是否计算cpuset中内存压力.何所谓内存压力?就是指当前系统的空闲内存不能满足当前的内存分配请求的速率.有关内存压力计算的细节可以参考kernel自带的文档.

文件对应的cftype如下示:

static struct cftype cft_memory_pressure_enabled = {

.name = "memory_pressure_enabled",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_MEMORY_PRESSURE_ENABLED,

};

从上面看到读操作的接口为cpuset_read_u64,写操作的接口为cpuset_write_u64.我们在之后也可以看到,cpuset中的大部份文件都是用的两个接口,它是根据它的private成员来区分各项操作的,

先来分析读操作:

static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)

{

struct cpuset *cs = cgroup_cs(cont);

cpuset_filetype_t type = cft->private;

switch (type) {

case FILE_CPU_EXCLUSIVE:

return is_cpu_exclusive(cs);

case FILE_MEM_EXCLUSIVE:

return is_mem_exclusive(cs);

case FILE_MEM_HARDWALL:

return is_mem_hardwall(cs);

case FILE_SCHED_LOAD_BALANCE:

return is_sched_load_balance(cs);

case FILE_MEMORY_MIGRATE:

return is_memory_migrate(cs);

case FILE_MEMORY_PRESSURE_ENABLED:

return cpuset_memory_pressure_enabled;

case FILE_MEMORY_PRESSURE:

return fmeter_getrate(&cs->fmeter);

case FILE_SPREAD_PAGE:

return is_spread_page(cs);

case FILE_SPREAD_SLAB:

return is_spread_slab(cs);

default:

BUG();

}

/* Unreachable but makes gcc happy */

return 0;

}

对应到memory_pressure_enable文件,对应的private域为FILE_MEMORY_PRESSURE_ENABLED.即返回cpuset_memory_pressure_enable的值.这个变量定义如下:

int cpuset_memory_pressure_enabled

虽然它是一个int型数据,但它是一个bool型的,只有0,1两种可能.从写操作就可以看到.

写操作的接口为: cpuset_write_u64().代码如下:

static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)

{

int retval = 0;

struct cpuset *cs = cgroup_cs(cgrp);

cpuset_filetype_t type = cft->private;

if (!cgroup_lock_live_group(cgrp))

return -ENODEV;

switch (type) {

case FILE_CPU_EXCLUSIVE:

retval = update_flag(CS_CPU_EXCLUSIVE, cs, val);

break;

case FILE_MEM_EXCLUSIVE:

retval = update_flag(CS_MEM_EXCLUSIVE, cs, val);

break;

case FILE_MEM_HARDWALL:

retval = update_flag(CS_MEM_HARDWALL, cs, val);

break;

case FILE_SCHED_LOAD_BALANCE:

retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val);

break;

case FILE_MEMORY_MIGRATE:

retval = update_flag(CS_MEMORY_MIGRATE, cs, val);

break;

case FILE_MEMORY_PRESSURE_ENABLED:

cpuset_memory_pressure_enabled = !!val;

break;

case FILE_MEMORY_PRESSURE:

retval = -EACCES;

break;

case FILE_SPREAD_PAGE:

retval = update_flag(CS_SPREAD_PAGE, cs, val);

cs->mems_generation = cpuset_mems_generation++;

break;

case FILE_SPREAD_SLAB:

retval = update_flag(CS_SPREAD_SLAB, cs, val);

cs->mems_generation = cpuset_mems_generation++;

break;

default:

retval = -EINVAL;

break;

}

cgroup_unlock();

return retval;

}

对应的memory_pressure_enable文件,它的操作为:

cpuset_memory_pressure_enabled = !!val

即就是设置cpuset_memory_pressure_enabled的值.如果写入为0,该值为0,如果写入其它数,该值为1.

综合上面的分析,它主要是对cpuset_memory_pressure_enabled进行操作,那么这个变量有什么作用呢?下面来分析一下.

在__alloc_pages_internal()中,如果当前内存不能满足内存分配请求的要求,就会调用cpuset_memory_pressure_bump().代码如下所示:

#define cpuset_memory_pressure_bump() /

do { /

if (cpuset_memory_pressure_enabled) /

__cpuset_memory_pressure_bump(); /

} while (0)

它实际上就是一个宏定义.如果启用了memory pressure,也就是cpuset_memroy_pressue_enable为1时.就会执行__cpuset_memroy_pressure_bump().代码如下:

void __cpuset_memory_pressure_bump(void)

{

task_lock(current);

fmeter_markevent(&task_cs(current)->fmeter);

task_unlock(current);

}

在这里我们就看到cpuset->fmeter成员的意义,它就是用来计算内存压力的.fmeter_markevent()就不分析了,它无非就是根据请求时内存不足速率来计算压力值.最后计算出来的压力值会保存在fmeter.val中.

5.2: memory_pressure文件

memory_pressure文件用来查看当前cpuset节点的内存压力值.cftype结构如下:

{

.name = "memory_pressure",

.read_u64 = cpuset_read_u64,

.write_u64 = cpuset_write_u64,

.private = FILE_MEMORY_PRESSURE,

操作接口跟之前分析的是一样的.

读操作:

static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)

{

．．．．．．

case FILE_MEMORY_PRESSURE:

return fmeter_getrate(&cs->fmeter);

．．．．．．

｝

Fmeter_getrate()代码如下:

static int fmeter_getrate(struct fmeter *fmp)

{

int val;

spin_lock(&fmp->lock);

fmeter_update(fmp);

val = fmp->val;

spin_unlock(&fmp->lock);

return val;

}

它就是返回了当前节下的内存压力值.

写操作:

static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)

{

．．．．．．

case FILE_MEMORY_PRESSURE:

retval = -EACCES;

．．．．．．

｝

从此可看到,这个文件是不可写的.

5.3:cpus文件

Cpus文件可以用来配置与cpuset的绑定cpu.对应的cftype结构如下:

{

.name = "cpus",

.read = cpuset_common_file_read,

.write_string = cpuset_write_resmask,

.max_write_len = (100U + 6 * NR_CPUS),

.private = FILE_CPULIST,

}

读操作接口为cpuset_common_file_read().代码如下:

static ssize_t cpuset_common_file_read(struct cgroup *cont,

struct cftype *cft,

struct file *file,

char __user *buf,

size_t nbytes, loff_t *ppos)

{

struct cpuset *cs = cgroup_cs(cont);

cpuset_filetype_t type = cft->private;

char *page;

ssize_t retval = 0;

char *s;

if (!(page = (char *)__get_free_page(GFP_TEMPORARY)))

return -ENOMEM;

s = page;

switch (type) {

case FILE_CPULIST:

/*将cpuset->cpus_allowed转换为字串存放s中*/

s += cpuset_sprintf_cpulist(s, cs);

break;

case FILE_MEMLIST:

/*将cpuset->memsallowd转换为字串存放在s 中*/

s += cpuset_sprintf_memlist(s, cs);

break;

default:

retval = -EINVAL;

goto out;

}

/*以/n结尾*/

*s++ = '/n';

/*copy 到用户空间*/

retval = simple_read_from_buffer(buf, nbytes, ppos, page, s - page);

out:

free_page((unsigned long)page);

retur

【上篇】15 个顶级 HTML5 游戏引擎
【下篇】2003-5-27 19:41:42 我今天终于有了工作的感觉

作者: horus

该日志由 horus 于11年前发表在综合分类下，最后更新于 2013年02月10日.
转载请注明: Linux cgroup机制分析之cpuset subsystem | 学步园 +复制链接

抱歉!评论已关闭.

返回首页

（其他合作也可洽谈）

必威体育

必威电竞

学步园