aboutsummaryrefslogtreecommitdiff
path: root/kernel/cgroup.c
AgeCommit message (Collapse)AuthorFilesLines
2013-08-02cgroup: Merge branch 'for-3.11-fixes' into for-3.12Tejun Heo1-13/+22
for-3.12 branch is about to receive invasive updates which are dependent on da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails"). Given the amount of scheduled changes, I think it'd less painful to pull in for-3.11-fixes as preparation. Pull in for-3.11-fixes into for-3.12. Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-01cgroup: rename cgroup_pidlist->mutexLi Zefan1-11/+11
It's a rw_semaphore not a mutex. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-01cgroup: restructure the failure path in cgroup_write_event_control()Li Zefan1-21/+18
It uses a single label and checks the validity of each pointer. This is err-prone, and actually we had a bug because one of the check was insufficient. Use multi lables as we do in other places. v2: - drop initializations of local variables. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-07-31cgroup: convert cgroup_ida to cgroup_idrLi Zefan1-6/+27
This enables us to lookup a cgroup by its id. v4: - add a comment for idr_remove() in cgroup_offline_fn(). v3: - on success, idr_alloc() returns the id but not 0, so fix the BUG_ON() in cgroup_init(). - pass the right value to idr_alloc() so that the id for dummy cgroup is 0. Signed-off-by: Li Zefan <lizefan@huawei.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-07-31cgroup: more naming cleanupsLi Zefan1-13/+13
Constantly use @cset for css_set variables and use @cgrp as cgroup variables. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-07-31cgroup: remove struct cgroup_seqfile_stateLi Zefan1-32/+13
We can use struct cfent instead. v2: - remove cgroup_seqfile_release(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-07-31cgroup: remove sparse tags from offline_css()Li Zefan1-3/+2
This should have been removed in commit d7eeac1913ff ("cgroup: hold cgroup_mutex before calling css_offline"). While at it, update the comments. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-07-31cgroup: fix a leak when percpu_ref_init() failsLi Zefan1-1/+3
ss->css_free() is not called when perfcpu_ref_init() fails. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-07-23Merge branch 'for-3.11-fixes' of ↵Linus Torvalds1-12/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "This contains two patches, both of which aren't fixes per-se but I think it'd be better to fast-track them. One removes bcache_subsys_id which was added without proper review through the block tree. Fortunately, bcache cgroup code is unconditionally disabled, so this was never exposed to userland. The cgroup subsys_id is removed. Kent will remove the affected (disabled) code through bcache branch. The other simplifies task_group_path_from_hierarchy(). The function doesn't currently have in-kernel users but there are external code and development going on dependent on the function and making the function available for 3.11 would make things go smoother" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: replace task_cgroup_path_from_hierarchy() with task_cgroup_path() cgroup: remove bcache_subsys_id which got added stealthily
2013-07-16cgroup: remove gratuituous BUG_ON()s from rebind_subsystems()Tejun Heo1-9/+0
rebind_subsystems() performs santiy checks even on subsystems which aren't specified to be added or removed and the checks aren't all that useful given that these are in a very cold path while the violations they check would trip up in much hotter paths. Let's remove these from rebind_subsystems(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-16cgroup: move module ref handling into rebind_subsystems()Tejun Heo1-65/+28
Module ref handling in cgroup is rather weird. parse_cgroupfs_options() grabs all the modules for the specified subsystems. A module ref is kept if the specified subsystem is newly bound to the hierarchy. If not, or the operation fails, the refs are dropped. This scatters module ref handling across multiple functions making it difficult to track. It also make the function nasty to use for dynamic subsystem binding which is necessary for the planned unified hierarchy. There's nothing which requires the subsystem modules to be pinned between parse_cgroupfs_options() and rebind_subsystems() in both mount and remount paths. parse_cgroupfs_options() can just parse and rebind_subsystems() can handle pinning the subsystems that it wants to bind, which is a natural part of its task - binding - anyway. Move module ref handling into rebind_subsystems() which makes the code a lot simpler - modules are gotten iff it's gonna be bound and put iff unbound or binding fails. v2: Li pointed out that if a controller module is unloaded between parsing and binding, rebind_subsystems() won't notice the missing controller as it only iterates through existing controllers. Fix it by updating rebind_subsystems() to compare @added_mask to @pinned and fail with -ENOENT if they don't match. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-14cgroup: we can use simple_lookup() nowAl Viro1-10/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-12cgroup: replace task_cgroup_path_from_hierarchy() with task_cgroup_path()Tejun Heo1-12/+19
task_cgroup_path_from_hierarchy() was added for the planned new users and none of the currently planned users wants to know about multiple hierarchies. This patch drops the multiple hierarchy part and makes it always return the path in the first non-dummy hierarchy. As unified hierarchy will always have id 1, this is guaranteed to return the path for the unified hierarchy if mounted; otherwise, it will return the path from the hierarchy which happens to occupy the lowest hierarchy id, which will usually be the first hierarchy mounted after boot. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kaluža <jkaluza@redhat.com>
2013-07-12cgroup: move number_of_cgroups test out of rebind_subsystems() into ↵Tejun Heo1-7/+6
cgroup_remount() rebind_subsystems() currently fails if the hierarchy has any !root cgroups; however, on the planned unified hierarchy, rebind_subsystems() will be used while populated. Move the test to cgroup_remount(), which is the only place the test is necessary anyway. As it's impossible for the other two callers of rebind_subsystems() to have populated hierarchy, this doesn't make any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-12cgroup: make rebind_subsystems() handle file additions and removals with ↵Tejun Heo1-32/+41
proper error handling Currently, creating and removing cgroup files in the root directory are handled separately from the actual subsystem binding and unbinding which happens in rebind_subsystems(). Also, rebind_subsystems() users aren't handling file creation errors properly. Let's integrate top_cgroup file handling into rebind_subsystems() so that it's simpler to use and everyone handles file creation errors correctly. * On a successful return, rebind_subsystems() is guaranteed to have created all files of the new subsystems and deleted the ones belonging to the removed subsystems. After a failure, no file is created or removed. * cgroup_remount() no longer needs to make explicit populate/clear calls as it's all handled by rebind_subsystems(), and it gets proper error handling automatically. * cgroup_mount() has been updated such that the root dentry and cgroup are linked before rebind_subsystems(). Also, the init_cred dancing and base file handling are moved right above rebind_subsystems() call and proper error handling for the base files is added. While at it, add a comment explaining what's going on with the cred thing. * cgroup_kill_sb() calls rebind_subsystems() to unbind all subsystems which now implies removing all subsystem files which requires the directory's i_mutex. Grab it. This means that files on the root cgroup are removed earlier - they used to be deleted from generic super_block cleanup from vfs. This doesn't lead to any functional difference and it's cleaner to do the clean up explicitly for all files. Combined with the previous changes, this makes all cgroup file creation errors handled correctly. v2: Added comment on init_cred. v3: Li spotted that cgroup_mount() wasn't freeing tmp_links after base file addition failure. Fix it by adding free_tmp_links error handling label. v4: v3 introduced build bugs which got noticed by Fengguang's awesome kbuild test robot. Fixed, and shame on me. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com>
2013-07-12cgroup: use for_each_subsys() instead of for_each_root_subsys() in ↵Tejun Heo1-5/+8
cgroup_populate/clear_dir() rebind_subsystems() will be updated to handle file creations and removals with proper error handling and to do that will need to perform file operations before actually adding the subsystem to the hierarchy. To enable such usage, update cgroup_populate/clear_dir() to use for_each_subsys() instead of for_each_root_subsys() so that they operate on all subsystems specified by @subsys_mask whether that subsystem is currently bound to the hierarchy or not. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-12cgroup: update error handling in cgroup_populate_dir()Tejun Heo1-2/+11
cgroup_populate_dir() didn't use to check whether the actual file creations were successful and could return success with only subset of the requested files created, which is nasty. This patch udpates cgroup_populate_dir() so that it either succeeds with all files or fails with no file. v2: The original patch also converted for_each_root_subsys() usages to for_each_subsys() without explaining why. That part has been moved to a separate patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-12cgroup: separate out cgroup_base_files[] handling out of ↵Tejun Heo1-27/+19
cgroup_populate/clear_dir() cgroup_populate/clear_dir() currently take @base_files and adds and removes, respectively, cgroup_base_files[] to the directory. File additions and removals are being reorganized for proper error handling and more dynamic handling for the unified hierarchy, and mixing base and subsys file handling into the same functions gets a bit confusing. This patch moves base file handling out of cgroup_populate/clear_dir() into their users - cgroup_mount(), cgroup_create() and cgroup_destroy_locked(). Note that this changes the behavior of base file removal. If @base_files is %true, cgroup_clear_dir() used to delete files regardless of cftype until there's no files left. Now, only files with matching cfts are removed. As files can only be created by the base or registered cftypes, this shouldn't result in any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-12cgroup: fix cgroup_add_cftypes() error handlingTejun Heo1-8/+18
cgroup_add_cftypes() uses cgroup_cfts_commit() to actually create the files; however, both functions ignore actual file creation errors and just assume success. This can lead to, for example, blkio hierarchy with some of the cgroups with only subset of interface files populated after cfq-iosched is loaded under heavy memory pressure, which is nasty. This patch updates cgroup_cfts_commit() and cgroup_add_cftypes() to guarantee that all files are created on success and no file is created on failure. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-12cgroup: fix error path of cgroup_addrm_files()Tejun Heo1-6/+22
cgroup_addrm_files() mishandled error return value from cgroup_add_file() and returns error iff the last file fails to create. As we're in the process of cleaning up file add/rm error handling and will reliably propagate file creation failures, there's no point in keeping adding files after a failure. Replace the broken error collection logic with immediate error return. While at it, add lockdep assertions and function comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-12cgroup: minor updates around cgroup_clear_directory()Tejun Heo1-11/+8
* Rename it to cgroup_clear_dir() and make it take the pointer to the target cgroup instead of the the dentry. This makes the function consistent with its counterpart - cgroup_populate_dir(). * Move cgroup_clear_directory() invocation from cgroup_d_remove_dir() to cgroup_remount() so that the function doesn't have to determine the cgroup pointer back from the dentry. cgroup_d_remove_dir() now only deals with vfs, which is slightly cleaner. This patch doesn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-29cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an ↵Tejun Heo1-1/+1
existing hierarchy 0ce6cba357 ("cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options") only updated the remount path but CGRP_ROOT_SUBSYS_BOUND should also be ignored when comparing options while mounting an existing hierarchy. As option mismatch triggers a warning but doesn't fail the mount without sane_behavior, this only triggers a spurious warning message. Fix it by only comparing CGRP_ROOT_OPTION_MASK bits when comparing new and existing root options. Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-27cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount optionsTejun Heo1-1/+4
1672d04070 ("cgroup: fix cgroupfs_root early destruction path") introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of subsys binding on a new root; however, this broke remounts. cgroup_remount() doesn't allow changing root options via remount and CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots, makes the function reject all remounts. Fix it by putting the options part in the lower 16 bits of root->flags and masking the comparions. While at it, make cgroup_remount() emit an error message explaining why it's rejecting a remount request, so that it's less of a mystery. Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-27cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts()Tejun Heo1-2/+2
eb178d06332 ("cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()") made drop_parsed_module_refcounts() grab cgroup_mutex to make lockdep assertion in for_each_subsys() happy. Unfortunately, cgroup_remount() calls the function while holding cgroup_mutex in its failure path leading to the following deadlock. # mount -t cgroup -o remount,memory,blkio cgroup blkio cgroup: option changes via remount are deprecated (pid=525 comm=mount) ============================================= [ INFO: possible recursive locking detected ] 3.10.0-rc4-work+ #1 Not tainted --------------------------------------------- mount/525 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0 but task is already holding lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(cgroup_mutex); lock(cgroup_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by mount/525: #0: (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30 #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200 #2: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200 #3: (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200 stack backtrace: CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8 ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056 ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640 Call Trace: [<ffffffff81c24bb1>] dump_stack+0x19/0x1b [<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60 [<ffffffff810f531c>] lock_acquire+0x9c/0x1f0 [<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410 [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0 [<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200 [<ffffffff811c9bb2>] do_remount_sb+0x82/0x190 [<ffffffff811e9d41>] do_mount+0x5f1/0xa30 [<ffffffff811ea203>] SyS_mount+0x83/0xc0 [<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b Fix it by moving the drop_parsed_module_refcounts() invocation outside cgroup_mutex. Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-26cgroup: always use RCU accessors for protected accessesTejun Heo1-9/+12
kernel/cgroup.c still has places where a RCU pointer is set and accessed directly without going through RCU_INIT_POINTER() or rcu_dereference_protected(). They're all properly protected accesses so nothing is broken but it leads to spurious sparse RCU address space warnings. Substitute direct accesses with RCU_INIT_POINTER() and rcu_dereference_protected(). Note that %true is specified as the extra condition for all derference updates. This isn't ideal as all it does is suppressing warning without actually policing synchronization rules; however, most are scheduled to be removed pretty soon along with css_id itself, so no reason to be more elaborate. Combined with the previous changes, this removes all RCU related sparse warnings from cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by; Li Zefan <lizefan@huawei.com>
2013-06-26cgroup: fix RCU accesses around task->cgroupsTejun Heo1-11/+13
There are several places in kernel/cgroup.c where task->cgroups is accessed and modified without going through proper RCU accessors. None is broken as they're all lock protected accesses; however, this still triggers sparse RCU address space warnings. * Consistently use task_css_set() for task->cgroups dereferencing. * Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on exit. * Remove unnecessary rcu_dereference_raw() from cset->subsys[] dereference in cgroup_exit(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()Tejun Heo1-5/+5
This isn't strictly necessary as all subsystems specified in @subsys_mask are guaranteed to be pinned; however, it does spuriously trigger lockdep warning. Let's grab cgroup_mutex around it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26cgroup: fix cgroupfs_root early destruction pathTejun Heo1-3/+19
cgroupfs_root used to have ->actual_subsys_mask in addition to ->subsys_mask. a8a648c4ac ("cgroup: remove cgroup->actual_subsys_mask") removed it noting that the subsys_mask is essentially temporary and doesn't belong in cgroupfs_root; however, the patch made it impossible to tell whether a cgroupfs_root actually has the subsystems bound or just have the bits set leading to the following BUG when trying to mount with subsystems which are already mounted elsewhere. kernel BUG at kernel/cgroup.c:1038! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ... CPU: 1 PID: 7973 Comm: mount Tainted: G W 3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105 task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000 RIP: 0010:[<ffffffff81249b29>] [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0 ... Call Trace: [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210 [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90 [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0 [<ffffffff81257169>] cpuset_mount+0xd9/0x110 [<ffffffff813d2580>] mount_fs+0xb0/0x2d0 [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180 [<ffffffff814070b5>] do_new_mount+0x145/0x2c0 [<ffffffff814085d6>] do_mount+0x356/0x3c0 [<ffffffff8140873d>] SyS_mount+0xfd/0x140 [<ffffffff854eb600>] tracesys+0xdd/0xe2 We still want rebind_subsystems() to take added/removed masks, so let's fix it by marking whether a cgroupfs_root has finished binding or not. Also, document what's going on around ->subsys_mask initialization so that similar mistakes aren't repeated. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchyTejun Heo1-4/+6
Before 1a57423166 ("cgroup: make hierarchy_id use cyclic idr"), hierarchy IDs were allocated from 0. As the dummy hierarchy was always the one first initialized, it got assigned 0 and all other hierarchies from 1. The patch accidentally changed the minimum useable ID to 2. Let's restore ID 0 for dummy_root and while at it reserve 1 for unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
2013-06-25cgroup: implement for_each_[builtin_]subsys()Tejun Heo1-76/+71
There are quite a few places where all loaded [builtin] subsys are iterated. Implement for_each_[builtin_]subsys() and replace manual iterations with those to simplify those places a bit. The new iterators automatically skip NULL subsystems. This shouldn't cause any functional difference. Iteration loops which scan all subsystems and then skipping modular ones explicitly are converted to use for_each_builtin_subsys(). While at it, reorder variable declarations and adjust whitespaces a bit in the affected functions. v2: Add lockdep_assert_held() in for_each_subsys() and add comments about synchronization as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25cgroup: move init_css_set initialization inside cgroup_mutexTejun Heo1-4/+4
cgroup_init() was doing init_css_set initialization outside cgroup_mutex, which is fine but we want to add lockdep annotation on subsystem iterations and cgroup_init() will trigger it spuriously. Move init_css_set initialization inside cgroup_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24cgroup: s/for_each_subsys()/for_each_root_subsys()/Tejun Heo1-25/+22
for_each_subsys() walks over subsystems attached to a hierarchy and we're gonna add iterators which walk over all available subsystems. Rename for_each_subsys() to for_each_root_subsys() so that it's more appropriately named and for_each_subsys() can be used to iterate all subsystems. While at it, remove unnecessary underbar prefix from macro arguments, put them inside parentheses, and adjust indentation for the two for_each_*() macros. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24cgroup: clean up find_css_set() and friendsTejun Heo1-21/+17
find_css_set() passes uninitialized on-stack template[] array to find_existing_css_set() which sets the entries for all subsystems. Passing around an uninitialized array is a bit icky and we want to introduce an iterator which only iterates loaded subsystems. Let's initialize it on definition. While at it, also make the following cosmetic cleanups. * Convert to proper /** comments. * Reorder variable declarations. * Replace comment on synchronization with lockdep_assert_held(). This patch doesn't make any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24cgroup: remove cgroup->actual_subsys_maskTejun Heo1-10/+12
cgroup curiously has two subsystem masks, ->subsys_mask and ->actual_subsys_mask. The latter only exists because the new target subsys_mask is passed into rebind_subsystems() via @root>subsys_mask. rebind_subsystems() needs to know what the current mask is to decide how to reach the target mask so ->actual_subsys_mask is used as the temp location to remember the current state. Adding a temporary field to a permanent data structure is rather silly and can be misleading. Update rebind_subsystems() to take @added_mask and @removed_mask instead and remove @root->actual_subsys_mask. This patch shouldn't introduce any behavior changes. v2: Comment and description updated as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24cgroup: prefix global variables with "cgroup_"Tejun Heo1-76/+77
Global variable names in kernel/cgroup.c are asking for trouble - subsys, roots, rootnode and so on. Rename them to have "cgroup_" prefix. * s/subsys/cgroup_subsys/ * s/rootnode/cgroup_dummy_root/ * s/dummytop/cgroup_cummy_top/ * s/roots/cgroup_roots/ * s/root_count/cgroup_root_count/ This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-19cgroup: rename cont to cgrpLi Zefan1-11/+11
Cont is short for container. control group was named process container at first, but then people found container already has a meaning in linux kernel. Clean up the leftover variable name @cont. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18cgroup: clean up cgroup_serial_nr_cursorTejun Heo1-7/+8
cgroup_serial_nr_cursor was created atomic64_t because I thought it was never gonna used for anything other than assigning unique numbers to cgroups and didn't want to worry about synchronization; however, now we're using it as an event-stamp to distinguish cgroups created before and after certain point which assumes that it's protected by cgroup_mutex. Let's make it clear by making it a u64. Also, rename it to cgroup_serial_nr_next and make it point to the next nr to allocate so that where it's pointing to is clear and more conventional. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com>
2013-06-18cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()Li Zefan1-36/+44
We used root->allcg_list to iterate cgroup hierarchy because at that time cgroup_for_each_descendant_pre() hasn't been invented. tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the assignment right above releasing cgroup_mutex and explain what's going on there. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18cgroup: make serial_nr_cursor available throughout cgroup.cLi Zefan1-8/+10
The next patch will use it to determine if a cgroup is newly created while we're iterating the cgroup hierarchy. tj: Rephrased the comment on top of cgroup_serial_nr_cursor. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18cgroup: fix memory leak in cgroup_rm_cftypes()Li Zefan1-1/+2
The memory allocated in cgroup_add_cftypes() should be freed. The effect of this bug is we leak a bit memory everytime we unload cfq-iosched module if blkio cgroup is enabled. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18cgroup: fix umount vs cgroup_event_remove() raceLi Zefan1-6/+19
commit 5db9a4d99b0157a513944e9a44d29c9cec2e91dc Author: Tejun Heo <tj@kernel.org> Date: Sat Jul 7 16:08:18 2012 -0700 cgroup: fix cgroup hierarchy umount race This commit fixed a race caused by the dput() in css_dput_fn(), but the dput() in cgroup_event_remove() can also lead to the same BUG(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2013-06-18cgroup: fix umount vs cgroup_cfts_commit() raceLi Zefan1-1/+8
cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex is dropped, but dget() won't prevent cgroupfs from being umounted. When the race happens, vfs will see some dentries with non-zero refcnt while umount is in process. Keep running this: mount -t cgroup -o blkio xxx /cgroup umount /cgroup And this: modprobe cfq-iosched rmmod cfs-iosched After a while, the BUG() in shrink_dcache_for_umount_subtree() may be triggered: BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2013-06-18cgroup: disallow rename(2) if sane_behaviorTejun Heo1-0/+7
cgroup's rename(2) isn't a proper migration implementation - it can't move the cgroup to a different parent in the hierarchy. All it can do is swapping the name string for that cgroup. This isn't useful and can mislead users to think that cgroup supports proper cgroup-level migration. Disallow rename(2) if sane_behavior. v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs return value when ->rename is not implemented as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13cgroup: use percpu refcnt for cgroup_subsys_statesTejun Heo1-61/+104
A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking ->css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <koverstreet@google.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Alasdair G. Kergon" <agk@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Glauber Costa <glommer@gmail.com>
2013-06-13Merge branch 'for-3.11' of ↵Tejun Heo1-5/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.11 This is to receive percpu_refcount which will replace atomic_t reference count in cgroup_subsys_state. Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13cgroup: split cgroup destruction into two stepsTejun Heo1-11/+27
Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup->free_work to ->destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13cgroup: reorder the operations in cgroup_destroy_locked()Tejun Heo1-26/+35
This patch reorders the operations in cgroup_destroy_locked() such that the userland visible parts happen before css offlining and removal from the ->sibling list. This will be used to make css use percpu refcnt. While at it, split out CGRP_DEAD related comment from the refcnt deactivation one and correct / clarify how different guarantees are met. While this patch changes the specific order of operations, it shouldn't cause any noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13cgroup: remove cgroup->count and useTejun Heo1-16/+5
cgroup->count tracks the number of css_sets associated with the cgroup and used only to verify that no css_set is associated when the cgroup is being destroyed. It's superflous as the destruction path can simply check whether cgroup->cset_links is empty instead. Drop cgroup->count and check ->cset_links directly from cgroup_destroy_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13cgroup: drop unnecessary RCU dancing from __put_css_set()Tejun Heo1-10/+10
__put_css_set() does RCU read access on @cgrp across dropping @cgrp->count so that it can continue accessing @cgrp even if the count reached zero and destruction of the cgroup commenced. Given that both sides - __css_put() and cgroup_destroy_locked() - are cold paths, this is unnecessary. Just making cgroup_destroy_locked() grab css_set_lock while checking @cgrp->count is enough. Remove the RCU read locking from __put_css_set() and make cgroup_destroy_locked() read-lock css_set_lock when checking @cgrp->count. This will also allow removing @cgrp->count. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13cgroup: rename CGRP_REMOVED to CGRP_DEADTejun Heo1-16/+14
We will add another flag indicating that the cgroup is in the process of being killed. REMOVING / REMOVED is more difficult to distinguish and cgroup_is_removing()/cgroup_is_removed() are a bit awkward. Also, later percpu_ref usage will involve "kill"ing the refcnt. s/CGRP_REMOVED/CGRP_DEAD/ s/cgroup_is_removed()/cgroup_is_dead() This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>