aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-08-12Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull w/w mutex deadlock injection fix from Ingo Molnar. This bug made the CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y option largely useless, but wouldn't affect normal users. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mutex: Fix w/w mutex deadlock injection
2013-08-12Merge branch 'fortglx/3.11/time' of ↵Ingo Molnar1-1/+1
git://git.linaro.org/people/jstultz/linux into timers/urgent Pull small fix for v3.11 from John Stultz. Signed-off-by: Ingo Molnar <[email protected]>
2013-08-09jump_label: Split jumplabel ratelimitAndrew Jones1-0/+1
Commit b202952075f62603bea9bfb6ebc6b0420db11949 ("perf, core: Rate limit perf_sched_events jump_label patching") introduced rate limiting for jump label disabling. The changes were made in the jump label code in order to be more widely available and to keep things tidier. This is all fine, except now jump_label.h includes linux/workqueue.h, which makes it impossible to include jump_label.h from anything that workqueue.h needs. For example, it's now impossible to include jump_label.h from asm/spinlock.h, which is done in proposed pv-ticketlock patches. This patch splits out the rate limiting related changes from jump_label.h into a new file, jump_label_ratelimit.h, to resolve the issue. Signed-off-by: Andrew Jones <[email protected]> Link: http://lkml.kernel.org/r/1376058122-8248-10-git-send-email-raghavendra.kt@linux.vnet.ibm.com Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Ingo Molnar <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2013-08-08cgroup: make css_for_each_descendant() and friends include the origin css in ↵Tejun Heo3-47/+53
the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Vivek Goyal <[email protected]> Acked-by: Aristeu Rozanski <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matt Helsley <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]>
2013-08-08cgroup: unexport cgroup_css()Tejun Heo1-0/+13
cgroup_css() no longer has any user left outside cgroup.c proper and we don't want subsystems to grow new usages of the function. cgroup core should always provide the css to use to the subsystems, which will make dynamic creation and destruction of css's across the lifetime of a cgroup much more manageable than exposing the cgroup directly to subsystems and let them dereference css's from it. Make cgroup_css() a static function in cgroup.c. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroupTejun Heo5-21/+18
cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. cgroup_taskset which is used by the subsystem attach methods is the last cgroup subsystem API which isn't using css as the handle. Update cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp. The conversions are pretty mechanical. One exception is cpuset::cgroup_cs(), which lost its last user and got removed. This patch shouldn't introduce any functional changes. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Daniel Wagner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Matt Helsley <[email protected]> Cc: Steven Rostedt <[email protected]>
2013-08-08cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state ↵Tejun Heo1-7/+8
instead of cgroup cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. cftype->[un]register_event() is among the remaining couple interfaces which still use struct cgroup. Convert it to cgroup_subsys_state. The conversion is mostly mechanical and removes the last users of mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed. v2: indentation update as suggested by Li Zefan. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]>
2013-08-08cgroup: make task iterators deal with cgroup_subsys_state instead of cgroupTejun Heo3-91/+88
cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. This patch converts task iterators to deal with css instead of cgroup. Note that under unified hierarchy, different sets of tasks will be considered belonging to a given cgroup depending on the subsystem in question and making the iterators deal with css instead cgroup provides them with enough information about the iteration. While at it, fix several function comment formats in cpuset.c. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Matt Helsley <[email protected]>
2013-08-08cgroup: remove struct cgroup_scannerTejun Heo2-87/+69
cgroup_scan_tasks() takes a pointer to struct cgroup_scanner as its sole argument and the only function of that struct is packing the arguments of the function call which are consisted of five fields. It's not too unusual to pack parameters into a struct when the number of arguments gets excessive or the whole set needs to be passed around a lot, but neither holds here making it just weird. Drop struct cgroup_scanner and pass the params directly to cgroup_scan_tasks(). Note that struct cpuset_change_nodemask_arg was added to cpuset.c to pass both ->cs and ->newmems pointer to cpuset_change_nodemask() using single data pointer. This doesn't make any functional differences. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: make cgroup_task_iter remember the cgroup being iteratedTejun Heo2-23/+21
Currently all cgroup_task_iter functions require @cgrp to be passed in, which is superflous and increases chance of usage error. Make cgroup_task_iter remember the cgroup being iterated and drop @cgrp argument from next and end functions. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Matt Helsley <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]>
2013-08-08cgroup: rename cgroup_iter to cgroup_task_iterTejun Heo2-49/+89
cgroup now has multiple iterators and it's quite confusing to have something which walks over tasks of a single cgroup named cgroup_iter. Let's rename it to cgroup_task_iter. While at it, reformat / update comments and replace the overview comment above the interface function decls with proper function comments. Such overview can be useful but function comments should be more than enough here. This is pure rename and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Matt Helsley <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]>
2013-08-08cgroup: relocate cgroup_advance_iter()Tejun Heo1-24/+24
For some reason, cgroup_advance_iter() is standing lonely all away from its iter comrades. Relocate it. This is cosmetic. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroupTejun Heo3-102/+112
cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vivek Goyal <[email protected]> Acked-by: Aristeu Rozanski <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Matt Helsley <[email protected]> Cc: Jens Axboe <[email protected]>
2013-08-08cgroup: always use cgroup_next_child() to walk the children listTejun Heo1-4/+3
There are several places where the children list is accessed directly. This patch converts those places to use cgroup_next_child(). This will help updating the hierarchy iterators to use @css instead of @cgrp. While cgroup_next_child() can be heavy in pathological cases - e.g. a lot of dead children, this shouldn't cause any noticeable behavior differences. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: convert cgroup_next_sibling() to cgroup_next_child()Tejun Heo1-29/+30
cgroup is transitioning to using css (cgroup_subsys_state) as the main subsys interface handle instead of cgroup and the iterators will be updated to use css too. The iterators need to walk the cgroup hierarchy and return the css's matching the origin css, which is a bit cumbersome to open code. This patch converts cgroup_next_sibling() to cgroup_next_child() so that it can handle all steps of direct child iteration. This will be used to update iterators to take @css instead of @cgrp. In addition to the new iteration init handling, cgroup_next_child() is restructured so that the different branches share the end of iteration condition check. This patch doesn't change any behavior. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo5-165/+165
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vivek Goyal <[email protected]> Acked-by: Aristeu Rozanski <[email protected]> Acked-by: Daniel Wagner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Matt Helsley <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Steven Rostedt <[email protected]>
2013-08-08cgroup: add cgroup->dummy_cssTejun Heo1-4/+5
cgroup subsystem API is being converted to use css (cgroup_subsys_state) as the main handle, which makes things a bit awkward for subsystem agnostic core features - the "cgroup.*" interface files and various iterations - a bit awkward as they don't have a css to use. This patch adds cgroup->dummy_css which has NULL ->ss and whose only role is pointing back to the cgroup. This will be used to support subsystem agnostic features on the coming css based API. css_parent() is updated to handle dummy_css's. Note that css will soon grow its own ->parent field and css_parent() will be made trivial. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: pin cgroup_subsys_state when opening a cgroupfs fileTejun Heo1-11/+32
Previously, each file read/write operation relied on the inode reference count pinning the cgroup and simply checked whether the cgroup was marked dead before proceeding to invoke the per-subsystem callback. This was rather silly as it didn't have any synchronization or css pinning around the check and the cgroup may be removed and all css refs drained between the DEAD check and actual method invocation. This patch pins the css between open() and release() so that it is guaranteed to be alive for all file operations and remove the silly DEAD checks from cgroup_file_read/write(). Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: add subsys backlink pointer to cftypeTejun Heo1-35/+43
cgroup is transitioning to using css (cgroup_subsys_state) instead of cgroup as the primary subsystem handle. The cgroupfs file interface will be converted to use css's which requires finding out the subsystem from cftype so that the matching css can be determined from the cgroup. This patch adds cftype->ss which points to the subsystem the file belongs to. The field is initialized while a cftype is being registered. This makes it unnecessary to explicitly specify the subsystem for other cftype handling functions. @ss argument dropped from various cftype handling functions. This patch shouldn't introduce any behavior differences. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Vivek Goyal <[email protected]> Cc: Jens Axboe <[email protected]>
2013-08-08cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methodsTejun Heo6-91/+111
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup * in subsystem implementations for the following reasons. * With unified hierarchy, subsystems will be dynamically bound and unbound from cgroups and thus css's (cgroup_subsys_state) may be created and destroyed dynamically over the lifetime of a cgroup, which is different from the current state where all css's are allocated and destroyed together with the associated cgroup. This in turn means that cgroup_css() should be synchronized and may return NULL, making it more cumbersome to use. * Differing levels of per-subsystem granularity in the unified hierarchy means that the task and descendant iterators should behave differently depending on the specific subsystem the iteration is being performed for. * In majority of the cases, subsystems only care about its part in the cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods often obtain the matching css pointer from the cgroup and don't bother with the cgroup pointer itself. Passing around css fits much better. This patch converts all cgroup_subsys methods to take @css instead of @cgroup. The conversions are mostly straight-forward. A few noteworthy changes are * ->css_alloc() now takes css of the parent cgroup rather than the pointer to the new cgroup as the css for the new cgroup doesn't exist yet. Knowing the parent css is enough for all the existing subsystems. * In kernel/cgroup.c::offline_css(), unnecessary open coded css dereference is replaced with local variable access. This patch shouldn't cause any behavior differences. v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced with local variable @css as suggested by Li Zefan. Rebased on top of new for-3.12 which includes for-3.11-fixes so that ->css_free() invocation added by da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails") is converted too. Suggested by Li Zefan. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vivek Goyal <[email protected]> Acked-by: Aristeu Rozanski <[email protected]> Acked-by: Daniel Wagner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Matt Helsley <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Steven Rostedt <[email protected]>
2013-08-08cgroup: add css_parent()Tejun Heo4-26/+8
Currently, controllers have to explicitly follow the cgroup hierarchy to find the parent of a given css. cgroup is moving towards using cgroup_subsys_state as the main controller interface construct, so let's provide a way to climb the hierarchy using just csses. This patch implements css_parent() which, given a css, returns its parent. The function is guarnateed to valid non-NULL parent css as long as the target css is not at the top of the hierarchy. freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices are converted to use css_parent() instead of accessing cgroup->parent directly. * __parent_ca() is dropped from cpuacct and its usage is replaced with parent_ca(). The only difference between the two was NULL test on cgroup->parent which is now embedded in css_parent() making the distinction moot. Note that eventually a css->parent field will be added to css and the NULL check in css_parent() will go away. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: add/update accessors which obtain subsys specific data from cssTejun Heo4-14/+27
css (cgroup_subsys_state) is usually embedded in a subsys specific data structure. Subsystems either use container_of() directly to cast from css to such data structure or has an accessor function wrapping such cast. As cgroup as whole is moving towards using css as the main interface handle, add and update such accessors to ease dealing with css's. All accessors explicitly handle NULL input and return NULL in those cases. While this looks like an extra branch in the code, as all controllers specific data structures have css as the first field, the casting doesn't involve any offsetting and the compiler can trivially optimize out the branch. * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such accessor. Added. * memory, hugetlb and devices already had one but didn't explicitly handle NULL input. Updated. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: add subsystem pointer to cgroup_subsys_stateTejun Heo1-0/+1
Currently, given a cgroup_subsys_state, there's no way to find out which subsystem the css is for, which we'll need to convert the cgroup controller API to primarily use @css instead of @cgroup. This patch adds cgroup_subsys_state->ss which points to the subsystem the @css belongs to. While at it, remove the comment about accessing @css->cgroup to determine the hierarchy. cgroup core will provide API to traverse hierarchy of css'es and we don't want subsystems to directly walk cgroup hierarchies anymore. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cpuset: drop "const" qualifiers from struct cpuset instancesTejun Heo1-9/+8
cpuset uses "const" qualifiers on struct cpuset in some functions; however, it doesn't work well when a value derived from returned const pointer has to be passed to an accessor. It's C after all. Drop the "const" qualifiers except for the trivially leaf ones. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/Tejun Heo7-16/+16
The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-08userns: limit the maximum depth of user_namespace->parent chainOleg Nesterov1-0/+4
Ensure that user_namespace->parent chain can't grow too much. Currently we use the hardroded 32 as limit. Reported-by: Andy Lutomirski <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-08-07perf: Do not get values from disabled counters in group format readJiri Olsa1-1/+2
It's possible some of the counters in the group could be disabled when sampling member of the event group is reading the rest via PERF_SAMPLE_READ sample type processing. Disabled counters could then produce wrong numbers. Fixing that by reading only enabled counters for PERF_SAMPLE_READ sample type processing. Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Namhyung Kim <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Corey Ashford <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2013-08-07perf: Add PERF_EVENT_IOC_ID ioctl to return event IDJiri Olsa1-0/+9
The only way to get the event ID is by reading the event fd, followed by parsing the ID value out of the returned data. While this is ok for current read format used by perf tool, it is not ok when we use PERF_FORMAT_GROUP format. With this format the data are returned for the whole group and there's no way to find out what ID belongs to our fd (if we are not group leader event). Adding a simple ioctl that returns event primary ID for given fd. Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Namhyung Kim <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Corey Ashford <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2013-08-07Merge tag 'trace-fixes-3.11-rc3' of ↵Linus Torvalds6-131/+272
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Oleg Nesterov has been working hard in closing all the holes that can lead to race conditions between deleting an event and accessing an event debugfs file. This included a fix to the debugfs system (acked by Greg Kroah-Hartman). We think that all the holes have been patched and hopefully we don't find more. I haven't marked all of them for stable because I need to examine them more to figure out how far back some of the changes need to go. Along the way, some other fixes have been made. Alexander Z Lam fixed some logic where the wrong buffer was being modifed. Andrew Vagin found a possible corruption for machines that actually allocate cpumask, as a reference to one was being zeroed out by mistake. Dhaval Giani found a bad prototype when tracing is not configured. And I not only had some changes to help Oleg, but also finally fixed a long standing bug that Dave Jones and others have been hitting, where a module unload and reload can cause the function tracing accounting to get screwed up" * tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix reset of time stamps during trace_clock changes tracing: Make TRACE_ITER_STOP_ON_FREE stop the correct buffer tracing: Fix trace_dump_stack() proto when CONFIG_TRACING is not set tracing: Fix fields of struct trace_iterator that are zeroed by mistake tracing/uprobes: Fail to unregister if probe event files are in use tracing/kprobes: Fail to unregister if probe event files are in use tracing: Add comment to describe special break case in probe_remove_event_call() tracing: trace_remove_event_call() should fail if call/file is in use debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs) ftrace: Check module functions being traced on reload ftrace: Consolidate some duplicate code for updating ftrace ops tracing: Change remove_event_file_dir() to clear "d_subdirs"->i_private tracing: Introduce remove_event_file_dir() tracing: Change f_start() to take event_mutex and verify i_private != NULL tracing: Change event_filter_read/write to verify i_private != NULL tracing: Change event_enable/disable_read() to verify i_private != NULL tracing: Turn event/id->i_private into call->event.type
2013-08-06x86, asmlinkage, power: Make various symbols used by the suspend asm code ↵Andi Kleen1-1/+1
visible Signed-off-by: Andi Kleen <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: H. Peter Anvin <[email protected]>
2013-08-06Merge branch 'for-3.11-fixes' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "Fix for a minor memory leak bug in the cgroup init failure path" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix a leak when percpu_ref_init() fails
2013-08-06Merge branch 'for-3.11-fixes' of ↵Linus Torvalds1-10/+34
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull two workqueue fixes from Tejun Heo: "A lockdep notation update so that nested work_on_cpu() invocations don't lead to spurious lockdep warnings and fix for an unbound attr bug which made what's shown in sysfs deviate from the actual ones. Both patches have pretty limited scope" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: copy workqueue_attrs with all fields workqueue: allow work_on_cpu() to be called recursively
2013-08-06printk: Fix return of braille_register_console()Steven Rostedt1-1/+2
Some of my configs I test with have CONFIG_A11Y_BRAILLE_CONSOLE set. When I started testing against v3.11-rc4 my console went bonkers. Using ktest to bisect the issue, it came down to: commit bbeddf52a "printk: move braille console support into separate braille.[ch] files" Looking into the patch I found the problem. It's with the return of braille_register_console(). As anything other than NULL is considered a failure. But for those of us that have CONFIG_A11Y_BRAILLE_CONSOLE set but do not define a "brl" or "brl=" on the command line, we still may want a console that those with sight can still use. Return NULL (success) if "brl" or "brl=" is not on the console line. Signed-off-by: Steven Rostedt <[email protected]> Acked-by: Joe Perches <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-08-06Revert "ptrace: PTRACE_DETACH should do flush_ptrace_hw_breakpoint(child)"Oleg Nesterov1-1/+0
This reverts commit fab840fc2d542fabcab903db8e03589a6702ba5f. This commit even has the test-case to prove that the tracee can be killed by SIGTRAP if the debugger does not remove the breakpoints before PTRACE_DETACH. However, this is exactly what wineserver deliberately does, set_thread_context() calls PTRACE_ATTACH + PTRACE_DETACH just for PTRACE_POKEUSER(DR*) in between. So we should revert this fix and document that PTRACE_DETACH should keep the breakpoints. Reported-by: Felipe Contreras <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-08-06userns: unshare_userns(&cred) should not populate cred on failureOleg Nesterov1-4/+9
unshare_userns(new_cred) does *new_cred = prepare_creds() before create_user_ns() which can fail. However, the caller expects that it doesn't need to take care of new_cred if unshare_userns() fails. We could change the single caller, sys_unshare(), but I think it would be more clean to avoid the side effects on failure, so with this patch unshare_userns() does put_cred() itself and initializes *new_cred only if create_user_ns() succeeeds. Cc: [email protected] Signed-off-by: Oleg Nesterov <[email protected]> Reviewed-by: Andy Lutomirski <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-08-05register_console: prevent adding the same console twiceAndreas Bießmann1-0/+7
This patch guards the console_drivers list to be corrupted. The for_each_console() macro insist on a strictly forward list ended by NULL: con0->next->con1->next->NULL Without this patch it may happen easily to destroy this list for example by adding 'earlyprintk' twice, especially on embedded devices where the early console is often a single static instance. This will result in the following list: con0->next->con0 This in turn will result in an endless loop in console_unlock() later on by printing the first __log_buf line endlessly. Signed-off-by: Andreas Bießmann <[email protected]> Cc: Kay Sievers <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-08-02tracing: Fix reset of time stamps during trace_clock changesAlexander Z Lam1-12/+12
Fixed two issues with changing the timestamp clock with trace_clock: - The global buffer was reset on instance clock changes. Change this to pass the correct per-instance buffer - ftrace_now() is used to set buf->time_start in tracing_reset_online_cpus(). This was incorrect because ftrace_now() used the global buffer's clock to return the current time. Change this to use buffer_ftrace_now() which returns the current time for the correct per-instance buffer. Also removed tracing_reset_current() because it is not used anywhere Link: http://lkml.kernel.org/r/[email protected] Cc: Vaibhav Nagarnaik <[email protected]> Cc: David Sharp <[email protected]> Cc: Alexander Z Lam <[email protected]> Cc: [email protected] # 3.10 Signed-off-by: Alexander Z Lam <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-08-02tracing: Make TRACE_ITER_STOP_ON_FREE stop the correct bufferAlexander Z Lam1-1/+1
Releasing the free_buffer file in an instance causes the global buffer to be stopped when TRACE_ITER_STOP_ON_FREE is enabled. Operate on the correct buffer. Link: http://lkml.kernel.org/r/[email protected] Cc: Vaibhav Nagarnaik <[email protected]> Cc: David Sharp <[email protected]> Cc: Alexander Z Lam <[email protected]> Cc: [email protected] # 3.10 Signed-off-by: Alexander Z Lam <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-08-02tracing: Fix fields of struct trace_iterator that are zeroed by mistakeAndrew Vagin1-0/+1
tracing_read_pipe zeros all fields bellow "seq". The declaration contains a comment about that, but it doesn't help. The first field is "snapshot", it's true when current open file is snapshot. Looks obvious, that it should not be zeroed. The second field is "started". It was converted from cpumask_t to cpumask_var_t (v2.6.28-4983-g4462344), in other words it was converted from cpumask to pointer on cpumask. Currently the reference on "started" memory is lost after the first read from tracing_read_pipe and a proper object will never be freed. The "started" is never dereferenced for trace_pipe, because trace_pipe can't have the TRACE_FILE_ANNOTATE options. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] # 2.6.30 Signed-off-by: Andrew Vagin <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-08-02cgroup: Merge branch 'for-3.11-fixes' into for-3.12Tejun Heo1-13/+22
for-3.12 branch is about to receive invasive updates which are dependent on da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails"). Given the amount of scheduled changes, I think it'd less painful to pull in for-3.11-fixes as preparation. Pull in for-3.11-fixes into for-3.12. Signed-off-by: Tejun Heo <[email protected]>
2013-08-02Merge tag 'pm+acpi-3.11-rc4' of ↵Linus Torvalds3-8/+14
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: - Revert two cpuidle commits added during the 3.8 development cycle that turn out to have introduced a significant performance regression as requested by Jeremy Eder. - The recent patches that made the freezer less heavy-weight introduced a regression causing user-space-driven hibernation using the ioctl() interface to block indefinitely when the hibernate process executes try_to_freeze(). Fix from Colin Cross addresses this by adding a process flag to mark the hibernate/suspend process to inform the freezer that that process should be ignored. - One of the recent cpufreq reverts uncovered a problem in the core causing the cpufreq driver module refcount to become negative after a system suspend-resume cycle. Fix from Rafael J Wysocki. - The evaluation of the ACPI battery _BIX method has never worked correctly, because the commit that added support for it forgot to take the "Revision" field in the return package into account. As a result, the reading of battery info doesn't work at all on some systems, which is addressed by a fix from Lan Tianyu. * tag 'pm+acpi-3.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: freezer: set PF_SUSPEND_TASK flag on tasks that call freeze_processes ACPI / battery: Fix parsing _BIX return value cpufreq: Fix cpufreq driver module refcount balance after suspend/resume Revert "cpuidle: Quickly notice prediction failure for repeat mode" Revert "cpuidle: Quickly notice prediction failure in general case"
2013-08-02hung_task debugging: Print more info when reporting the problemOleg Nesterov1-4/+9
printk(KERN_ERR) from check_hung_task() likely means we have a bug, but unlike BUG_ON()/WARN_ON ()it doesn't show the kernel version, this complicates the bug-reports investigation. Add the additional pr_err() to print tainted/release/version like dump_stack_print_info() does, the output becomes: INFO: task perl:504 blocked for more than 2 seconds. Not tainted 3.11.0-rc1-10367-g136bb46-dirty #1763 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ... While at it, turn the old printk's into pr_err(). Signed-off-by: Oleg Nesterov <[email protected]> Cc: [email protected] Cc: Christopher Williams <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Mandeep Singh Baines <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-01tracing/uprobes: Fail to unregister if probe event files are in useSteven Rostedt (Red Hat)1-13/+38
Uprobes suffer the same problem that kprobes have. There's a race between writing to the "enable" file and removing the probe. The probe checks for it being in use and if it is not, goes about deleting the probe and the event that represents it. But the problem with that is, after it checks if it is in use it can be enabled, and the deletion of the event (access to the probe) will fail, as it is in use. But the uprobe will still be deleted. This is a problem as the event can reference the uprobe that was deleted. The fix is to remove the event first, and check to make sure the event removal succeeds. Then it is safe to remove the probe. When the event exists, either ftrace or perf can enable the probe and prevent the event from being removed. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-08-01cgroup: rename cgroup_pidlist->mutexLi Zefan1-11/+11
It's a rw_semaphore not a mutex. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-08-01cgroup: restructure the failure path in cgroup_write_event_control()Li Zefan1-21/+18
It uses a single label and checks the validity of each pointer. This is err-prone, and actually we had a bug because one of the check was insufficient. Use multi lables as we do in other places. v2: - drop initializations of local variables. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-08-01workqueue: copy workqueue_attrs with all fieldsShaohua Li1-0/+12
$echo '0' > /sys/bus/workqueue/devices/xxx/numa $cat /sys/bus/workqueue/devices/xxx/numa I got 1. It should be 0, the reason is copy_workqueue_attrs() called in apply_workqueue_attrs() doesn't copy no_numa field. Fix it by making copy_workqueue_attrs() copy ->no_numa too. This would also make get_unbound_pool() set a pool's ->no_numa attribute according to the workqueue attributes used when the pool was created. While harmelss, as ->no_numa isn't a pool attribute, this is a bit confusing. Clear it explicitly. tj: Updated description and comments a bit. Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2013-07-31tracing/kprobes: Fail to unregister if probe event files are in useSteven Rostedt (Red Hat)1-6/+15
When a probe is being removed, it cleans up the event files that correspond to the probe. But there is a race between writing to one of these files and deleting the probe. This is especially true for the "enable" file. CPU 0 CPU 1 ----- ----- fd = open("enable",O_WRONLY); probes_open() release_all_trace_probes() unregister_trace_probe() if (trace_probe_is_enabled(tp)) return -EBUSY write(fd, "1", 1) __ftrace_set_clr_event() call->class->reg() (kprobe_register) enable_trace_probe(tp) __unregister_trace_probe(tp); list_del(&tp->list) unregister_probe_event(tp) <-- fails! free_trace_probe(tp) write(fd, "0", 1) __ftrace_set_clr_event() call->class->unreg (kprobe_register) disable_trace_probe(tp) <-- BOOM! A test program was written that used two threads to simulate the above scenario adding a nanosleep() interval to change the timings and after several thousand runs, it was able to trigger this bug and crash: BUG: unable to handle kernel paging request at 00000005000000f9 IP: [<ffffffff810dee70>] probes_open+0x3b/0xa7 PGD 7808a067 PUD 0 Oops: 0000 [#1] PREEMPT SMP Dumping ftrace buffer: --------------------------------- Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 task: ffff880077756440 ti: ffff880076e52000 task.ti: ffff880076e52000 RIP: 0010:[<ffffffff810dee70>] [<ffffffff810dee70>] probes_open+0x3b/0xa7 RSP: 0018:ffff880076e53c38 EFLAGS: 00010203 RAX: 0000000500000001 RBX: ffff88007844f440 RCX: 0000000000000003 RDX: 0000000000000003 RSI: 0000000000000003 RDI: ffff880076e52000 RBP: ffff880076e53c58 R08: ffff880076e53bd8 R09: 0000000000000000 R10: ffff880077756440 R11: 0000000000000006 R12: ffffffff810dee35 R13: ffff880079250418 R14: 0000000000000000 R15: ffff88007844f450 FS: 00007f87a276f700(0000) GS:ffff88007d480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000005000000f9 CR3: 0000000077262000 CR4: 00000000000007e0 Stack: ffff880076e53c58 ffffffff81219ea0 ffff88007844f440 ffffffff810dee35 ffff880076e53ca8 ffffffff81130f78 ffff8800772986c0 ffff8800796f93a0 ffffffff81d1b5d8 ffff880076e53e04 0000000000000000 ffff88007844f440 Call Trace: [<ffffffff81219ea0>] ? security_file_open+0x2c/0x30 [<ffffffff810dee35>] ? unregister_trace_probe+0x4b/0x4b [<ffffffff81130f78>] do_dentry_open+0x162/0x226 [<ffffffff81131186>] finish_open+0x46/0x54 [<ffffffff8113f30b>] do_last+0x7f6/0x996 [<ffffffff8113cc6f>] ? inode_permission+0x42/0x44 [<ffffffff8113f6dd>] path_openat+0x232/0x496 [<ffffffff8113fc30>] do_filp_open+0x3a/0x8a [<ffffffff8114ab32>] ? __alloc_fd+0x168/0x17a [<ffffffff81131f4e>] do_sys_open+0x70/0x102 [<ffffffff8108f06e>] ? trace_hardirqs_on_caller+0x160/0x197 [<ffffffff81131ffe>] SyS_open+0x1e/0x20 [<ffffffff81522742>] system_call_fastpath+0x16/0x1b Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7 RIP [<ffffffff810dee70>] probes_open+0x3b/0xa7 RSP <ffff880076e53c38> CR2: 00000005000000f9 ---[ end trace 35f17d68fc569897 ]--- The unregister_trace_probe() must be done first, and if it fails it must fail the removal of the kprobe. Several changes have already been made by Oleg Nesterov and Masami Hiramatsu to allow moving the unregister_probe_event() before the removal of the probe and exit the function if it fails. This prevents the tp structure from being used after it is freed. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-31Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds7-105/+197
Merge more patches from Andrew Morton: "A bunch of fixes. Plus Joe's printk move and rework. It's not a -rc3 thing but now would be a nice time to offload it, while things are quiet. I've been sitting on it all for a couple of weeks, no issues" * emailed patches from Andrew Morton <[email protected]>: vmpressure: make sure there are no events queued after memcg is offlined vmpressure: do not check for pending work to prevent from new work vmpressure: change vmpressure::sr_lock to spinlock printk: rename struct log to struct printk_log printk: use pointer for console_cmdline indexing printk: move braille console support into separate braille.[ch] files printk: add console_cmdline.h printk: move to separate directory for easier modification drivers/rtc/rtc-twl.c: fix: rtcX/wakealarm attribute isn't created mm: zbud: fix condition check on allocation size thp, mm: avoid PageUnevictable on active/inactive lru lists mm/swap.c: clear PageActive before adding pages onto unevictable list arch/x86/platform/ce4100/ce4100.c: include reboot.h mm: sched: numa: fix NUMA balancing when !SCHED_DEBUG rapidio: fix use after free in rio_unregister_scan() .gitignore: ignore *.lz4 files MAINTAINERS: dynamic debug: Jason's not there... dmi_scan: add comments on dmi_present() and the loop in dmi_scan_machine() ocfs2/refcounttree: add the missing NULL check of the return value of find_or_create_page() mm: mempolicy: fix mbind_range() && vma_adjust() interaction
2013-07-31printk: rename struct log to struct printk_logJoe Perches1-40/+40
Rename the struct to enable moving portions of printk.c to separate files. The rename changes output of /proc/vmcoreinfo. Signed-off-by: Joe Perches <[email protected]> Cc: Samuel Thibault <[email protected]> Cc: Ming Lei <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-07-31printk: use pointer for console_cmdline indexingJoe Perches1-23/+26
Make the code a bit more compact by always using a pointer for the active console_cmdline. Move overly indented code to correct indent level. Signed-off-by: Joe Perches <[email protected]> Cc: Samuel Thibault <[email protected]> Cc: Ming Lei <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>