aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2014-03-20Rename TAINT_UNSAFE_SMP to TAINT_CPU_OUT_OF_SPECDave Jones2-2/+2
Rename TAINT_UNSAFE_SMP to TAINT_CPU_OUT_OF_SPEC, so we can repurpose the flag to encompass a wider range of pushing the CPU beyond its warrany. Signed-off-by: Dave Jones <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: H. Peter Anvin <[email protected]>
2014-03-20tracing: Fix array size mismatch in format stringVaibhav Nagarnaik2-11/+2
In event format strings, the array size is reported in two locations. One in array subscript and then via the "size:" attribute. The values reported there have a mismatch. For e.g., in sched:sched_switch the prev_comm and next_comm character arrays have subscript values as [32] where as the actual field size is 16. name: sched_switch ID: 301 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1;signed:0; field:int common_pid; offset:4; size:4; signed:1; field:char prev_comm[32]; offset:8; size:16; signed:1; field:pid_t prev_pid; offset:24; size:4; signed:1; field:int prev_prio; offset:28; size:4; signed:1; field:long prev_state; offset:32; size:8; signed:1; field:char next_comm[32]; offset:40; size:16; signed:1; field:pid_t next_pid; offset:56; size:4; signed:1; field:int next_prio; offset:60; size:4; signed:1; After bisection, the following commit was blamed: 92edca0 tracing: Use direct field, type and system names This commit removes the duplication of strings for field->name and field->type assuming that all the strings passed in __trace_define_field() are immutable. This is not true for arrays, where the type string is created in event_storage variable and field->type for all array fields points to event_storage. Use __stringify() to create a string constant for the type string. Also, get rid of event_storage and event_storage_mutex that are not needed anymore. also, an added benefit is that this reduces the overhead of events a bit more: text data bss dec hex filename 8424787 2036472 1302528 11763787 b3804b vmlinux 8420814 2036408 1302528 11759750 b37086 vmlinux.patched Link: http://lkml.kernel.org/r/[email protected] Cc: Laurent Chavey <[email protected]> Cc: [email protected] # 3.10+ Signed-off-by: Vaibhav Nagarnaik <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-03-20cgroup: break kernfs active_ref protection in cgroup directory operationsTejun Heo1-1/+26
cgroup_tree_mutex should nest above the kernfs active_ref protection; however, cgroup_create() and cgroup_rename() were grabbing cgroup_tree_mutex while under kernfs active_ref protection. This has actualy possibility to lead to deadlocks in case these operations race against cgroup_rmdir() which invokes kernfs_remove() on directory kernfs_node while holding cgroup_tree_mutex. Neither cgroup_create() or cgroup_rename() requires active_ref protection. The former already has enough synchronization through cgroup_lock_live_group() and the latter doesn't care, so this can be fixed by updating both functions to break all active_ref protections before grabbing cgroup_tree_mutex. While this patch fixes the immediate issue, it probably needs further work in the long term - kernfs directories should enable lockdep annotations and maybe the better way to handle this is marking directory nodes as not needing active_ref protection rather than breaking it in each operation. Signed-off-by: Tejun Heo <[email protected]> Cc: Greg Kroah-Hartman <[email protected]>
2014-03-20syscall_get_arch: remove useless function argumentsEric Paris1-2/+2
Every caller of syscall_get_arch() uses current for the task and no implementors of the function need args. So just get rid of both of those things. Admittedly, since these are inline functions we aren't wasting stack space, but it just makes the prototypes better. Signed-off-by: Eric Paris <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected]
2014-03-20audit: remove stray newline from audit_log_execve_info() audit_panic() callJoe Perches1-1/+1
There's an unnecessary use of a \n in audit_panic. Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: remove stray newlines from audit_log_lost messagesJosh Boyer1-2/+2
Calling audit_log_lost with a \n in the format string leads to extra newlines in dmesg. That function will eventually call audit_panic which uses pr_err with an explicit \n included. Just make these calls match the others that lack \n. Reported-by: Jonathan Kamens <[email protected]> Signed-off-by: Josh Boyer <[email protected]> Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: include subject in login recordsEric Paris1-6/+4
The login uid change record does not include the selinux context of the task logging in. Add that information. (Updated from 2011-01: RHBZ:670328 -- RGB) Reported-by: Steve Grubb <[email protected]> Acked-by: James Morris <[email protected]> Signed-off-by: Eric Paris <[email protected]> Signed-off-by: Aristeu Rozanski <[email protected]> Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: remove superfluous new- prefix in AUDIT_LOGIN messagesRichard Guy Briggs1-1/+1
The new- prefix on ses and auid are un-necessary and break ausearch. Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: allow user processes to log from another PID namespaceRichard Guy Briggs1-3/+7
Still only permit the audit logging daemon and control to operate from the initial PID namespace, but allow processes to log from another PID namespace. Cc: "Eric W. Biederman" <[email protected]> (informed by ebiederman's c776b5d2) Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: anchor all pid references in the initial pid namespaceRichard Guy Briggs3-10/+28
Store and log all PIDs with reference to the initial PID namespace and use the access functions task_pid_nr() and task_tgid_nr() for task->pid and task->tgid. Cc: "Eric W. Biederman" <[email protected]> (informed by ebiederman's c776b5d2) Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: convert PPIDs to the inital PID namespace.Richard Guy Briggs2-3/+3
sys_getppid() returns the parent pid of the current process in its own pid namespace. Since audit filters are based in the init pid namespace, a process could avoid a filter or trigger an unintended one by being in an alternate pid namespace or log meaningless information. Switch to task_ppid_nr() for PPIDs to anchor all audit filters in the init_pid_ns. (informed by ebiederman's 6c621b7e) Cc: [email protected] Cc: Eric W. Biederman <[email protected]> Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: rename the misleading audit_get_context() to audit_take_context()Richard Guy Briggs1-3/+4
"get" usually implies incrementing a refcount into a structure to indicate a reference being held by another part of code. Change this function name to indicate it is in fact being taken from it, returning the value while clearing it in the supplying structure. Signed-off-by: Richard Guy Briggs <[email protected]>
2014-03-20audit: Send replies in the proper network namespace.Eric W. Biederman2-13/+15
In perverse cases of file descriptor passing the current network namespace of a process and the network namespace of a socket used by that socket may differ. Therefore use the network namespace of the appropiate socket to ensure replies always go to the appropiate socket. Signed-off-by: "Eric W. Biederman" <[email protected]> Acked-by: Richard Guy Briggs <[email protected]> Signed-off-by: Eric Paris <[email protected]>
2014-03-20audit: Use struct net not pid_t to remember the network namespce to reply inEric W. Biederman3-6/+9
While reading through 3.14-rc1 I found a pretty siginficant mishandling of network namespaces in the recent audit changes. In struct audit_netlink_list and audit_reply add a reference to the network namespace of the caller and remove the userspace pid of the caller. This cleanly remembers the callers network namespace, and removes a huge class of races and nasty failure modes that can occur when attempting to relook up the callers network namespace from a pid_t (including the caller's network namespace changing, pid wraparound, and the pid simply not being present). Signed-off-by: "Eric W. Biederman" <[email protected]> Acked-by: Richard Guy Briggs <[email protected]> Signed-off-by: Eric Paris <[email protected]>
2014-03-20audit: Audit proc/<pid>/cmdline aka proctitleWilliam Roberts2-0/+73
During an audit event, cache and print the value of the process's proctitle value (proc/<pid>/cmdline). This is useful in situations where processes are started via fork'd virtual machines where the comm field is incorrect. Often times, setting the comm field still is insufficient as the comm width is not very wide and most virtual machine "package names" do not fit. Also, during execution, many threads have their comm field set as well. By tying it back to the global cmdline value for the process, audit records will be more complete in systems with these properties. An example of where this is useful and applicable is in the realm of Android. With Android, their is no fork/exec for VM instances. The bare, preloaded Dalvik VM listens for a fork and specialize request. When this request comes in, the VM forks, and the loads the specific application (specializing). This was done to take advantage of COW and to not require a load of basic packages by the VM on very app spawn. When this spawn occurs, the package name is set via setproctitle() and shows up in procfs. Many of these package names are longer then 16 bytes, the historical width of task->comm. Having the cmdline in the audit records will couple the application back to the record directly. Also, on my Debian development box, some audit records were more useful then what was printed under comm. The cached proctitle is tied to the life-cycle of the audit_context structure and is built on demand. Proctitle is controllable by userspace, and thus should not be trusted. It is meant as an aid to assist in debugging. The proctitle event is emitted during syscall audits, and can be filtered with auditctl. Example: type=AVC msg=audit(1391217013.924:386): avc: denied { getattr } for pid=1971 comm="mkdir" name="/" dev="selinuxfs" ino=1 scontext=system_u:system_r:consolekit_t:s0-s0:c0.c255 tcontext=system_u:object_r:security_t:s0 tclass=filesystem type=SYSCALL msg=audit(1391217013.924:386): arch=c000003e syscall=137 success=yes exit=0 a0=7f019dfc8bd7 a1=7fffa6aed2c0 a2=fffffffffff4bd25 a3=7fffa6aed050 items=0 ppid=1967 pid=1971 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mkdir" exe="/bin/mkdir" subj=system_u:system_r:consolekit_t:s0-s0:c0.c255 key=(null) type=UNKNOWN[1327] msg=audit(1391217013.924:386): proctitle=6D6B646972002D70002F7661722F72756E2F636F6E736F6C65 Acked-by: Steve Grubb <[email protected]> (wrt record formating) Signed-off-by: William Roberts <[email protected]> Signed-off-by: Eric Paris <[email protected]>
2014-03-20profile: Fix CPU hotplug callback registrationSrivatsa S. Bhat1-5/+15
Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Instead, the correct and race-free way of performing the callback registration is: cpu_notifier_register_begin(); for_each_online_cpu(cpu) init_cpu(cpu); /* Note the use of the double underscored version of the API */ __register_cpu_notifier(&foobar_cpu_notifier); cpu_notifier_register_done(); Fix the profile code by using this latter form of callback registration. Cc: Al Viro <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2014-03-20trace, ring-buffer: Fix CPU hotplug callback registrationSrivatsa S. Bhat1-8/+11
Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Instead, the correct and race-free way of performing the callback registration is: cpu_notifier_register_begin(); for_each_online_cpu(cpu) init_cpu(cpu); /* Note the use of the double underscored version of the API */ __register_cpu_notifier(&foobar_cpu_notifier); cpu_notifier_register_done(); Fix the tracing ring-buffer code by using this latter form of callback registration. Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Acked-by: Steven Rostedt <[email protected]> Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2014-03-20CPU hotplug: Provide lockless versions of callback registration functionsSrivatsa S. Bhat1-2/+19
The following method of CPU hotplug callback registration is not safe due to the possibility of an ABBA deadlock involving the cpu_add_remove_lock and the cpu_hotplug.lock. get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); The deadlock is shown below: CPU 0 CPU 1 ----- ----- Acquire cpu_hotplug.lock [via get_online_cpus()] CPU online/offline operation takes cpu_add_remove_lock [via cpu_maps_update_begin()] Try to acquire cpu_add_remove_lock [via register_cpu_notifier()] CPU online/offline operation tries to acquire cpu_hotplug.lock [via cpu_hotplug_begin()] *** DEADLOCK! *** The problem here is that callback registration takes the locks in one order whereas the CPU hotplug operations take the same locks in the opposite order. To avoid this issue and to provide a race-free method to register CPU hotplug callbacks (along with initialization of already online CPUs), introduce new variants of the callback registration APIs that simply register the callbacks without holding the cpu_add_remove_lock during the registration. That way, we can avoid the ABBA scenario. However, we will need to hold the cpu_add_remove_lock throughout the entire critical section, to protect updates to the callback/notifier chain. This can be achieved by writing the callback registration code as follows: cpu_maps_update_begin(); [ or cpu_notifier_register_begin(); see below ] for_each_online_cpu(cpu) init_cpu(cpu); /* This doesn't take the cpu_add_remove_lock */ __register_cpu_notifier(&foobar_cpu_notifier); cpu_maps_update_done(); [ or cpu_notifier_register_done(); see below ] Note that we can't use get_online_cpus() here instead of cpu_maps_update_begin() because the cpu_hotplug.lock is dropped during the invocation of CPU_POST_DEAD notifiers, and hence get_online_cpus() cannot provide the necessary synchronization to protect the callback/notifier chains against concurrent reads and writes. On the other hand, since the cpu_add_remove_lock protects the entire hotplug operation (including CPU_POST_DEAD), we can use cpu_maps_update_begin/done() to guarantee proper synchronization. Also, since cpu_maps_update_begin/done() is like a super-set of get/put_online_cpus(), the former naturally protects the critical sections from concurrent hotplug operations. Since the names cpu_maps_update_begin/done() don't make much sense in CPU hotplug callback registration scenarios, we'll introduce new APIs named cpu_notifier_register_begin/done() and map them to cpu_maps_update_begin/done(). In summary, introduce the lockless variants of un/register_cpu_notifier() and also export the cpu_notifier_register_begin/done() APIs for use by modules. This way, we provide a race-free way to register hotplug callbacks as well as perform initialization for the CPUs that are already online. Cc: Thomas Gleixner <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Toshi Kani <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2014-03-20CPU hotplug: Add lockdep annotations to get/put_online_cpus()Gautham R. Shenoy1-0/+17
Add lockdep annotations for get/put_online_cpus() and cpu_hotplug_begin()/cpu_hotplug_end(). Cc: Ingo Molnar <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2014-03-20Merge branches 'pm-runtime' and 'pm-sleep'Rafael J. Wysocki6-14/+20
* pm-runtime: PM / Runtime: Update runtime_idle() documentation for return value meaning * pm-sleep: PM / sleep: Correct whitespace errors in <linux/pm.h> PM: Add missing "freeze" state PM / Hibernate: Spelling s/anonymouns/anonymous/ PM / Runtime: Add missing "it" in comment PM / suspend: Remove unnecessary !! PCI / PM: Resume runtime-suspended devices later during system suspend ACPI / PM: Resume runtime-suspended devices later during system suspend PM / sleep: Set pm_generic functions to NULL for !CONFIG_PM_SLEEP PM: fix typo in comment PM / hibernate: use name_to_dev_t to parse resume PM / wakeup: Include appropriate header file in kernel/power/wakelock.c PM / sleep: Move prototype declaration to header file kernel/power/power.h PM / sleep: Asynchronous threads for suspend_late PM / sleep: Asynchronous threads for suspend_noirq PM / sleep: Asynchronous threads for resume_early PM / sleep: Asynchronous threads for resume_noirq PM / sleep: Two flags for async suspend_noirq and suspend_late
2014-03-20Merge branches 'pm-qos', 'pm-domains' and 'pm-drivers'Rafael J. Wysocki1-6/+12
* pm-qos: PM / QoS: Add type to dev_pm_qos_add_ancestor_request() arguments ACPI / LPSS: Support for device latency tolerance PM QoS ACPI / scan: Add bind/unbind callbacks to struct acpi_scan_handler PM / QoS: Introcuce latency tolerance device PM QoS type PM / QoS: Add no_constraints_value field to struct pm_qos_constraints PM / QoS: Rename device resume latency QoS items * pm-domains: PM / domains: Turn latency warning into debug message * pm-drivers: PM: Add pm_runtime_suspend|resume_force functions PM / runtime: Fetch runtime PM callbacks using a macro
2014-03-20timer: Remove code redundancy while calling get_nohz_timer_target()Viresh Kumar3-21/+6
There are only two users of get_nohz_timer_target(): timer and hrtimer. Both call it under same circumstances, i.e. #ifdef CONFIG_NO_HZ_COMMON if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) return get_nohz_timer_target(); #endif So, it makes more sense to get all this as part of get_nohz_timer_target() instead of duplicating code at two places. For this another parameter is required to be passed to this routine, pinned. Signed-off-by: Viresh Kumar <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.org Signed-off-by: Thomas Gleixner <[email protected]>
2014-03-20timer: Use variable head instead of &work_list in __run_timers()Viresh Kumar1-1/+1
We already have a variable 'head' that points to '&work_list', and so we should use that instead wherever possible. Signed-off-by: Viresh Kumar <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/0d8645a6efc8360c4196c9797d59343abbfdcc5e.1395129136.git.viresh.kumar@linaro.org Signed-off-by: Thomas Gleixner <[email protected]>
2014-03-19Merge branch 'for-3.14-fixes' of ↵Linus Torvalds1-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "One really late cgroup patch to fix error path in create_css(). Hitting this bug would be pretty rare but still possible and it gets delayed we'd need to backport it through -stable anyway. It only updates error path in create_css() and has low chance of new breakages" * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix a failure path in create_css()
2014-03-19cgroup: fix cgroup_taskset walking orderTejun Heo1-5/+19
cgroup_taskset is used to track and iterate target tasks while migrating a task or process and should guarantee that the first task iterated is the task group leader if a process is being migrated. b3dc094e9390 ("cgroup: use css_set->mg_tasks to track target tasks during migration") replaced flex array cgroup_taskset->tc_array with css_set->mg_tasks list to remove process size limit and dynamic allocation during migration; unfortunately, it incorrectly used list operations which don't preserve order breaking the guarantee that cgroup_taskset_first() returns the leader for a process target. Fix it by using order preserving list operations. Note that as multiple src_csets may map to a single dst_cset, the iteration order may change across cgroup_task_migrate(); however, the leader is still guaranteed to be the first entry. The switch to list_splice_tail_init() at the end of cgroup_migrate() isn't strictly necessary. Let's still do it for consistency. Signed-off-by: Tejun Heo <[email protected]>
2014-03-19resources: Set type in __request_region()Bjorn Helgaas1-2/+2
We don't set the type (I/O, memory, etc.) of resources added by __request_region(), which leads to confusing messages like this: address space collision: [io 0x1000-0x107f] conflicts with ACPI CPU throttle [??? 0x00001010-0x00001015 flags 0x80000000] Set the type of a new resource added by __request_region() (used by request_region() and request_mem_region()) to the type of its parent. This makes the resource tree internally consistent and fixes messages like the above, where the ACPI CPU throttle resource really is an I/O port region, but request_region() didn't fill in the type, so %pR didn't know how to print it. Sample dmesg showing the issue at the link below. Link: https://bugzilla.kernel.org/show_bug.cgi?id=71611 Reported-by: Paul Bolle <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2014-03-19cgroup: implement CFTYPE_ONLY_ON_DFLTejun Heo1-0/+2
This cftype flag makes the file only appear on the default hierarchy. This will later be used for cgroup.controllers file. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19cgroup: make cgrp_dfl_root mountableTejun Heo1-33/+61
cgrp_dfl_root will be used as the default unified hierarchy. This patch makes cgrp_dfl_root mountable by making the following changes. * cgroup_init_early() now initializes cgrp_dfl_root w/ CGRP_ROOT_SANE_BEHAVIOR. The default hierarchy is always sane. * parse_cgroupfs_options() and cgroup_mount() are updated such that cgrp_dfl_root is mounted if sane_behavior is specified w/o any subsystems. * rebind_subsystems() now populates the root directory of cgrp_dfl_root. Note that the function still guarantees success of rebinding subsystems to cgrp_dfl_root. If populating fails while rebinding to cgrp_dfl_root, it whines but ignores the error. * For backward compatibility, the default hierarchy shows up in /proc/$PID/cgroup only after it's explicitly mounted so that userland which doesn't make use of it doesn't see any change. * "current_css_set_cg_links" file of debug cgroup now treats the default hierarchy the same as other hierarchies. This is visible to userland. Given that it's for debug controller, this should be fine. * While at it, implement cgroup_on_dfl() which tests whether a give cgroup is on the default hierarchy or not. The above changes make cgrp_dfl_root mostly equivalent to other controllers but the actual unified hierarchy behaviors are not implemented yet. Let's plug child cgroup creation in cgrp_dfl_root from create_cgroup() for now. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19cgroup: drop const from @buffer of cftype->write_string()Tejun Heo3-3/+3
cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]>
2014-03-19cgroup: rename cgroup_dummy_root and related namesTejun Heo1-87/+81
The dummy root will be repurposed to serve as the default unified hierarchy. Let's rename things in preparation. * s/cgroup_dummy_root/cgrp_dfl_root/ * s/cgroupfs_root/cgroup_root/ as we don't do fs part directly anymore * s/cgroup_root->top_cgroup/cgroup_root->cgrp/ for brevity This is pure rename. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19cgroup: move ->subsys_mask from cgroupfs_root to cgroupTejun Heo1-22/+39
cgroupfs_root->subsys_mask represents the controllers attached to the hierarchy. This patch moves the field to cgroup. Subsystem initialization and rebinding updates the top cgroup's subsys_mask. For !root cgroups, the subsys_mask bits are set from create_css() and cleared from kill_css(), which effectively means that all cgroups will have the same subsys_mask as the top cgroup. While this doesn't make any difference now, this will help implementation of the default unified hierarchy where !root cgroups may have subsets of the top_cgroup's subsys_mask. While at it, __kill_css() is split out of kill_css(). The former doesn't care about the subsys_mask while the latter becomes noop if the controller is already killed and clears the matching bit if not before proceeding to killing the css. This will be used later by the default unified hierarchy implementation. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebindingTejun Heo1-44/+56
Currently, while rebinding, cgroup_dummy_root serves as the anchor point. In addition to the target root, rebind_subsystems() takes @added_mask and @removed_mask. The subsystems specified in the former are expected to be on the dummy root and then moved to the target root. The ones in the latter are moved from non-dummy root to dummy. Now that the dummy root is a fully functional one and we're planning to use it for the default unified hierarchy, this level of distinction between dummy and non-dummy roots is quite awkward. This patch updates rebind_subsystems() to take the target root and one subsystem mask and move the specified subsystmes to the target root which may or may not be the dummy root. IOW, unbinding now becomes moving the subsystems to the dummy root and binding to non-dummy root. This makes the dummy root mostly equivalent to other hierarchies in terms of the mechanism of moving subsystems around; however, we still retain all the semantical restrictions so that this patch doesn't introduce any visible behavior differences. Another noteworthy detail is that rebind_subsystems() guarantees that moving a subsystem to the dummy root never fails so that valid unmounting attempts always succeed. This unifies binding and unbinding of subsystems. The invocation points of ->bind() were inconsistent between the two and now moved after whole rebinding is complete. This doesn't break the current users and generally makes more sense. All rebind_subsystems() users are converted accordingly. Note that cgroup_remount() now makes two calls to rebind_subsystems() to bind and then unbind the requested subsystems. This will allow repurposing of the dummy hierarchy as the default unified hierarchy and shouldn't make any userland visible behavior difference. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19cgroup: use cgroup_setup_root() to initialize cgroup_dummy_rootTejun Heo1-23/+20
cgroup_dummy_root is used to host controllers which aren't attached to any other hierarchy. The root is minimally set up during kernfs bootstrap and didn't go through full hierarchy initialization. We're planning to use cgroup_dummy_root for the default unified hierarchy and thus want it to be fully functional. Replace the special initialization, which was collected into cgroup_init() by the previous patch, with an invocation of cgroup_setup_root(). This simplifies the init path and makes cgroup_dummy_root a full hierarchy with its own kernfs_root and all. As this puts the dummy hierarchy on the cgroup_roots list, rename for_each_active_root() to for_each_root() and update its users to skip the dummy root for now. This patch doesn't cause any userland visible behavior changes at this point. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19cgroup: reorganize cgroup bootstrappingTejun Heo1-51/+49
* Fields of init_css_set and css_set_count are now set using initializer instead of programmatically from cgroup_init_early(). * init_cgroup_root() now also takes @opts and performs the optional part of initialization too. The leftover part of cgroup_root_from_opts() is collapsed into its only caller - cgroup_mount(). * Initialization of cgroup_root_count and linking of init_css_set are moved from cgroup_init_early() to to cgroup_init(). None of the early_init users depends on init_css_set being linked. * Subsystem initializations are moved after dummy hierarchy init and init_css_set linking. These changes reorganize the bootstrap logic so that the dummy hierarchy can share the usual hierarchy init path and be made more normal. These changes don't make noticeable behavior changes. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19cgroup: relocate setting of CGRP_DEADTejun Heo1-9/+9
In cgroup_destroy_locked(), move setting of CGRP_DEAD above invocations of kill_css(). This doesn't make any visible behavior difference now but will be used to inhibit manipulating controller enable states of a dying cgroup on the unified hierarchy. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-03-19genirq: procfs: Make smp_affinity values go+rChema Gonzalez1-4/+4
Includes: - /proc/irq/default_smp_affinity - /proc/irq/*/affinity_hint - /proc/irq/*/smp_affinity - /proc/irq/*/smp_affinity_list Users can distill the same information by reading /proc/interrupts. Signed-off-by: Chema Gonzalez <[email protected]> Cc: Eric Dumazet <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-03-19softirq: Add linux/irq.h to make it compile againThomas Gleixner1-0/+1
On Sparc and S390 the removal of irq.h from kernel_stat.h causes: kernel/softirq.c:774:9: error: 'NR_IRQS_LEGACY' undeclared Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2014-03-18cgroup: fix a failure path in create_css()Li Zefan1-4/+7
If online_css() fails, we should remove cgroup files belonging to css->ss. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2014-03-18uprobes: allow ignoring of probe hitsDavid A. Long1-0/+9
Allow arches to decided to ignore a probe hit. ARM will use this to only call handlers if the conditions to execute a conditionally executed instruction are satisfied. Signed-off-by: David A. Long <[email protected]> Acked-by: Oleg Nesterov <[email protected]>
2014-03-18uprobes: Kconfig dependency fixDavid A. Long1-0/+1
Suggested change from Oleg Nesterov. Fixes incomplete dependencies for uprobes feature. Signed-off-by: David A. Long <[email protected]> Acked-by: Oleg Nesterov <[email protected]>
2014-03-16Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds3-3/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three small fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Prevent tracing recursion in sched_clock_cpu() stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus() sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policy
2014-03-14genirq: Add a new IRQCHIP_EOI_THREADED flagThomas Gleixner3-9/+42
The flag is necessary for interrupt chips which require an ACK/EOI after the handler has run. In case of threaded handlers this needs to happen after the threaded handler has completed before the unmask of the interrupt. The flag is only unseful in combination with the handle_fasteoi_irq flow control handler. It can be combined with the flag IRQCHIP_EOI_IF_HANDLED, so the EOI is not issued when the interrupt is disabled or in progress. Tested-by: Hans de Goede <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Maxime Ripard <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-03-13block: remove old blk_iopoll_enabled variableJens Axboe1-12/+0
This was a debugging measure to toggle enabled/disabled when testing. But for real production setups, it's not safe to toggle this setting without either reloading drivers of quiescing IO first. Neither of which the toggle enforces. Additionally, it makes drivers deal with the conditional state. Remove it completely. It's up to the driver whether iopoll is enabled or not. Signed-off-by: Jens Axboe <[email protected]>
2014-03-13sched: Remove needless round trip nsecs <-> tick conversion of steal timeFrederic Weisbecker2-16/+0
When update_rq_clock_task() accounts the pending steal time for a task, it converts the steal delta from nsecs to tick then from tick to nsecs. There is no apparent good reason for doing that though because both the task clock and the prev steal delta are u64 and store values in nsecs. So lets remove the needless conversion. Cc: Ingo Molnar <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2014-03-13cputime: Fix jiffies based cputime assumption on steal accountingFrederic Weisbecker1-5/+11
The steal guest time accounting code assumes that cputime_t is based on jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t is based on nsecs, steal_account_process_tick() passes the delta in jiffies to account_steal_time() which then accounts it as if it's a value in nsecs. As a result, accounting 1 second of steal time (with HZ=100 that would be 100 jiffies) is spuriously accounted as 100 nsecs. As such /proc/stat may report 0 values of steal time even when two guests have run concurrently for a few seconds on the same host and same CPU. In order to fix this, lets convert the nsecs based steal delta to cputime instead of jiffies by using the right conversion API. Given that the steal time is stored in cputime_t and this type can have a smaller granularity than nsecs, we only account the rounded converted value and leave the remaining nsecs for the next deltas. Reported-by: Huiqingding <[email protected]> Reported-by: Marcelo Tosatti <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2014-03-13Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULEMathieu Desnoyers3-3/+8
Users have reported being unable to trace non-signed modules loaded within a kernel supporting module signature. This is caused by tracepoint.c:tracepoint_module_coming() refusing to take into account tracepoints sitting within force-loaded modules (TAINT_FORCED_MODULE). The reason for this check, in the first place, is that a force-loaded module may have a struct module incompatible with the layout expected by the kernel, and can thus cause a kernel crash upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y. Tracepoints, however, specifically accept TAINT_OOT_MODULE and TAINT_CRAP, since those modules do not lead to the "very likely system crash" issue cited above for force-loaded modules. With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed module is tainted re-using the TAINT_FORCED_MODULE taint flag. Unfortunately, this means that Tracepoints treat that module as a force-loaded module, and thus silently refuse to consider any tracepoint within this module. Since an unsigned module does not fit within the "very likely system crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag to specifically address this taint behavior, and accept those modules within Tracepoints. We use the letter 'X' as a taint flag character for a module being loaded that doesn't know how to sign its name (proposed by Steven Rostedt). Also add the missing 'O' entry to trace event show_module_flags() list for the sake of completeness. Signed-off-by: Mathieu Desnoyers <[email protected]> Acked-by: Steven Rostedt <[email protected]> NAKed-by: Ingo Molnar <[email protected]> CC: Thomas Gleixner <[email protected]> CC: David Howells <[email protected]> CC: Greg Kroah-Hartman <[email protected]> Signed-off-by: Rusty Russell <[email protected]>
2014-03-13module: use pr_contJiri Slaby1-3/+3
When dumping loaded modules, we print them one by one in separate printks. Let's use pr_cont as they are continuation prints. Signed-off-by: Jiri Slaby <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Rusty Russell <[email protected]>
2014-03-12Merge branch 'irq/for-gpio' into irq/coreThomas Gleixner20-98/+168
Merge the request/release callbacks which are in a separate branch for consumption by the gpio folks. Signed-off-by: Thomas Gleixner <[email protected]>
2014-03-12genirq: Provide irq_request/release_resources chip callbacksThomas Gleixner1-1/+27
For certain irq types, e.g. gpios, it's necessary to request resources before starting up the irq. This might fail so we cannot use the irq_startup() callback because we might call the irq_set_type() callback before that which does not make sense when the resource is not available. Calling irq_startup() before irq_set_type() can lead to spurious interrupts which is not desired either. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jean-Jacques Hiblot <[email protected]> Cc: Grant Likely <[email protected]> Cc: [email protected] Reviewed-by: Linus Walleij <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-03-12locking/mutex: Fix debug checksPeter Zijlstra2-0/+13
OK, so commit: 1d8fe7dc8078 ("locking/mutexes: Unlock the mutex without the wait_lock") generates this boot warning when CONFIG_DEBUG_MUTEXES=y: WARNING: CPU: 0 PID: 139 at /usr/src/linux-2.6/kernel/locking/mutex-debug.c:82 debug_mutex_unlock+0x155/0x180() DEBUG_LOCKS_WARN_ON(lock->owner != current) And that makes sense, because as soon as we release the lock a new owner can come in... One would think that !__mutex_slowpath_needs_to_unlock() implementations suffer the same, but for DEBUG we fall back to mutex-null.h which has an unconditional 1 for that. The mutex debug code requires the mutex to be unlocked after doing the debug checks, otherwise it can find inconsistent state. Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>