aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2012-09-18workqueue: remove @delayed from cwq_dec_nr_in_flight()Lai Jiangshan1-13/+8
@delayed is now always false for all callers, remove it. tj: Updated description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-18workqueue: fix possible stall on try_to_grab_pending() of a delayed work itemLai Jiangshan1-3/+22
Currently, when try_to_grab_pending() grabs a delayed work item, it leaves its linked work items alone on the delayed_works. The linked work items are always NO_COLOR and will cause future cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and may cause the whole cwq to stall. For example, state: cwq->max_active = 1, cwq->nr_active = 1 one work in cwq->pool, many in cwq->delayed_works. step1: try_to_grab_pending() removes a work item from delayed_works but leaves its NO_COLOR linked work items on it. step2: Later on, cwq_activate_first_delayed() activates the linked work item increasing ->nr_active. step3: cwq->nr_active = 1, but all activated work items of the cwq are NO_COLOR. When they finish, cwq->nr_active will not be decreased due to NO_COLOR, and no further work items will be activated from cwq->delayed_works. the cwq stalls. Fix it by ensuring the target work item is activated before stealing PENDING in try_to_grab_pending(). This ensures that all the linked work items are activated without incorrectly bumping cwq->nr_active. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2012-09-18workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()Lai Jiangshan1-1/+1
workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so hotcpu_notifier() fits better than cpu_notifier(). When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same. When HOTPLUG_CPU=n, if we use cpu_notifier(), workqueue_cpu_down_callback() will be called during boot to do nothing, and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discarded after boot. If we use hotcpu_notifier(), we can avoid the no-op call of workqueue_cpu_down_callback() and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at build time: $ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier -rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier -rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier $ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier text data bss dec hex filename 18513 2387 1221 22121 5669 kernel/workqueue.o.cpu_notifier 18082 2355 1221 21658 549a kernel/workqueue.o.hotcpu_notifier tj: Updated description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-18workqueue: use __cpuinit instead of __devinit for cpu callbacksLai Jiangshan1-2/+2
For workqueue hotplug callbacks, it makes less sense to use __devinit which discards the memory after boot if !HOTPLUG. __cpuinit, which discards the memory after boot if !HOTPLUG_CPU fits better. tj: Updated description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-18workqueue: rename manager_mutex to assoc_mutexLai Jiangshan1-19/+19
Now that manager_mutex's role has changed from synchronizing manager role to excluding hotplug against manager, the name is misleading. As it is protecting the CPU-association of the gcwq now, rename it to assoc_mutex. This patch is pure rename and doesn't introduce any functional change. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-18workqueue: WORKER_REBIND is no longer necessary for idle rebindingLai Jiangshan1-26/+15
Now both worker destruction and idle rebinding remove the worker from idle list while it's still idle, so list_empty(&worker->entry) can be used to test whether either is pending and WORKER_DIE to distinguish between the two instead making WORKER_REBIND unnecessary. Use list_empty(&worker->entry) to determine whether destruction or rebinding is pending. This simplifies worker state transitions. WORKER_REBIND is not needed anymore. Remove it. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-18workqueue: WORKER_REBIND is no longer necessary for busy rebindingLai Jiangshan1-16/+2
Because the old unbind/rebinding implementation wasn't atomic w.r.t. GCWQ_DISASSOCIATED manipulation which is protected by global_cwq->lock, we had to use two flags, WORKER_UNBOUND and WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with back-to-back CPU hotplug operations; otherwise, completion of rebinding while another unbinding is in progress could clear UNBIND prematurely. Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED, there's no need to use two flags. Just one is enough. Don't use WORKER_REBIND for busy rebinding. tj: Updated description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-18workqueue: reimplement idle worker rebindingLai Jiangshan1-99/+42
Currently rebind_workers() uses rebinds idle workers synchronously before proceeding to requesting busy workers to rebind. This is necessary because all workers on @worker_pool->idle_list must be bound before concurrency management local wake-ups from the busy workers take place. Unfortunately, the synchronous idle rebinding is quite complicated. This patch reimplements idle rebinding to simplify the code path. Rather than trying to make all idle workers bound before rebinding busy workers, we simply remove all to-be-bound idle workers from the idle list and let them add themselves back after completing rebinding (successful or not). As only workers which finished rebinding can on on the idle worker list, the idle worker list is guaranteed to have only bound workers unless CPU went down again and local wake-ups are safe. After the change, @worker_pool->nr_idle may deviate than the actual number of idle workers on @worker_pool->idle_list. More specifically, nr_idle may be non-zero while ->idle_list is empty. All users of ->nr_idle and ->idle_list are audited. The only affected one is too_many_workers() which is updated to check %false if ->idle_list is empty regardless of ->nr_idle. After this patch, rebind_workers() no longer performs the nasty idle-rebind retries which require temporary release of gcwq->lock, and both unbinding and rebinding are atomic w.r.t. global_cwq->lock. worker->idle_rebind and global_cwq->rebind_hold are now unnecessary and removed along with the definition of struct idle_rebind. Changed from V1: 1) remove unlikely from too_many_workers(), ->idle_list can be empty anytime, even before this patch, no reason to use unlikely. 2) fix a small rebasing mistake. (which is from rebasing the orignal fixing patch to for-next) 3) add a lot of comments. 4) clear WORKER_REBIND unconditionaly in idle_worker_rebind() tj: Updated comments and description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-18userns: Add kprojid_t and associated infrastructure in projid.hEric W. Biederman2-1/+135
Implement kprojid_t a cousin of the kuid_t and kgid_t. The per user namespace mapping of project id values can be set with /proc/<pid>/projid_map. A full compliment of helpers is provided: make_kprojid, from_kprojid, from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq, projid_eq, projid_lt. Project identifiers are part of the generic disk quota interface, although it appears only xfs implements project identifiers currently. The xfs code allows anyone who has permission to set the project identifier on a file to use any project identifier so when setting up the user namespace project identifier mappings I do not require a capability. Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-18userns: Teach trace to use from_kuidEric W. Biederman2-2/+3
- When tracing capture the kuid. - When displaying the data to user space convert the kuid into the user namespace of the process that opened the report file. Cc: Steven Rostedt <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2012-09-18userns: Convert bsd process accounting to use kuid and kgid where appropriateEric W. Biederman1-2/+2
BSD process accounting conveniently passes the file the accounting records will be written into to do_acct_process. The file credentials captured the user namespace of the opener of the file. Use the file credentials to format the uid and the gid of the current process into the user namespace of the user that started the bsd process accounting. Cc: Pavel Emelyanov <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2012-09-18userns: Convert taskstats to handle the user and pid namespaces.Eric W. Biederman2-11/+24
- Explicitly limit exit task stat broadcast to the initial user and pid namespaces, as it is already limited to the initial network namespace. - For broadcast task stats explicitly generate all of the idenitiers in terms of the initial user namespace and the initial pid namespace. - For request stats report them in terms of the current user namespace and the current pid namespace. Netlink messages are delivered syncrhonously to the kernel allowing us to get the user namespace and the pid namespace from the current task. - Pass the namespaces for representing pids and uids and gids into bacct_add_task. Cc: Balbir Singh <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2012-09-18userns: Convert audit to work with user namespaces enabledEric W. Biederman3-23/+34
- Explicitly format uids gids in audit messges in the initial user namespace. This is safe because auditd is restrected to be in the initial user namespace. - Convert audit_sig_uid into a kuid_t. - Enable building the audit code and user namespaces at the same time. The net result is that the audit subsystem now uses kuid_t and kgid_t whenever possible making it almost impossible to confuse a raw uid_t with a kuid_t preventing bugs. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2012-09-17userns: Convert the audit loginuid to be a kuidEric W. Biederman4-23/+26
Always store audit loginuids in type kuid_t. Print loginuids by converting them into uids in the appropriate user namespace, and then printing the resulting uid. Modify audit_get_loginuid to return a kuid_t. Modify audit_set_loginuid to take a kuid_t. Modify /proc/<pid>/loginuid on read to convert the loginuid into the user namespace of the opener of the file. Modify /proc/<pid>/loginud on write to convert the loginuid rom the user namespace of the opener of the file. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Cc: Paul Moore <[email protected]> ? Cc: David Miller <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2012-09-17audit: Add typespecific uid and gid comparatorsEric W. Biederman3-89/+182
The audit filter code guarantees that uid are always compared with uids and gids are always compared with gids, as the comparason operations are type specific. Take advantage of this proper to define audit_uid_comparator and audit_gid_comparator which use the type safe comparasons from uidgid.h. Build on audit_uid_comparator and audit_gid_comparator and replace audit_compare_id with audit_compare_uid and audit_compare_gid. This is one of those odd cases where being type safe and duplicating code leads to simpler shorter and more concise code. Don't allow bitmask operations in uid and gid comparisons in audit_data_to_entry. Bitmask operations are already denined in audit_rule_to_entry. Convert constants in audit_rule_to_entry and audit_data_to_entry into kuids and kgids when appropriate. Convert the uid and gid field in struct audit_names to be of type kuid_t and kgid_t respectively, so that the new uid and gid comparators can be applied in a type safe manner. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17audit: Don't pass pid or uid to audit_log_common_recv_msgEric W. Biederman1-15/+14
The only place we use the uid and the pid that we calculate in audit_receive_msg is in audit_log_common_recv_msg so move the calculation of these values into the audit_log_common_recv_msg. Simplify the calcuation of the current pid and uid by reading them from current instead of reading them from NETLINK_CREDS. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17audit: Remove the unused uid parameter from audit_receive_filterEric W. Biederman2-3/+3
Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17audit: Properly set the origin port id of audit messages.Eric W. Biederman1-1/+1
For user generated audit messages set the portid field in the netlink header to the netlink port where the user generated audit message came from. Reporting the process id in a port id field was just nonsense. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17audit: Simply AUDIT_TTY_SET and AUDIT_TTY_GETEric W. Biederman1-25/+13
Use current instead of looking up the current up the current task by process identifier. Netlink requests are processed in trhe context of the sending task so this is safe. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17audit: kill audit_prepare_user_ttyEric W. Biederman1-19/+1
Now that netlink messages are processed in the context of the sender tty_audit_push_task can be called directly and audit_prepare_user_tty which only added looking up the task of the tty by process id is not needed. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17audit: Use current instead of NETLINK_CREDS() in audit_filterEric W. Biederman2-8/+7
Get caller process uid and gid and pid values from the current task instead of the NETLINK_CB. This is simpler than passing NETLINK_CREDS from from audit_receive_msg to audit_filter_user_rules and avoid the chance of being hit by the occassional bugs in netlink uid/gid credential passing. This is a safe changes because all netlink requests are processed in the task of the sending process. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17audit: Limit audit requests to processes in the initial pid and user namespaces.Eric W. Biederman1-0/+6
This allows the code to safely make the assumption that all of the uids gids and pids that need to be send in audit messages are in the initial namespaces. If someone cares we may lift this restriction someday but start with limiting access so at least the code is always correct. Cc: Al Viro <[email protected]> Cc: Eric Paris <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-09-17Merge branch 'for-3.6-fixes' of ↵Tejun Heo1-23/+99
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.7 This merge is necessary as Lai's CPU hotplug restructuring series depends on the CPU hotplug bug fixes in for-3.6-fixes. The merge creates one trivial conflict between the following two commits. 96e65306b8 "workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic" e2b6a6d570 "workqueue: use system_highpri_wq for highpri workers in rebind_workers()" Both add local variable definitions to the same block and can be merged in any order. Signed-off-by: Tejun Heo <[email protected]>
2012-09-17Merge branch 'for-3.6-fixes' of ↵Linus Torvalds1-2/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull another workqueue fix from Tejun Heo: "Unfortunately, yet another late fix. This too is discovered and fixed by Lai. This bug was introduced during this merge window by commit 25511a477657 ("workqueue: reimplement CPU online rebinding to handle idle workers") which started using WORKER_REBIND flag for idle rebind too. The bug is relatively easy to trigger if the CPU rapidly goes through off, on and then off (and stay off). The fix is on the safer side. This hasn't been on linux-next yet but I'm pushing early so that it can get more exposure before v3.6 release." * 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
2012-09-17workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()Lai Jiangshan1-2/+10
busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed (CPU is down again). This used to be okay because the flag wasn't used for anything else. However, after 25511a477 "workqueue: reimplement CPU online rebinding to handle idle workers", WORKER_REBIND is also used to command idle workers to rebind. If not cleared, the worker may confuse the next CPU_UP cycle by having REBIND spuriously set or oops / get stuck by prematurely calling idle_worker_rebind(). WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5 00() Hardware name: Bochs Modules linked in: test_wq(O-) Pid: 33, comm: kworker/1:1 Tainted: G O 3.6.0-rc1-work+ #3 Call Trace: [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500 [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 ---[ end trace e977cf20f4661968 ]--- BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: test_wq(O-) CPU 0 Pid: 33, comm: kworker/1:1 Tainted: G W O 3.6.0-rc1-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff810b3db0>] [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP: 0018:ffff88001e1c9de0 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009 RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580 R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900) Stack: ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010 ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340 Call Trace: [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b RIP [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP <ffff88001e1c9de0> CR2: 0000000000000000 There was no reason to keep WORKER_REBIND on failure in the first place - WORKER_UNBOUND is guaranteed to be set in such cases preventing incorrectly activating concurrency management. Always clear WORKER_REBIND. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-09-17pid-namespace: limit value of ns_last_pid to (0, max_pid)Andrew Vagin1-1/+5
The kernel doesn't check the pid for negative values, so if you try to write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic. The crash happens because the next pid is -1, and alloc_pidmap() will try to access to a nonexistent pidmap. map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; Signed-off-by: Andrew Vagin <[email protected]> Acked-by: Cyrill Gorcunov <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-09-17Merge branch 'pm-qos'Rafael J. Wysocki1-0/+1
* pm-qos: PM / QoS: Add return code to pm_qos_get_value function.
2012-09-17Merge branch 'pm-sleep'Rafael J. Wysocki2-2/+2
* pm-sleep: properly __init-annotate pm_sysrq_init() PM / wakeup: Use irqsave/irqrestore for events_lock PM / Freezer: Fix small typo "regrigerator" PM / Sleep: Print name of wakeup source that aborts suspend
2012-09-17Merge branch 'pm-timers'Rafael J. Wysocki3-0/+30
* pm-timers: PM: Do not use the syscore flag for runtime PM sh: MTU2: Basic runtime PM support sh: CMT: Basic runtime PM support sh: TMU: Basic runtime PM support PM / Domains: Do not measure start time for "irq safe" devices PM / Domains: Move syscore flag from subsys data to struct device PM / Domains: Rename the always_on device flag to syscore PM / Runtime: Allow helpers to be called by early platform drivers PM: Reorganize device PM initialization sh: MTU2: Introduce clock events suspend/resume routines sh: CMT: Introduce clocksource/clock events suspend/resume routines sh: TMU: Introduce clocksource/clock events suspend/resume routines timekeeping: Add suspend and resume of clock event devices PM / Domains: Add power off/on function for system core suspend stage PM / Domains: Introduce simplified power on routine for system resume
2012-09-17arm64: Add support for /proc/sys/debug/exception-traceCatalin Marinas1-1/+1
This patch allows setting of the show_unhandled_signals variable via /proc/sys/debug/exception-trace. The default value is currently 1 showing unhandled user faults (undefined instructions, data aborts) and invalid signal stack frames. Signed-off-by: Catalin Marinas <[email protected]> Acked-by: Tony Lindgren <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Nicolas Pitre <[email protected]> Acked-by: Olof Johansson <[email protected]> Acked-by: Santosh Shilimkar <[email protected]>
2012-09-16Revert "sched: Improve scalability via 'CPU buddies', which withstand random ↵Linus Torvalds2-45/+22
perturbations" This reverts commit 970e178985cadbca660feb02f4d2ee3a09f7fdda. Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20% performance drop on PostgreSQL 9.2 on his machine (running "pgbench"). Borislav Petkov was able to reproduce this, and bisected it to this commit 970e178985ca ("sched: Improve scalability via 'CPU buddies' ...") apparently because the new single-idle-buddy model simply doesn't find idle CPU's to reschedule on aggressively enough. Mike Galbraith suspects that it is likely due to the user-mode spinlocks in PostgreSQL not reacting well to preemption, but we don't really know the details - I'll just revert the commit for now. There are hopefully other approaches to improve scheduler scalability without it causing these kinds of downsides. Reported-by: Nikolay Ulyanitsky <[email protected]> Bisected-by: Borislav Petkov <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-89/+189
Conflicts: net/netfilter/nfnetlink_log.c net/netfilter/xt_LOG.c Rather easy conflict resolution, the 'net' tree had bug fixes to make sure we checked if a socket is a time-wait one or not and elide the logging code if so. Whereas on the 'net-next' side we are calculating the UID and GID from the creds using different interfaces due to the user namespace changes from Eric Biederman. Signed-off-by: David S. Miller <[email protected]>
2012-09-15uprobes: Introduce arch_uprobe_enable/disable_step()Sebastian Andrzej Siewior1-2/+12
As Oleg pointed out in [0] uprobe should not use the ptrace interface for enabling/disabling single stepping. [0] http://lkml.kernel.org/r/[email protected] Add the new "__weak arch" helpers which simply call user_*_single_step() as a preparation. This is only needed to not break the powerpc port, we will fold this logic into arch_uprobe_pre/post_xol() hooks later. We should also change handle_singlestep(), _disable_step(&uprobe->arch) should be called before put_uprobe(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]>
2012-09-15uprobes: Teach find_active_uprobe() to clear MMF_HAS_UPROBESOleg Nesterov1-0/+22
The wrong MMF_HAS_UPROBES doesn't really hurt, just it triggers the "slow" and unnecessary handle_swbp() path if the task hits the non-uprobe breakpoint. So this patch changes find_active_uprobe() to check every valid vma and clear MMF_HAS_UPROBES if no uprobes were found. This is adds the slow O(n) path, but it is only called in unlikely case when the task hits the normal breakpoint first time after uprobe_unregister(). Note the "not strictly accurate" comment in mmf_recalc_uprobes(). We can fix this, we only need to teach vma_has_uprobes() to return a bit more more info, but I am not sure this worth the trouble. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]>
2012-09-15uprobes: Introduce MMF_RECALC_UPROBESOleg Nesterov1-4/+35
Add the new MMF_RECALC_UPROBES flag, it means that MMF_HAS_UPROBES can be false positive after remove_breakpoint() or uprobe_munmap(). It is also set by uprobe_dup_mmap(), this is not optimal but simple. We could add the new hook, uprobe_dup_vma(), to set MMF_HAS_UPROBES only if the new mm actually has uprobes, but I don't think this makes sense. The next patch will use this flag to clear MMF_HAS_UPROBES. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]>
2012-09-15uprobes: uprobes_treelock should not disable irqsOleg Nesterov1-13/+8
Nobody plays with uprobes_tree/uprobes_treelock in interrupt context, no need to disable irqs. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]>
2012-09-15uprobes: Don't put NULL pointer in uprobe_register()Sebastian Andrzej Siewior1-1/+2
alloc_uprobe() might return a NULL pointer, put_uprobe() can't deal with this. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]>
2012-09-14Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds5-29/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Smaller fixlets" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix kernel-doc warnings in kernel/sched/fair.c sched: Unthrottle rt runqueues in __disable_runtime() sched: Add missing call to calc_load_exit_idle() sched: Fix load avg vs cpu-hotplug
2012-09-14Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-31/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This tree includes various fixes" Ingo really needs to improve on the whole "explain git pull" part. "Various fixes" indeed. * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled perf/x86: Enable Intel Cedarview Atom suppport perf_event: Switch to internal refcount, fix race with close() oprofile, s390: Fix uninitialized memory access when writing to oprofilefs perf/x86: Fix microcode revision check for SNB-PEBS
2012-09-14cgroup: mark subsystems with broken hierarchy support and whine if cgroups ↵Tejun Heo3-1/+26
are nested for them Currently, cgroup hierarchy support is a mess. cpu related subsystems behave correctly - configuration, accounting and control on a parent properly cover its children. blkio and freezer completely ignore hierarchy and treat all cgroups as if they're directly under the root cgroup. Others show yet different behaviors. These differing interpretations of cgroup hierarchy make using cgroup confusing and it impossible to co-mount controllers into the same hierarchy and obtain sane behavior. Eventually, we want full hierarchy support from all subsystems and probably a unified hierarchy. Users using separate hierarchies expecting completely different behaviors depending on the mounted subsystem is deterimental to making any progress on this front. This patch adds cgroup_subsys.broken_hierarchy and sets it to %true for controllers which are lacking in hierarchy support. The goal of this patch is two-fold. * Move users away from using hierarchy on currently non-hierarchical subsystems, so that implementing proper hierarchy support on those doesn't surprise them. * Keep track of which controllers are broken how and nudge the subsystems to implement proper hierarchy support. For now, start with a single warning message. We can whine louder later on. v2: Fixed a typo spotted by Michal. Warning message updated. v3: Updated memcg part so that it doesn't generate warning in the cases where .use_hierarchy=false doesn't make the behavior different from root.use_hierarchy=true. Fixed a typo spotted by Glauber. v4: Check ->broken_hierarchy after cgroup creation is complete so that ->create() can affect the result per Michal. Dropped unnecessary memcg root handling per Michal. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Serge E. Hallyn <[email protected]> Cc: Glauber Costa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Thomas Graf <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Neil Horman <[email protected]> Cc: Aneesh Kumar K.V <[email protected]>
2012-09-14cgroup: Assign subsystem IDs during compile timeDaniel Wagner1-19/+3
WARNING: With this change it is impossible to load external built controllers anymore. In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is set, corresponding subsys_id should also be a constant. Up to now, net_prio_subsys_id and net_cls_subsys_id would be of the type int and the value would be assigned during runtime. By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN to IS_ENABLED, all *_subsys_id will have constant value. That means we need to remove all the code which assumes a value can be assigned to net_prio_subsys_id and net_cls_subsys_id. A close look is necessary on the RCU part which was introduces by following patch: commit f845172531fb7410c7fb7780b1a6e51ee6df7d52 Author: Herbert Xu <[email protected]> Mon May 24 09:12:34 2010 Committer: David S. Miller <[email protected]> Mon May 24 09:12:34 2010 cls_cgroup: Store classid in struct sock Tis code was added to init_cgroup_cls() /* We can't use rcu_assign_pointer because this is an int. */ smp_wmb(); net_cls_subsys_id = net_cls_subsys.subsys_id; respectively to exit_cgroup_cls() net_cls_subsys_id = -1; synchronize_rcu(); and in module version of task_cls_classid() rcu_read_lock(); id = rcu_dereference(net_cls_subsys_id); if (id >= 0) classid = container_of(task_subsys_state(p, id), struct cgroup_cls_state, css)->classid; rcu_read_unlock(); Without an explicit explaination why the RCU part is needed. (The rcu_deference was fixed by exchanging it to rcu_derefence_index_check() in a later commit, but that is a minor detail.) So here is my pondering why it was introduced and why it safe to remove it now. Note that this code was copied over to net_prio the reasoning holds for that subsystem too. The idea behind the RCU use for net_cls_subsys_id is to make sure we get a valid pointer back from task_subsys_state(). task_subsys_state() is just blindly accessing the subsys array and returning the pointer. Obviously, passing in -1 as id into task_subsys_state() returns an invalid value (out of lower bound). So this code makes sure that only after module is loaded and the subsystem registered, the id is assigned. Before unregistering the module all old readers must have left the critical section. This is done by assigning -1 to the id and issuing a synchronized_rcu(). Any new readers wont call task_subsys_state() anymore and therefore it is safe to unregister the subsystem. The new code relies on the same trick, but it looks at the subsys pointer return by task_subsys_state() (remember the id is constant and therefore we allways have a valid index into the subsys array). No precautions need to be taken during module loading module. Eventually, all CPUs will get a valid pointer back from task_subsys_state() because rebind_subsystem() which is called after the module init() function will assigned subsys[net_cls_subsys_id] the newly loaded module subsystem pointer. When the subsystem is about to be removed, rebind_subsystem() will called before the module exit() function. In this case, rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer and then it calls synchronize_rcu(). All old readers have left by then the critical section. Any new reader wont access the subsystem anymore. At this point we are safe to unregister the subsystem. No synchronize_rcu() call is needed. Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Neil Horman <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Gao feng <[email protected]> Cc: Glauber Costa <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Jamal Hadi Salim <[email protected]> Cc: John Fastabend <[email protected]> Cc: Kamezawa Hiroyuki <[email protected]> Cc: [email protected] Cc: [email protected]
2012-09-14cgroup: Do not depend on a given order when populating the subsys arrayDaniel Wagner1-1/+1
The *_subsys_id will be used as index to access the subsys. Therefore we need to care we populate the subsystem at the correct position by using designated initialization. With this change we are able to interleave builtin and modules in the subsys array. Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Neil Horman <[email protected]> Cc: Gao feng <[email protected]> Cc: Jamal Hadi Salim <[email protected]> Cc: John Fastabend <[email protected]> Cc: [email protected] Cc: [email protected]
2012-09-14cgroup: Wrap subsystem selection macroDaniel Wagner1-0/+1
Before we are able to define all subsystem ids at compile time we need a more fine grained control what gets defined when we include cgroup_subsys.h. For example we define the enums for the subsystems or to declare for struct cgroup_subsys (builtin subsystem) by including cgroup_subsys.h and defining SUBSYS accordingly. Currently, the decision if a subsys is used is defined inside the header by testing if CONFIG_*=y is true. By moving this test outside of cgroup_subsys.h we are able to control it on the include level. This is done by introducing IS_SUBSYS_ENABLED which then is defined according the task, e.g. is CONFIG_*=y or CONFIG_*=m. Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Neil Horman <[email protected]> Cc: Gao feng <[email protected]> Cc: Jamal Hadi Salim <[email protected]> Cc: John Fastabend <[email protected]> Cc: [email protected] Cc: [email protected]
2012-09-14cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNTDaniel Wagner1-28/+40
CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when looping over the subsys array looking either at the builtin or the module subsystems. Since all the builtin subsystems have an id which is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids are sorted. We are about to change id assignment to happen only at compile time later in this series. That means we can't rely on the above trick since all ids will always be defined at compile time. Furthermore, ordering the builtin subsystems and the module subsystems is not really necessary. So we need a different way to know which subsystem is a builtin or a module one. We can use the subsys[]->module pointer for this. Any place where we need to know if a subsys is module we just check for the pointer. If it is NULL then the subsystem is a builtin one. With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT enum. Though we need to introduce a temporary placeholder so that we don't get a compilation error when only CONFIG_CGROUP is selected and no single controller. An empty enum definition is not valid. Later in this series we are able to remove the placeholder again. And with this change we get a fix for this: kernel/cgroup.c: In function ‘cgroup_load_subsys’: kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds] when CONFIG_CGROUP=y and no built in controller was enabled. Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Acked-by: Neil Horman <[email protected]> Cc: Gao feng <[email protected]> Cc: Jamal Hadi Salim <[email protected]> Cc: John Fastabend <[email protected]> Cc: [email protected] Cc: [email protected]
2012-09-14Merge branch 'tip/perf/core' of ↵Ingo Molnar5-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core Pull tracing updates from Steve Rostedt. Signed-off-by: Ingo Molnar <[email protected]>
2012-09-13kprobes/x86: Fix to support jprobes on ftrace-based kprobeMasami Hiramatsu1-3/+0
Fix kprobes/x86 to support jprobes on ftrace-based kprobes. Because of -mfentry support of ftrace, ftrace is now put on the beginning of function where jprobes are put. Originally ftrace-based kprobes doesn't support jprobe because it will change regs->ip and ftrace doesn't support changing IP and ftrace itself doesn't conflict jprobe. However, ftrace -mfentry support moves mcount call on the top of functions where jprobes are put. This means that jprobe always conflicts with ftrace-based kprobe and fails. This patch allows ftrace-based kprobes to support jprobes by allowing to modify regs->ip and kprobes breakpoint handler also allows to skip singlestepping because there is a ftrace call (not an original instruction). Link: http://lkml.kernel.org/r/[email protected] Reported-by: Fengguang Wu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2012-09-13trace: Stop compiling in trace_clock unconditionallyJosh Triplett3-6/+7
Commit 56449f437 "tracing: make the trace clocks available generally", in April 2009, made trace_clock available unconditionally, since CONFIG_X86_DS used it too. Commit faa4602e47 "x86, perf, bts, mm: Delete the never used BTS-ptrace code", in March 2010, removed CONFIG_X86_DS, and now only CONFIG_RING_BUFFER (split out from CONFIG_TRACING for general use) has a dependency on trace_clock. So, only compile in trace_clock with CONFIG_RING_BUFFER or CONFIG_TRACING enabled. Link: http://lkml.kernel.org/r/20120903024513.GA19583@leaf Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Josh Triplett <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2012-09-13tracing: Skip printing "OK" if failed to disable eventYuanhan Liu1-1/+3
No acutal case found. But logically, we should skip "OK" in case any error met. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yuanhan Liu <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2012-09-13locking: Adjust spin lock inlining Kconfig optionsJan Beulich1-40/+63
Break out the DEBUG_SPINLOCK dependency (requires moving up UNINLINE_SPIN_UNLOCK, as this was the only one in that block not depending on that option). Avoid putting values not selected into the resulting .config - they are not useful for anything, make the output less legible, and just consume space: Use "depends on" rather than directly setting the default from the combined dependency values. Signed-off-by: Jan Beulich <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-09-13time: Fix timeekeping_get_ns overflow on 32bit systemsJohn Stultz1-7/+12
Daniel Lezcano reported seeing multi-second stalls from keyboard input on his T61 laptop when NOHZ and CPU_IDLE were enabled on a 32bit kernel. He bisected the problem down to commit 1e75fa8be9fb6 ("time: Condense timekeeper.xtime into xtime_sec"). After reproducing this issue, I narrowed the problem down to the fact that timekeeping_get_ns() returns a 64bit nsec value that hasn't been accumulated. In some cases this value was being then stored in timespec.tv_nsec (which is a long). On 32bit systems, with idle times larger then 4 seconds (or less, depending on the value of xtime_nsec), the returned nsec value would overflow 32bits. This limited kept time from increasing, causing timers to not expire. The fix is to make sure we don't directly store the result of timekeeping_get_ns() into a tv_nsec field, instead using a 64bit nsec value which can then be added into the timespec via timespec_add_ns(). Reported-and-bisected-by: Daniel Lezcano <[email protected]> Tested-by: Daniel Lezcano <[email protected]> Signed-off-by: John Stultz <[email protected]> Acked-by: Prarit Bhargava <[email protected]> Cc: Richard Cochran <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>