aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2015-10-16genirq/msi: Do not use pci_msi_[un]mask_irq as default methodsMarc Zyngier1-5/+1
When we create a generic MSI domain, that MSI_FLAG_USE_DEF_CHIP_OPS is set, and that any of .mask or .unmask are NULL in the irq_chip structure, we set them to pci_msi_[un]mask_irq. This is a bad idea for at least two reasons: - PCI_MSI might not be selected, kernel fails to build (yes, this is legitimate, at least on arm64!) - This may not be a PCI/MSI domain at all (platform MSI, for example) Either way, this looks wrong. Move the overriding of mask/unmask to the PCI counterpart, and panic is any of these two methods is not set in the core code (they really should be present). Signed-off-by: Marc Zyngier <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Bjorn Helgaas <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-16bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_putTom Herbert1-0/+1
Currently, is only called from __prog_put_rcu in the bpf_prog_release path. Need this to call this from bpf_prog_put also to get correct accounting. Fixes: aaac3ba95e4c8b49 ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Tom Herbert <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-15cgroup: drop cgroup__DEVEL__legacy_files_on_dflTejun Heo1-28/+2
Now that interfaces for the major three controllers - cpu, memory, io - are shaping up, there's no reason to have an option to force legacy files to show up on the unified hierarchy for testing. Drop it. Signed-off-by: Tejun Heo <[email protected]> Cc: Li Zefan <[email protected]> Cc: Johannes Weiner <[email protected]>
2015-10-15cgroup: replace error handling in cgroup_init() with WARN_ON()sTejun Heo1-11/+4
The init sequence shouldn't fail short of bugs and even when it does it's better to continue with the rest of initialization and we were silently ignoring /proc/cgroups creation failure. Drop the explicit error handling and wrap sysfs_create_mount_point(), register_filesystem() and proc_create() with WARN_ON()s. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Johannes Weiner <[email protected]>
2015-10-15cgroup: add cgroup_subsys->free() method and use it to fix pids controllerTejun Heo2-2/+9
pids controller is completely broken in that it uncharges when a task exits allowing zombies to escape resource control. With the recent updates, cgroup core now maintains cgroup association till task free and pids controller can be fixed by uncharging on free instead of exit. This patch adds cgroup_subsys->free() method and update pids controller to use it instead of ->exit() for uncharging. Signed-off-by: Tejun Heo <[email protected]> Cc: Aleksa Sarai <[email protected]>
2015-10-15cgroup: keep zombies associated with their original cgroupsTejun Heo5-53/+37
cgroup_exit() is called when a task exits and disassociates the exiting task from its cgroups and half-attach it to the root cgroup. This is unnecessary and undesirable. No controller actually needs an exiting task to be disassociated with non-root cgroups. Both cpu and perf_event controllers update the association to the root cgroup from their exit callbacks just to keep consistent with the cgroup core behavior. Also, this disassociation makes it difficult to track resources held by zombies or determine where the zombies came from. Currently, pids controller is completely broken as it uncharges on exit and zombies always escape the resource restriction. With cgroup association being reset on exit, fixing it is pretty painful. There's no reason to reset cgroup membership on exit. The zombie can be removed from its css_set so that it doesn't show up on "cgroup.procs" and thus can't be migrated or interfere with cgroup removal. It can still pin and point to the css_set so that its cgroup membership is maintained. This patch makes cgroup core keep zombies associated with their cgroups at the time of exit. * Previous patches decoupled populated_cnt tracking from css_set lifetime, so a dying task can be simply unlinked from its css_set while pinning and pointing to the css_set. This keeps css_set association from task side alive while hiding it from "cgroup.procs" and populated_cnt tracking. The css_set reference is dropped when the task_struct is freed. * ->exit() callback no longer needs the css arguments as the associated css never changes once PF_EXITING is set. Removed. * cpu and perf_events controllers no longer need ->exit() callbacks. There's no reason to explicitly switch away on exit. The final schedule out is enough. The callbacks are removed. * On traditional hierarchies, nothing changes. "/proc/PID/cgroup" still reports "/" for all zombies. On the default hierarchy, "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged to at the time of exit. If the cgroup gets removed before the task is reaped, " (deleted)" is appended. v2: Build brekage due to missing dummy cgroup_free() when !CONFIG_CGROUP fixed. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]>
2015-10-15cgroup: make css_set_rwsem a spinlock and rename it to css_set_lockTejun Heo1-72/+72
css_set_rwsem is the inner lock protecting css_sets and is accessed from hot paths such as fork and exit. Internally, it has no reason to be a rwsem or even mutex. There are no internal blocking operations while holding it. This was rwsem because css task iteration used to expose it to external iterator users. As the previous patch updated css task iteration such that the locking is not leaked to its users, there's no reason to keep it a rwsem. This patch converts css_set_rwsem to a spinlock and rename it to css_set_lock. It uses bh-safe operations as a planned usage needs to access it from RCU callback context. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: don't hold css_set_rwsem across css task iterationTejun Heo1-14/+73
css_sets are synchronized through css_set_rwsem but the locking scheme is kinda bizarre. The hot paths - fork and exit - have to write lock the rwsem making the rw part pointless; furthermore, many readers already hold cgroup_mutex. One of the readers is css task iteration. It read locks the rwsem over the entire duration of iteration. This leads to silly locking behavior. When cpuset tries to migrate processes of a cgroup to a different NUMA node, css_set_rwsem is held across the entire migration attempt which can take a long time locking out forking, exiting and other cgroup operations. This patch updates css task iteration so that it locks css_set_rwsem only while the iterator is being advanced. css task iteration involves two levels - css_set and task iteration. As css_sets in use are practically immutable, simply pinning the current one is enough for resuming iteration afterwards. Task iteration is tricky as tasks may leave their css_set while iteration is in progress. This is solved by keeping track of active iterators and advancing them if their next task leaves its css_set. v2: put_task_struct() in css_task_iter_next() moved outside css_set_rwsem. A later patch will add cgroup operations to task_struct free path which may grab the same lock and this avoids deadlock possibilities. css_set_move_task() updated to use list_for_each_entry_safe() when walking task_iters and advancing them. This is necessary as advancing an iter may remove it from the list. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: reorganize css_task_iter functionsTejun Heo1-21/+28
* Rename css_advance_task_iter() to css_task_iter_advance_css_set() and make it clear it->task_pos too at the end of the iteration. * Factor out css_task_iter_advance() from css_task_iter_next(). The new function whines if called on a terminated iterator. Except for the termination check, this is pure reorganization and doesn't introduce any behavior changes. This will help the planned locking update for css_task_iter. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: factor out css_set_move_task()Tejun Heo1-48/+56
A task is associated and disassociated with its css_set in three places - during migration, after a new task is created and when a task exits. The first is handled by cgroup_task_migrate() and the latter two are open-coded. These are similar operations and spreading them over multiple places makes it harder to follow and update. This patch collects all task css_set [dis]association operations into css_set_move_task(). While css_set_move_task() may check whether populated state needs to be updated when not strictly necessary, the behavior is essentially equivalent before and after this patch. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: keep css_set and task lists in chronological orderTejun Heo1-6/+5
css task iteration will be updated to not leak cgroup internal locking to iterator users. In preparation, update css_set and task lists to be in chronological order. For tasks, as migration path is already using list_splice_tail_init(), only cgroup_enable_task_cg_lists() and cgroup_post_fork() need updating. For css_sets, link_css_set() is the only place which needs to be updated. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: make cgroup_destroy_locked() test cgroup_is_populated()Tejun Heo1-6/+5
cgroup_destroy_locked() currently tests whether any css_sets are associated to reject removal if the cgroup contains tasks. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch updates cgroup_destroy_locked() so that it tests cgroup_is_populated(), which counts the number of populated css_sets, instead of whether cgrp->cset_links is empty to determine whether the cgroup is populated or not. This ensures that rmdirs won't be incorrectly rejected for cgroups which only contain zombie tasks. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: make css_sets pin the associated cgroupsTejun Heo1-4/+6
Currently, css_sets don't pin the associated cgroups. This is okay as a cgroup with css_sets associated are not allowed to be removed; however, to help resource tracking for zombie tasks, this is scheduled to change such that a cgroup can be removed even when it has css_sets associated as long as none of them are populated. To ensure that a cgroup doesn't go away while css_sets are still associated with it, make each associated css_set hold a reference on the cgroup if non-root. v2: Root cgroups are special and shouldn't be ref'd by css_sets. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: relocate cgroup_[try]get/put()Tejun Heo1-16/+16
Relocate cgroup_get(), cgroup_tryget() and cgroup_put() upwards. This is pure code reorganization to prepare for future changes. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: move check_for_release() invocationTejun Heo1-7/+1
To trigger release agent when the last task leaves the cgroup, check_for_release() is called from put_css_set_locked(); however, css_set being unlinked is being decoupled from task leaving the cgroup and the correct condition to test is cgroup->nr_populated dropping to zero which check_for_release() is already updated to test. This patch moves check_for_release() invocation from put_css_set_locked() to cgroup_update_populated(). Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: replace cgroup_has_tasks() with cgroup_is_populated()Tejun Heo2-4/+4
Currently, cgroup_has_tasks() tests whether the target cgroup has any css_set linked to it. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch replaces cgroup_has_tasks() with cgroup_is_populated() which tests cgroup->nr_populated instead which locally counts the number of populated css_sets. Unlike cgroup_has_tasks(), cgroup_is_populated() is recursive - if any of the descendants is populated, the cgroup is populated too. While this changes the meaning of the test, all the existing users are okay with the change. While at it, replace the open-coded ->populated_cnt test in cgroup_events_show() with cgroup_is_populated(). Signed-off-by: Tejun Heo <[email protected]> Cc: Li Zefan <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]>
2015-10-15cgroup: make cgroup->nr_populated count the number of populated css_setsTejun Heo1-13/+52
Currently, cgroup->nr_populated counts whether the cgroup has any css_sets linked to it and the number of children which has non-zero ->nr_populated. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch updates cgroup->nr_populated so that for the cgroup itself it counts the number of css_sets which have tasks associated with them so that empty css_sets don't skew the populated test. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15cgroup: remove an unused parameter from cgroup_task_migrate()Tejun Heo1-5/+2
cgroup_task_migrate() no longer uses @old_cgrp. Remove it. Signed-off-by: Tejun Heo <[email protected]>
2015-10-15Merge branch 'for-4.3-fixes' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixlet from Tejun Heo: "Single patch to make delayed work always be queued on the local CPU" This is not actually something we should guarantee, but it's something we by accident have historically done, and at least one call site has grown to depend on it. I'm going to fix that known broken callsite, but in the meantime this makes the accidental behavior be explicit, just in case there are other cases that might depend on it. * 'for-4.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: make sure delayed work run in local cpu
2015-10-15posix_cpu_timer: Reduce unnecessary sighand lock contentionJason Low1-2/+24
It was found while running a database workload on large systems that significant time was spent trying to acquire the sighand lock. The issue was that whenever an itimer expired, many threads ended up simultaneously trying to send the signal. Most of the time, nothing happened after acquiring the sighand lock because another thread had just already sent the signal and updated the "next expire" time. The fastpath_timer_check() didn't help much since the "next expire" time was updated after the threads exit fastpath_timer_check(). This patch addresses this by having the thread_group_cputimer structure maintain a boolean to signify when a thread in the group is already checking for process wide timers, and adds extra logic in the fastpath to check the boolean. Signed-off-by: Jason Low <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Reviewed-by: George Spelvin <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-15posix_cpu_timer: Convert cputimer->running to boolJason Low2-3/+3
In the next patch in this series, a new field 'checking_timer' will be added to 'struct thread_group_cputimer'. Both this and the existing 'running' integer field are just used as boolean values. To save space in the structure, we can make both of these fields booleans. This is a preparatory patch to convert the existing running integer field to a boolean. Suggested-by: George Spelvin <[email protected]> Signed-off-by: Jason Low <[email protected]> Reviewed: George Spelvin <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-15posix_cpu_timer: Check thread timers only when there are active thread timersJason Low1-6/+16
The fastpath_timer_check() contains logic to check for if any timers are set by checking if !task_cputime_zero(). Similarly, we can do this before calling check_thread_timers(). In the case where there are only process-wide timers, this will skip all of the computations for per-thread timers when there are no per-thread timers. As suggested by George, we can put the task_cputime_zero() check in check_thread_timers(), since that is more of an optization to the function. Similarly, we move the existing check of cputimer->running to check_process_timers(). Signed-off-by: Jason Low <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Reviewed-by: George Spelvin <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-15posix_cpu_timer: Optimize fastpath_timer_check()Jason Low1-8/+3
In fastpath_timer_check(), the task_cputime() function is always called to compute the utime and stime values. However, this is not necessary if there are no per-thread timers to check for. This patch modifies the code such that we compute the task_cputime values only when there are per-thread timers set. Signed-off-by: Jason Low <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Reviewed-by: George Spelvin <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13ftrace: Remove redundant strsep in mod_callbackDmitry Safonov1-10/+5
By now there isn't any subcommand for mod. Before: sh$ echo '*:mod:ipv6:a' > set_ftrace_filter sh$ echo '*:mod:ipv6' > set_ftrace_filter had the same results, but now first will result in: sh$ echo '*:mod:ipv6:a' > set_ftrace_filter -bash: echo: write error: Invalid argument Also, I clarified ftrace_mod_callback code a little. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Safonov <[email protected]> [ converted 'if (ret == 0)' to 'if (!ret)' ] Signed-off-by: Steven Rostedt <[email protected]>
2015-10-14PM / hibernate: fix a comment typoGeliang Tang1-1/+1
Just fix a typo in a function name in kerneldoc comments. Signed-off-by: Geliang Tang <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2015-10-14PM / sleep: Add flags to indicate platform firmware involvementRafael J. Wysocki1-0/+4
There are quite a few cases in which device drivers, bus types or even the PM core itself may benefit from knowing whether or not the platform firmware will be involved in the upcoming system power transition (during system suspend) or whether or not it was involved in it (during system resume). For this reason, introduce global system suspend flags that can be used by the platform code to expose that information for the benefit of the other parts of the kernel and make the ACPI core set them as appropriate. Users of the new flags will be added later. Signed-off-by: Rafael J. Wysocki <[email protected]>
2015-10-13irqdomain/msi: Use fwnode instead of of_nodeMarc Zyngier1-4/+4
As we continue to push of_node towards the outskirts of irq domains, let's start tackling the case of msi_create_irq_domain and its little friends. This has limited impact in both PCI/MSI, platform MSI, and a few drivers. Signed-off-by: Marc Zyngier <[email protected]> Tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Introduce irq_domain_create_hierarchyMarc Zyngier1-6/+6
As we're about to start converting the various MSI layers to use fwnode_handle instead of device_node, add irq_domain_create_hierarchy as a directly equivalent of irq_domain_add_hierarchy (which still exists as a compatibility interface). Signed-off-by: Marc Zyngier <[email protected]> Tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Add a fwnode_handle allocatorMarc Zyngier1-0/+51
In order to be able to reference an irqdomain from ACPI, we need to be able to create an identifier, which is usually a struct device_node. This device node does't really fit the ACPI infrastructure, so we cunningly allocate a new structure containing a fwnode_handle, and return that. This structure doesn't really point to a device (interrupt controllers are not "real" devices in Linux), but as we cannot really deny that they exist, we create them with a new fwnode_type (FWNODE_IRQCHIP). Signed-off-by: Marc Zyngier <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Reviewed-and-tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Introduce irq_domain_create_{linear, tree}Marc Zyngier1-5/+6
Just like we have irq_domain_add_{linear,tree} to create a irq domain identified by an of_node, introduce irq_domain_create_{linear,tree} that do the same thing, except that they take a struct fwnode_handle. Existing functions get rewritten in terms of the new ones so that everything keeps working as before (and __irq_domain_add is now fwnode_handle based as well). Signed-off-by: Marc Zyngier <[email protected]> Reviewed-and-tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Introduce irq_create_fwspec_mappingMarc Zyngier1-15/+15
Just like we have irq_create_of_mapping, irq_create_fwspec_mapping creates a IRQ domain mapping for an interrupt described in a struct irq_fwspec. irq_create_of_mapping gets rewritten in terms of the new function, and the hack we introduced before gets removed (now that no stacked irqchip uses of_phandle_args anymore). Signed-off-by: Marc Zyngier <[email protected]> Reviewed-and-tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Introduce a firmware-specific IRQ specifier structureMarc Zyngier1-11/+48
So far the closest thing to a generic IRQ specifier structure is of_phandle_args, which happens to be pretty OF specific (the of_node pointer in there is quite annoying). Let's introduce 'struct irq_fwspec' that can be used in place of of_phandle_args for OF, but also for other firmware implementations (that'd be ACPI). This is used together with a new 'translate' method that is the pendent of 'xlate'. We convert irq_create_of_mapping to use this new structure (with a small hack that will be removed later). Signed-off-by: Marc Zyngier <[email protected]> Reviewed-and-tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Allow irq domain lookup by fwnodeMarc Zyngier1-9/+7
So far, our irq domains are still looked up by device node. Let's change this and allow a domain to be looked up using a fwnode_handle pointer. The existing interfaces are preserved with a couple of helpers. Signed-off-by: Marc Zyngier <[email protected]> Reviewed-and-tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Convert irqdomain-%3Eof_node to fwnodeMarc Zyngier1-1/+5
Now that we have everyone accessing the of_node field via the irq_domain_get_of_node accessor, it is pretty easy to swap it for a pointer to a fwnode_handle. This translates into a few limited changes in __irq_domain_add, and an updated irq_domain_get_of_node. Signed-off-by: Marc Zyngier <[email protected]> Reviewed-and-tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13irqdomain: Use irq_domain_get_of_node() instead of direct field accessMarc Zyngier1-9/+21
The struct irq_domain contains a "struct device_node *" field (of_node) that is almost the only link between the irqdomain and the device tree infrastructure. In order to prepare for the removal of that field, convert all users to use irq_domain_get_of_node() instead. Signed-off-by: Marc Zyngier <[email protected]> Reviewed-and-tested-by: Hanjun Guo <[email protected]> Tested-by: Lorenzo Pieralisi <[email protected]> Cc: <[email protected]> Cc: Tomasz Nowicki <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Graeme Gregory <[email protected]> Cc: Jake Oshins <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-13Merge branch 'linus' into irq/coreThomas Gleixner8-52/+129
Bring in upstream updates for patches which depend on them
2015-10-12bpf: charge user for creation of BPF maps and programsAlexei Starovoitov3-1/+68
since eBPF programs and maps use kernel memory consider it 'locked' memory from user accounting point of view and charge it against RLIMIT_MEMLOCK limit. This limit is typically set to 64Kbytes by distros, so almost all bpf+tracing programs would need to increase it, since they use maps, but kernel charges maximum map size upfront. For example the hash map of 1024 elements will be charged as 64Kbyte. It's inconvenient for current users and changes current behavior for root, but probably worth doing to be consistent root vs non-root. Similar accounting logic is done by mmap of perf_event. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-12bpf: enable non-root eBPF programsAlexei Starovoitov3-14/+116
In order to let unprivileged users load and execute eBPF programs teach verifier to prevent pointer leaks. Verifier will prevent - any arithmetic on pointers (except R10+Imm which is used to compute stack addresses) - comparison of pointers (except if (map_value_ptr == 0) ... ) - passing pointers to helper functions - indirectly passing pointers in stack to helper functions - returning pointer from bpf program - storing pointers into ctx or maps Spill/fill of pointers into stack is allowed, but mangling of pointers stored in the stack or reading them byte by byte is not. Within bpf programs the pointers do exist, since programs need to be able to access maps, pass skb pointer to LD_ABS insns, etc but programs cannot pass such pointer values to the outside or obfuscate them. Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, so that socket filters (tcpdump), af_packet (quic acceleration) and future kcm can use it. tracing and tc cls/act program types still require root permissions, since tracing actually needs to be able to see all kernel pointers and tc is for root only. For example, the following unprivileged socket filter program is allowed: int bpf_prog1(struct __sk_buff *skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 *value = bpf_map_lookup_elem(&my_map, &index); if (value) *value += skb->len; return 0; } but the following program is not: int bpf_prog1(struct __sk_buff *skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 *value = bpf_map_lookup_elem(&my_map, &index); if (value) *value += (u64) skb; return 0; } since it would leak the kernel address into the map. Unprivileged socket filter bpf programs have access to the following helper functions: - map lookup/update/delete (but they cannot store kernel pointers into them) - get_random (it's already exposed to unprivileged user space) - get_smp_processor_id - tail_call into another socket filter program - ktime_get_ns The feature is controlled by sysctl kernel.unprivileged_bpf_disabled. This toggle defaults to off (0), but can be set true (1). Once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false. Signed-off-by: Alexei Starovoitov <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-12Merge back earlier 'pm-sleep' material for v4.4.Rafael J. Wysocki2-1/+18
2015-10-12workqueue: Allocate the unbound pool using local node memoryXunlei Pang1-12/+14
Currently, get_unbound_pool() uses kzalloc() to allocate the worker pool. Actually, we can use the right node to do the allocation, achieving local memory access. This patch selects target node first, and uses kzalloc_node() instead. Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2015-10-12Merge tag 'v4.3-rc5' into timers/core, to pick up fixes before applying new ↵Ingo Molnar11-82/+220
changes Signed-off-by: Ingo Molnar <[email protected]>
2015-10-12sched, tracing: Stop/start critical timings around the idle=poll idle loopDaniel Bristot de Oliveira1-0/+2
When using idle=poll, the preemptoff tracer is always showing the idle task as the culprit for long latencies. That happens because critical timings are not stopped before idle loop. This patch stops critical timings before entering the idle loop, starting it again after the idle loop. This problem does not affect the irqsoff tracer because interruptions are enabled before entering the idle loop. Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Reviewed-by: Luis Claudio R. Goncalves <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/10fc3705874aef11dbe152a068b591a7be1899b4.1444314899.git.bristot@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2015-10-11timers: Use __fls in apply_slack()Rasmus Villemoes1-1/+1
In apply_slack(), find_last_bit() is applied to a bitmask consisting of precisely BITS_PER_LONG bits. Since mask is non-zero, we might as well eliminate the function call and use __fls() directly. On x86_64, this shaves 23 bytes of the only caller, mod_timer(). This also gets rid of Coverity CID 1192106, but that is a false positive: Coverity is not aware that mask != 0 implies that find_last_bit will not return BITS_PER_LONG. Signed-off-by: Rasmus Villemoes <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-11clocksource: Remove return statement from void functionsGuillaume Gomez1-3/+2
Signed-off-by: Guillaume Gomez <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/CAAOQCfSDgmqSWDBsetau%2ByF8x0%[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-10-11Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2-7/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "Fix a long standing state race in finish_task_switch()" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix TASK_DEAD race in finish_task_switch()
2015-10-11bpf: fix cb access in socket filter programsAlexei Starovoitov1-1/+1
eBPF socket filter programs may see junk in 'u32 cb[5]' area, since it could have been used by protocol layers earlier. For socket filter programs used in af_packet we need to clean 20 bytes of skb->cb area if it could be used by the program. For programs attached to TCP/UDP sockets we need to save/restore these 20 bytes, since it's used by protocol layers. Remove SK_RUN_FILTER macro, since it's no longer used. Long term we may move this bpf cb area to per-cpu scratch, but that requires addition of new 'per-cpu load/store' instructions, so not suitable as a short term fix. Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields") Reported-by: Eric Dumazet <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-11genirq: Add flag to force mask in disable_irq[_nosync]()Thomas Gleixner3-0/+22
If an irq chip does not implement the irq_disable callback, then we use a lazy approach for disabling the interrupt. That means that the interrupt is marked disabled, but the interrupt line is not immediately masked in the interrupt chip. It only becomes masked if the interrupt is raised while it's marked disabled. We use this to avoid possibly expensive mask/unmask operations for common case operations. Unfortunately there are devices which do not allow the interrupt to be disabled easily at the device level. They are forced to use disable_irq_nosync(). This can result in taking each interrupt twice. Instead of enforcing the non lazy mode on all interrupts of a irq chip, provide a settings flag, which can be set by the driver for that particular interrupt line. Reported-and-tested-by: Duc Dang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Jason Cooper <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1510092348370.6097@nanos
2015-10-09pmem, memremap: convert to numa aware allocationsDan Williams1-3/+4
Given that pmem ranges come with numa-locality hints, arrange for the resulting driver objects to be obtained from node-local memory. Reviewed-by: Tejun Heo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2015-10-09devm_memremap_pages: use numa_mem_idDan Williams1-1/+1
Hint to closest numa node for the placement of newly allocated pages. As that is where the device's other allocations will originate by default when it does not specify a NUMA node. Cc: Ross Zwisler <[email protected]> Reviewed-by: Tejun Heo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2015-10-09devm_memremap: convert to return ERR_PTRDan Williams1-1/+1
Make devm_memremap consistent with the error return scheme of devm_memremap_pages to remove special casing in the pmem driver. Cc: Christoph Hellwig <[email protected]> Cc: Ross Zwisler <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>