aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig2-4/+3
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-08-23workqueue: Use TASK_IDLEPeter Zijlstra1-2/+2
Workqueues don't use signals, it (ab)uses TASK_INTERRUPTIBLE to avoid increasing the loadavg numbers. We've 'recently' introduced TASK_IDLE for this case: 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE") use it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-08-23genirq: Let irq_set_vcpu_affinity() iterate over hierarchyMarc Zyngier1-2/+12
When assigning an interrupt to a vcpu, it is not unlikely that the level of the hierarchy implementing irq_set_vcpu_affinity is not the top level (think a generic MSI domain on top of a virtualization aware interrupt controller). In such a case, let's iterate over the hierarchy until we find an irqchip implementing it. Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2017-08-22bpf: minor cleanups for dev_mapDaniel Borkmann1-59/+41
Some minor code cleanups, while going over it I also noticed that we're accounting the bitmap only for one CPU currently, so fix that up as well. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-22ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES.Martijn Coenen1-0/+1
These will be required going forward. Signed-off-by: Martijn Coenen <[email protected]> Cc: stable <[email protected]> # 4.11+ Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-08-22bpf: fix map value attribute for hash of mapsDaniel Borkmann1-13/+17
Currently, iproute2's BPF ELF loader works fine with array of maps when retrieving the fd from a pinned node and doing a selfcheck against the provided map attributes from the object file, but we fail to do the same for hash of maps and thus refuse to get the map from pinned node. Reason is that when allocating hash of maps, fd_htab_map_alloc() will set the value size to sizeof(void *), and any user space map creation requests are forced to set 4 bytes as value size. Thus, selfcheck will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID returns the value size used to create the map. Fix it by handling it the same way as we do for array of maps, which means that we leave value size at 4 bytes and in the allocation phase round up value size to 8 bytes. alloc_htab_elem() needs an adjustment in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem() calling into htab_map_update_elem() with the pointer of the map pointer as value. Unlike array of maps where we just xchg(), we're using the generic htab_map_update_elem() callback also used from helper calls, which published the key/value already on return, so we need to ensure to memcpy() the right size. Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-22bpf: fix map value attribute for hash of mapsDaniel Borkmann1-13/+17
Currently, iproute2's BPF ELF loader works fine with array of maps when retrieving the fd from a pinned node and doing a selfcheck against the provided map attributes from the object file, but we fail to do the same for hash of maps and thus refuse to get the map from pinned node. Reason is that when allocating hash of maps, fd_htab_map_alloc() will set the value size to sizeof(void *), and any user space map creation requests are forced to set 4 bytes as value size. Thus, selfcheck will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID returns the value size used to create the map. Fix it by handling it the same way as we do for array of maps, which means that we leave value size at 4 bytes and in the allocation phase round up value size to 8 bytes. alloc_htab_elem() needs an adjustment in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem() calling into htab_map_update_elem() with the pointer of the map pointer as value. Unlike array of maps where we just xchg(), we're using the generic htab_map_update_elem() callback also used from helper calls, which published the key/value already on return, so we need to ensure to memcpy() the right size. Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller9-28/+149
2017-08-21pids: make task_tgid_nr_ns() safeOleg Nesterov1-7/+4
This was reported many times, and this was even mentioned in commit 52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is not safe because task->group_leader points to nowhere after the exiting task passes exit_notify(), rcu_read_lock() can not help. We really need to change __unhash_process() to nullify group_leader, parent, and real_parent, but this needs some cleanups. Until then we can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and fix the problem. Reported-by: Troy Kensinger <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-21Merge branch 'for-mingo' of ↵Ingo Molnar25-787/+544
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Removal of spin_unlock_wait() - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes - CPU-hotplug fixes - Miscellaneous non-RCU fixes Signed-off-by: Ingo Molnar <[email protected]>
2017-08-20bpf: fix double free from dev_map_notification()Daniel Borkmann1-7/+5
In the current code, dev_map_free() can still race with dev_map_notification(). In dev_map_free(), we remove dtab from the list of dtabs after we purged all entries from it. However, we don't do xchg() with NULL or the like, so the entry at that point is still pointing to the device. If a unregister notification comes in at the same time, we therefore risk a double-free, since the pointer is still present in the map, and then pushed again to __dev_map_entry_free(). All this is completely unnecessary. Just remove the dtab from the list right before the synchronize_rcu(), so all outstanding readers from the notifier list have finished by then, thus we don't need to deal with this corner case anymore and also wouldn't need to nullify dev entires. This is fine because we iterate over the map releasing all entries and therefore dev references anyway. Fixes: 4cc7b9544b9a ("bpf: devmap fix mutex in rcu critical section") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-8/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for the perf subsystem: - Fix an inconsistency of RDPMC mm struct tagging across exec() which causes RDPMC to fault. - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which causes incorrect timestamps and total time calculations" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix time on IOC_ENABLE perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2-4/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A pile of smallish changes all over the place: - Add a missing ISB in the GIC V1 driver - Remove an ACPI version check in the GIC V3 ITS driver - Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid spurious wakeups - Remove the artifical limitation of ITS instances to the number of NUMA nodes which prevents utilizing the ITS hardware correctly - Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code - Honour the force affinity argument in the GIC-V3 driver which is required to make perf work correctly - Correctly report allocation failures in GIC-V2/V3 to avoid using half allocated and initialized interrupts. - Fixup checks against nr_cpu_ids in the generic IPI code" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/ipi: Fixup checks against nr_cpu_ids genirq: Restore trigger settings in irq_modify_status() MAINTAINERS: Remove Jason Cooper's irqchip git tree irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES irqchip: brcmstb-l2: Define an irq_pm_shutdown function irqchip/gic: Ensure we have an ISB between ack and ->handle_irq irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA irqchip/gic-v3: Honor forced affinity setting irqchip/gic-v3: Report failures in gic_irq_domain_alloc irqchip/gic-v2: Report failures in gic_irq_domain_alloc irqchip/atmel-aic: Remove root argument from ->fixup() prototype irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup() irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
2017-08-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds2-0/+60
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes
2017-08-20Merge branch 'fortglx/4.14/time' of ↵Thomas Gleixner2-3/+10
https://git.linaro.org/people/john.stultz/linux into timers/core Pull timekeepig updates from John Stultz - kselftest improvements - Use the proper timekeeper in the debug code - Prevent accessing an unavailable wakeup source in the alarmtimer sysfs interface.
2017-08-20genirq/ipi: Fixup checks against nr_cpu_idsAlexey Dobriyan1-2/+2
Valid CPU ids are [0, nr_cpu_ids-1] inclusive. Fixes: 3b8e29a82dd1 ("genirq: Implement ipi_send_mask/single()") Fixes: f9bce791ae2a ("genirq: Add a new function to get IPI reverse mapping") Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/20170819095751.GB27864@avx2
2017-08-19bpf: inline map in map lookup functions for array and htabDaniel Borkmann2-0/+43
Avoid two successive functions calls for the map in map lookup, first is the bpf_map_lookup_elem() helper call, and second the callback via map->ops->map_lookup_elem() to get to the map in map implementation. Implementation inlines array and htab flavor for map in map lookups. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-19bpf: make htab inlining more robust wrt assumptionsDaniel Borkmann1-1/+5
Commit 9015d2f59535 ("bpf: inline htab_map_lookup_elem()") was making the assumption that a direct call emission to the function __htab_map_lookup_elem() will always work out for JITs. This is currently true since all JITs we have are for 64 bit archs, but in case of 32 bit JITs like upcoming arm32, we get a NULL pointer dereference when executing the call to __htab_map_lookup_elem() since passed arguments are of a different size (due to pointer args) than what we do out of BPF. Guard and thus limit this for now for the current 64 bit JITs only. Reported-by: Shubham Bansal <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-19bpf: Allow selecting numa node during map creationMartin KaFai Lau7-21/+55
The current map creation API does not allow to provide the numa-node preference. The memory usually comes from where the map-creation-process is running. The performance is not ideal if the bpf_prog is known to always run in a numa node different from the map-creation-process. One of the use case is sharding on CPU to different LRU maps (i.e. an array of LRU maps). Here is the test result of map_perf_test on the INNER_LRU_HASH_PREALLOC test if we force the lru map used by CPU0 to be allocated from a remote numa node: [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ] ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec 4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec 3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec 6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec 2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec 1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec 7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec 0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec #<<< After specifying numa node: ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec 3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec 1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec 6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec 2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec 4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec 7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec 0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<< This patch adds one field, numa_node, to the bpf_attr. Since numa node 0 is a valid node, a new flag BPF_F_NUMA_NODE is also added. The numa_node field is honored if and only if the BPF_F_NUMA_NODE flag is set. Numa node selection is not supported for percpu map. This patch does not change all the kmalloc. F.e. 'htab = kzalloc()' is not changed since the object is small enough to stay in the cache. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-18bpf: Fix map-in-map checking in the verifierMartin KaFai Lau1-0/+1
In check_map_func_compatibility(), a 'break' has been accidentally removed for the BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS cases. This patch adds it back. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Cc: John Fastabend <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-18signal: don't remove SIGNAL_UNKILLABLE for traced tasks.Jamie Iles1-1/+5
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive faults, but this is undesirable when tracing. For example, debugging an init process (whether global or namespace), hitting a breakpoint and SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE. Everything continues fine, but then once debugging has finished, the init process is left killable which is unlikely what the user expects, resulting in either an accidentally killed init or an init that stops reaping zombies. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jamie Iles <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-18kmod: fix wait on recursive loopLuis R. Rodriguez1-2/+23
Recursive loops with module loading were previously handled in kmod by restricting the number of modprobe calls to 50 and if that limit was breached request_module() would return an error and a user would see the following on their kernel dmesg: request_module: runaway loop modprobe binfmt-464c Starting init:/sbin/init exists but couldn't execute it (error -8) This issue could happen for instance when a 64-bit kernel boots a 32-bit userspace on some architectures and has no 32-bit binary format hanlders. This is visible, for instance, when a CONFIG_MODULES enabled 64-bit MIPS kernel boots a into o32 root filesystem and the binfmt handler for o32 binaries is not built-in. After commit 6d7964a722af ("kmod: throttle kmod thread limit") we now don't have any visible signs of an error and the kernel just waits for the loop to end somehow. Although this *particular* recursive loop could also be addressed by doing a sanity check on search_binary_handler() and disallowing a modular binfmt to be required for modprobe, a generic solution for any recursive kernel kmod issues is still needed. This should catch these loops. We can investigate each loop and address each one separately as they come in, this however puts a stop gap for them as before. Link: http://lkml.kernel.org/r/[email protected] Fixes: 6d7964a722af ("kmod: throttle kmod thread limit") Signed-off-by: Luis R. Rodriguez <[email protected]> Reported-by: Matt Redfearn <[email protected]> Tested-by: Matt Redfearn <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Daniel Mentz <[email protected]> Cc: David Binderman <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: Michal Marek <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-18bpf: fix a return in sockmap_get_from_fd()Dan Carpenter1-2/+2
"map" is a valid pointer. We wanted to return "err" instead. Also let's return a zero literal at the end. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: Dan Carpenter <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-18cpuset: Allow v2 behavior in v1 cgroupWaiman Long1-13/+20
Cpuset v2 has some useful behaviors that are not present in v1 because of backward compatibility concern. One of that is the restoration of the original cpu and memory node mask after a hot removal and addition event sequence. This patch makes the cpuset controller to check the CGRP_ROOT_CPUSET_V2_MODE flag and use the v2 behavior if it is set. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-08-18cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroupWaiman Long1-0/+6
A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs filesystem to enable cpuset controller to use v2 behavior in a v1 cgroup. This mount option applies only to cpuset controller and have no effect on other controllers. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-08-18posix-cpu-timers: Use dedicated helper to access rlimit valuesKrzysztof Opasiak1-8/+6
Use rlimit() and rlimit_max() helper instead of manually writing whole chain from task to rlimit value Signed-off-by: Krzysztof Opasiak <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-18kernel/watchdog: Prevent false positives with turbo modesThomas Gleixner2-0/+60
The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18genirq: Restore trigger settings in irq_modify_status()Marc Zyngier1-2/+8
irq_modify_status starts by clearing the trigger settings from irq_data before applying the new settings, but doesn't restore them, leaving them to IRQ_TYPE_NONE. That's pretty confusing to the potential request_irq() that could follow. Instead, snapshot the settings before clearing them, and restore them if the irq_modify_status() invocation was not changing the trigger. Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ") Reported-and-tested-by: jeffy <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jon Hunter <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18Merge branch 'irq/for-gpio' into irq/coreThomas Gleixner3-30/+313
Merge the flow handlers and irq domain extensions which are in a separate branch so they can be consumed by the gpio folks.
2017-08-18irqdomain: Add irq_domain_{push,pop}_irq() functionsDavid Daney1-0/+169
For an already existing irqdomain hierarchy, as might be obtained via a call to pci_enable_msix_range(), a PCI driver wishing to add an additional irqdomain to the hierarchy needs to be able to insert the irqdomain to that already initialized hierarchy. Calling irq_domain_create_hierarchy() allows the new irqdomain to be created, but no existing code allows for initializing the associated irq_data. Add a couple of helper functions (irq_domain_push_irq() and irq_domain_pop_irq()) to initialize the irq_data for the new irqdomain added to an existing hierarchy. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18irqdomain: Check for NULL function pointer in irq_domain_free_irqs_hierarchy()David Daney1-1/+2
A follow-on patch will call irq_domain_free_irqs_hierarchy() when the free() function pointer may be NULL. Add a NULL pointer check to handle this new use case. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18irqdomain: Factor out code to add and remove items to and from the revmapDavid Daney1-29/+29
The code to add and remove items to and from the revmap occurs several times. In preparation for the follow on patches that add more uses of this code, factor this out in to separate static functions. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq: Add handle_fasteoi_{level,edge}_irq flow handlersDavid Daney2-0/+110
Follow-on patch for gpio-thunderx uses a irqdomain hierarchy which requires slightly different flow handlers, add them to chip.c which contains most of the other flow handlers. Make these conditionally compiled based on CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq: Export more irq_chip_*_parent() functionsDavid Daney1-0/+3
Many of the family of functions including irq_chip_mask_parent(), irq_chip_unmask_parent() are exported, but not all. Add EXPORT_SYMBOL_GPL to irq_chip_enable_parent, irq_chip_disable_parent and irq_chip_set_affinity_parent, so they likewise are usable from modules. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq/proc: Use the the accessor to report the effective affinityMarc Zyngier1-1/+1
If CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK is defined, but that the interrupt is not single target, the effective affinity reported in /proc/irq/x/effective_affinity will be empty, which is not the truth. Instead, use the accessor to report the affinity, which will pick the right mask. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Andrew Lunn <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Paul Burton <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Kevin Cernekee <[email protected]> Cc: Wei Xu <[email protected]> Cc: Max Filippov <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Gregory Clement <[email protected]> Cc: Matt Redfearn <[email protected]> Cc: Sebastian Hesselbarth <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq/debugfs: Triggering of interrupts from userspaceMarc Zyngier1-1/+49
When developing new (and therefore buggy) interrupt related code, it can sometimes be useful to inject interrupts without having to rely on a device to actually generate them. This functionnality relies either on the irqchip driver to expose a irq_set_irqchip_state(IRQCHIP_STATE_PENDING) callback, or on the core code to be able to retrigger a (edge-only) interrupt. To use this feature: echo -n trigger > /sys/kernel/debug/irq/irqs/IRQNUM WARNING: This is DANGEROUS, and strictly a debug feature. Do not use it on a production system. Your HW is likely to catch fire, your data to be corrupted, and reporting this will make you look an even bigger fool than the idiot who wrote this patch. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-18ACPI / PM: Check low power idle constraints for debug onlySrinivas Pandruvada1-1/+1
For SoC to achieve its lowest power platform idle state a set of hardware preconditions must be met. These preconditions or constraints can be obtained by issuing a device specific method (_DSM) with function "1". Refer to the document provided in the link below. Here during initialization (from attach() callback of LPS0 device), invoke function 1 to get the device constraints. Each enabled constraint is stored in a table. The devices in this table are used to check whether they were in required minimum state, while entering suspend. This check is done from platform freeze wake() callback, only when /sys/power/pm_debug_messages attribute is non zero. If any constraint is not met and device is ACPI power managed then it prints the device information to kernel logs. Also if debug is enabled in acpi/sleep.c, the constraint table and state of each device on wake is dumped in kernel logs. Since pm_debug_messages_on setting is used as condition to check constraints outside kernel/power/main.c, pm_debug_messages_on is changed to a global variable. Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf Signed-off-by: Srinivas Pandruvada <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-08-18cpufreq: schedutil: Always process remote callback with slow switchingViresh Kumar1-2/+7
The frequency update from the utilization update handlers can be divided into two parts: (A) Finding the next frequency (B) Updating the frequency While any CPU can do (A), (B) can be restricted to a group of CPUs only, depending on the current platform. For platforms where fast cpufreq switching is possible, both (A) and (B) are always done from the same CPU and that CPU should be capable of changing the frequency of the target CPU. But for platforms where fast cpufreq switching isn't possible, after doing (A) we wake up a kthread which will eventually do (B). This kthread is already bound to the right set of CPUs, i.e. only those which can change the frequency of CPUs of a cpufreq policy. And so any CPU can actually do (A) in this case, as the frequency is updated from the right set of CPUs only. Check cpufreq_can_do_remote_dvfs() only for the fast switching case. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-08-18cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarilyViresh Kumar1-1/+5
Utilization update callbacks are now processed remotely, even on the CPUs that don't share cpufreq policy with the target CPU (if dvfs_possible_from_any_cpu flag is set). But in non-fast switch paths, the frequency is changed only from one of policy->related_cpus. This happens because the kthread which does the actual update is bound to a subset of CPUs (i.e. related_cpus). Allow frequency to be remotely updated as well (i.e. call __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set. Reported-by: Pavan Kondeti <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-08-17Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"Kees Cook1-2/+1
This reverts commit 68c4a4f8abc60c9440ede9cd123d48b78325f7a3, with various conflict clean-ups. The capability check required too much privilege compared to simple DAC controls. A system builder was forced to have crash handler processes run with CAP_SYSLOG which would give it the ability to read (and wipe) the _current_ dmesg, which is much more access than being given access only to the historical log stored in pstorefs. With the prior commit to make the root directory 0750, the files are protected by default but a system builder can now opt to give access to a specific group (via chgrp on the pstorefs root directory) without being forced to also give away CAP_SYSLOG. Suggested-by: Nick Kralevich <[email protected]> Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]>
2017-08-17alarmtimer: Fix unavailable wake-up source in sysfsGeert Uytterhoeven1-2/+9
Currently the alarmtimer registers a wake-up source unconditionally, regardless of the system having a (wake-up capable) RTC or not. Hence the alarmtimer will always show up in /sys/kernel/debug/wakeup_sources, even if it is not available, and thus cannot be a wake-up source. To fix this, postpone registration until a wake-up capable RTC device is added. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-08-17timekeeping: Use proper timekeeper for debug codeStafford Horne1-1/+1
When CONFIG_DEBUG_TIMEKEEPING is enabled the timekeeping_check_update() function will update status like last_warning and underflow_seen on the timekeeper. If there are issues found this state is used to rate limit the warnings that get printed. This rate limiting doesn't really really work if stored in real_tk as the shadow timekeeper is overwritten onto real_tk at the end of every update_wall_time() call, resetting last_warning and other statuses. Fix rate limiting by using the shadow_timekeeper for timekeeping_check_update(). Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Fixes: commit 57d05a93ada7 ("time: Rework debugging variables so they aren't global") Signed-off-by: Stafford Horne <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-08-17bpf: don't enable preemption twice in smap_do_verdictDaniel Borkmann1-1/+2
In smap_do_verdict(), the fall-through branch leads to call preempt_enable() twice for the SK_REDIRECT, which creates an imbalance. Only enable it for all remaining cases again. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-17bpf: fix liveness propagation to parent in spilled stack slotsDaniel Borkmann1-1/+1
Using parent->regs[] when propagating REG_LIVE_READ for spilled regs doesn't work since parent->regs[] denote the set of normal registers but not spilled ones. Propagate to the correct regs. Fixes: dc503a8ad984 ("bpf/verifier: track liveness for pruning") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-17Merge branches 'doc.2017.08.17a', 'fixes.2017.08.17a', ↵Paul E. McKenney26-798/+573
'hotplug.2017.07.25b', 'misc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD doc.2017.08.17a: Documentation updates. fixes.2017.08.17a: RCU fixes. hotplug.2017.07.25b: CPU-hotplug updates. misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts). spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait(). srcu.2017.07.27c: SRCU updates. torture.2017.07.24c: Torture-test updates.
2017-08-17locking: Remove spin_unlock_wait() generic definitionsPaul E. McKenney1-117/+0
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes spin_unlock_wait() and related definitions from core code. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Will Deacon <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Alan Stern <[email protected]> Cc: Andrea Parri <[email protected]> Cc: Linus Torvalds <[email protected]>
2017-08-17exit: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney1-1/+2
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in do_exit() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock is a per-task lock, and this is happening only at task-exit time. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Cc: Alan Stern <[email protected]> Cc: Andrea Parri <[email protected]> Cc: Linus Torvalds <[email protected]>
2017-08-17completion: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney1-7/+4
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in completion_done() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock will be held only the wakeup happens really quickly. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Cc: Alan Stern <[email protected]> Cc: Andrea Parri <[email protected]> Cc: Linus Torvalds <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]>
2017-08-17membarrier: Provide expedited private commandMathieu Desnoyers5-71/+178
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built from all runqueues for which current thread's mm is the same as the thread calling sys_membarrier. It executes faster than the non-expedited variant (no blocking). It also works on NOHZ_FULL configurations. Scheduler-wise, it requires a memory barrier before and after context switching between processes (which have different mm). The memory barrier before context switch is already present. For the barrier after context switch: * Our TSO archs can do RELEASE without being a full barrier. Look at x86 spin_unlock() being a regular STORE for example. But for those archs, all atomics imply smp_mb and all of them have atomic ops in switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full barrier. * From all weakly ordered machines, only ARM64 and PPC can do RELEASE, the rest does indeed do smp_mb(), so there the spin_unlock() is a full barrier and we're good. * ARM64 has a very heavy barrier in switch_to(), which suffices. * PPC just removed its barrier from switch_to(), but appears to be talking about adding something to switch_mm(). So add a smp_mb__after_unlock_lock() for now, until this is settled on the PPC side. Changes since v3: - Properly document the memory barriers provided by each architecture. Changes since v2: - Address comments from Peter Zijlstra, - Add smp_mb__after_unlock_lock() after finish_lock_switch() in finish_task_switch() to add the memory barrier we need after storing to rq->curr. This is much simpler than the previous approach relying on atomic_dec_and_test() in mmdrop(), which actually added a memory barrier in the common case of switching between userspace processes. - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full kernel, rather than having the whole membarrier system call returning -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full. Adapt the CMD_QUERY mask accordingly. Changes since v1: - move membarrier code under kernel/sched/ because it uses the scheduler runqueue, - only add the barrier when we switch from a kernel thread. The case where we switch from a user-space thread is already handled by the atomic_dec_and_test() in mmdrop(). - add a comment to mmdrop() documenting the requirement on the implicit memory barrier. CC: Peter Zijlstra <[email protected]> CC: Paul E. McKenney <[email protected]> CC: Boqun Feng <[email protected]> CC: Andrew Hunter <[email protected]> CC: Maged Michael <[email protected]> CC: [email protected] CC: Avi Kivity <[email protected]> CC: Benjamin Herrenschmidt <[email protected]> CC: Paul Mackerras <[email protected]> CC: Michael Ellerman <[email protected]> Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Dave Watson <[email protected]>
2017-08-17rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()Paul E. McKenney1-2/+0
The rcu_idle_exit() and rcu_idle_enter() functions are exported because they were originally used by RCU_NONIDLE(), which was intended to be usable from modules. However, RCU_NONIDLE() now instead uses rcu_irq_enter_irqson() and rcu_irq_exit_irqson(), which are not exported, and there have been no complaints. This commit therefore removes the exports from rcu_idle_exit() and rcu_idle_enter(). Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>