aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-06-04Merge branch 'work.aio-1' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull aio updates from Al Viro: "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly. The only thing I'm holding back for a day or so is Adam's aio ioprio - his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case), but let it sit in -next for decency sake..." * 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) aio: sanitize the limit checking in io_submit(2) aio: fold do_io_submit() into callers aio: shift copyin of iocb into io_submit_one() aio_read_events_ring(): make a bit more readable aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way aio: take list removal to (some) callers of aio_complete() aio: add missing break for the IOCB_CMD_FDSYNC case random: convert to ->poll_mask timerfd: convert to ->poll_mask eventfd: switch to ->poll_mask pipe: convert to ->poll_mask crypto: af_alg: convert to ->poll_mask net/rxrpc: convert to ->poll_mask net/iucv: convert to ->poll_mask net/phonet: convert to ->poll_mask net/nfc: convert to ->poll_mask net/caif: convert to ->poll_mask net/bluetooth: convert to ->poll_mask net/sctp: convert to ->poll_mask net/tipc: convert to ->poll_mask ...
2018-06-04bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPSYonghong Song1-0/+2
Commit bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper") introduced a new helper bpf_get_current_cgroup_id(). The helper has a dependency on CONFIG_CGROUPS. When CONFIG_CGROUPS is not defined, using the helper will result the following verifier error: kernel subsystem misconfigured func bpf_get_current_cgroup_id#80 which is hard for users to interpret. Guarding the reference to bpf_get_current_cgroup_id_proto with CONFIG_CGROUPS will result in below better message: unknown func bpf_get_current_cgroup_id#80 Fixes: bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper") Suggested-by: Daniel Borkmann <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-06-04Merge branch 'hch.procfs' of ↵Linus Torvalds11-248/+27
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull procfs updates from Al Viro: "Christoph's proc_create_... cleanups series" * 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits) xfs, proc: hide unused xfs procfs helpers isdn/gigaset: add back gigaset_procinfo assignment proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields tty: replace ->proc_fops with ->proc_show ide: replace ->proc_fops with ->proc_show ide: remove ide_driver_proc_write isdn: replace ->proc_fops with ->proc_show atm: switch to proc_create_seq_private atm: simplify procfs code bluetooth: switch to proc_create_seq_data netfilter/x_tables: switch to proc_create_seq_private netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data neigh: switch to proc_create_seq_data hostap: switch to proc_create_{seq,single}_data bonding: switch to proc_create_seq_data rtc/proc: switch to proc_create_single_data drbd: switch to proc_create_single resource: switch to proc_create_seq_data staging/rtl8192u: simplify procfs code jfs: simplify procfs code ...
2018-06-04Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+7
Pull block updates from Jens Axboe: - clean up how we pass around gfp_t and blk_mq_req_flags_t (Christoph) - prepare us to defer scheduler attach (Christoph) - clean up drivers handling of bounce buffers (Christoph) - fix timeout handling corner cases (Christoph/Bart/Keith) - bcache fixes (Coly) - prep work for bcachefs and some block layer optimizations (Kent). - convert users of bio_sets to using embedded structs (Kent). - fixes for the BFQ io scheduler (Paolo/Davide/Filippo) - lightnvm fixes and improvements (Matias, with contributions from Hans and Javier) - adding discard throttling to blk-wbt (me) - sbitmap blk-mq-tag handling (me/Omar/Ming). - remove the sparc jsflash block driver, acked by DaveM. - Kyber scheduler improvement from Jianchao, making it more friendly wrt merging. - conversion of symbolic proc permissions to octal, from Joe Perches. Previously the block parts were a mix of both. - nbd fixes (Josef and Kevin Vigor) - unify how we handle the various kinds of timestamps that the block core and utility code uses (Omar) - three NVMe pull requests from Keith and Christoph, bringing AEN to feature completeness, file backed namespaces, cq/sq lock split, and various fixes - various little fixes and improvements all over the map * tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits) blk-mq: update nr_requests when switching to 'none' scheduler block: don't use blocking queue entered for recursive bio submits dm-crypt: fix warning in shutdown path lightnvm: pblk: take bitmap alloc. out of critical section lightnvm: pblk: kick writer on new flush points lightnvm: pblk: only try to recover lines with written smeta lightnvm: pblk: remove unnecessary bio_get/put lightnvm: pblk: add possibility to set write buffer size manually lightnvm: fix partial read error path lightnvm: proper error handling for pblk_bio_add_pages lightnvm: pblk: fix smeta write error path lightnvm: pblk: garbage collect lines with failed writes lightnvm: pblk: rework write error recovery path lightnvm: pblk: remove dead function lightnvm: pass flag on graceful teardown to targets lightnvm: pblk: check for chunk size before allocating it lightnvm: pblk: remove unnecessary argument lightnvm: pblk: remove unnecessary indirection lightnvm: pblk: return NVM_ error on failed submission lightnvm: pblk: warn in case of corrupted write buffer ...
2018-06-04Merge branches 'pm-pci', 'acpi-pm', 'pm-sleep' and 'pm-avs'Rafael J. Wysocki4-11/+30
* pm-pci: PCI / PM: Clean up outdated comments in pci_target_state() PCI / PM: Do not clear state_saved for devices that remain suspended * acpi-pm: ACPI: EC: Dispatch the EC GPE directly on s2idle wake ACPICA: Introduce acpi_dispatch_gpe() * pm-sleep: PM / hibernate: Fix oops at snapshot_write() PM / wakeup: Make s2idle_lock a RAW_SPINLOCK PM / s2idle: Make s2idle_wait_head swait based PM / wakeup: Make events_lock a RAW_SPINLOCK PM / suspend: Prevent might sleep splats * pm-avs: PM / AVS: rockchip-io: add io selectors and supplies for PX30
2018-06-04Merge branches 'pm-cpufreq-sched' and 'pm-cpuidle'Rafael J. Wysocki1-83/+179
* pm-cpufreq-sched: cpufreq: schedutil: Avoid missing updates for one-CPU policies schedutil: Allow cpufreq requests to be made even when kthread kicked cpufreq: Rename cpufreq_can_do_remote_dvfs() cpufreq: schedutil: Cleanup and document iowait boost cpufreq: schedutil: Fix iowait boost reset cpufreq: schedutil: Don't set next_freq to UINT_MAX Revert "cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily" * pm-cpuidle: cpuidle: governors: Consolidate PM QoS handling cpuidle: governors: Drop redundant checks related to PM QoS
2018-06-04Merge branches 'pm-qos' and 'pm-core'Rafael J. Wysocki2-1/+1
* pm-qos: PM / QoS: Drop redundant declaration of pm_qos_get_value() * pm-core: PM / runtime: Drop usage count for suppliers at device link removal PM / runtime: Fixup reference counting of device link suppliers at probe PM: wakeup: Use pr_debug() for the "aborting suspend" message PM / core: Drop unused internal inline functions for sysfs PM / core: Drop unused internal functions for pm_qos sysfs PM / core: Drop unused internal inline functions for wakeirqs PM / core: Drop internal unused inline functions for wakeups PM / wakeup: Only update last time for active wakeup sources PM / wakeup: Use seq_open() to show wakeup stats PM / core: Use dev_printk() and symbols in suspend/resume diagnostics PM / core: Simplify initcall_debug_report() timing PM / core: Remove unused initcall_debug_report() arguments PM / core: fix deferred probe breaking suspend resume order
2018-06-03bpf: implement bpf_get_current_cgroup_id() helperYonghong Song3-0/+18
bpf has been used extensively for tracing. For example, bcc contains an almost full set of bpf-based tools to trace kernel and user functions/events. Most tracing tools are currently either filtered based on pid or system-wide. Containers have been used quite extensively in industry and cgroup is often used together to provide resource isolation and protection. Several processes may run inside the same container. It is often desirable to get container-level tracing results as well, e.g. syscall count, function count, I/O activity, etc. This patch implements a new helper, bpf_get_current_cgroup_id(), which will return cgroup id based on the cgroup within which the current task is running. The later patch will provide an example to show that userspace can get the same cgroup id so it could configure a filter or policy in the bpf program based on task cgroup id. The helper is currently implemented for tracing. It can be added to other program types as well when needed. Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds3-18/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: - two patches addressing the problem that the scheduler allows under certain conditions user space tasks to be scheduled on CPUs which are not yet fully booted which causes a few subtle and hard to debug issue - add a missing runqueue clock update in the deadline scheduler which triggers a warning under certain circumstances - fix a silly typo in the scheduler header file * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/headers: Fix typo sched/deadline: Fix missing clock update sched/core: Require cpu_active() in select_task_rq(), for user tasks sched/core: Fix rules for running on online && !active CPUs
2018-06-03bpf/xdp: devmap can avoid calling ndo_xdp_flushJesper Dangaard Brouer1-13/+6
The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in appropriate places. Notice after this patch it is possible to remove ndo_xdp_flush completely, as this is the last user of ndo_xdp_flush. This is left for later patches, to keep driver changes separate. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03xdp: add flags argument to ndo_xdp_xmit APIJesper Dangaard Brouer1-1/+1
This patch only change the API and reject any use of flags. This is an intermediate step that allows us to implement the flush flag operation later, for each individual driver in a separate patch. The plan is to implement flush operation via XDP_XMIT_FLUSH flag and then remove XDP_XMIT_FLAGS_NONE when done. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: fix context access in tracing progs on 32 bit archsDaniel Borkmann2-3/+10
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT program type in test_verifier report the following errors on x86_32: 172/p unpriv: spill/fill of different pointers ldx FAIL Unexpected error message! 0: (bf) r6 = r10 1: (07) r6 += -8 2: (15) if r1 == 0x0 goto pc+3 R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1 3: (bf) r2 = r10 4: (07) r2 += -76 5: (7b) *(u64 *)(r6 +0) = r2 6: (55) if r1 != 0x0 goto pc+1 R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp 7: (7b) *(u64 *)(r6 +0) = r1 8: (79) r1 = *(u64 *)(r6 +0) 9: (79) r1 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (71) r0 = *(u8 *)(r1 +68) invalid bpf_context access off=68 size=1 379/p check bpf_perf_event_data->sample_period half load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (69) r0 = *(u16 *)(r1 +68) invalid bpf_context access off=68 size=2 380/p check bpf_perf_event_data->sample_period word load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (61) r0 = *(u32 *)(r1 +68) invalid bpf_context access off=68 size=4 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (79) r0 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the case of sample_period access from struct bpf_perf_event_data, hence verifier wrongly thinks we might be doing an unaligned access here though underlying arch can handle it just fine. Therefore adjust this down to machine size and check and rewrite the offset for narrow access on that basis. We also need to fix corresponding pe_prog_is_valid_access(), since we hit the check for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs for tracing work on x86_32. Reported-by: Wang YanQing <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Tested-by: Wang YanQing <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: avoid retpoline for lookup/update/delete calls on mapsDaniel Borkmann2-22/+58
While some of the BPF map lookup helpers provide a ->map_gen_lookup() callback for inlining the map lookup altogether it is not available for every map, so the remaining ones have to call bpf_map_lookup_elem() helper which does a dispatch to map->ops->map_lookup_elem(). In times of retpolines, this will control and trap speculative execution rather than letting it do its work for the indirect call and will therefore cause a slowdown. Likewise, bpf_map_update_elem() and bpf_map_delete_elem() do not have an inlined version and need to call into their map->ops->map_update_elem() resp. map->ops->map_delete_elem() handlers. Before: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#232656 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call bpf_map_delete_elem#215008 <-- indirect call via 16: (95) exit helper After: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#233328 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call htab_lru_map_delete_elem#238240 <-- direct call 16: (95) exit In all three lookup/update/delete cases however we can use the actual address of the map callback directly if we find that there's only a single path with a map pointer leading to the helper call, meaning when the map pointer has not been poisoned from verifier side. Example code can be seen above for the delete case. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: show prog and map id in fdinfoDaniel Borkmann1-4/+8
Its trivial and straight forward to expose it for scripts that can then use it along with bpftool in order to inspect an individual application's used maps and progs. Right now we dump some basic information in the fdinfo file but with the help of the map/prog id full introspection becomes possible now. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: fixup error message from gpl helpers on license mismatchDaniel Borkmann1-1/+1
Stating 'proprietary program' in the error is just silly since it can also be a different open source license than that which is just not compatible. Reference: https://twitter.com/majek04/status/998531268039102465 Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-15/+31
Filling in the padding slot in the bpf structure as a bug fix in 'ne' overlapped with actually using that padding area for something in 'net-next'. Signed-off-by: David S. Miller <[email protected]>
2018-06-02libnvdimm, e820: Register all pmem resourcesDan Williams1-0/+1
There is currently a mismatch between the resources that will trigger the e820_pmem driver to register/load and the resources that will actually be surfaced as pmem ranges. register_e820_pmem() uses walk_iomem_res_desc() which includes children and siblings. In contrast, e820_pmem_probe() only considers top level resources. For example the following resource tree results in the driver being loaded, but no resources being registered: 398000000000-39bfffffffff : PCI Bus 0000:ae 39be00000000-39bf07ffffff : PCI Bus 0000:af 39be00000000-39beffffffff : 0000:af:00.0 39be10000000-39beffffffff : Persistent Memory (legacy) Fix this up to allow definitions of "legacy" pmem ranges anywhere in system-physical address space. Not that it is a recommended or safe to define a pmem range in PCI space, but it is useful for debug / experimentation, and the restriction on being a top-level resource was arbitrary. Cc: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-06-02bpf: btf: Ensure t->type == 0 for BTF_KIND_FWDMartin KaFai Lau1-1/+20
The t->type in BTF_KIND_FWD is not used. It must be 0. This patch ensures that and also adds a test case in test_btf.c Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-02bpf: btf: Check array t->sizeMartin KaFai Lau1-0/+5
This patch ensures array's t->size is 0. The array size is decided by its individual elem's size and the number of elements. Hence, t->size is not used and it must be 0. A test case is added to test_btf.c Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-31Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar4-27/+72
Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/headers: Fix typoDavidlohr Bueso1-1/+1
I cannot spell 'throttling'. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/deadline: Fix missing clock updateJuri Lelli1-3/+3
A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720 Call Trace: <IRQ> __hrtimer_run_queues+0x10f/0x530 hrtimer_interrupt+0xe5/0x240 smp_apic_timer_interrupt+0x79/0x2b0 apic_timer_interrupt+0xf/0x20 </IRQ> do_idle+0x203/0x280 cpu_startup_entry+0x6f/0x80 start_secondary+0x1b0/0x200 secondary_startup_64+0xa5/0xb0 hardirqs last enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360 hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0 softirqs last enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70 softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70 This happens because inactive_task_timer() calls sub_running_bw() (if TASK_DEAD and non_contending) that might trigger a schedutil update, which might access the clock. Clock is however currently updated only later in inactive_task_timer() function. Fix the problem by updating the clock right after task_rq_lock(). Reported-by: kernel test robot <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Claudio Scordino <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luca Abeni <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/core: Require cpu_active() in select_task_rq(), for user tasksPaul Burton1-2/+1
select_task_rq() is used in a few paths to select the CPU upon which a thread should be run - for example it is used by try_to_wake_up() & by fork or exec balancing. As-is it allows use of any online CPU that is present in the task's cpus_allowed mask. This presents a problem because there is a period whilst CPUs are brought online where a CPU is marked online, but is not yet fully initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state < CPUHP_ONLINE. Usually we don't run any user tasks during this window, but there are corner cases where this can happen. An example observed is: - Some user task A, running on CPU X, forks to create task B. - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's task_struct::cpu field to X. - CPU X is offlined. - Task A, currently somewhere between the __set_task_cpu() in copy_process() and the call to wake_up_new_task(), is migrated to CPU Y by migrate_tasks() when CPU X is offlined. - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The scheduler is now active on CPU X, but there are no user tasks on the runqueue. - Task A runs on CPU Y & reaches wake_up_new_task(). This calls select_task_rq() with cpu=X, taken from task B's task_struct, and select_task_rq() allows CPU X to be returned. - Task A enqueues task B on CPU X's runqueue, via activate_task() & enqueue_task(). - CPU X now has a user task on its runqueue before it has reached the CPUHP_ONLINE state. In most cases, the user tasks that schedule on the newly onlined CPU have no idea that anything went wrong, but one case observed to be problematic is if the task goes on to invoke the sched_setaffinity syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state before the CPU that brought it online calls stop_machine_unpark(). This means that for a portion of the window of time between CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct cpu_stopper has its enabled field set to false. If a user thread is executed on the CPU during this window and it invokes sched_setaffinity with a CPU mask that does not include the CPU it's running on, then when __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke migration_cpu_stop() and perform the actual migration away from the CPU it will simply return -ENOENT rather than calling migration_cpu_stop(). We then return from the sched_setaffinity syscall back to the user task that is now running on a CPU which it just asked not to run on, and which is not present in its cpus_allowed mask. This patch resolves the problem by having select_task_rq() enforce that user tasks run on CPUs that are active - the same requirement that select_fallback_rq() already enforces. This should ensure that newly onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to schedule user tasks, and also implies that bringup_wait_for_ap() will have called stop_machine_unpark() which resolves the sched_setaffinity issue above. I haven't yet investigated them, but it may be of interest to review whether any of the actions performed by hotplug states between CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended effects on user tasks that might schedule before they are reached, which might widen the scope of the problem from just affecting the behaviour of sched_setaffinity. Signed-off-by: Paul Burton <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/core: Fix rules for running on online && !active CPUsPeter Zijlstra1-12/+30
As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-30bpf: devmap: remove redundant assignment of dev = devColin Ian King1-1/+1
The assignment dev = dev is redundant and should be removed. Detected by CoverityScan, CID#1469486 ("Evaluation order violation") Signed-off-by: Colin Ian King <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-30media: rc: introduce BPF_PROG_LIRC_MODE2Sean Young1-0/+7
Add support for BPF_PROG_LIRC_MODE2. This type of BPF program can call rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report that the last key should be repeated. The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; the target_fd must be the /dev/lircN device. Acked-by: Yonghong Song <[email protected]> Signed-off-by: Sean Young <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-30bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not foundSean Young2-2/+11
This makes is it possible for bpf prog detach to return -ENOENT. Acked-by: Yonghong Song <[email protected]> Signed-off-by: Sean Young <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-30Merge branch 'for-mingo' of ↵Ingo Molnar1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU fix from Paul E. McKenney: "This additional v4.18 pull request contains a single commit that fell through the cracks: Provide early rcu_cpu_starting() callback for the benefit of the x86/mtrr code, which needs RCU to be available on incoming CPUs earlier than has been the case in the past." Signed-off-by: Ingo Molnar <[email protected]>
2018-05-29tracing: Allow histogram triggers to access ftrace internal eventsSteven Rostedt (VMware)1-1/+1
Now that trace_marker can have triggers, including a histogram triggers, the onmatch() and onmax() access the trace event. To do so, the search routine to find the event file needs to use the raw __find_event_file() that does not filter out ftrace events. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Have zero size length in filter logic be full stringSteven Rostedt (VMware)1-11/+12
As strings in trace events may not have a nul terminating character, the filter string compares use the defined string length for the field for the compares. The trace_marker records data slightly different than do normal events. It's size is zero, meaning that the string is the rest of the array, and that the string also ends with '\0'. If the size is zero, assume that the string is nul terminated and read the string in question as is. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Add trigger file for trace_markers tracefs/ftrace/printSteven Rostedt (VMware)4-2/+29
Allow writing to the trace_markers file initiate triggers defined in tracefs/ftrace/print/trigger file. This will allow of user space to trigger the same type of triggers (including histograms) that the trace events use. Had to create a ftrace_event_register() function that will become the trace_marker print event's reg() function. This is required because of how triggers are enabled: event_trigger_write() { event_trigger_regex_write() { trigger_process_regex() { for p in trigger_commands { p->func(); /* trigger_snapshot_cmd->func */ event_trigger_callback() { cmd_ops->reg() /* register_trigger() */ { trace_event_trigger_enable_disable() { trace_event_enable_disable() { call->class->reg(); Without the reg() function, the trigger code will call a NULL pointer and crash the system. Cc: Tom Zanussi <[email protected]> Cc: Clark Williams <[email protected]> Cc: Karim Yaghmour <[email protected]> Cc: Brendan Gregg <[email protected]> Suggested-by: Joel Fernandes <[email protected]> Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29Merge tag 'trace-v4.17-rc4-3' of ↵Linus Torvalds3-10/+28
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "While writing selftests for a new feature, I triggered two existing bugs that deal with triggers and instances. - a generic trigger bug where the triggers are not removed from a linked list properly when deleting an instance. - a bug specific to snapshots, where the snapshot is done in the top level buffer, when it is supposed to snapshot the buffer associated to the instance the snapshot trigger exists in" * tag 'trace-v4.17-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Make the snapshot trigger work with instances tracing: Fix crash when freeing instances with event triggers
2018-05-29tracing: Do not show filter file for ftrace internal eventsSteven Rostedt (VMware)1-4/+6
The filter file in the ftrace internal events, like in /sys/kernel/tracing/events/ftrace/function/filter is not attached to any functionality. Do not create them as they are meaningless. In the future, if an ftrace internal event gets filter functionality, then it will need to create it directly. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Add brackets in ftrace event dynamic arraysSteven Rostedt (VMware)1-1/+1
The dynamic arrays defined for ftrace internal events, such as the buf field for trace_marker (ftrace/print) did not have brackets which makes the filter code not accept it as a string. This is not currently an issues because the filter code doesn't do anything for these events, but they will in the future, and this needs to be fixed for when it does. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Have event_trace_init() called by trace_init_tracefs()Steven Rostedt (VMware)3-3/+4
Instead of having both trace_init_tracefs() and event_trace_init() be called by fs_initcall() routines, have event_trace_init() called directly by trace_init_tracefs(). This will guarantee order of how the events are created with respect to the rest of the ftrace infrastructure. This is needed to be able to assoctiate event files with ftrace internal events, such as the trace_marker. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Add __find_event_file() to find event files without restrictionsSteven Rostedt (VMware)2-5/+20
By adding the function __find_event_file() that can search for files without restrictions, such as if the event associated with the file has a reg function, or if it has the "ignore" flag set, the files that are associated to ftrace internal events (like trace_marker and function events) can be found and used. find_event_file() still returns a "filtered" file, as most callers need a valid trace event file. One created by the trace_events.h macros and not one created for parsing ftrace specific events. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Do not reference event data in post call triggersSteven Rostedt (VMware)2-6/+4
Trace event triggers can be called before or after the event has been committed. If it has been called after the commit, there's a possibility that the event no longer exists. Currently, the two post callers is the trigger to disable tracing (traceoff) and the one that will record a stack dump (stacktrace). Neither of them reference the trace event entry record, as that would lead to a race condition that could pass in corrupted data. To prevent any other users of the post data triggers from using the trace event record, pass in NULL to the post call trigger functions for the event record, as they should never need to use them in the first place. This does not fix any bug, but prevents bugs from happening by new post call trigger users. Reviewed-by: Masami Hiramatsu <[email protected]> Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-28PM / QoS: Drop redundant declaration of pm_qos_get_value()Rafael J. Wysocki1-1/+0
The extra forward declaration of pm_qos_get_value() is redundant, so drop it. Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-28tracepoints: Fix the descriptions of tracepoint_probe_register{_prio}Lee, Chun-Yi1-2/+1
The description of tracepoint_probe_register duplicates with tracepoint_probe_register_prio. This patch cleans up the descriptions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Lee, Chun-Yi" <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-28tracing: Make the snapshot trigger work with instancesSteven Rostedt (VMware)3-8/+25
The snapshot trigger currently only affects the main ring buffer, even when it is used by the instances. This can be confusing as the snapshot trigger is listed in the instance. > # cd /sys/kernel/tracing > # mkdir instances/foo > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger > # echo top buffer > trace_marker > # echo foo buffer > instances/foo/trace_marker > # touch /tmp/bar > # chown rostedt /tmp/bar > # cat instances/foo/snapshot # tracer: nop # # # * Snapshot is freed * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') > # cat snapshot # tracer: nop # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | bash-1189 [000] .... 111.488323: tracing_mark_write: top buffer Not only did the snapshot occur in the top level buffer, but the instance snapshot buffer should have been allocated, and it is still free. Cc: [email protected] Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-28bpf: Hooks for sys_sendmsgAndrey Ignatov2-1/+18
In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr *` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr *` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Signed-off-by: Andrey Ignatov <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-28bpf: avoid -Wmaybe-uninitialized warningArnd Bergmann1-4/+3
The stack_map_get_build_id_offset() function is too long for gcc to track whether 'work' may or may not be initialized at the end of it, leading to a false-positive warning: kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset': kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized] This removes the 'in_nmi_ctx' flag and uses the state of that variable itself to see if it got initialized. Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-28bpf: btf: avoid -Wreturn-type warningArnd Bergmann1-1/+1
gcc warns about a noreturn function possibly returning in some configurations: kernel/bpf/btf.c: In function 'env_type_is_resolve_sink': kernel/bpf/btf.c:729:1: error: control reaches end of non-void function [-Werror=return-type] Using BUG() instead of BUG_ON() avoids that warning and otherwise does the exact same thing. Fixes: eb3f595dab40 ("bpf: btf: Validate type reference") Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-27tracing: Fix crash when freeing instances with event triggersSteven Rostedt (VMware)1-2/+3
If a instance has an event trigger enabled when it is freed, it could cause an access of free memory. Here's the case that crashes: # cd /sys/kernel/tracing # mkdir instances/foo # echo snapshot > instances/foo/events/initcall/initcall_start/trigger # rmdir instances/foo Would produce: general protection fault: 0000 [#1] PREEMPT SMP PTI Modules linked in: tun bridge ... CPU: 5 PID: 6203 Comm: rmdir Tainted: G W 4.17.0-rc4-test+ #933 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:clear_event_triggers+0x3b/0x70 RSP: 0018:ffffc90003783de0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0 RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016 R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0 R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c FS: 00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0 Call Trace: event_trace_del_tracer+0x2a/0xc5 instance_rmdir+0x15c/0x200 tracefs_syscall_rmdir+0x52/0x90 vfs_rmdir+0xdb/0x160 do_rmdir+0x16d/0x1c0 __x64_sys_rmdir+0x17/0x20 do_syscall_64+0x55/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe This was due to the call the clears out the triggers when an instance is being deleted not removing the trigger from the link list. Cc: [email protected] Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-27PM / hibernate: Fix oops at snapshot_write()Tetsuo Handa1-0/+5
syzbot is reporting NULL pointer dereference at snapshot_write() [1]. This is because data->handle is zero-cleared by ioctl(SNAPSHOT_FREE). Fix this by checking data_of(data->handle) != NULL before using it. [1] https://syzkaller.appspot.com/bug?id=828a3c71bd344a6de8b6a31233d51a72099f27fd Signed-off-by: Tetsuo Handa <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-27PM / wakeup: Make s2idle_lock a RAW_SPINLOCKSebastian Andrzej Siewior1-7/+7
The `s2idle_lock' is acquired during suspend while interrupts are disabled even on RT. The lock is acquired for short sections only. Make it a RAW lock which avoids "sleeping while atomic" warnings on RT. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-27PM / s2idle: Make s2idle_wait_head swait basedSebastian Andrzej Siewior1-4/+5
s2idle_wait_head is used during s2idle with interrupts disabled even on RT. There is no "custom" wake up function so swait could be used instead which is also lower weight compared to the wait_queue. Make s2idle_wait_head a swait_queue_head. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-27PM / suspend: Prevent might sleep splatsThomas Gleixner3-0/+13
timekeeping suspend/resume calls read_persistent_clock() which takes rtc_lock. That results in might sleep warnings because at that point we run with interrupts disabled. We cannot convert rtc_lock to a raw spinlock as that would trigger other might sleep warnings. As a workaround we disable the might sleep warnings by setting system_state to SYSTEM_SUSPEND before calling sysdev_suspend() and restoring it to SYSTEM_RUNNING afer sysdev_resume(). There is no lock contention because hibernate / suspend to RAM is single-CPU at this point. In s2idle's case the system_state is set to SYSTEM_SUSPEND before timekeeping_suspend() which is invoked by the last CPU. In the resume case it set back to SYSTEM_RUNNING after timekeeping_resume() which is invoked by the first CPU in the resume case. The other CPUs will block on tick_freeze_lock. Signed-off-by: Thomas Gleixner <[email protected]> [bigeasy: cover s2idle in tick_freeze() / tick_unfreeze()] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-31/+163
Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <[email protected]>
2018-05-26Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Three fixes for scheduler and kthread code: - allow calling kthread_park() on an already parked thread - restore the sched_pi_setprio() tracepoint behaviour - clarify the unclear string for the scheduling domain debug output" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched, tracing: Fix trace_sched_pi_setprio() for deboosting kthread: Allow kthread_park() on a parked kthread sched/topology: Clarify root domain(s) debug string