aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-12-18bpf: btf: refactor btf_int_bits_seq_show()Yonghong Song1-14/+21
Refactor function btf_int_bits_seq_show() by creating function btf_bitfield_seq_show() which has no dependence on btf and btf_type. The function btf_bitfield_seq_show() will be in later patch to directly dump bitfield member values. Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-12-17bpf: remove useless version check for prog loadDaniel Borkmann1-5/+0
Existing libraries and tracing frameworks work around this kernel version check by automatically deriving the kernel version from uname(3) or similar such that the user does not need to do it manually; these workarounds also make the version check useless at the same time. Moreover, most other BPF tracing types enabling bpf_probe_read()-like functionality have /not/ adapted this check, and in general these days it is well understood anyway that all the tracing programs are not stable with regards to future kernels as kernel internal data structures are subject to change from release to release. Back at last netconf we discussed [0] and agreed to remove this check from bpf_prog_load() and instead document it here in the uapi header that there is no such guarantee for stable API for these programs. [0] http://vger.kernel.org/netconf2018_files/DanielBorkmann_netconf2018.pdf Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Quentin Monnet <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-17dma-direct: do not include SME mask in the DMA supported checkLendacky, Thomas1-1/+6
The dma_direct_supported() function intends to check the DMA mask against specific values. However, the phys_to_dma() function includes the SME encryption mask, which defeats the intended purpose of the check. This results in drivers that support less than 48-bit DMA (SME encryption mask is bit 47) from being able to set the DMA mask successfully when SME is active, which results in the driver failing to initialize. Change the function used to check the mask from phys_to_dma() to __phys_to_dma() so that the SME encryption mask is not part of the check. Fixes: c1d0af1a1d5d ("kernel/dma/direct: take DMA offset into account in dma_direct_supported") Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-17kprobes: Blacklist symbols in arch-defined prohibited areaMasami Hiramatsu1-14/+53
Blacklist symbols in arch-defined probe-prohibited areas. With this change, user can see all symbols which are prohibited to probe in debugfs. All archtectures which have custom prohibit areas should define its own arch_populate_kprobe_blacklist() function, but unless that, all symbols marked __kprobes are blacklisted. Reported-by: Andrea Righi <[email protected]> Tested-by: Andrea Righi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anil S Keshavamurthy <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: David S. Miller <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Yonghong Song <[email protected]> Link: http://lkml.kernel.org/r/154503485491.26176.15823229545155174796.stgit@devbox Signed-off-by: Ingo Molnar <[email protected]>
2018-12-17Merge tag 'v4.20-rc7' into perf/core, to pick up fixesIngo Molnar7-18/+183
Signed-off-by: Ingo Molnar <[email protected]>
2018-12-17posix-timers: Fix division by zero bugThomas Gleixner1-4/+1
The signal delivery path of posix-timers can try to rearm the timer even if the interval is zero. That's handled for the common case (hrtimer) but not for alarm timers. In that case the forwarding function raises a division by zero exception. The handling for hrtimer based posix timers is wrong because it marks the timer as active despite the fact that it is stopped. Move the check from common_hrtimer_rearm() to posixtimer_rearm() to cure both issues. Reported-by: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: John Stultz <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-9/+25
Alexei Starovoitov says: ==================== pull-request: bpf 2018-12-15 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix liveness propagation of callee saved registers, from Jakub. 2) fix overflow in bpf_jit_limit knob, from Daniel. 3) bpf_flow_dissector api fix, from Stanislav. 4) bpf_perf_event api fix on powerpc, from Sandipan. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-12-15kconfig: warn no new line at end of fileMasahiro Yamada1-1/+1
It would be nice to warn if a new line is missing at end of file. We could do this by checkpatch.pl for arbitrary files, but new line is rather essential as a statement terminator in Kconfig. The warning message looks like this: kernel/Kconfig.preempt:60:warning: no new line at end of file Currently, kernel/Kconfig.preempt is the only file with no new line at end of file. Fix it. I know there are some false negative cases. For example, no warning is displayed when the last line contains some whitespaces/comments, but no new line. Yet, this commit works well for most cases. Signed-off-by: Masahiro Yamada <[email protected]>
2018-12-15bpf: add self-check logic to liveness analysisAlexei Starovoitov1-1/+107
Introduce REG_LIVE_DONE to check the liveness propagation and prepare the states for merging. See algorithm description in clean_live_states(). Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-12-15bpf: improve stacksafe state comparisonAlexei Starovoitov1-6/+7
"if (old->allocated_stack > cur->allocated_stack)" check is too conservative. In some cases explored stack could have allocated more space, but that stack space was not live. The test case improves from 19 to 15 processed insns and improvement on real programs is significant as well: before after bpf_lb-DLB_L3.o 1940 1831 bpf_lb-DLB_L4.o 3089 3029 bpf_lb-DUNKNOWN.o 1065 1064 bpf_lxc-DDROP_ALL.o 28052 26309 bpf_lxc-DUNKNOWN.o 35487 33517 bpf_netdev.o 10864 9713 bpf_overlay.o 6643 6184 bpf_lcx_jit.o 38437 37335 Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Edward Cree <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-12-15bpf: speed up stacksafe checkAlexei Starovoitov1-1/+3
Don't check the same stack liveness condition 8 times. once is enough. Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Edward Cree <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-12-14bpf: verbose log bpf_line_info in verifierMartin KaFai Lau1-5/+69
This patch adds bpf_line_info during the verifier's verbose. It can give error context for debug purpose. ~~~~~~~~~~ Here is the verbose log for backedge: while (a) { a += bpf_get_smp_processor_id(); bpf_trace_printk(fmt, sizeof(fmt), a); } ~> bpftool prog load ./test_loop.o /sys/fs/bpf/test_loop type tracepoint 13: while (a) { 3: a += bpf_get_smp_processor_id(); back-edge from insn 13 to 3 ~~~~~~~~~~ Here is the verbose log for invalid pkt access: Modification to test_xdp_noinline.c: data = (void *)(long)xdp->data; data_end = (void *)(long)xdp->data_end; /* if (data + 4 > data_end) return XDP_DROP; */ *(u32 *)data = dst->dst; ~> bpftool prog load ./test_xdp_noinline.o /sys/fs/bpf/test_xdp_noinline type xdp ; data = (void *)(long)xdp->data; 224: (79) r2 = *(u64 *)(r10 -112) 225: (61) r2 = *(u32 *)(r2 +0) ; *(u32 *)data = dst->dst; 226: (63) *(u32 *)(r2 +0) = r1 invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0) R2 offset is outside of the packet Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-14bpf: Create a new btf_name_by_offset() for non type name use caseMartin KaFai Lau2-13/+22
The current btf_name_by_offset() is returning "(anon)" type name for the offset == 0 case and "(invalid-name-offset)" for the out-of-bound offset case. It fits well for the internal BTF verbose log purpose which is focusing on type. For example, offset == 0 => "(anon)" => anonymous type/name. Returning non-NULL for the bad offset case is needed during the BTF verification process because the BTF verifier may complain about another field first before discovering the name_off is invalid. However, it may not be ideal for the newer use case which does not necessary mean type name. For example, when logging line_info in the BPF verifier in the next patch, it is better to log an empty src line instead of logging "(anon)". The existing bpf_name_by_offset() is renamed to __bpf_name_by_offset() and static to btf.c. A new bpf_name_by_offset() is added for generic context usage. It returns "\0" for name_off == 0 (note that btf->strings[0] is "\0") and NULL for invalid offset. It allows the caller to decide what is the best output in its context. The new btf_name_by_offset() is overlapped with btf_name_offset_valid(). Hence, btf_name_offset_valid() is removed from btf.h to keep the btf.h API minimal. The existing btf_name_offset_valid() usage in btf.c could also be replaced later. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-14ARM: module: Fix function kallsyms on Thumb-2Vincent Whitchurch1-16/+27
Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNC GLOBAL DEFAULT 2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 333: 00002d4d 36 FUNC GLOBAL DEFAULT 1 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 00002d4c T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7f802d4d t tun_get_socket [tun] Because of this, the symbol+offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 00002d4c <tun_get_socket>: 2d4c: 6943 ldr r3, [r0, #20] 2d4e: 4a07 ldr r2, [pc, #28] 2d50: 4293 cmp r3, r2 a crash when tun_get_socket is called with NULL results in: PC is at tun_xdp+0xa3/0xa4 [tun] pc : [<7f802d4c>] As can be seen, the "PC is at" line reports the wrong symbol name, and the symbol+offset will point to the wrong source line if it is passed to gdb. To solve this, add a way for archs to fixup the reading of these module kallsyms values, and use that to clear the lowest bit for function symbols on Thumb-2. After the fix: # cat /proc/kallsyms | grep tun_get_socket 7f802d4c t tun_get_socket [tun] PC is at tun_get_socket+0x0/0x24 [tun] pc : [<7f802d4c>] Signed-off-by: Vincent Whitchurch <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2018-12-14module: Overwrite st_size instead of st_infoVincent Whitchurch1-2/+2
st_info is currently overwritten after relocation and used to store the elf_type(). However, we're going to need it fix kallsyms on ARM's Thumb-2 kernels, so preserve st_info and overwrite the st_size field instead. st_size is neither used by the module core nor by any architecture. Reviewed-by: Miroslav Benes <[email protected]> Reviewed-by: Dave Martin <[email protected]> Signed-off-by: Vincent Whitchurch <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2018-12-14audit: remove duplicated include from audit.cYueHaibing1-1/+0
Remove duplicated include. Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2018-12-13seccomp: fix poor type promotionTycho Andersen1-2/+1
sparse complains, kernel/seccomp.c:1172:13: warning: incorrect type in assignment (different base types) kernel/seccomp.c:1172:13: expected restricted __poll_t [usertype] ret kernel/seccomp.c:1172:13: got int kernel/seccomp.c:1173:13: warning: restricted __poll_t degrades to integer Instead of assigning this to ret, since we don't use this anywhere, let's just test it against 0 directly. Signed-off-by: Tycho Andersen <[email protected]> Reported-by: 0day robot <[email protected]> Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Signed-off-by: Kees Cook <[email protected]>
2018-12-13bpf: remove obsolete prog->aux sanitation in bpf_insn_prepare_dumpDaniel Borkmann1-7/+0
This logic is not needed anymore since we got rid of the verifier rewrite that was using prog->aux address in f6069b9aa993 ("bpf: fix redirect to map under tail calls"). Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-13dma-mapping: bypass indirect calls for dma-directChristoph Hellwig2-44/+33
Avoid expensive indirect calls in the fast path DMA mapping operations by directly calling the dma_direct_* ops if we are using the directly mapped DMA operations. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-direct: merge swiotlb_dma_ops into the dma_direct codeChristoph Hellwig2-251/+94
While the dma-direct code is (relatively) clean and simple we actually have to use the swiotlb ops for the mapping on many architectures due to devices with addressing limits. Instead of keeping two implementations around this commit allows the dma-direct implementation to call the swiotlb bounce buffering functions and thus share the guts of the mapping implementation. This also simplified the dma-mapping setup on a few architectures where we don't have to differenciate which implementation to use. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-direct: use dma_direct_map_page to implement dma_direct_map_sgChristoph Hellwig1-9/+5
No need to duplicate the mapping logic. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-direct: improve addressability error reportingChristoph Hellwig1-21/+15
Only report report a DMA addressability report once to avoid spewing the kernel log with repeated message. Also provide a stack trace to make it easy to find the actual caller that caused the problem. Last but not least move the actual check into the fast path and only leave the error reporting in a helper. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13swiotlb: remove dma_mark_cleanChristoph Hellwig1-17/+1
Instead of providing a special dma_mark_clean hook just for ia64, switch ia64 to use the normal arch_sync_dma_for_cpu hooks instead. This means that we now also set the PG_arch_1 bit for pages in the swiotlb buffer, which isn't stricly needed as we will never execute code out of the swiotlb buffer, but otherwise harmless. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13swiotlb: remove SWIOTLB_MAP_ERRORChristoph Hellwig1-2/+2
We can use DMA_MAPPING_ERROR instead, which already maps to the same value. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-mapping: factor out dummy DMA opsRobin Murphy2-1/+40
The dummy DMA ops are currently used by arm64 for any device which has an invalid ACPI description and is thus barred from using DMA due to not knowing whether is is cache-coherent or not. Factor these out into general dma-mapping code so that they can be referenced from other common code paths. In the process, we can prune all the optional callbacks which just do the same thing as the default behaviour, and fill in .map_resource for completeness. Signed-off-by: Robin Murphy <[email protected]> [hch: moved to a separate source file] Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-13dma-mapping: always build the direct mapping codeChristoph Hellwig2-9/+1
All architectures except for sparc64 use the dma-direct code in some form, and even for sparc64 we had the discussion of a direct mapping mode a while ago. In preparation for directly calling the direct mapping code don't bother having it optionally but always build the code in. This is a minor hardship for some powerpc and arm configs that don't pull it in yet (although they should in a relase ot two), and sparc64 which currently doesn't need it at all, but it will reduce the ifdef mess we'd otherwise need significantly. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-mapping: move dma_cache_sync out of lineChristoph Hellwig1-0/+11
This isn't exactly a slow path routine, but it is not super critical either, and moving it out of line will help to keep the include chain clean for the following DMA indirection bypass work. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-mapping: move various slow path functions out of lineChristoph Hellwig1-2/+138
There is no need to have all setup and coherent allocation / freeing routines inline. Move them out of line to keep the implemeation nicely encapsulated and save some kernel text size. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-mapping: move dma_get_required_mask to kernel/dmaChristoph Hellwig1-1/+33
dma_get_required_mask should really be with the rest of the DMA mapping implementation instead of in drivers/base as a lone outlier. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-mapping: simplify the dma_sync_single_range_for_{cpu,device} implementationChristoph Hellwig1-42/+0
We can just call the regular calls after adding offset the the address instead of reimplementing them. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Tony Luck <[email protected]>
2018-12-13dma-mapping: remove a pointless memset in dma_atomic_pool_initChristoph Hellwig1-1/+0
We already zero the memory after allocating it from the pool that this function fills, and having the memset here in this form means we can't support CMA highmem allocations. Signed-off-by: Christoph Hellwig <[email protected]> Reported-by: Russell King - ARM Linux <[email protected]>
2018-12-13bpf: verifier: make sure callees don't prune with caller differencesJakub Kicinski1-3/+10
Currently for liveness and state pruning the register parentage chains don't include states of the callee. This makes some sense as the callee can't access those registers. However, this means that READs done after the callee returns will not propagate into the states of the callee. Callee will then perform pruning disregarding differences in caller state. Example: 0: (85) call bpf_user_rnd_u32 1: (b7) r8 = 0 2: (55) if r0 != 0x0 goto pc+1 3: (b7) r8 = 1 4: (bf) r1 = r8 5: (85) call pc+4 6: (15) if r8 == 0x1 goto pc+1 7: (05) *(u64 *)(r9 - 8) = r3 8: (b7) r0 = 0 9: (95) exit 10: (15) if r1 == 0x0 goto pc+0 11: (95) exit Here we acquire unknown state with call to get_random() [1]. Then we store this random state in r8 (either 0 or 1) [1 - 3], and make a call on line 5. Callee does nothing but a trivial conditional jump (to create a pruning point). Upon return caller checks the state of r8 and either performs an unsafe read or not. Verifier will first explore the path with r8 == 1, creating a pruning point at [11]. The parentage chain for r8 will include only callers states so once verifier reaches [6] it will mark liveness only on states in the caller, and not [11]. Now when verifier walks the paths with r8 == 0 it will reach [11] and since REG_LIVE_READ on r8 was not propagated there it will prune the walk entirely (stop walking the entire program, not just the callee). Since [6] was never walked with r8 == 0, [7] will be considered dead and replaced with "goto -1" causing hang at runtime. This patch weaves the callee's explored states onto the callers parentage chain. Rough parentage for r8 would have looked like this before: [0] [1] [2] [3] [4] [5] [10] [11] [6] [7] | | ,---|----. | | | sl0: sl0: / sl0: \ sl0: sl0: sl0: fr0: r8 <-- fr0: r8<+--fr0: r8 `fr0: r8 ,fr0: r8<-fr0: r8 \ fr1: r8 <- fr1: r8 / \__________________/ after: [0] [1] [2] [3] [4] [5] [10] [11] [6] [7] | | | | | | sl0: sl0: sl0: sl0: sl0: sl0: fr0: r8 <-- fr0: r8 <- fr0: r8 <- fr0: r8 <-fr0: r8<-fr0: r8 fr1: r8 <- fr1: r8 Now the mark from instruction 6 will travel through callees states. Note that we don't have to connect r0 because its overwritten by callees state on return and r1 - r5 because those are not alive any more once a call is made. v2: - don't connect the callees registers twice (Alexei: suggestion & code) - add more details to the comment (Ed & Alexei) v1: don't unnecessarily link caller saved regs (Jiong) Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Reported-by: David Beckett <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Jiong Wang <[email protected]> Reviewed-by: Edward Cree <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-13arm64: add prctl control for resetting ptrauth keysKristina Martsenko1-0/+8
Add an arm64-specific prctl to allow a thread to reinitialize its pointer authentication keys to random values. This can be useful when exec() is not used for starting new processes, to ensure that different processes still have different keys. Signed-off-by: Kristina Martsenko <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2018-12-13bpf: include sub program tags in bpf_prog_infoSong Liu1-0/+22
Changes v2 -> v3: 1. remove check for bpf_dump_raw_ok(). Changes v1 -> v2: 1. Fix error path as Martin suggested. This patch adds nr_prog_tags and prog_tags to bpf_prog_info. This is a reliable way for user space to get tags of all sub programs. Before this patch, user space need to find sub program tags via kallsyms. This feature will be used in BPF introspection, where user space queries information about BPF programs via sys_bpf. Signed-off-by: Song Liu <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-12-13bpf: Remove bpf_dump_raw_ok() check for func_info and line_infoMartin KaFai Lau1-20/+12
The func_info and line_info have the bpf insn offset but they do not contain kernel address. They will still be useful for the userspace tool to annotate the xlated insn. This patch removes the bpf_dump_raw_ok() guard for the func_info and line_info during bpf_prog_get_info_by_fd(). The guard stays for jited_line_info which contains the kernel address. Although this bpf_dump_raw_ok() guard behavior has started since the earlier func_info patch series, I marked the Fixes tag to the latest line_info patch series which contains both func_info and line_info and this patch is fixing for both of them. Fixes: c454a46b5efd ("bpf: Add bpf_line_info support") Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-12-13irq/irq_sim: Store multiple interrupt offsets in a bitmapBartosz Golaszewski1-2/+21
Two threads can try to fire the irq_sim with different offsets and will end up fighting for the irq_work asignment. Thomas Gleixner suggested a solution based on a bitfield where we set a bit for every offset associated with an interrupt that should be fired and then iterate over all set bits in the interrupt handler. This is a slightly modified solution using a bitmap so that we don't impose a limit on the number of interrupts one can allocate with irq_sim. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Bartosz Golaszewski <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-12-12Merge tag 'trace-v4.20-rc6' of ↵Linus Torvalds3-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "While running various ftrace tests on new development code, the kmemleak detector found some allocations that were not freed correctly. This fixes a couple of leaks in the event trigger code as well as in adding function trace filters in trace instances" * tag 'trace-v4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix memory leak of instance function hash filters tracing: Fix memory leak in set_trigger_filter() tracing: Fix memory leak in create_filter()
2018-12-12bpf: add bpffs pretty print for cgroup local storage mapsRoman Gushchin2-1/+114
Implement bpffs pretty printing for cgroup local storage maps (both shared and per-cpu). Output example (captured for tools/testing/selftests/bpf/netcnt_prog.c): Shared: $ cat /sys/fs/bpf/map_2 # WARNING!! The output is for debug purpose only # WARNING!! The output format will change {4294968594,1}: {9999,1039896} Per-cpu: $ cat /sys/fs/bpf/map_1 # WARNING!! The output is for debug purpose only # WARNING!! The output format will change {4294968594,1}: { cpu0: {0,0,0,0,0} cpu1: {0,0,0,0,0} cpu2: {1,104,0,0,0} cpu3: {0,0,0,0,0} } Signed-off-by: Roman Gushchin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-12bpf: pass struct btf pointer to the map_check_btf() callbackRoman Gushchin3-1/+4
If key_type or value_type are of non-trivial data types (e.g. structure or typedef), it's not possible to check them without the additional information, which can't be obtained without a pointer to the btf structure. So, let's pass btf pointer to the map_check_btf() callbacks. Signed-off-by: Roman Gushchin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-12PM / sleep: convert to DEFINE_SHOW_ATTRIBUTEYangtao Li1-13/+2
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-12-12printk: Remove print_prefix() calls with NULL buffer.Tetsuo Handa1-25/+14
We can save lines/size by removing print_prefix() with buf == NULL. This patch makes no functional change. Link: http://lkml.kernel.org/r/1544521745-11925-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp To: Steven Rostedt <[email protected]> Cc: [email protected] Signed-off-by: Tetsuo Handa <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-12-11bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64KDaniel Borkmann1-6/+15
Michael and Sandipan report: Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF JIT allocations. At compile time it defaults to PAGE_SIZE * 40000, and is adjusted again at init time if MODULES_VADDR is defined. For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with the compile-time default at boot-time, which is 0x9c400000 when using 64K page size. This overflows the signed 32-bit bpf_jit_limit value: root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit -1673527296 and can cause various unexpected failures throughout the network stack. In one case `strace dhclient eth0` reported: setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524) and similar failures can be seen with tools like tcpdump. This doesn't always reproduce however, and I'm not sure why. The more consistent failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host would time out on systemd/netplan configuring a virtio-net NIC with no noticeable errors in the logs. Given this and also given that in near future some architectures like arm64 will have a custom area for BPF JIT image allocations we should get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For 4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec() so therefore add another overridable bpf_jit_alloc_exec_limit() helper function which returns the possible size of the memory area for deriving the default heuristic in bpf_jit_charge_init(). Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default JIT memory provider, and therefore in case archs implement their custom module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}. Additionally, for archs supporting large page sizes, we should change the sysctl to be handled as long to not run into sysctl restrictions in future. Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations") Reported-by: Sandipan Das <[email protected]> Reported-by: Michael Roth <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Michael Roth <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-11timekeeping: Convert to DEFINE_SHOW_ATTRIBUTEYangtao Li1-13/+2
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-12-11seccomp: add a return code to trap to userspaceTycho Andersen1-2/+446
This patch introduces a means for syscalls matched in seccomp to notify some other task that a particular filter has been triggered. The motivation for this is primarily for use with containers. For example, if a container does an init_module(), we obviously don't want to load this untrusted code, which may be compiled for the wrong version of the kernel anyway. Instead, we could parse the module image, figure out which module the container is trying to load and load it on the host. As another example, containers cannot mount() in general since various filesystems assume a trusted image. However, if an orchestrator knows that e.g. a particular block device has not been exposed to a container for writing, it want to allow the container to mount that block device (that is, handle the mount for it). This patch adds functionality that is already possible via at least two other means that I know about, both of which involve ptrace(): first, one could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. Unfortunately this is slow, so a faster version would be to install a filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. Since ptrace allows only one tracer, if the container runtime is that tracer, users inside the container (or outside) trying to debug it will not be able to use ptrace, which is annoying. It also means that older distributions based on Upstart cannot boot inside containers using ptrace, since upstart itself uses ptrace to monitor services while starting. The actual implementation of this is fairly small, although getting the synchronization right was/is slightly complex. Finally, it's worth noting that the classic seccomp TOCTOU of reading memory data from the task still applies here, but can be avoided with careful design of the userspace handler: if the userspace handler reads all of the task memory that is necessary before applying its security policy, the tracee's subsequent memory edits will not be read by the tracer. Signed-off-by: Tycho Andersen <[email protected]> CC: Kees Cook <[email protected]> CC: Andy Lutomirski <[email protected]> CC: Oleg Nesterov <[email protected]> CC: Eric W. Biederman <[email protected]> CC: "Serge E. Hallyn" <[email protected]> Acked-by: Serge Hallyn <[email protected]> CC: Christian Brauner <[email protected]> CC: Tyler Hicks <[email protected]> CC: Akihiro Suda <[email protected]> Signed-off-by: Kees Cook <[email protected]>
2018-12-11seccomp: switch system call argument type to void *Tycho Andersen1-4/+4
The const qualifier causes problems for any code that wants to write to the third argument of the seccomp syscall, as we will do in a future patch in this series. The third argument to the seccomp syscall is documented as void *, so rather than just dropping the const, let's switch everything to use void * as well. I believe this is safe because of 1. the documentation above, 2. there's no real type information exported about syscalls anywhere besides the man pages. Signed-off-by: Tycho Andersen <[email protected]> CC: Kees Cook <[email protected]> CC: Andy Lutomirski <[email protected]> CC: Oleg Nesterov <[email protected]> CC: Eric W. Biederman <[email protected]> CC: "Serge E. Hallyn" <[email protected]> Acked-by: Serge Hallyn <[email protected]> CC: Christian Brauner <[email protected]> CC: Tyler Hicks <[email protected]> CC: Akihiro Suda <[email protected]> Signed-off-by: Kees Cook <[email protected]>
2018-12-11seccomp: hoist struct seccomp_data recalculation higherTycho Andersen1-6/+6
In the next patch, we're going to use the sd pointer passed to __seccomp_filter() as the data to pass to userspace. Except that in some cases (__seccomp_filter(SECCOMP_RET_TRACE), emulate_vsyscall(), every time seccomp is inovked on power, etc.) the sd pointer will be NULL in order to force seccomp to recompute the register data. Previously this recomputation happened one level lower, in seccomp_run_filters(); this patch just moves it up a level higher to __seccomp_filter(). Thanks Oleg for spotting this. Signed-off-by: Tycho Andersen <[email protected]> CC: Kees Cook <[email protected]> CC: Andy Lutomirski <[email protected]> CC: Oleg Nesterov <[email protected]> CC: Eric W. Biederman <[email protected]> CC: "Serge E. Hallyn" <[email protected]> Acked-by: Serge Hallyn <[email protected]> CC: Christian Brauner <[email protected]> CC: Tyler Hicks <[email protected]> CC: Akihiro Suda <[email protected]> Signed-off-by: Kees Cook <[email protected]>
2018-12-11tracing: Fix memory leak of instance function hash filtersSteven Rostedt (VMware)1-0/+1
The following commands will cause a memory leak: # cd /sys/kernel/tracing # mkdir instances/foo # echo schedule > instance/foo/set_ftrace_filter # rmdir instances/foo The reason is that the hashes that hold the filters to set_ftrace_filter and set_ftrace_notrace are not freed if they contain any data on the instance and the instance is removed. Found by kmemleak detector. Cc: [email protected] Fixes: 591dffdade9f ("ftrace: Allow for function tracing instance to filter functions") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-11tracing: Fix memory leak in set_trigger_filter()Steven Rostedt (VMware)1-2/+4
When create_event_filter() fails in set_trigger_filter(), the filter may still be allocated and needs to be freed. The caller expects the data->filter to be updated with the new filter, even if the new filter failed (we could add an error message by setting set_str parameter of create_event_filter(), but that's another update). But because the error would just exit, filter was left hanging and nothing could free it. Found by kmemleak detector. Cc: [email protected] Fixes: bac5fb97a173a ("tracing: Add and use generic set_trigger_filter() implementation") Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-11tracing: Fix memory leak in create_filter()Steven Rostedt (VMware)1-1/+4
The create_filter() calls create_filter_start() which allocates a "parse_error" descriptor, but fails to call create_filter_finish() that frees it. The op_stack and inverts in predicate_parse() were also not freed. Found by kmemleak detector. Cc: [email protected] Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster") Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-11sched/fair: Select an energy-efficient CPU on task wake-upQuentin Perret1-2/+141
If an Energy Model (EM) is available and if the system isn't overutilized, re-route waking tasks into an energy-aware placement algorithm. The selection of an energy-efficient CPU for a task is achieved by estimating the impact on system-level active energy resulting from the placement of the task on the CPU with the highest spare capacity in each performance domain. This strategy spreads tasks in a performance domain and avoids overly aggressive task packing. The best CPU energy-wise is then selected if it saves a large enough amount of energy with respect to prev_cpu. Although it has already shown significant benefits on some existing targets, this approach cannot scale to platforms with numerous CPUs. This is an attempt to do something useful as writing a fast heuristic that performs reasonably well on a broad spectrum of architectures isn't an easy task. As such, the scope of usability of the energy-aware wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag set, and where the EM isn't too complex. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>