aboutsummaryrefslogtreecommitdiff
path: root/tools
AgeCommit message (Collapse)AuthorFilesLines
2021-11-30tools/nolibc: x86-64: Use `mov $60,%eax` instead of `mov $60,%rax`Ammar Faizi1-1/+1
Note that mov to 32-bit register will zero extend to 64-bit register. Thus `mov $60,%eax` has the same effect with `mov $60,%rax`. Use the shorter opcode to achieve the same thing. ``` b8 3c 00 00 00 mov $60,%eax (5 bytes) [1] 48 c7 c0 3c 00 00 00 mov $60,%rax (7 bytes) [2] ``` Currently, we use [2]. Change it to [1] for shorter code. Signed-off-by: Ammar Faizi <[email protected]> Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30tools/nolibc: x86: Remove `r8`, `r9` and `r10` from the clobber listAmmar Faizi1-14/+19
Linux x86-64 syscall only clobbers rax, rcx and r11 (and "memory"). - rax for the return value. - rcx to save the return address. - r11 to save the rflags. Other registers are preserved. Having r8, r9 and r10 in the syscall clobber list is harmless, but this results in a missed-optimization. As the syscall doesn't clobber r8-r10, GCC should be allowed to reuse their value after the syscall returns to userspace. But since they are in the clobber list, GCC will always miss this opportunity. Remove them from the x86-64 syscall clobber list to help GCC generate better code and fix the comment. See also the x86-64 ABI, section A.2 AMD64 Linux Kernel Conventions, A.2.1 Calling Conventions [1]. Extra note: Some people may think it does not really give a benefit to remove r8, r9 and r10 from the syscall clobber list because the impression of syscall is a C function call, and function call always clobbers those 3. However, that is not the case for nolibc.h, because we have a potential to inline the "syscall" instruction (which its opcode is "0f 05") to the user functions. All syscalls in the nolibc.h are written as a static function with inline ASM and are likely always inline if we use optimization flag, so this is a profit not to have r8, r9 and r10 in the clobber list. Here is the example where this matters. Consider the following C code: ``` #include "tools/include/nolibc/nolibc.h" #define read_abc(a, b, c) __asm__ volatile("nop"::"r"(a),"r"(b),"r"(c)) int main(void) { int a = 0xaa; int b = 0xbb; int c = 0xcc; read_abc(a, b, c); write(1, "test\n", 5); read_abc(a, b, c); return 0; } ``` Compile with: gcc -Os test.c -o test -nostdlib With r8, r9, r10 in the clobber list, GCC generates this: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 54 push %r12 1006: 41 bc cc 00 00 00 mov $0xcc,%r12d 100c: 55 push %rbp 100d: bd bb 00 00 00 mov $0xbb,%ebp 1012: 53 push %rbx 1013: bb aa 00 00 00 mov $0xaa,%ebx 1018: 90 nop 1019: b8 01 00 00 00 mov $0x1,%eax 101e: bf 01 00 00 00 mov $0x1,%edi 1023: ba 05 00 00 00 mov $0x5,%edx 1028: 48 8d 35 d1 0f 00 00 lea 0xfd1(%rip),%rsi 102f: 0f 05 syscall 1031: 90 nop 1032: 31 c0 xor %eax,%eax 1034: 5b pop %rbx 1035: 5d pop %rbp 1036: 41 5c pop %r12 1038: c3 ret GCC thinks that syscall will clobber r8, r9, r10. So it spills 0xaa, 0xbb and 0xcc to callee saved registers (r12, rbp and rbx). This is clearly extra memory access and extra stack size for preserving them. But syscall does not actually clobber them, so this is a missed optimization. Now without r8, r9, r10 in the clobber list, GCC generates better code: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 b8 aa 00 00 00 mov $0xaa,%r8d 100a: 41 b9 bb 00 00 00 mov $0xbb,%r9d 1010: 41 ba cc 00 00 00 mov $0xcc,%r10d 1016: 90 nop 1017: b8 01 00 00 00 mov $0x1,%eax 101c: bf 01 00 00 00 mov $0x1,%edi 1021: ba 05 00 00 00 mov $0x5,%edx 1026: 48 8d 35 d3 0f 00 00 lea 0xfd3(%rip),%rsi 102d: 0f 05 syscall 102f: 90 nop 1030: 31 c0 xor %eax,%eax 1032: c3 ret Cc: Andy Lutomirski <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: [email protected] Cc: "H. Peter Anvin" <[email protected]> Cc: David Laight <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Signed-off-by: Ammar Faizi <[email protected]> Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI [1] Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30tools/nolibc: fix incorrect truncation of exit codeWilly Tarreau1-8/+5
Ammar Faizi reported that our exit code handling is wrong. We truncate it to the lowest 8 bits but the syscall itself is expected to take a regular 32-bit signed integer, not an unsigned char. It's the kernel that later truncates it to the lowest 8 bits. The difference is visible in strace, where the program below used to show exit(255) instead of exit(-1): int main(void) { return -1; } This patch applies the fix to all archs. x86_64, i386, arm64, armv7 and mips were all tested and confirmed to work fine now. Risc-v was not tested but the change is trivial and exactly the same as for other archs. Reported-by: Ammar Faizi <[email protected]> Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30tools/nolibc: i386: fix initial stack alignmentWilly Tarreau1-1/+9
After re-checking in the spec and comparing stack offsets with glibc, The last pushed argument must be 16-byte aligned (i.e. aligned before the call) so that in the callee esp+4 is multiple of 16, so the principle is the 32-bit equivalent to what Ammar fixed for x86_64. It's possible that 32-bit code using SSE2 or MMX could have been affected. In addition the frame pointer ought to be zero at the deepest level. Link: https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI Cc: Ammar Faizi <[email protected]> Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30tools/nolibc: x86-64: Fix startup code bugAmmar Faizi1-2/+8
Before this patch, the `_start` function looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: and $0xfffffffffffffff0,%rsp 117d: sub $0x8,%rsp 1181: call 1000 <main> 1186: movzbq %al,%rdi 118a: mov $0x3c,%rax 1191: syscall 1193: hlt 1194: data16 cs nopw 0x0(%rax,%rax,1) 119f: nop ``` Note the "and" to %rsp with $-16, it makes the %rsp be 16-byte aligned, but then there is a "sub" with $0x8 which makes the %rsp no longer 16-byte aligned, then it calls main. That's the bug! What actually the x86-64 System V ABI mandates is that right before the "call", the %rsp must be 16-byte aligned, not after the "call". So the "sub" with $0x8 here breaks the alignment. Remove it. An example where this rule matters is when the callee needs to align its stack at 16-byte for aligned move instruction, like `movdqa` and `movaps`. If the callee can't align its stack properly, it will result in segmentation fault. x86-64 System V ABI also mandates the deepest stack frame should be zero. Just to be safe, let's zero the %rbp on startup as the content of %rbp may be unspecified when the program starts. Now it looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: xor %ebp,%ebp # zero the %rbp 117b: and $0xfffffffffffffff0,%rsp # align the %rsp 117f: call 1000 <main> 1184: movzbq %al,%rdi 1188: mov $0x3c,%rax 118f: syscall 1191: hlt 1192: data16 cs nopw 0x0(%rax,%rax,1) 119d: nopl (%rax) ``` Cc: Bedirhan KURT <[email protected]> Cc: Louvian Lyndal <[email protected]> Reported-by: Peter Cordes <[email protected]> Signed-off-by: Ammar Faizi <[email protected]> [wt: I did this on purpose due to a misunderstanding of the spec, other archs will thus have to be rechecked, particularly i386] Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30rcu: Remove the RCU_FAST_NO_HZ Kconfig optionPaul E. McKenney3-3/+0
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve systems with RCU callbacks offloaded. In this situation, all that this Kconfig option does is slow down idle entry/exit with an additional allways-taken early exit. If this is the only use case, then this Kconfig option nothing but an attractive nuisance that needs to go away. This commit therefore removes the RCU_FAST_NO_HZ Kconfig option. Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30torture: Remove RCU_FAST_NO_HZ from rcu scenariosPaul E. McKenney6-6/+0
All of the rcu scenarios that mentioning CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30torture: Remove RCU_FAST_NO_HZ from rcuscale and refscale scenariosPaul E. McKenney6-6/+0
All of the rcuscale and refscale scenarios that mention the Kconfig option CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenariosPaul E. McKenney5-0/+5
With CONFIG_PREEMPT_DYNAMIC=y, the kernel builds with CONFIG_PREEMPTION=y because preemption can be enabled at runtime. This prevents any tests of Tiny RCU or Tiny SRCU from running correctly. This commit therefore explicitly sets CONFIG_PREEMPT_DYNAMIC=n for those scenarios. Signed-off-by: Paul E. McKenney <[email protected]>
2021-11-30libbpf: Avoid reload of imm for weak, unresolved, repeating ksymKumar Kartikeya Dwivedi1-3/+2
Alexei pointed out that we can use BPF_REG_0 which already contains imm from move_blob2blob computation. Note that we now compare the second insn's imm, but this should not matter, since both will be zeroed out for the error case for the insn populated earlier. Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-30libbpf: Avoid double stores for success/failure case of ksym relocationsKumar Kartikeya Dwivedi1-16/+21
Instead, jump directly to success case stores in case ret >= 0, else do the default 0 value store and jump over the success case. This is better in terms of readability. Readjust the code for kfunc relocation as well to follow a similar pattern, also leads to easier to follow code now. Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-30selftest/bpf/benchs: Add bpf_loop benchmarkJoanne Koong7-1/+203
Add benchmark to measure the throughput and latency of the bpf_loop call. Testing this on my dev machine on 1 thread, the data is as follows: nr_loops: 10 bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op nr_loops: 100 bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op nr_loops: 500 bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op nr_loops: 1000 bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op nr_loops: 5000 bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op nr_loops: 10000 bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op nr_loops: 50000 bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op nr_loops: 100000 bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op nr_loops: 500000 bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op nr_loops: 1000000 bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op >From this data, we can see that the latency per loop decreases as the number of loops increases. On this particular machine, each loop had an overhead of about ~4 ns, and we were able to run ~250 million loops per second. Signed-off-by: Joanne Koong <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-30selftests/bpf: Measure bpf_loop verifier performanceJoanne Koong5-4/+169
This patch tests bpf_loop in pyperf and strobemeta, and measures the verifier performance of replacing the traditional for loop with bpf_loop. The results are as follows: ~strobemeta~ Baseline verification time 6808200 usec stack depth 496 processed 554252 insns (limit 1000000) max_states_per_insn 16 total_states 15878 peak_states 13489 mark_read 3110 #192 verif_scale_strobemeta:OK (unrolled loop) Using bpf_loop verification time 31589 usec stack depth 96+400 processed 1513 insns (limit 1000000) max_states_per_insn 2 total_states 106 peak_states 106 mark_read 60 #193 verif_scale_strobemeta_bpf_loop:OK ~pyperf600~ Baseline verification time 29702486 usec stack depth 368 processed 626838 insns (limit 1000000) max_states_per_insn 7 total_states 30368 peak_states 30279 mark_read 748 #182 verif_scale_pyperf600:OK (unrolled loop) Using bpf_loop verification time 148488 usec stack depth 320+40 processed 10518 insns (limit 1000000) max_states_per_insn 10 total_states 705 peak_states 517 mark_read 38 #183 verif_scale_pyperf600_bpf_loop:OK Using the bpf_loop helper led to approximately a 99% decrease in the verification time and in the number of instructions. Signed-off-by: Joanne Koong <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-30selftests/bpf: Add bpf_loop testJoanne Koong2-0/+257
Add test for bpf_loop testing a variety of cases: various nr_loops, null callback ctx, invalid flags, nested callbacks. Signed-off-by: Joanne Koong <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-30bpf: Add bpf_loop helperJoanne Koong1-0/+25
This patch adds the kernel-side and API changes for a new helper function, bpf_loop: long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx, u64 flags); where long (*callback_fn)(u32 index, void *ctx); bpf_loop invokes the "callback_fn" **nr_loops** times or until the callback_fn returns 1. The callback_fn can only return 0 or 1, and this is enforced by the verifier. The callback_fn index is zero-indexed. A few things to please note: ~ The "u64 flags" parameter is currently unused but is included in case a future use case for it arises. ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c), bpf_callback_t is used as the callback function cast. ~ A program can have nested bpf_loop calls but the program must still adhere to the verifier constraint of its stack depth (the stack depth cannot exceed MAX_BPF_STACK)) ~ Recursive callback_fns do not pass the verifier, due to the call stack for these being too deep. ~ The next patch will include the tests and benchmark Signed-off-by: Joanne Koong <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-80/+257
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix constant sign extension affecting TCR_EL2 and preventing running on ARMv8.7 models due to spurious bits being set - Fix use of helpers using PSTATE early on exit by always sampling it as soon as the exit takes place - Move pkvm's 32bit handling into a common helper RISC-V: - Fix incorrect KVM_MAX_VCPUS value - Unmap stage2 mapping when deleting/moving a memslot x86: - Fix and downgrade BUG_ON due to uninitialized cache - Many APICv and MOVE_ENC_CONTEXT_FROM fixes - Correctly emulate TLB flushes around nested vmentry/vmexit and when the nested hypervisor uses VPID - Prevent modifications to CPUID after the VM has run - Other smaller bugfixes Generic: - Memslot handling bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits) KVM: fix avic_set_running for preemptable kernels KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled KVM: SEV: accept signals in sev_lock_two_vms KVM: SEV: do not take kvm->lock when destroying KVM: SEV: Prohibit migration of a VM that has mirrors KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM KVM: SEV: initialize regions_list of a mirror VM KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM KVM: SEV: do not use list_replace_init on an empty list KVM: x86: Use a stable condition around all VT-d PI paths KVM: x86: check PIR even for vCPUs with disabled APICv KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem KVM: x86/mmu: Handle "default" period when selectively waking kthread KVM: MMU: shadow nested paging does not have PKU KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg() ...
2021-11-30tools: Fix math.h breakageMatthew Wilcox (Oracle)3-21/+29
Commit 98e1385ef24b ("include/linux/radix-tree.h: replace kernel.h with the necessary inclusions") broke the radix tree test suite in two different ways; first by including math.h which didn't exist in the tools directory, and second by removing an implicit include of spinlock.h before lockdep.h. Fix both issues. Cc: Andy Shevchenko <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Andy Shevchenko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-30libbpf: Remove duplicate assignmentsMehrdad Arshad Rad1-1/+0
There is a same action when load_attr.attach_btf_id is initialized. Signed-off-by: Mehrdad Arshad Rad <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-30Bonding: add arp_missed_max optionHangbin Liu1-0/+1
Currently, we use hard code number to verify if we are in the arp_interval timeslice. But some user may want to reduce/extend the verify timeslice. With the similar team option 'missed_max' the uers could change that number based on their own environment. Acked-by: Jay Vosburgh <[email protected]> Signed-off-by: Hangbin Liu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-30KVM: SEV: Prohibit migration of a VM that has mirrorsPaolo Bonzini1-0/+37
VMs that mirror an encryption context rely on the owner to keep the ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM would cause a dangling ASID: 1. copy context from A to B (gets ref to A) 2. move context from A to L (moves ASID from A to L) 3. close L (releases ASID from L, B still references it) The right way to do the handoff instead is to create a fresh mirror VM on the destination first: 1. copy context from A to B (gets ref to A) [later] 2. close B (releases ref to A) 3. move context from A to L (moves ASID from A to L) 4. copy context from L to M So, catch the situation by adding a count of how many VMs are mirroring this one's encryption context. Fixes: 0b020f5af092 ("KVM: SEV: Add support for SEV-ES intra host migration") Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-30selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROMPaolo Bonzini1-7/+105
I am putting the tests in sev_migrate_tests because the failure conditions are very similar and some of the setup code can be reused, too. The tests cover both successful creation of a mirror VM, and error conditions. Cc: Peter Gonda <[email protected]> Cc: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-30KVM: selftests: page_table_test: fix calculation of guest_test_phys_memMaciej S. Szmigiero1-1/+1
A kvm_page_table_test run with its default settings fails on VMX due to memory region add failure: > ==== Test Assertion Failure ==== > lib/kvm_util.c:952: ret == 0 > pid=10538 tid=10538 errno=17 - File exists > 1 0x00000000004057d1: vm_userspace_mem_region_add at kvm_util.c:947 > 2 0x0000000000401ee9: pre_init_before_test at kvm_page_table_test.c:302 > 3 (inlined by) run_test at kvm_page_table_test.c:374 > 4 0x0000000000409754: for_each_guest_mode at guest_modes.c:53 > 5 0x0000000000401860: main at kvm_page_table_test.c:500 > 6 0x00007f82ae2d8554: ?? ??:0 > 7 0x0000000000401894: _start at ??:? > KVM_SET_USER_MEMORY_REGION IOCTL failed, > rc: -1 errno: 17 > slot: 1 flags: 0x0 > guest_phys_addr: 0xc0000000 size: 0x40000000 This is because the memory range that this test is trying to add (0x0c0000000 - 0x100000000) conflicts with LAPIC mapping at 0x0fee00000. Looking at the code it seems that guest_test_*phys*_mem variable gets mistakenly overwritten with guest_test_*virt*_mem while trying to adjust the former for alignment. With the correct variable adjusted this test runs successfully. Signed-off-by: Maciej S. Szmigiero <[email protected]> Message-Id: <52e487458c3172923549bbcf9dfccfbe6faea60b.1637940473.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-29wireguard: device: reset peer src endpoint when netns exitsJason A. Donenfeld1-1/+23
Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Tested-by: Hangbin Liu <[email protected]> Reported-by: Xiumei Mu <[email protected]> Cc: Toke Høiland-Jørgensen <[email protected]> Cc: Paolo Abeni <[email protected]> Fixes: 900575aa33a3 ("wireguard: device: avoid circular netns references") Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-29wireguard: selftests: rename DEBUG_PI_LIST to DEBUG_PLISTLi Zhijian1-1/+1
DEBUG_PI_LIST was renamed to DEBUG_PLIST since 8e18faeac3 ("lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST"). Signed-off-by: Li Zhijian <[email protected]> Fixes: 8e18faeac3e4 ("lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST") Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-29wireguard: selftests: actually test for routing loopsJason A. Donenfeld1-1/+5
We previously removed the restriction on looping to self, and then added a test to make sure the kernel didn't blow up during a routing loop. The kernel didn't blow up, thankfully, but on certain architectures where skb fragmentation is easier, such as ppc64, the skbs weren't actually being discarded after a few rounds through. But the test wasn't catching this. So actually test explicitly for massive increases in tx to see if we have a routing loop. Note that the actual loop problem will need to be addressed in a different commit. Fixes: b673e24aad36 ("wireguard: socket: remove errant restriction on looping to self") Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-29wireguard: selftests: increase default dmesg log sizeJason A. Donenfeld1-0/+1
The selftests currently parse the kernel log at the end to track potential memory leaks. With these tests now reading off the end of the buffer, due to recent optimizations, some creation messages were lost, making the tests think that there was a free without an alloc. Fix this by increasing the kernel log size. Fixes: 24b70eeeb4f4 ("wireguard: use synchronize_net rather than synchronize_rcu") Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-29libbpf: Silence uninitialized warning/error in btf_dump_dump_type_dataAlan Maguire1-1/+1
When compiling libbpf with gcc 4.8.5, we see: CC staticobjs/btf_dump.o btf_dump.c: In function ‘btf_dump_dump_type_data.isra.24’: btf_dump.c:2296:5: error: ‘err’ may be used uninitialized in this function [-Werror=maybe-uninitialized] if (err < 0) ^ cc1: all warnings being treated as errors make: *** [staticobjs/btf_dump.o] Error 1 While gcc 4.8.5 is too old to build the upstream kernel, it's possible it could be used to build standalone libbpf which suffers from the same problem. Silence the error by initializing 'err' to 0. The warning/error seems to be a false positive since err is set early in the function. Regardless we shouldn't prevent libbpf from building for this. Fixes: 920d16af9b42 ("libbpf: BTF dumper support for typed data") Signed-off-by: Alan Maguire <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-29selftests: net: bridge: fix typo in vlan_filtering dependency testIvan Vecera1-1/+1
Prior patch: ]# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh TEST: Vlan multicast snooping enable [ OK ] Device "bridge" does not exist. TEST: Disable multicast vlan snooping when vlan filtering is disabled [FAIL] Vlan filtering is disabled but multicast vlan snooping is still enabled After patch: # TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh TEST: Vlan multicast snooping enable [ OK ] TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ] Fixes: f5a9dd58f48b7c ("selftests: net: bridge: add test for vlan_filtering dependency") Cc: Nikolay Aleksandrov <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-28selftests/bpf: Test BPF_MAP_TYPE_PROG_ARRAY static initializationHengqi Chen2-0/+71
Add testcase for BPF_MAP_TYPE_PROG_ARRAY static initialization. Signed-off-by: Hengqi Chen <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-28libbpf: Support static initialization of BPF_MAP_TYPE_PROG_ARRAYHengqi Chen1-33/+121
Support static initialization of BPF_MAP_TYPE_PROG_ARRAY with a syntax similar to map-in-map initialization ([0]): SEC("socket") int tailcall_1(void *ctx) { return 0; } struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 2); __uint(key_size, sizeof(__u32)); __array(values, int (void *)); } prog_array_init SEC(".maps") = { .values = { [1] = (void *)&tailcall_1, }, }; Here's the relevant part of libbpf debug log showing what's going on with prog-array initialization: libbpf: sec '.relsocket': collecting relocation for section(3) 'socket' libbpf: sec '.relsocket': relo #0: insn #2 against 'prog_array_init' libbpf: prog 'entry': found map 0 (prog_array_init, sec 4, off 0) for insn #0 libbpf: .maps relo #0: for 3 value 0 rel->r_offset 32 name 53 ('tailcall_1') libbpf: .maps relo #0: map 'prog_array_init' slot [1] points to prog 'tailcall_1' libbpf: map 'prog_array_init': created successfully, fd=5 libbpf: map 'prog_array_init': slot [1] set to prog 'tailcall_1' fd=6 [0] Closes: https://github.com/libbpf/libbpf/issues/354 Signed-off-by: Hengqi Chen <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-26af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead.Kuniyuki Iwashima3-4/+2
In BSD and abstract address cases, we store sockets in the hash table with keys between 0 and UNIX_HASH_SIZE - 1. However, the hash saved in a socket varies depending on its address type; sockets with BSD addresses always have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash. This is just for the UNIX_ABSTRACT() macro used to check the address type. The difference of the saved hashes comes from the first byte of the address in the first place. So, we can test it directly. Then we can keep a real hash in each socket and replace unix_table_lock with per-hash locks in the later patch. Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add test for vlan_filtering dependencyNikolay Aleksandrov1-1/+11
Add a test for dependency of mcast_vlan_snooping on vlan_filtering. If vlan_filtering gets disabled, then mcast_vlan_snooping must be automatically disabled as well. TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast_router testsNikolay Aleksandrov1-1/+53
Add tests for the new per-port/vlan mcast_router option, verify that unknown multicast packets are flooded only to router ports. TEST: Port vlan 10 option mcast_router default value [ OK ] TEST: Port vlan 10 mcast_router option changed to 2 [ OK ] TEST: Flood unknown vlan multicast packets to router port only [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast query and query response interval testsNikolay Aleksandrov1-1/+50
Add tests which change the new per-vlan mcast_query_interval and verify the new value is in effect, also add a test to change mcast_query_response_interval's value. TEST: Vlan mcast_query_interval global option default value [ OK ] TEST: Vlan 10 mcast_query_interval option changed to 200 [ OK ] TEST: Vlan mcast_query_response_interval global option default value [ OK ] TEST: Vlan 10 mcast_query_response_interval option changed to 200 [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast_querier_interval testsNikolay Aleksandrov1-1/+39
Add tests which change the new per-vlan mcast_querier_interval and verify that it is taken into account when an outside querier is present. TEST: Vlan mcast_querier_interval global option default value [ OK ] TEST: Vlan 10 mcast_querier_interval option changed to 100 [ OK ] TEST: Vlan 10 mcast_querier_interval expire after outside query [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast_membership_interval testNikolay Aleksandrov1-1/+27
Add a test which changes the new per-vlan mcast_membership_interval and verifies that a newly learned mdb entry would expire in that interval. TEST: Vlan mcast_membership_interval global option default value [ OK ] TEST: Vlan 10 mcast_membership_interval option changed to 200 [ OK ] TEST: Vlan 10 mcast_membership_interval mdb entry expire [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast_startup_query_count/interval testsNikolay Aleksandrov1-1/+41
Add tests which change the new per-vlan startup query count/interval options and verify the proper number of queries are sent in the expected interval. TEST: Vlan mcast_startup_query_interval global option default value [ OK ] TEST: Vlan mcast_startup_query_count global option default value [ OK ] TEST: Vlan 10 mcast_startup_query_interval option changed to 100 [ OK ] TEST: Vlan 10 mcast_startup_query_count option changed to 3 [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast_last_member_count/interval testsNikolay Aleksandrov1-1/+35
Add tests which verify the default values of mcast_last_member_count mcast_last_member_count and also try to change them. TEST: Vlan mcast_last_member_count global option default value [ OK ] TEST: Vlan mcast_last_member_interval global option default value [ OK ] TEST: Vlan 10 mcast_last_member_count option changed to 3 [ OK ] TEST: Vlan 10 mcast_last_member_interval option changed to 200 [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast igmp/mld version testsNikolay Aleksandrov1-1/+44
Add tests which change the new per-vlan IGMP/MLD versions and verify that proper tagged general query packets are sent. TEST: Vlan mcast_igmp_version global option default value [ OK ] TEST: Vlan mcast_mld_version global option default value [ OK ] TEST: Vlan 10 mcast_igmp_version option changed to 3 [ OK ] TEST: Vlan 10 tagged IGMPv3 general query sent [ OK ] TEST: Vlan 10 mcast_mld_version option changed to 2 [ OK ] TEST: Vlan 10 tagged MLDv2 general query sent [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast querier testNikolay Aleksandrov1-1/+104
Add a test to try the new global vlan mcast_querier control and also verify that tagged general query packets are properly generated when querier is enabled for a single vlan. TEST: Vlan mcast_querier global option default value [ OK ] TEST: Vlan 10 multicast querier enable [ OK ] TEST: Vlan 10 tagged IGMPv2 general query sent [ OK ] TEST: Vlan 10 tagged MLD general query sent [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26selftests: net: bridge: add vlan mcast snooping control testNikolay Aleksandrov1-0/+148
Add the first test for bridge per-vlan multicast snooping which checks if control of the global and per-vlan options work as expected, joins and leaves are tested at each option value. TEST: Vlan multicast snooping enable [ OK ] TEST: Vlan global options existence [ OK ] TEST: Vlan mcast_snooping global option default value [ OK ] TEST: Vlan 10 multicast snooping control [ OK ] Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski34-294/+1155
drivers/net/ipa/ipa_main.c 8afc7e471ad3 ("net: ipa: separate disabling setup from modem stop") 76b5fbcd6b47 ("net: ipa: kill ipa_modem_init()") Duplicated include, drop one. Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-26bpf, mips: Fix build errors about __NR_bpf undeclaredTiezhu Yang3-0/+22
Add the __NR_bpf definitions to fix the following build errors for mips: $ cd tools/bpf/bpftool $ make [...] bpf.c:54:4: error: #error __NR_bpf not defined. libbpf does not support your arch. # error __NR_bpf not defined. libbpf does not support your arch. ^~~~~ bpf.c: In function ‘sys_bpf’: bpf.c:66:17: error: ‘__NR_bpf’ undeclared (first use in this function); did you mean ‘__NR_brk’? return syscall(__NR_bpf, cmd, attr, size); ^~~~~~~~ __NR_brk [...] In file included from gen_loader.c:15:0: skel_internal.h: In function ‘skel_sys_bpf’: skel_internal.h:53:17: error: ‘__NR_bpf’ undeclared (first use in this function); did you mean ‘__NR_brk’? return syscall(__NR_bpf, cmd, attr, size); ^~~~~~~~ __NR_brk We can see the following generated definitions: $ grep -r "#define __NR_bpf" arch/mips arch/mips/include/generated/uapi/asm/unistd_o32.h:#define __NR_bpf (__NR_Linux + 355) arch/mips/include/generated/uapi/asm/unistd_n64.h:#define __NR_bpf (__NR_Linux + 315) arch/mips/include/generated/uapi/asm/unistd_n32.h:#define __NR_bpf (__NR_Linux + 319) The __NR_Linux is defined in arch/mips/include/uapi/asm/unistd.h: $ grep -r "#define __NR_Linux" arch/mips arch/mips/include/uapi/asm/unistd.h:#define __NR_Linux 4000 arch/mips/include/uapi/asm/unistd.h:#define __NR_Linux 5000 arch/mips/include/uapi/asm/unistd.h:#define __NR_Linux 6000 That is to say, __NR_bpf is: 4000 + 355 = 4355 for mips o32, 6000 + 319 = 6319 for mips n32, 5000 + 315 = 5315 for mips n64. So use the GCC pre-defined macro _ABIO32, _ABIN32 and _ABI64 [1] to define the corresponding __NR_bpf. This patch is similar with commit bad1926dd2f6 ("bpf, s390: fix build for libbpf and selftest suite"). [1] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/mips/mips.h#l549 Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-11-26Merge tag 'net-5.16-rc3' of ↵Linus Torvalds10-184/+1017
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from netfilter. Current release - regressions: - r8169: fix incorrect mac address assignment - vlan: fix underflow for the real_dev refcnt when vlan creation fails - smc: avoid warning of possible recursive locking Current release - new code bugs: - vsock/virtio: suppress used length validation - neigh: fix crash in v6 module initialization error path Previous releases - regressions: - af_unix: fix change in behavior in read after shutdown - igb: fix netpoll exit with traffic, avoid warning - tls: fix splice_read() when starting mid-record - lan743x: fix deadlock in lan743x_phy_link_status_change() - marvell: prestera: fix bridge port operation Previous releases - always broken: - tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows - nexthop: fix refcount issues when replacing IPv6 groups - nexthop: fix null pointer dereference when IPv6 is not enabled - phylink: force link down and retrigger resolve on interface change - mptcp: fix delack timer length calculation and incorrect early clearing - ieee802154: handle iftypes as u32, prevent shift-out-of-bounds - nfc: virtual_ncidev: change default device permissions - netfilter: ctnetlink: fix error codes and flags used for kernel side filtering of dumps - netfilter: flowtable: fix IPv6 tunnel addr match - ncsi: align payload to 32-bit to fix dropped packets - iavf: fix deadlock and loss of config during VF interface reset - ice: avoid bpf_prog refcount underflow - ocelot: fix broken PTP over IP and PTP API violations Misc: - marvell: mvpp2: increase MTU limit when XDP enabled" * tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) net: dsa: microchip: implement multi-bridge support net: mscc: ocelot: correctly report the timestamping RX filters in ethtool net: mscc: ocelot: set up traps for PTP packets net: ptp: add a definition for the UDP port for IEEE 1588 general messages net: mscc: ocelot: create a function that replaces an existing VCAP filter net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP net: hns3: fix incorrect components info of ethtool --reset command net: hns3: fix one incorrect value of page pool info when queried by debugfs net: hns3: add check NULL address for page pool net: hns3: fix VF RSS failed problem after PF enable multi-TCs net: qed: fix the array may be out of bound net/smc: Don't call clcsock shutdown twice when smc shutdown net: vlan: fix underflow for the real_dev refcnt ptp: fix filter names in the documentation ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce() nfc: virtual_ncidev: change default device permissions net/sched: sch_ets: don't peek at classes beyond 'nbands' net: stmmac: Disable Tx queues when reconfiguring the interface selftests: tls: test for correct proto_ops tls: fix replacing proto_ops ...
2021-11-26KVM: selftests: Make sure kvm_create_max_vcpus test won't hit RLIMIT_NOFILEVitaly Kuznetsov1-0/+30
With the elevated 'KVM_CAP_MAX_VCPUS' value kvm_create_max_vcpus test may hit RLIMIT_NOFILE limits: # ./kvm_create_max_vcpus KVM_CAP_MAX_VCPU_ID: 4096 KVM_CAP_MAX_VCPUS: 1024 Testing creating 1024 vCPUs, with IDs 0...1023. /dev/kvm not available (errno: 24), skipping test Adjust RLIMIT_NOFILE limits to make sure KVM_CAP_MAX_VCPUS fds can be opened. Note, raising hard limit ('rlim_max') requires CAP_SYS_RESOURCE capability which is generally not needed to run kvm selftests (but without raising the limit the test is doomed to fail anyway). Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> [Skip the test if the hard limit can be raised. - Paolo] Reviewed-by: Sean Christopherson <[email protected]> Tested-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-26KVM: selftests: Avoid KVM_SET_CPUID2 after KVM_RUN in hyperv_features testVitaly Kuznetsov1-69/+71
hyperv_features's sole purpose is to test access to various Hyper-V MSRs and hypercalls with different CPUID data. As KVM_SET_CPUID2 after KVM_RUN is deprecated and soon-to-be forbidden, avoid it by re-creating test VM for each sub-test. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-26selftests: sev_migrate_tests: free all VMsPaolo Bonzini1-1/+8
Ensure that the ASID are freed promptly, which becomes more important when more tests are added to this file. Cc: Peter Gonda <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-26selftests: fix check for circular KVM_CAP_VM_MOVE_ENC_CONTEXT_FROMPaolo Bonzini1-2/+5
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM leaves the source VM in a dead state, so migrating back to the original source VM fails the ioctl. Adjust the test. Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-25selftests: tls: test for correct proto_opsJakub Kicinski1-0/+55
Previous patch fixes overriding callbacks incorrectly. Triggering the crash in sendpage_locked would be more spectacular but it's hard to get to, so take the easier path of proving this is broken and call getname. We're currently getting IPv4 socket info on an IPv6 socket. Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-25selftests: tls: test splicing decrypted recordsJakub Kicinski1-0/+49
Add tests for half-received and peeked records. Signed-off-by: Jakub Kicinski <[email protected]>