Age | Commit message (Collapse) | Author | Files | Lines |
|
Note that mov to 32-bit register will zero extend to 64-bit register.
Thus `mov $60,%eax` has the same effect with `mov $60,%rax`. Use the
shorter opcode to achieve the same thing.
```
b8 3c 00 00 00 mov $60,%eax (5 bytes) [1]
48 c7 c0 3c 00 00 00 mov $60,%rax (7 bytes) [2]
```
Currently, we use [2]. Change it to [1] for shorter code.
Signed-off-by: Ammar Faizi <[email protected]>
Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Linux x86-64 syscall only clobbers rax, rcx and r11 (and "memory").
- rax for the return value.
- rcx to save the return address.
- r11 to save the rflags.
Other registers are preserved.
Having r8, r9 and r10 in the syscall clobber list is harmless, but this
results in a missed-optimization.
As the syscall doesn't clobber r8-r10, GCC should be allowed to reuse
their value after the syscall returns to userspace. But since they are
in the clobber list, GCC will always miss this opportunity.
Remove them from the x86-64 syscall clobber list to help GCC generate
better code and fix the comment.
See also the x86-64 ABI, section A.2 AMD64 Linux Kernel Conventions,
A.2.1 Calling Conventions [1].
Extra note:
Some people may think it does not really give a benefit to remove r8,
r9 and r10 from the syscall clobber list because the impression of
syscall is a C function call, and function call always clobbers those 3.
However, that is not the case for nolibc.h, because we have a potential
to inline the "syscall" instruction (which its opcode is "0f 05") to the
user functions.
All syscalls in the nolibc.h are written as a static function with inline
ASM and are likely always inline if we use optimization flag, so this is
a profit not to have r8, r9 and r10 in the clobber list.
Here is the example where this matters.
Consider the following C code:
```
#include "tools/include/nolibc/nolibc.h"
#define read_abc(a, b, c) __asm__ volatile("nop"::"r"(a),"r"(b),"r"(c))
int main(void)
{
int a = 0xaa;
int b = 0xbb;
int c = 0xcc;
read_abc(a, b, c);
write(1, "test\n", 5);
read_abc(a, b, c);
return 0;
}
```
Compile with:
gcc -Os test.c -o test -nostdlib
With r8, r9, r10 in the clobber list, GCC generates this:
0000000000001000 <main>:
1000: f3 0f 1e fa endbr64
1004: 41 54 push %r12
1006: 41 bc cc 00 00 00 mov $0xcc,%r12d
100c: 55 push %rbp
100d: bd bb 00 00 00 mov $0xbb,%ebp
1012: 53 push %rbx
1013: bb aa 00 00 00 mov $0xaa,%ebx
1018: 90 nop
1019: b8 01 00 00 00 mov $0x1,%eax
101e: bf 01 00 00 00 mov $0x1,%edi
1023: ba 05 00 00 00 mov $0x5,%edx
1028: 48 8d 35 d1 0f 00 00 lea 0xfd1(%rip),%rsi
102f: 0f 05 syscall
1031: 90 nop
1032: 31 c0 xor %eax,%eax
1034: 5b pop %rbx
1035: 5d pop %rbp
1036: 41 5c pop %r12
1038: c3 ret
GCC thinks that syscall will clobber r8, r9, r10. So it spills 0xaa,
0xbb and 0xcc to callee saved registers (r12, rbp and rbx). This is
clearly extra memory access and extra stack size for preserving them.
But syscall does not actually clobber them, so this is a missed
optimization.
Now without r8, r9, r10 in the clobber list, GCC generates better code:
0000000000001000 <main>:
1000: f3 0f 1e fa endbr64
1004: 41 b8 aa 00 00 00 mov $0xaa,%r8d
100a: 41 b9 bb 00 00 00 mov $0xbb,%r9d
1010: 41 ba cc 00 00 00 mov $0xcc,%r10d
1016: 90 nop
1017: b8 01 00 00 00 mov $0x1,%eax
101c: bf 01 00 00 00 mov $0x1,%edi
1021: ba 05 00 00 00 mov $0x5,%edx
1026: 48 8d 35 d3 0f 00 00 lea 0xfd3(%rip),%rsi
102d: 0f 05 syscall
102f: 90 nop
1030: 31 c0 xor %eax,%eax
1032: c3 ret
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: "H. Peter Anvin" <[email protected]>
Cc: David Laight <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Signed-off-by: Ammar Faizi <[email protected]>
Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI [1]
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Ammar Faizi reported that our exit code handling is wrong. We truncate
it to the lowest 8 bits but the syscall itself is expected to take a
regular 32-bit signed integer, not an unsigned char. It's the kernel
that later truncates it to the lowest 8 bits. The difference is visible
in strace, where the program below used to show exit(255) instead of
exit(-1):
int main(void)
{
return -1;
}
This patch applies the fix to all archs. x86_64, i386, arm64, armv7 and
mips were all tested and confirmed to work fine now. Risc-v was not
tested but the change is trivial and exactly the same as for other archs.
Reported-by: Ammar Faizi <[email protected]>
Cc: [email protected]
Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
After re-checking in the spec and comparing stack offsets with glibc,
The last pushed argument must be 16-byte aligned (i.e. aligned before the
call) so that in the callee esp+4 is multiple of 16, so the principle is
the 32-bit equivalent to what Ammar fixed for x86_64. It's possible that
32-bit code using SSE2 or MMX could have been affected. In addition the
frame pointer ought to be zero at the deepest level.
Link: https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI
Cc: Ammar Faizi <[email protected]>
Cc: [email protected]
Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Before this patch, the `_start` function looks like this:
```
0000000000001170 <_start>:
1170: pop %rdi
1171: mov %rsp,%rsi
1174: lea 0x8(%rsi,%rdi,8),%rdx
1179: and $0xfffffffffffffff0,%rsp
117d: sub $0x8,%rsp
1181: call 1000 <main>
1186: movzbq %al,%rdi
118a: mov $0x3c,%rax
1191: syscall
1193: hlt
1194: data16 cs nopw 0x0(%rax,%rax,1)
119f: nop
```
Note the "and" to %rsp with $-16, it makes the %rsp be 16-byte aligned,
but then there is a "sub" with $0x8 which makes the %rsp no longer
16-byte aligned, then it calls main. That's the bug!
What actually the x86-64 System V ABI mandates is that right before the
"call", the %rsp must be 16-byte aligned, not after the "call". So the
"sub" with $0x8 here breaks the alignment. Remove it.
An example where this rule matters is when the callee needs to align
its stack at 16-byte for aligned move instruction, like `movdqa` and
`movaps`. If the callee can't align its stack properly, it will result
in segmentation fault.
x86-64 System V ABI also mandates the deepest stack frame should be
zero. Just to be safe, let's zero the %rbp on startup as the content
of %rbp may be unspecified when the program starts. Now it looks like
this:
```
0000000000001170 <_start>:
1170: pop %rdi
1171: mov %rsp,%rsi
1174: lea 0x8(%rsi,%rdi,8),%rdx
1179: xor %ebp,%ebp # zero the %rbp
117b: and $0xfffffffffffffff0,%rsp # align the %rsp
117f: call 1000 <main>
1184: movzbq %al,%rdi
1188: mov $0x3c,%rax
118f: syscall
1191: hlt
1192: data16 cs nopw 0x0(%rax,%rax,1)
119d: nopl (%rax)
```
Cc: Bedirhan KURT <[email protected]>
Cc: Louvian Lyndal <[email protected]>
Reported-by: Peter Cordes <[email protected]>
Signed-off-by: Ammar Faizi <[email protected]>
[wt: I did this on purpose due to a misunderstanding of the spec, other
archs will thus have to be rechecked, particularly i386]
Cc: [email protected]
Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve
systems with RCU callbacks offloaded. In this situation, all that this
Kconfig option does is slow down idle entry/exit with an additional
allways-taken early exit. If this is the only use case, then this
Kconfig option nothing but an attractive nuisance that needs to go away.
This commit therefore removes the RCU_FAST_NO_HZ Kconfig option.
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
All of the rcu scenarios that mentioning CONFIG_RCU_FAST_NO_HZ disable it.
But this Kconfig option is disabled by default, so this commit removes
the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios.
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
All of the rcuscale and refscale scenarios that mention the Kconfig option
CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by
default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n"
lines from these scenarios.
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
With CONFIG_PREEMPT_DYNAMIC=y, the kernel builds with CONFIG_PREEMPTION=y
because preemption can be enabled at runtime. This prevents any tests
of Tiny RCU or Tiny SRCU from running correctly. This commit therefore
explicitly sets CONFIG_PREEMPT_DYNAMIC=n for those scenarios.
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Alexei pointed out that we can use BPF_REG_0 which already contains imm
from move_blob2blob computation. Note that we now compare the second
insn's imm, but this should not matter, since both will be zeroed out
for the error case for the insn populated earlier.
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Instead, jump directly to success case stores in case ret >= 0, else do
the default 0 value store and jump over the success case. This is better
in terms of readability. Readjust the code for kfunc relocation as well
to follow a similar pattern, also leads to easier to follow code now.
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Add benchmark to measure the throughput and latency of the bpf_loop
call.
Testing this on my dev machine on 1 thread, the data is as follows:
nr_loops: 10
bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
nr_loops: 100
bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
nr_loops: 500
bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
nr_loops: 1000
bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
nr_loops: 5000
bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
nr_loops: 10000
bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
nr_loops: 50000
bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
nr_loops: 100000
bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
nr_loops: 500000
bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
nr_loops: 1000000
bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>From this data, we can see that the latency per loop decreases as the
number of loops increases. On this particular machine, each loop had an
overhead of about ~4 ns, and we were able to run ~250 million loops
per second.
Signed-off-by: Joanne Koong <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This patch tests bpf_loop in pyperf and strobemeta, and measures the
verifier performance of replacing the traditional for loop
with bpf_loop.
The results are as follows:
~strobemeta~
Baseline
verification time 6808200 usec
stack depth 496
processed 554252 insns (limit 1000000) max_states_per_insn 16
total_states 15878 peak_states 13489 mark_read 3110
#192 verif_scale_strobemeta:OK (unrolled loop)
Using bpf_loop
verification time 31589 usec
stack depth 96+400
processed 1513 insns (limit 1000000) max_states_per_insn 2
total_states 106 peak_states 106 mark_read 60
#193 verif_scale_strobemeta_bpf_loop:OK
~pyperf600~
Baseline
verification time 29702486 usec
stack depth 368
processed 626838 insns (limit 1000000) max_states_per_insn 7
total_states 30368 peak_states 30279 mark_read 748
#182 verif_scale_pyperf600:OK (unrolled loop)
Using bpf_loop
verification time 148488 usec
stack depth 320+40
processed 10518 insns (limit 1000000) max_states_per_insn 10
total_states 705 peak_states 517 mark_read 38
#183 verif_scale_pyperf600_bpf_loop:OK
Using the bpf_loop helper led to approximately a 99% decrease
in the verification time and in the number of instructions.
Signed-off-by: Joanne Koong <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Add test for bpf_loop testing a variety of cases:
various nr_loops, null callback ctx, invalid flags, nested callbacks.
Signed-off-by: Joanne Koong <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This patch adds the kernel-side and API changes for a new helper
function, bpf_loop:
long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
u64 flags);
where long (*callback_fn)(u32 index, void *ctx);
bpf_loop invokes the "callback_fn" **nr_loops** times or until the
callback_fn returns 1. The callback_fn can only return 0 or 1, and
this is enforced by the verifier. The callback_fn index is zero-indexed.
A few things to please note:
~ The "u64 flags" parameter is currently unused but is included in
case a future use case for it arises.
~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
bpf_callback_t is used as the callback function cast.
~ A program can have nested bpf_loop calls but the program must
still adhere to the verifier constraint of its stack depth (the stack depth
cannot exceed MAX_BPF_STACK))
~ Recursive callback_fns do not pass the verifier, due to the call stack
for these being too deep.
~ The next patch will include the tests and benchmark
Signed-off-by: Joanne Koong <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix constant sign extension affecting TCR_EL2 and preventing
running on ARMv8.7 models due to spurious bits being set
- Fix use of helpers using PSTATE early on exit by always sampling it
as soon as the exit takes place
- Move pkvm's 32bit handling into a common helper
RISC-V:
- Fix incorrect KVM_MAX_VCPUS value
- Unmap stage2 mapping when deleting/moving a memslot
x86:
- Fix and downgrade BUG_ON due to uninitialized cache
- Many APICv and MOVE_ENC_CONTEXT_FROM fixes
- Correctly emulate TLB flushes around nested vmentry/vmexit and when
the nested hypervisor uses VPID
- Prevent modifications to CPUID after the VM has run
- Other smaller bugfixes
Generic:
- Memslot handling bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
KVM: fix avic_set_running for preemptable kernels
KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled
KVM: SEV: accept signals in sev_lock_two_vms
KVM: SEV: do not take kvm->lock when destroying
KVM: SEV: Prohibit migration of a VM that has mirrors
KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked
selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
KVM: SEV: initialize regions_list of a mirror VM
KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
KVM: SEV: do not use list_replace_init on an empty list
KVM: x86: Use a stable condition around all VT-d PI paths
KVM: x86: check PIR even for vCPUs with disabled APICv
KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem
KVM: x86/mmu: Handle "default" period when selectively waking kthread
KVM: MMU: shadow nested paging does not have PKU
KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path
KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()
...
|
|
Commit 98e1385ef24b ("include/linux/radix-tree.h: replace kernel.h with
the necessary inclusions") broke the radix tree test suite in two
different ways; first by including math.h which didn't exist in the
tools directory, and second by removing an implicit include of
spinlock.h before lockdep.h. Fix both issues.
Cc: Andy Shevchenko <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Andy Shevchenko <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a same action when load_attr.attach_btf_id is initialized.
Signed-off-by: Mehrdad Arshad Rad <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Currently, we use hard code number to verify if we are in the
arp_interval timeslice. But some user may want to reduce/extend
the verify timeslice. With the similar team option 'missed_max'
the uers could change that number based on their own environment.
Acked-by: Jay Vosburgh <[email protected]>
Signed-off-by: Hangbin Liu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
VMs that mirror an encryption context rely on the owner to keep the
ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
would cause a dangling ASID:
1. copy context from A to B (gets ref to A)
2. move context from A to L (moves ASID from A to L)
3. close L (releases ASID from L, B still references it)
The right way to do the handoff instead is to create a fresh mirror VM
on the destination first:
1. copy context from A to B (gets ref to A)
[later] 2. close B (releases ref to A)
3. move context from A to L (moves ASID from A to L)
4. copy context from L to M
So, catch the situation by adding a count of how many VMs are
mirroring this one's encryption context.
Fixes: 0b020f5af092 ("KVM: SEV: Add support for SEV-ES intra host migration")
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
I am putting the tests in sev_migrate_tests because the failure conditions are
very similar and some of the setup code can be reused, too.
The tests cover both successful creation of a mirror VM, and error
conditions.
Cc: Peter Gonda <[email protected]>
Cc: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
A kvm_page_table_test run with its default settings fails on VMX due to
memory region add failure:
> ==== Test Assertion Failure ====
> lib/kvm_util.c:952: ret == 0
> pid=10538 tid=10538 errno=17 - File exists
> 1 0x00000000004057d1: vm_userspace_mem_region_add at kvm_util.c:947
> 2 0x0000000000401ee9: pre_init_before_test at kvm_page_table_test.c:302
> 3 (inlined by) run_test at kvm_page_table_test.c:374
> 4 0x0000000000409754: for_each_guest_mode at guest_modes.c:53
> 5 0x0000000000401860: main at kvm_page_table_test.c:500
> 6 0x00007f82ae2d8554: ?? ??:0
> 7 0x0000000000401894: _start at ??:?
> KVM_SET_USER_MEMORY_REGION IOCTL failed,
> rc: -1 errno: 17
> slot: 1 flags: 0x0
> guest_phys_addr: 0xc0000000 size: 0x40000000
This is because the memory range that this test is trying to add
(0x0c0000000 - 0x100000000) conflicts with LAPIC mapping at 0x0fee00000.
Looking at the code it seems that guest_test_*phys*_mem variable gets
mistakenly overwritten with guest_test_*virt*_mem while trying to adjust
the former for alignment.
With the correct variable adjusted this test runs successfully.
Signed-off-by: Maciej S. Szmigiero <[email protected]>
Message-Id: <52e487458c3172923549bbcf9dfccfbe6faea60b.1637940473.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Each peer's endpoint contains a dst_cache entry that takes a reference
to another netdev. When the containing namespace exits, we take down the
socket and prevent future sockets from being created (by setting
creating_net to NULL), which removes that potential reference on the
netns. However, it doesn't release references to the netns that a netdev
cached in dst_cache might be taking, so the netns still might fail to
exit. Since the socket is gimped anyway, we can simply clear all the
dst_caches (by way of clearing the endpoint src), which will release all
references.
However, the current dst_cache_reset function only releases those
references lazily. But it turns out that all of our usages of
wg_socket_clear_peer_endpoint_src are called from contexts that are not
exactly high-speed or bottle-necked. For example, when there's
connection difficulty, or when userspace is reconfiguring the interface.
And in particular for this patch, when the netns is exiting. So for
those cases, it makes more sense to call dst_release immediately. For
that, we add a small helper function to dst_cache.
This patch also adds a test to netns.sh from Hangbin Liu to ensure this
doesn't regress.
Tested-by: Hangbin Liu <[email protected]>
Reported-by: Xiumei Mu <[email protected]>
Cc: Toke Høiland-Jørgensen <[email protected]>
Cc: Paolo Abeni <[email protected]>
Fixes: 900575aa33a3 ("wireguard: device: avoid circular netns references")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
DEBUG_PI_LIST was renamed to DEBUG_PLIST since 8e18faeac3 ("lib/plist:
rename DEBUG_PI_LIST to DEBUG_PLIST").
Signed-off-by: Li Zhijian <[email protected]>
Fixes: 8e18faeac3e4 ("lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
We previously removed the restriction on looping to self, and then added
a test to make sure the kernel didn't blow up during a routing loop. The
kernel didn't blow up, thankfully, but on certain architectures where
skb fragmentation is easier, such as ppc64, the skbs weren't actually
being discarded after a few rounds through. But the test wasn't catching
this. So actually test explicitly for massive increases in tx to see if
we have a routing loop. Note that the actual loop problem will need to
be addressed in a different commit.
Fixes: b673e24aad36 ("wireguard: socket: remove errant restriction on looping to self")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
The selftests currently parse the kernel log at the end to track
potential memory leaks. With these tests now reading off the end of the
buffer, due to recent optimizations, some creation messages were lost,
making the tests think that there was a free without an alloc. Fix this
by increasing the kernel log size.
Fixes: 24b70eeeb4f4 ("wireguard: use synchronize_net rather than synchronize_rcu")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
When compiling libbpf with gcc 4.8.5, we see:
CC staticobjs/btf_dump.o
btf_dump.c: In function ‘btf_dump_dump_type_data.isra.24’:
btf_dump.c:2296:5: error: ‘err’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (err < 0)
^
cc1: all warnings being treated as errors
make: *** [staticobjs/btf_dump.o] Error 1
While gcc 4.8.5 is too old to build the upstream kernel, it's possible it
could be used to build standalone libbpf which suffers from the same problem.
Silence the error by initializing 'err' to 0. The warning/error seems to be
a false positive since err is set early in the function. Regardless we
shouldn't prevent libbpf from building for this.
Fixes: 920d16af9b42 ("libbpf: BTF dumper support for typed data")
Signed-off-by: Alan Maguire <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Prior patch:
]# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh
TEST: Vlan multicast snooping enable [ OK ]
Device "bridge" does not exist.
TEST: Disable multicast vlan snooping when vlan filtering is disabled [FAIL]
Vlan filtering is disabled but multicast vlan snooping is still enabled
After patch:
# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh
TEST: Vlan multicast snooping enable [ OK ]
TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
Fixes: f5a9dd58f48b7c ("selftests: net: bridge: add test for vlan_filtering dependency")
Cc: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Ivan Vecera <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add testcase for BPF_MAP_TYPE_PROG_ARRAY static initialization.
Signed-off-by: Hengqi Chen <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Support static initialization of BPF_MAP_TYPE_PROG_ARRAY with a
syntax similar to map-in-map initialization ([0]):
SEC("socket")
int tailcall_1(void *ctx)
{
return 0;
}
struct {
__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
__uint(max_entries, 2);
__uint(key_size, sizeof(__u32));
__array(values, int (void *));
} prog_array_init SEC(".maps") = {
.values = {
[1] = (void *)&tailcall_1,
},
};
Here's the relevant part of libbpf debug log showing what's
going on with prog-array initialization:
libbpf: sec '.relsocket': collecting relocation for section(3) 'socket'
libbpf: sec '.relsocket': relo #0: insn #2 against 'prog_array_init'
libbpf: prog 'entry': found map 0 (prog_array_init, sec 4, off 0) for insn #0
libbpf: .maps relo #0: for 3 value 0 rel->r_offset 32 name 53 ('tailcall_1')
libbpf: .maps relo #0: map 'prog_array_init' slot [1] points to prog 'tailcall_1'
libbpf: map 'prog_array_init': created successfully, fd=5
libbpf: map 'prog_array_init': slot [1] set to prog 'tailcall_1' fd=6
[0] Closes: https://github.com/libbpf/libbpf/issues/354
Signed-off-by: Hengqi Chen <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
In BSD and abstract address cases, we store sockets in the hash table with
keys between 0 and UNIX_HASH_SIZE - 1. However, the hash saved in a socket
varies depending on its address type; sockets with BSD addresses always
have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash.
This is just for the UNIX_ABSTRACT() macro used to check the address type.
The difference of the saved hashes comes from the first byte of the address
in the first place. So, we can test it directly.
Then we can keep a real hash in each socket and replace unix_table_lock
with per-hash locks in the later patch.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add a test for dependency of mcast_vlan_snooping on vlan_filtering. If
vlan_filtering gets disabled, then mcast_vlan_snooping must be
automatically disabled as well.
TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add tests for the new per-port/vlan mcast_router option, verify that
unknown multicast packets are flooded only to router ports.
TEST: Port vlan 10 option mcast_router default value [ OK ]
TEST: Port vlan 10 mcast_router option changed to 2 [ OK ]
TEST: Flood unknown vlan multicast packets to router port only [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add tests which change the new per-vlan mcast_query_interval and verify
the new value is in effect, also add a test to change
mcast_query_response_interval's value.
TEST: Vlan mcast_query_interval global option default value [ OK ]
TEST: Vlan 10 mcast_query_interval option changed to 200 [ OK ]
TEST: Vlan mcast_query_response_interval global option default value [ OK ]
TEST: Vlan 10 mcast_query_response_interval option changed to 200 [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add tests which change the new per-vlan mcast_querier_interval and
verify that it is taken into account when an outside querier is present.
TEST: Vlan mcast_querier_interval global option default value [ OK ]
TEST: Vlan 10 mcast_querier_interval option changed to 100 [ OK ]
TEST: Vlan 10 mcast_querier_interval expire after outside query [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add a test which changes the new per-vlan mcast_membership_interval and
verifies that a newly learned mdb entry would expire in that interval.
TEST: Vlan mcast_membership_interval global option default value [ OK ]
TEST: Vlan 10 mcast_membership_interval option changed to 200 [ OK ]
TEST: Vlan 10 mcast_membership_interval mdb entry expire [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add tests which change the new per-vlan startup query count/interval
options and verify the proper number of queries are sent in the expected
interval.
TEST: Vlan mcast_startup_query_interval global option default value [ OK ]
TEST: Vlan mcast_startup_query_count global option default value [ OK ]
TEST: Vlan 10 mcast_startup_query_interval option changed to 100 [ OK ]
TEST: Vlan 10 mcast_startup_query_count option changed to 3 [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add tests which verify the default values of mcast_last_member_count
mcast_last_member_count and also try to change them.
TEST: Vlan mcast_last_member_count global option default value [ OK ]
TEST: Vlan mcast_last_member_interval global option default value [ OK ]
TEST: Vlan 10 mcast_last_member_count option changed to 3 [ OK ]
TEST: Vlan 10 mcast_last_member_interval option changed to 200 [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add tests which change the new per-vlan IGMP/MLD versions and verify
that proper tagged general query packets are sent.
TEST: Vlan mcast_igmp_version global option default value [ OK ]
TEST: Vlan mcast_mld_version global option default value [ OK ]
TEST: Vlan 10 mcast_igmp_version option changed to 3 [ OK ]
TEST: Vlan 10 tagged IGMPv3 general query sent [ OK ]
TEST: Vlan 10 mcast_mld_version option changed to 2 [ OK ]
TEST: Vlan 10 tagged MLDv2 general query sent [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add a test to try the new global vlan mcast_querier control and also
verify that tagged general query packets are properly generated when
querier is enabled for a single vlan.
TEST: Vlan mcast_querier global option default value [ OK ]
TEST: Vlan 10 multicast querier enable [ OK ]
TEST: Vlan 10 tagged IGMPv2 general query sent [ OK ]
TEST: Vlan 10 tagged MLD general query sent [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add the first test for bridge per-vlan multicast snooping which checks
if control of the global and per-vlan options work as expected, joins
and leaves are tested at each option value.
TEST: Vlan multicast snooping enable [ OK ]
TEST: Vlan global options existence [ OK ]
TEST: Vlan mcast_snooping global option default value [ OK ]
TEST: Vlan 10 multicast snooping control [ OK ]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
drivers/net/ipa/ipa_main.c
8afc7e471ad3 ("net: ipa: separate disabling setup from modem stop")
76b5fbcd6b47 ("net: ipa: kill ipa_modem_init()")
Duplicated include, drop one.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add the __NR_bpf definitions to fix the following build errors for mips:
$ cd tools/bpf/bpftool
$ make
[...]
bpf.c:54:4: error: #error __NR_bpf not defined. libbpf does not support your arch.
# error __NR_bpf not defined. libbpf does not support your arch.
^~~~~
bpf.c: In function ‘sys_bpf’:
bpf.c:66:17: error: ‘__NR_bpf’ undeclared (first use in this function); did you mean ‘__NR_brk’?
return syscall(__NR_bpf, cmd, attr, size);
^~~~~~~~
__NR_brk
[...]
In file included from gen_loader.c:15:0:
skel_internal.h: In function ‘skel_sys_bpf’:
skel_internal.h:53:17: error: ‘__NR_bpf’ undeclared (first use in this function); did you mean ‘__NR_brk’?
return syscall(__NR_bpf, cmd, attr, size);
^~~~~~~~
__NR_brk
We can see the following generated definitions:
$ grep -r "#define __NR_bpf" arch/mips
arch/mips/include/generated/uapi/asm/unistd_o32.h:#define __NR_bpf (__NR_Linux + 355)
arch/mips/include/generated/uapi/asm/unistd_n64.h:#define __NR_bpf (__NR_Linux + 315)
arch/mips/include/generated/uapi/asm/unistd_n32.h:#define __NR_bpf (__NR_Linux + 319)
The __NR_Linux is defined in arch/mips/include/uapi/asm/unistd.h:
$ grep -r "#define __NR_Linux" arch/mips
arch/mips/include/uapi/asm/unistd.h:#define __NR_Linux 4000
arch/mips/include/uapi/asm/unistd.h:#define __NR_Linux 5000
arch/mips/include/uapi/asm/unistd.h:#define __NR_Linux 6000
That is to say, __NR_bpf is:
4000 + 355 = 4355 for mips o32,
6000 + 319 = 6319 for mips n32,
5000 + 315 = 5315 for mips n64.
So use the GCC pre-defined macro _ABIO32, _ABIN32 and _ABI64 [1] to define
the corresponding __NR_bpf.
This patch is similar with commit bad1926dd2f6 ("bpf, s390: fix build for
libbpf and selftest suite").
[1] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/mips/mips.h#l549
Signed-off-by: Tiezhu Yang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter.
Current release - regressions:
- r8169: fix incorrect mac address assignment
- vlan: fix underflow for the real_dev refcnt when vlan creation
fails
- smc: avoid warning of possible recursive locking
Current release - new code bugs:
- vsock/virtio: suppress used length validation
- neigh: fix crash in v6 module initialization error path
Previous releases - regressions:
- af_unix: fix change in behavior in read after shutdown
- igb: fix netpoll exit with traffic, avoid warning
- tls: fix splice_read() when starting mid-record
- lan743x: fix deadlock in lan743x_phy_link_status_change()
- marvell: prestera: fix bridge port operation
Previous releases - always broken:
- tcp_cubic: fix spurious Hystart ACK train detections for
not-cwnd-limited flows
- nexthop: fix refcount issues when replacing IPv6 groups
- nexthop: fix null pointer dereference when IPv6 is not enabled
- phylink: force link down and retrigger resolve on interface change
- mptcp: fix delack timer length calculation and incorrect early
clearing
- ieee802154: handle iftypes as u32, prevent shift-out-of-bounds
- nfc: virtual_ncidev: change default device permissions
- netfilter: ctnetlink: fix error codes and flags used for kernel
side filtering of dumps
- netfilter: flowtable: fix IPv6 tunnel addr match
- ncsi: align payload to 32-bit to fix dropped packets
- iavf: fix deadlock and loss of config during VF interface reset
- ice: avoid bpf_prog refcount underflow
- ocelot: fix broken PTP over IP and PTP API violations
Misc:
- marvell: mvpp2: increase MTU limit when XDP enabled"
* tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
net: dsa: microchip: implement multi-bridge support
net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
net: mscc: ocelot: set up traps for PTP packets
net: ptp: add a definition for the UDP port for IEEE 1588 general messages
net: mscc: ocelot: create a function that replaces an existing VCAP filter
net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP
net: hns3: fix incorrect components info of ethtool --reset command
net: hns3: fix one incorrect value of page pool info when queried by debugfs
net: hns3: add check NULL address for page pool
net: hns3: fix VF RSS failed problem after PF enable multi-TCs
net: qed: fix the array may be out of bound
net/smc: Don't call clcsock shutdown twice when smc shutdown
net: vlan: fix underflow for the real_dev refcnt
ptp: fix filter names in the documentation
ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()
nfc: virtual_ncidev: change default device permissions
net/sched: sch_ets: don't peek at classes beyond 'nbands'
net: stmmac: Disable Tx queues when reconfiguring the interface
selftests: tls: test for correct proto_ops
tls: fix replacing proto_ops
...
|
|
With the elevated 'KVM_CAP_MAX_VCPUS' value kvm_create_max_vcpus test
may hit RLIMIT_NOFILE limits:
# ./kvm_create_max_vcpus
KVM_CAP_MAX_VCPU_ID: 4096
KVM_CAP_MAX_VCPUS: 1024
Testing creating 1024 vCPUs, with IDs 0...1023.
/dev/kvm not available (errno: 24), skipping test
Adjust RLIMIT_NOFILE limits to make sure KVM_CAP_MAX_VCPUS fds can be
opened. Note, raising hard limit ('rlim_max') requires CAP_SYS_RESOURCE
capability which is generally not needed to run kvm selftests (but without
raising the limit the test is doomed to fail anyway).
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
[Skip the test if the hard limit can be raised. - Paolo]
Reviewed-by: Sean Christopherson <[email protected]>
Tested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
hyperv_features's sole purpose is to test access to various Hyper-V MSRs
and hypercalls with different CPUID data. As KVM_SET_CPUID2 after KVM_RUN
is deprecated and soon-to-be forbidden, avoid it by re-creating test VM
for each sub-test.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Ensure that the ASID are freed promptly, which becomes more important
when more tests are added to this file.
Cc: Peter Gonda <[email protected]>
Cc: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM leaves the source VM in a dead state,
so migrating back to the original source VM fails the ioctl. Adjust
the test.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Previous patch fixes overriding callbacks incorrectly. Triggering
the crash in sendpage_locked would be more spectacular but it's
hard to get to, so take the easier path of proving this is broken
and call getname. We're currently getting IPv4 socket info on an
IPv6 socket.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add tests for half-received and peeked records.
Signed-off-by: Jakub Kicinski <[email protected]>
|