aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-12hv: vmbus: Convert to platform remove callback returning voidUwe Kleine-König1-3/+2
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <[email protected]> Link: https://lore.kernel.org/r/920230729ddbeb9f3c4ff8282a18b0c0e1a37969.1709886922.git.u.kleine-koenig@pengutronix.de Signed-off-by: Wei Liu <[email protected]> Message-ID: <920230729ddbeb9f3c4ff8282a18b0c0e1a37969.1709886922.git.u.kleine-koenig@pengutronix.de>
2024-03-12mshyperv: Introduce hv_get_hypervisor_version functionNuno Das Neves5-29/+56
Introduce x86_64 and arm64 functions to get the hypervisor version information and store it in a structure for simpler parsing. Use the new function to get and parse the version at boot time. While at it, move the printing code to hv_common_init() so it is not duplicated. Signed-off-by: Nuno Das Neves <[email protected]> Acked-by: Wei Liu <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <[email protected]> Message-ID: <1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski14-82/+86
Merge in late fixes to prepare for the 6.9 net-next PR. Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11Merge branch 'nexthop-fix-two-nexthop-group-statistics-issues'Jakub Kicinski2-26/+38
Ido Schimmel says: ==================== nexthop: Fix two nexthop group statistics issues Fix two issues that were introduced as part of the recent nexthop group statistics submission. See the commit messages for more details. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=yIdo Schimmel1-1/+2
Locally generated packets can increment the new nexthop statistics from process context, resulting in the following splat [1] due to preemption being enabled. Fix by using get_cpu_ptr() / put_cpu_ptr() which will which take care of disabling / enabling preemption. BUG: using smp_processor_id() in preemptible [00000000] code: ping/949 caller is nexthop_select_path+0xcf8/0x1e30 CPU: 12 PID: 949 Comm: ping Not tainted 6.8.0-rc7-custom-gcb450f605fae #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xbd/0xe0 check_preemption_disabled+0xce/0xe0 nexthop_select_path+0xcf8/0x1e30 fib_select_multipath+0x865/0x18b0 fib_select_path+0x311/0x1160 ip_route_output_key_hash_rcu+0xe54/0x2720 ip_route_output_key_hash+0x193/0x380 ip_route_output_flow+0x25/0x130 raw_sendmsg+0xbab/0x34a0 inet_sendmsg+0xa2/0xe0 __sys_sendto+0x2ad/0x430 __x64_sys_sendto+0xe5/0x1c0 do_syscall_64+0xc5/0x1d0 entry_SYSCALL_64_after_hwframe+0x63/0x6b [...] Fixes: f4676ea74b85 ("net: nexthop: Add nexthop group entry stats") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11nexthop: Fix out-of-bounds access during attribute validationIdo Schimmel2-12/+23
Passing a maximum attribute type to nlmsg_parse() that is larger than the size of the passed policy will result in an out-of-bounds access [1] when the attribute type is used as an index into the policy array. Fix by setting the maximum attribute type according to the policy size, as is already done for RTM_NEWNEXTHOP messages. Add a test case that triggers the bug. No regressions in fib nexthops tests: # ./fib_nexthops.sh [...] Tests passed: 236 Tests failed: 0 [1] BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x1e53/0x2940 Read of size 1 at addr ffffffff99ab4d20 by task ip/610 CPU: 3 PID: 610 Comm: ip Not tainted 6.8.0-rc7-custom-gd435d6e3e161 #9 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x8f/0xe0 print_report+0xcf/0x670 kasan_report+0xd8/0x110 __nla_validate_parse+0x1e53/0x2940 __nla_parse+0x40/0x50 rtm_del_nexthop+0x1bd/0x400 rtnetlink_rcv_msg+0x3cc/0xf20 netlink_rcv_skb+0x170/0x440 netlink_unicast+0x540/0x820 netlink_sendmsg+0x8d3/0xdb0 ____sys_sendmsg+0x31f/0xa60 ___sys_sendmsg+0x13a/0x1e0 __sys_sendmsg+0x11c/0x1f0 do_syscall_64+0xc5/0x1d0 entry_SYSCALL_64_after_hwframe+0x63/0x6b [...] The buggy address belongs to the variable: rtm_nh_policy_del+0x20/0x40 Fixes: 2118f9390d83 ("net: nexthop: Adjust netlink policy parsing for a new attribute") Reported-by: Eric Dumazet <[email protected]> Closes: https://lore.kernel.org/netdev/CANn89i+UNcG0PJMW5X7gOMunF38ryMh=L1aeZUKH3kL4UdUqag@mail.gmail.com/ Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11nexthop: Only parse NHA_OP_FLAGS for dump messages that require itIdo Schimmel1-5/+5
The attribute is parsed in __nh_valid_dump_req() which is called by the dump handlers of RTM_GETNEXTHOP and RTM_GETNEXTHOPBUCKET although it is only used by the former and rejected by the policy of the latter. Move the parsing to nh_valid_dump_req() which is only called by the dump handler of RTM_GETNEXTHOP. This is a preparation for a subsequent patch. Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11nexthop: Only parse NHA_OP_FLAGS for get messages that require itIdo Schimmel1-8/+8
The attribute is parsed into 'op_flags' in nh_valid_get_del_req() which is called from the handlers of three message types: RTM_DELNEXTHOP, RTM_GETNEXTHOPBUCKET and RTM_GETNEXTHOP. The attribute is only used by the latter and rejected by the policies of the other two. Pass 'op_flags' as NULL from the handlers of the other two and only parse the attribute when the argument is not NULL. This is a preparation for a subsequent patch. Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11Merge tag 'x86_tdx_for_6.9' of ↵Linus Torvalds2-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tdx update from Dave Hansen: - Fix sparse warning from TDX use of movdir64b() * tag 'x86_tdx_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Remove the __iomem annotation of movdir64b()'s dst argument
2024-03-11Merge tag 'x86_mm_for_6.9' of ↵Linus Torvalds3-20/+32
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: - Add a warning when memory encryption conversions fail. These operations require VMM cooperation, even in CoCo environments where the VMM is untrusted. While it's _possible_ that memory pressure could trigger the new warning, the odds are that a guest would only see this from an attacking VMM. - Simplify page fault code by re-enabling interrupts unconditionally - Avoid truncation issues when pfns are passed in to pfn_to_kaddr() with small (<64-bit) types. * tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails x86/mm: Get rid of conditional IF flag handling in page fault path x86/mm: Ensure input to pfn_to_kaddr() is treated as a 64-bit type
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds97-562/+667
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'x86-cleanups-2024-03-11' of ↵Linus Torvalds37-145/+103
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups, including a large series from Thomas Gleixner to cure sparse warnings" * tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Drop unused declaration of proc_nmi_enabled() x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() x86/percpu: Cure per CPU madness on UP smp: Consolidate smp_prepare_boot_cpu() x86/msr: Add missing __percpu annotations x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h> perf/x86/amd/uncore: Fix __percpu annotation x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP) x86/apm_32: Remove dead function apm_get_battery_status() x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11Merge tag 'x86-build-2024-03-11' of ↵Linus Torvalds21-109/+142
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build updates from Ingo Molnar: - Reduce <asm/bootparam.h> dependencies - Simplify <asm/efi.h> - Unify *_setup_data definitions into <asm/setup_data.h> - Reduce the size of <asm/bootparam.h> * tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Do not include <asm/bootparam.h> in several files x86/efi: Implement arch_ima_efi_boot_mode() in source file x86/setup: Move internal setup_data structures into setup_data.h x86/setup: Move UAPI setup structures into setup_data.h
2024-03-11Merge tag 'x86-asm-2024-03-11' of ↵Linus Torvalds2-72/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "Two changes to simplify the x86 decoder logic a bit" * tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/insn: Directly assign x86_64 state in insn_init() x86/insn: Remove superfluous checks from instruction decoding routines
2024-03-11Merge tag 'sched-core-2024-03-11' of ↵Linus Torvalds7-93/+74
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Fix inconsistency in misfit task load-balancing - Fix CPU isolation bugs in the task-wakeup logic - Rework and unify the sched_use_asym_prio() and sched_asym_prefer() logic - Clean up and simplify ->avg_* accesses - Misc cleanups and fixes * tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio() sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer() sched/fair: Remove unused parameter from sched_asym() sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS sched/fair: Simplify the update_sd_pick_busiest() logic sched/fair: Do strict inequality check for busiest misfit task group sched/fair: Remove unnecessary goto in update_sd_lb_stats() sched/fair: Take the scheduling domain into account in select_idle_core() sched/fair: Take the scheduling domain into account in select_idle_smt() sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl sched/core: Simplify code by removing duplicate #ifdefs
2024-03-11Merge tag 'locking-core-2024-03-11' of ↵Linus Torvalds17-49/+154
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Micro-optimize local_xchg() and the rtmutex code on x86 - Fix percpu-rwsem contention tracepoints - Simplify debugging Kconfig dependencies - Update/clarify the documentation of atomic primitives - Misc cleanups * tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters() locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix locking/percpu-rwsem: Trigger contention tracepoints only if contended locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint locking/mutex: Simplify <linux/mutex.h> locking/qspinlock: Fix 'wait_early' set but not used warning locking/atomic: scripts: Clarify ordering of conditional atomics
2024-03-11Merge tag 'edac_updates_for_v6.9' of ↵Linus Torvalds33-336/+5165
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add a FRU (Field Replaceable Unit) memory poison manager which collects and manages previously encountered hw errors in order to save them to persistent storage across reboots. Previously recorded errors are "replayed" upon reboot in order to poison memory which has caused said errors in the past. The main use case is stacked, on-chip memory which cannot simply be replaced so poisoning faulty areas of it and thus making them inaccessible is the only strategy to prolong its lifetime. - Add an AMD address translation library glue which converts the reported addresses of hw errors into system physical addresses in order to be used by other subsystems like memory failure, for example. Add support for MI300 accelerators to that library. - igen6: Add support for Alder Lake-N SoC - i10nm: Add Grand Ridge support - The usual fixlets and cleanups * tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versal: Convert to platform remove callback returning void RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide EDAC/versal: Make the bit position of injected errors configurable EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library EDAC/synopsys: Convert to devm_platform_ioremap_resource()
2024-03-11Merge tag 'for-netdev' of ↵Jakub Kicinski88-590/+4181
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2024-03-11 We've added 59 non-merge commits during the last 9 day(s) which contain a total of 88 files changed, 4181 insertions(+), 590 deletions(-). The main changes are: 1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena, from Alexei. 2) Introduce bpf_arena which is sparse shared memory region between bpf program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and bpf programs, from Alexei and Andrii. 3) Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it, from Alexei. 4) Use IETF format for field definitions in the BPF standard document, from Dave. 5) Extend struct_ops libbpf APIs to allow specify version suffixes for stuct_ops map types, share the same BPF program between several map definitions, and other improvements, from Eduard. 6) Enable struct_ops support for more than one page in trampolines, from Kui-Feng. 7) Support kCFI + BPF on riscv64, from Puranjay. 8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay. 9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11Merge tag 'x86_misc_for_v6.9_rc1' of ↵Linus Torvalds3-6/+37
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Borislav Petkov: - Fix a wrong check in the function reporting whether a CPU executes (or not) a NMI handler - Ratelimit unknown NMIs messages in order to not potentially slow down the machine - Other fixlets * tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Fix the inverse "in NMI handler" check Documentation/maintainer-tip: Add C++ tail comments exception Documentation/maintainer-tip: Add Closes tag x86/nmi: Rate limit unknown NMI messages Documentation/kernel-parameters: Add spec_rstack_overflow to mitigations=off
2024-03-11Merge tag 'x86_sev_for_v6.9_rc1' of ↵Linus Torvalds45-252/+2623
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the x86 part of the SEV-SNP host support. This will allow the kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place * tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/sev: Disable KMSAN for memory encryption TUs x86/sev: Dump SEV_STATUS crypto: ccp - Have it depend on AMD_IOMMU iommu/amd: Fix failure return from snp_lookup_rmpentry() x86/sev: Fix position dependent variable references in startup code crypto: ccp: Make snp_range_list static x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT Documentation: virt: Fix up pre-formatted text block for SEV ioctls crypto: ccp: Add the SNP_SET_CONFIG command crypto: ccp: Add the SNP_COMMIT command crypto: ccp: Add the SNP_PLATFORM_STATUS command x86/cpufeatures: Enable/unmask SEV-SNP CPU feature KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown crypto: ccp: Handle legacy SEV commands when SNP is enabled crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled crypto: ccp: Handle the legacy TMR allocation when SNP is enabled x86/sev: Introduce an SNP leaked pages list crypto: ccp: Provide an API to issue SEV and SNP commands ...
2024-03-11Merge tag 'x86_cache_for_v6.9_rc1' of ↵Linus Torvalds9-339/+946
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull resource control updates from Borislav Petkov: - Rework different aspects of the resctrl code like adding arch-specific accessors and splitting the locking, in order to accomodate ARM's MPAM implementation of hw resource control and be able to use the same filesystem control interface like on x86. Work by James Morse - Improve the memory bandwidth throttling heuristic to handle workloads with not too regular load levels which end up penalized unnecessarily - Use CPUID to detect the memory bandwidth enforcement limit on AMD - The usual set of fixes * tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/resctrl: Remove lockdep annotation that triggers false positive x86/resctrl: Separate arch and fs resctrl locks x86/resctrl: Move domain helper migration into resctrl_offline_cpu() x86/resctrl: Add CPU offline callback for resctrl work x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU x86/resctrl: Add CPU online callback for resctrl work x86/resctrl: Add helpers for system wide mon/alloc capable x86/resctrl: Make rdt_enable_key the arch's decision to switch x86/resctrl: Move alloc/mon static keys into helpers x86/resctrl: Make resctrl_mounted checks explicit x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() x86/resctrl: Allow resctrl_arch_rmid_read() to sleep x86/resctrl: Queue mon_event_read() instead of sending an IPI x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow x86/resctrl: Move CLOSID/RMID matching and setting to use helpers x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding x86/resctrl: Track the number of dirty RMID a CLOSID has x86/resctrl: Allow RMID allocation to be scoped by CLOSID x86/resctrl: Access per-rmid structures by index ...
2024-03-11Merge tag 'x86_mtrr_for_v6.9_rc1' of ↵Linus Torvalds2-9/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MTRR update from Borislav Petkov: - Relax the PAT MSR programming which was unnecessarily using the MTRR programming protocol of disabling the cache around the changes. The reason behind this is the current algorithm triggering a #VE exception for TDX guests and unnecessarily complicating things * tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pat: Simplify the PAT programming protocol
2024-03-11Merge tag 'x86_cpu_for_v6.9_rc1' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu update from Borislav Petkov: - Have AMD Zen common init code run on all families from Zen1 onwards in order to save some future enablement effort * tag 'x86_cpu_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Do the common init on future Zens too
2024-03-11Merge tag 'ras_core_for_v6.9_rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixlet from Borislav Petkov: - Constify yet another static struct bus_type instance now that the driver core can handle that * tag 'ras_core_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Make mce_subsys const
2024-03-11Revert "dm: use queue_limits_set"Linus Torvalds2-13/+16
This reverts commit 8e0ef412869430d114158fc3b9b1fb111e247bd3. It's broken, and causes the boot to fail on encrypted volumes. Reported-and-bisected-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Acked-by: Jens Axboe <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-03-11bpf: move sleepable flag from bpf_prog_aux to bpf_progAndrii Nakryiko9-21/+21
prog->aux->sleepable is checked very frequently as part of (some) BPF program run hot paths. So this extra aux indirection seems wasteful and on busy systems might cause unnecessary memory cache misses. Let's move sleepable flag into prog itself to eliminate unnecessary pointer dereference. Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Jiri Olsa <[email protected]> Message-ID: <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2024-03-11bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()Puranjay Mohan1-1/+6
On some architectures like ARM64, PMD_SIZE can be really large in some configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is 512MB. Use 2MB * num_possible_nodes() as the size for allocations done through the prog pack allocator. On most architectures, PMD_SIZE will be equal to 2MB in case of 4KB pages and will be greater than 2MB for bigger page sizes. Fixes: ea2babac63d4 ("bpf: Simplify bpf_prog_pack_[size|mask]") Reported-by: "kernelci.org bot" <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Suggested-by: Song Liu <[email protected]> Signed-off-by: Puranjay Mohan <[email protected]> Message-ID: <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2024-03-11Merge tag 'x86-entry-2024-03-11' of ↵Linus Torvalds2-20/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry update from Thomas Gleixner: "A single update for the x86 entry code: The current CR3 handling for kernel page table isolation in the paranoid return paths which are relevant for #NMI, #MCE, #VC, #DB and #DF is unconditionally writing CR3 with the value retrieved on exception entry. In the vast majority of cases when returning to the kernel this is a pointless exercise because CR3 was not modified on exception entry. The only situation where this is necessary is when the exception interrupts a entry from user before switching to kernel CR3 or interrupts an exit to user after switching back to user CR3. As CR3 writes can be expensive on some systems this becomes measurable overhead with high frequency #NMIs such as perf. Avoid this overhead by checking the CR3 value, which was saved on entry, and write it back to CR3 only when it is a user CR3" * tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Avoid redundant CR3 write on paranoid returns
2024-03-11selftests/bpf: Add kprobe multi triggering benchmarksJiri Olsa3-0/+50
Adding kprobe multi triggering benchmarks. It's useful now to bench new fprobe implementation and might be useful later as well. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11ptp: Move from simple ida to xarrayKory Maincent1-14/+18
Move from simple ida to xarray for storing and loading the ptp_clock pointer. This prepares support for future hardware timestamp selection by being able to link the ptp clock index to its pointer. Signed-off-by: Kory Maincent <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11vxlan: Remove generic .ndo_get_stats64Breno Leitao1-2/+0
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is configured") moved the callback to dev_get_tstats64() to net core, so, unless the driver is doing some custom stats collection, it does not need to set .ndo_get_stats64. Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Subbaraya Sundeep <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11vxlan: Do not alloc tstats manuallyBreno Leitao1-11/+2
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of in this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Remove the allocation in the vxlan driver and leverage the network core allocation instead. Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Subbaraya Sundeep <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds56-121/+1372
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-03-11devlink: Add comments to use netlink gen toolWilliam Tu1-1/+4
Add the comment to remind people not to manually modify the net/devlink/netlink_gen.c, but to use tools/net/ynl/ynl-regen.sh to generate it. Signed-off-by: William Tu <[email protected]> Suggested-by: Jiri Pirko <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11nfp: flower: handle acti_netdevs allocation failureDuoming Zhou1-0/+5
The kmalloc_array() in nfp_fl_lag_do_work() will return null, if the physical memory has run out. As a result, if we dereference the acti_netdevs, the null pointer dereference bugs will happen. This patch adds a check to judge whether allocation failure occurs. If it happens, the delayed work will be rescheduled and try again. Fixes: bb9a8d031140 ("nfp: flower: monitor and offload LAG groups") Signed-off-by: Duoming Zhou <[email protected]> Reviewed-by: Louis Peens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11net/packet: Add getsockopt support for PACKET_COPY_THRESHJuntong Deng2-3/+6
Currently getsockopt does not support PACKET_COPY_THRESH, and we are unable to get the value of PACKET_COPY_THRESH socket option through getsockopt. This patch adds getsockopt support for PACKET_COPY_THRESH. In addition, this patch converts access to copy_thresh to READ_ONCE/WRITE_ONCE. Signed-off-by: Juntong Deng <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://lore.kernel.org/r/AM6PR03MB58487A9704FD150CF76F542899272@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSIDJuntong Deng1-0/+3
Currently getsockopt does not support NETLINK_LISTEN_ALL_NSID, and we are unable to get the value of NETLINK_LISTEN_ALL_NSID socket option through getsockopt. This patch adds getsockopt support for NETLINK_LISTEN_ALL_NSID. Signed-off-by: Juntong Deng <[email protected]> Link: https://lore.kernel.org/r/AM6PR03MB58482322B7B335308DA56FE599272@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-11Merge tag 'x86-apic-2024-03-10' of ↵Linus Torvalds86-1569/+1555
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: "Rework of APIC enumeration and topology evaluation. The current implementation has a couple of shortcomings: - It fails to handle hybrid systems correctly. - The APIC registration code which handles CPU number assignents is in the middle of the APIC code and detached from the topology evaluation. - The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest specific ones, tweak global variables as they see fit or in case of XENPV just hack around the generic mechanisms completely. - The CPUID topology evaluation code is sprinkled all over the vendor code and reevaluates global variables on every hotplug operation. - There is no way to analyze topology on the boot CPU before bringing up the APs. This causes problems for infrastructure like PERF which needs to size certain aspects upfront or could be simplified if that would be possible. - The APIC admission and CPU number association logic is incomprehensible and overly complex and needs to be kept around after boot instead of completing this right after the APIC enumeration. This update addresses these shortcomings with the following changes: - Rework the CPUID evaluation code so it is common for all vendors and provides information about the APIC ID segments in a uniform way independent of the number of segments (Thread, Core, Module, ..., Die, Package) so that this information can be computed instead of rewriting global variables of dubious value over and over. - A few cleanups and simplifcations of the APIC, IO/APIC and related interfaces to prepare for the topology evaluation changes. - Seperation of the parser stages so the early evaluation which tries to find the APIC address can be seperately overridden from the late evaluation which enumerates and registers the local APIC as further preparation for sanitizing the topology evaluation. - A new registration and admission logic which - encapsulates the inner workings so that parsers and guest logic cannot longer fiddle in it - uses the APIC ID segments to build topology bitmaps at registration time - provides a sane admission logic - allows to detect the crash kernel case, where CPU0 does not run on the real BSP, automatically. This is required to prevent sending INIT/SIPI sequences to the real BSP which would reset the whole machine. This was so far handled by a tedious command line parameter, which does not even work in nested crash scenarios. - Associates CPU number after the enumeration completed and prevents the late registration of APICs, which was somehow tolerated before. - Converting all parsers and guest enumeration mechanisms over to the new interfaces. This allows to get rid of all global variable tweaking from the parsers and enumeration mechanisms and sanitizes the XEN[PV] handling so it can use CPUID evaluation for the first time. - Mopping up existing sins by taking the information from the APIC ID segment bitmaps. This evaluates hybrid systems correctly on the boot CPU and allows for cleanups and fixes in the related drivers, e.g. PERF. The series has been extensively tested and the minimal late fallout due to a broken ACPI/MADT table has been addressed by tightening the admission logic further" * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) x86/topology: Ignore non-present APIC IDs in a present package x86/apic: Build the x86 topology enumeration functions on UP APIC builds too smp: Provide 'setup_max_cpus' definition on UP too smp: Avoid 'setup_max_cpus' namespace collision/shadowing x86/bugs: Use fixed addressing for VERW operand x86/cpu/topology: Get rid of cpuinfo::x86_max_cores x86/cpu/topology: Provide __num_[cores|threads]_per_package x86/cpu/topology: Rename topology_max_die_per_package() x86/cpu/topology: Rename smp_num_siblings x86/cpu/topology: Retrieve cores per package from topology bitmaps x86/cpu/topology: Use topology logical mapping mechanism x86/cpu/topology: Provide logical pkg/die mapping x86/cpu/topology: Simplify cpu_mark_primary_thread() x86/cpu/topology: Mop up primary thread mask handling x86/cpu/topology: Use topology bitmaps for sizing x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT x86/xen/smp_pv: Count number of vCPUs early x86/cpu/topology: Assign hotpluggable CPUIDs during init x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug x86/topology: Add a mechanism to track topology via APIC IDs ...
2024-03-11Merge branch 'bpf-introduce-bpf-arena'Andrii Nakryiko37-40/+2028
Alexei Starovoitov says: ==================== bpf: Introduce BPF arena. From: Alexei Starovoitov <[email protected]> v2->v3: - contains bpf bits only, but cc-ing past audience for continuity - since prerequisite patches landed, this series focus on the main functionality of bpf_arena. - adopted Andrii's approach to support arena in libbpf. - simplified LLVM support. Instead of two instructions it's now only one. - switched to cond_break (instead of open coded iters) in selftests - implemented several follow-ups that will be sent after this set . remember first IP and bpf insn that faulted in arena. report to user space via bpftool . copy paste and tweak glob_match() aka mini-regex as a selftests/bpf - see patch 1 for detailed description of bpf_arena v1->v2: - Improved commit log with reasons for using vmap_pages_range() in arena. Thanks to Johannes - Added support for __arena global variables in bpf programs - Fixed race conditions spotted by Barret - Fixed wrap32 issue spotted by Barret - Fixed bpf_map_mmap_sz() the way Andrii suggested The work on bpf_arena was inspired by Barret's work: https://github.com/google/ghost-userspace/blob/main/lib/queue.bpf.h that implements queues, lists and AVL trees completely as bpf programs using giant bpf array map and integer indices instead of pointers. bpf_arena is a sparse array that allows to use normal C pointers to build such data structures. Last few patches implement page_frag allocator, link list and hash table as bpf programs. v1: bpf programs have multiple options to communicate with user space: - Various ring buffers (perf, ftrace, bpf): The data is streamed unidirectionally from bpf to user space. - Hash map: The bpf program populates elements, and user space consumes them via bpf syscall. - mmap()-ed array map: Libbpf creates an array map that is directly accessed by the bpf program and mmap-ed to user space. It's the fastest way. Its disadvantage is that memory for the whole array is reserved at the start. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]>
2024-03-11selftests/bpf: Add bpf_arena_htab test.Alexei Starovoitov6-0/+243
bpf_arena_htab.h - hash table implemented as bpf program Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11selftests/bpf: Add bpf_arena_list test.Alexei Starovoitov4-0/+314
bpf_arena_alloc.h - implements page_frag allocator as a bpf program. bpf_arena_list.h - doubly linked link list as a bpf program. Compiled as a bpf program and as native C code. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11selftests/bpf: Add unit tests for bpf_arena_alloc/free_pagesAlexei Starovoitov6-2/+227
Add unit tests for bpf_arena_alloc/free_pages() functionality and bpf_arena_common.h with a set of common helpers and macros that is used in this test and the following patches. Also modify test_loader that didn't support running bpf_prog_type_syscall programs. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11bpf: Add helper macro bpf_addr_space_cast()Alexei Starovoitov1-0/+43
Introduce helper macro bpf_addr_space_cast() that emits: rX = rX instruction with off = BPF_ADDR_SPACE_CAST and encodes dest and src address_space-s into imm32. It's useful with older LLVM that doesn't emit this insn automatically. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11libbpf: Recognize __arena global variables.Andrii Nakryiko3-13/+120
LLVM automatically places __arena variables into ".arena.1" ELF section. In order to use such global variables bpf program must include definition of arena map in ".maps" section, like: struct { __uint(type, BPF_MAP_TYPE_ARENA); __uint(map_flags, BPF_F_MMAPABLE); __uint(max_entries, 1000); /* number of pages */ __ulong(map_extra, 2ull << 44); /* start of mmap() region */ } arena SEC(".maps"); libbpf recognizes both uses of arena and creates single `struct bpf_map *` instance in libbpf APIs. ".arena.1" ELF section data is used as initial data image, which is exposed through skeleton and bpf_map__initial_value() to the user, if they need to tune it before the load phase. During load phase, this initial image is copied over into mmap()'ed region corresponding to arena, and discarded. Few small checks here and there had to be added to make sure this approach works with bpf_map__initial_value(), mostly due to hard-coded assumption that map->mmaped is set up with mmap() syscall and should be munmap()'ed. For arena, .arena.1 can be (much) smaller than maximum arena size, so this smaller data size has to be tracked separately. Given it is enforced that there is only one arena for entire bpf_object instance, we just keep it in a separate field. This can be generalized if necessary later. All global variables from ".arena.1" section are accessible from user space via skel->arena->name_of_var. For bss/data/rodata the skeleton/libbpf perform the following sequence: 1. addr = mmap(MAP_ANONYMOUS) 2. user space optionally modifies global vars 3. map_fd = bpf_create_map() 4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel 5. mmap(addr, MAP_FIXED, map_fd) after step 5 user spaces see the values it wrote at step 2 at the same addresses arena doesn't support update_map_elem. Hence skeleton/libbpf do: 1. addr = malloc(sizeof SEC ".arena.1") 2. user space optionally modifies global vars 3. map_fd = bpf_create_map(MAP_TYPE_ARENA) 4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd) 5. memcpy(real_addr, addr) // this will fault-in and allocate pages At the end look and feel of global data vs __arena global data is the same from bpf prog pov. Another complication is: struct { __uint(type, BPF_MAP_TYPE_ARENA); } arena SEC(".maps"); int __arena foo; int bar; ptr1 = &foo; // relocation against ".arena.1" section ptr2 = &arena; // relocation against ".maps" section ptr3 = &bar; // relocation against ".bss" section Fo the kernel ptr1 and ptr2 has point to the same arena's map_fd while ptr3 points to a different global array's map_fd. For the verifier: ptr1->type == unknown_scalar ptr2->type == const_ptr_to_map ptr3->type == ptr_to_map_value After verification, from JIT pov all 3 ptr-s are normal ld_imm64 insns. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Quentin Monnet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11bpftool: Recognize arena map typeAlexei Starovoitov2-2/+2
Teach bpftool to recognize arena map type. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Quentin Monnet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11libbpf: Add support for bpf_arena.Alexei Starovoitov2-8/+46
mmap() bpf_arena right after creation, since the kernel needs to remember the address returned from mmap. This is user_vm_start. LLVM will generate bpf_arena_cast_user() instructions where necessary and JIT will add upper 32-bit of user_vm_start to such pointers. Fix up bpf_map_mmap_sz() to compute mmap size as map->value_size * map->max_entries for arrays and PAGE_SIZE * map->max_entries for arena. Don't set BTF at arena creation time, since it doesn't support it. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11libbpf: Add __arg_arena to bpf_helpers.hAlexei Starovoitov1-0/+1
Add __arg_arena to bpf_helpers.h Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11bpf: Recognize btf_decl_tag("arg: Arena") as PTR_TO_ARENA.Alexei Starovoitov3-4/+31
In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA. Note, when the verifier sees: __weak void foo(struct bar *p) it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars. Hence the only way to use arena pointers in global functions is to tag them with "arg:arena". Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11bpf: Recognize addr_space_cast instruction in the verifier.Alexei Starovoitov5-9/+109
rY = addr_space_cast(rX, 0, 1) tells the verifier that rY->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have to be in 32-bit domain. The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as kern_vm_start + 32bit_addr memory accesses. rY = addr_space_cast(rX, 1, 0) tells the verifier that rY->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well. Otherwise JIT will convert it to: rY = (u32)rX; if (rY) rY |= arena->user_vm_start & ~(u64)~0U; Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-11bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.Alexei Starovoitov3-1/+47
LLVM generates bpf_addr_space_cast instruction while translating pointers between native (zero) address space and __attribute__((address_space(N))). The addr_space=1 is reserved as bpf_arena address space. rY = addr_space_cast(rX, 0, 1) is processed by the verifier and converted to normal 32-bit move: wX = wY rY = addr_space_cast(rX, 1, 0) has to be converted by JIT: aux_reg = upper_32_bits of arena->user_vm_start aux_reg <<= 32 wX = wY // clear upper 32 bits of dst register if (wX) // if not zero add upper bits of user_vm_start wX |= aux_reg JIT can do it more efficiently: mov dst_reg32, src_reg32 // 32-bit move shl dst_reg, 32 or dst_reg, user_vm_start rol dst_reg, 32 xor r11, r11 test dst_reg32, dst_reg32 // check if lower 32-bit are zero cmove r11, dst_reg // if so, set dst_reg to zero // Intel swapped src/dst register encoding in CMOVcc Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]