aboutsummaryrefslogtreecommitdiff
path: root/include/uapi
AgeCommit message (Collapse)AuthorFilesLines
2021-09-18mptcp: add MPTCP_TCPINFO getsockopt supportFlorian Westphal1-1/+9
Allow users to retrieve TCP_INFO data of all subflows. Users need to pre-initialize a meta header that has to be prepended to the data buffer that will be filled with the tcp info data. The meta header looks like this: struct mptcp_subflow_data { __u32 size_subflow_data;/* size of this structure in userspace */ __u32 num_subflows; /* must be 0, set by kernel */ __u32 size_kernel; /* must be 0, set by kernel */ __u32 size_user; /* size of one element in data[] */ } __attribute__((aligned(8))); size_subflow_data has to be set to 'sizeof(struct mptcp_subflow_data)'. This allows to extend mptcp_subflow_data structure later on without breaking backwards compatibility. If the structure is extended later on, kernel knows where the userspace-provided meta header ends, even if userspace uses an older (smaller) version of the structure. num_subflows must be set to 0. If the getsockopt request succeeds (return value is 0), it will be updated to contain the number of active subflows for the given logical connection. size_kernel must be set to 0. If the getsockopt request is successful, it will contain the size of the 'struct tcp_info' as known by the kernel. This is informational only. size_user must be set to 'sizeof(struct tcp_info)'. This allows the kernel to only fill in the space reserved/expected by userspace. Example: struct my_tcp_info { struct mptcp_subflow_data d; struct tcp_info ti[2]; }; struct my_tcp_info ti; socklen_t olen; memset(&ti, 0, sizeof(ti)); ti.d.size_subflow_data = sizeof(struct mptcp_subflow_data); ti.d.size_user = sizeof(struct tcp_info); olen = sizeof(ti); ret = getsockopt(fd, SOL_MPTCP, MPTCP_TCPINFO, &ti, &olen); if (ret < 0) die_perror("getsockopt MPTCP_TCPINFO"); mptcp_subflow_data.num_subflows is populated with the number of subflows that exist on the kernel side for the logical mptcp connection. This allows userspace to re-try with a larger tcp_info array if the number of subflows was larger than the available space in the ti[] array. olen has to be set to the number of bytes that userspace has allocated to receive the kernel data. It will be updated to contain the real number bytes that have been copied to by the kernel. In the above example, if the number if subflows was 1, olen is equal to 'sizeof(struct mptcp_subflow_data) + sizeof(struct tcp_info). For 2 or more subflows olen is equal to 'sizeof(struct my_tcp_info)'. If there was more data that could not be copied due to lack of space in the option buffer, userspace can detect this by checking mptcp_subflow_data->num_subflows. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-18mptcp: add MPTCP_INFO getsockoptFlorian Westphal1-0/+3
Its not compatible with multipath-tcp.org kernel one. 1. The out-of-tree implementation defines a different 'struct mptcp_info', with embedded __user addresses for additional data such as endpoint addresses. 2. Mat Martineau points out that embedded __user addresses doesn't work with BPF_CGROUP_RUN_PROG_GETSOCKOPT() which assumes that copying in optsize bytes from optval provides all data that got copied to userspace. This provides mptcp_info data for the given mptcp socket. Userspace sets optlen to the size of the structure it expects. The kernel updates it to contain the number of bytes that it copied. This allows to append more information to the structure later. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-17bpf: Clarify data_len param in bpf_snprintf and bpf_seq_printf commentsDave Marchevsky1-2/+3
Since the data_len in these two functions is a byte len of the preceding u64 *data array, it must always be a multiple of 8. If this isn't the case both helpers error out, so let's make the requirement explicit so users don't need to infer it. Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-09-17bpf: Add bpf_trace_vprintk helperDave Marchevsky1-0/+11
This helper is meant to be "bpf_trace_printk, but with proper vararg support". Follow bpf_snprintf's example and take a u64 pseudo-vararg array. Write to /sys/kernel/debug/tracing/trace_pipe using the same mechanism as bpf_trace_printk. The functionality of this helper was requested in the libbpf issue tracker [0]. [0] Closes: https://github.com/libbpf/libbpf/issues/315 Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-09-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski2-21/+60
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-09-17 We've added 63 non-merge commits during the last 12 day(s) which contain a total of 65 files changed, 2653 insertions(+), 751 deletions(-). The main changes are: 1) Streamline internal BPF program sections handling and bpf_program__set_attach_target() in libbpf, from Andrii. 2) Add support for new btf kind BTF_KIND_TAG, from Yonghong. 3) Introduce bpf_get_branch_snapshot() to capture LBR, from Song. 4) IMUL optimization for x86-64 JIT, from Jie. 5) xsk selftest improvements, from Magnus. 6) Introduce legacy kprobe events support in libbpf, from Rafael. 7) Access hw timestamp through BPF's __sk_buff, from Vadim. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (63 commits) selftests/bpf: Fix a few compiler warnings libbpf: Constify all high-level program attach APIs libbpf: Schedule open_opts.attach_prog_fd deprecation since v0.7 selftests/bpf: Switch fexit_bpf2bpf selftest to set_attach_target() API libbpf: Allow skipping attach_func_name in bpf_program__set_attach_target() libbpf: Deprecated bpf_object_open_opts.relaxed_core_relocs selftests/bpf: Stop using relaxed_core_relocs which has no effect libbpf: Use pre-setup sec_def in libbpf_find_attach_btf_id() bpf: Update bpf_get_smp_processor_id() documentation libbpf: Add sphinx code documentation comments selftests/bpf: Skip btf_tag test if btf_tag attribute not supported docs/bpf: Add documentation for BTF_KIND_TAG selftests/bpf: Add a test with a bpf program with btf_tag attributes selftests/bpf: Test BTF_KIND_TAG for deduplication selftests/bpf: Add BTF_KIND_TAG unit tests selftests/bpf: Change NAME_NTH/IS_NAME_NTH for BTF_KIND_TAG format selftests/bpf: Test libbpf API function btf__add_tag() bpftool: Add support for BTF_KIND_TAG libbpf: Add support for BTF_KIND_TAG libbpf: Rename btf_{hash,equal}_int to btf_{hash,equal}_int_tag ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski8-29/+516
No conflicts! Signed-off-by: Jakub Kicinski <[email protected]>
2021-09-16net/tls: support SM4 GCM/CCM algorithmTianjia Zhang1-0/+30
The RFC8998 specification defines the use of the ShangMi algorithm cipher suites in TLS 1.3, and also supports the GCM/CCM mode using the SM4 algorithm. Signed-off-by: Tianjia Zhang <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-15bpf: Update bpf_get_smp_processor_id() documentationMatteo Croce1-1/+1
BPF programs run with migration disabled regardless of preemption, as they are protected by migrate_disable(). Update the uapi documentation accordingly. Signed-off-by: Matteo Croce <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-09-15xfrm: make user policy API completeNicolas Dichtel1-3/+6
>From a userland POV, this API was based on some magic values: - dirmask and action were bitfields but meaning of bits (XFRM_POL_DEFAULT_*) are not exported; - action is confusing, if a bit is set, does it mean drop or accept? Let's try to simplify this uapi by using explicit field and macros. Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") Signed-off-by: Nicolas Dichtel <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2021-09-14drivers/cdrom: improved ioctl for media change detectionLukas Prediger1-0/+19
The current implementation of the CDROM_MEDIA_CHANGED ioctl relies on global state, meaning that only one process can detect a disc change while the ioctl call will return 0 for other calling processes afterwards (see bug 213267). This introduces a new cdrom ioctl, CDROM_TIMED_MEDIA_CHANGE, that works by maintaining a timestamp of the last detected disc change instead of a boolean flag: Processes calling this ioctl command can provide a timestamp of the last disc change known to them and receive an indication whether the disc was changed since then and the updated timestamp. I considered fixing the buggy behavior in the original CDROM_MEDIA_CHANGED ioctl but that would require maintaining state for each calling process in the kernel, which seems like a worse solution than introducing this new ioctl. Signed-off-by: Lukas Prediger <[email protected]> Link: https://lore.kernel.org/all/[email protected] Signed-off-by: Phillip Potter <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-09-14bpf: Support for new btf kind BTF_KIND_TAGYonghong Song1-1/+13
LLVM14 added support for a new C attribute ([1]) __attribute__((btf_tag("arbitrary_str"))) This attribute will be emitted to dwarf ([2]) and pahole will convert it to BTF. Or for bpf target, this attribute will be emitted to BTF directly ([3], [4]). The attribute is intended to provide additional information for - struct/union type or struct/union member - static/global variables - static/global function or function parameter. For linux kernel, the btf_tag can be applied in various places to specify user pointer, function pre- or post- condition, function allow/deny in certain context, etc. Such information will be encoded in vmlinux BTF and can be used by verifier. The btf_tag can also be applied to bpf programs to help global verifiable functions, e.g., specifying preconditions, etc. This patch added basic parsing and checking support in kernel for new BTF_KIND_TAG kind. [1] https://reviews.llvm.org/D106614 [2] https://reviews.llvm.org/D106621 [3] https://reviews.llvm.org/D106622 [4] https://reviews.llvm.org/D109560 Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-09-14btf: Change BTF_KIND_* macros to enumsYonghong Song1-19/+22
Change BTF_KIND_* macros to enums so they are encoded in dwarf and appear in vmlinux.h. This will make it easier for bpf programs to use these constants without macro definitions. Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-09-14net/smc: add generic netlink support for system EIDKarsten Graul1-0/+12
With SMC-Dv2 users can configure if the static system EID should be used during CLC handshake, or if only user EIDs are allowed. Add generic netlink support to enable and disable the system EID, and to retrieve the system EID and its current enabled state. Signed-off-by: Karsten Graul <[email protected]> Reviewed-by: Guvenc Gulce <[email protected]> Signed-off-by: Guvenc Gulce <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-14net/smc: add support for user defined EIDsKarsten Graul1-0/+15
SMC-Dv2 allows users to define EIDs which allows to create separate name spaces enabling users to cluster their SMC-Dv2 connections. Add support for user defined EIDs and extent the generic netlink interface so users can add, remove and dump EIDs. Signed-off-by: Karsten Graul <[email protected]> Reviewed-by: Guvenc Gulce <[email protected]> Signed-off-by: Guvenc Gulce <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-14nitro_enclaves: Add fixes for checkpatch spell check reportsAndra Paraschiv1-5/+5
Fix the typos in the words spelling as per the checkpatch script reports. Reviewed-by: George-Aurelian Popescu <[email protected]> Signed-off-by: Andra Paraschiv <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-09-14include/uapi/linux/xfrm.h: Fix XFRM_MSG_MAPPING ABI breakageEugene Syromiatnikov1-3/+3
Commit 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") broke ABI by changing the value of the XFRM_MSG_MAPPING enum item, thus also evading the build-time check in security/selinux/nlmsgtab.c:selinux_nlmsg_lookup for presence of proper security permission checks in nlmsg_xfrm_perms. Fix it by placing XFRM_MSG_SETDEFAULT/XFRM_MSG_GETDEFAULT to the end of the enum, right before __XFRM_MSG_MAX, and updating the nlmsg_xfrm_perms accordingly. Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") References: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Eugene Syromiatnikov <[email protected]> Acked-by: Antony Antony <[email protected]> Acked-by: Nicolas Dichtel <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2021-09-14Merge drm/drm-next into drm-misc-nextMaxime Ripard71-491/+2881
Kickstart new drm-misc-next cycle. Signed-off-by: Maxime Ripard <[email protected]>
2021-09-14binder: fix freeze raceLi Li1-0/+7
Currently cgroup freezer is used to freeze the application threads, and BINDER_FREEZE is used to freeze the corresponding binder interface. There's already a mechanism in ioctl(BINDER_FREEZE) to wait for any existing transactions to drain out before actually freezing the binder interface. But freezing an app requires 2 steps, freezing the binder interface with ioctl(BINDER_FREEZE) and then freezing the application main threads with cgroupfs. This is not an atomic operation. The following race issue might happen. 1) Binder interface is frozen by ioctl(BINDER_FREEZE); 2) Main thread A initiates a new sync binder transaction to process B; 3) Main thread A is frozen by "echo 1 > cgroup.freeze"; 4) The response from process B reaches the frozen thread, which will unexpectedly fail. This patch provides a mechanism to check if there's any new pending transaction happening between ioctl(BINDER_FREEZE) and freezing the main thread. If there's any, the main thread freezing operation can be rolled back to finish the pending transaction. Furthermore, the response might reach the binder driver before the rollback actually happens. That will still cause failed transaction. As the other process doesn't wait for another response of the response, the response transaction failure can be fixed by treating the response transaction like an oneway/async one, allowing it to reach the frozen thread. And it will be consumed when the thread gets unfrozen later. NOTE: This patch reuses the existing definition of struct binder_frozen_status_info but expands the bit assignments of __u32 member sync_recv. To ensure backward compatibility, bit 0 of sync_recv still indicates there's an outstanding sync binder transaction. This patch adds new information to bit 1 of sync_recv, indicating the binder transaction happens exactly when there's a race. If an existing userspace app runs on a new kernel, a sync binder call will set bit 0 of sync_recv so ioctl(BINDER_GET_FROZEN_INFO) still return the expected value (true). The app just doesn't check bit 1 intentionally so it doesn't have the ability to tell if there's a race. This behavior is aligned with what happens on an old kernel which doesn't set bit 1 at all. A new userspace app can 1) check bit 0 to know if there's a sync binder transaction happened when being frozen - same as before; and 2) check bit 1 to know if that sync binder transaction happened exactly when there's a race - a new information for rollback decision. the same time, confirmed the pending transactions succeeded. Fixes: 432ff1e91694 ("binder: BINDER_FREEZE ioctl") Acked-by: Todd Kjos <[email protected]> Cc: stable <[email protected]> Signed-off-by: Li Li <[email protected]> Test: stress test with apps being frozen and initiating binder calls at Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-09-13cifs: remove pathname for file from SPDX headerSteve French1-1/+0
checkpatch complains about source files with filenames (e.g. in these cases just below the SPDX header in comments at the top of various files in fs/cifs). It also is helpful to change this now so will be less confusing when the parent directory is renamed e.g. from fs/cifs to fs/smb_client (or fs/smbfs) Reviewed-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-09-13bpf: Introduce helper bpf_get_branch_snapshotSong Liu1-0/+22
Introduce bpf_get_branch_snapshot(), which allows tracing pogram to get branch trace from hardware (e.g. Intel LBR). To use the feature, the user need to create perf_event with proper branch_record filtering on each cpu, and then calls bpf_get_branch_snapshot in the bpf function. On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-09-13io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg itemsEugene Syromiatnikov1-1/+7
The items passed in the array pointed by the arg parameter of IORING_REGISTER_IOWQ_MAX_WORKERS io_uring_register operation carry certain semantics: they refer to different io-wq worker categories; provide IO_WQ_* constants in the UAPI, so these categories can be referenced in the user space code. Suggested-by: Jens Axboe <[email protected]> Complements: 2e480058ddc21ec5 ("io-wq: provide a way to limit max number of workers") Signed-off-by: Eugene Syromiatnikov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-09-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds3-1/+317
Pull virtio updates from Michael Tsirkin: - vduse driver ("vDPA Device in Userspace") supporting emulated virtio block devices - virtio-vsock support for end of record with SEQPACKET - vdpa: mac and mq support for ifcvf and mlx5 - vdpa: management netlink for ifcvf - virtio-i2c, gpio dt bindings - misc fixes and cleanups * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits) Documentation: Add documentation for VDUSE vduse: Introduce VDUSE - vDPA Device in Userspace vduse: Implement an MMU-based software IOTLB vdpa: Support transferring virtual addressing during DMA mapping vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() vhost-iotlb: Add an opaque pointer for vhost IOTLB vhost-vdpa: Handle the failure of vdpa_reset() vdpa: Add reset callback in vdpa_config_ops vdpa: Fix some coding style issues file: Export receive_fd() to modules eventfd: Export eventfd_wake_count to modules iova: Export alloc_iova_fast() and free_iova_fast() virtio-blk: remove unneeded "likely" statements virtio-balloon: Use virtio_find_vqs() helper vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro vsock_test: update message bounds test for MSG_EOR af_vsock: rename variables in receive loop virtio/vsock: support MSG_EOR bit processing vhost/vsock: support MSG_EOR bit processing ...
2021-09-10bpf: Add hardware timestamp field to __sk_buffVadim Fedorenko1-0/+2
BPF programs may want to know hardware timestamps if NIC supports such timestamping. Expose this data as hwtstamp field of __sk_buff the same way as gso_segs/gso_size. This field could be accessed from the same programs as tstamp field, but it's read-only field. Explicit test to deny access to padding data is added to bpf_skb_is_valid_access. Also update BPF_PROG_TEST_RUN tests of the feature. Signed-off-by: Vadim Fedorenko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-09-10locks: remove LOCK_MAND flock lock supportJeff Layton1-0/+4
As best I can tell, the logic for these has been broken for a long time (at least before the move to git), such that they never conflict with anything. Also, nothing checks for these flags and prevented opens or read/write behavior on the files. They don't seem to do anything. Given that, we can rip these symbols out of the kernel, and just make flock(2) return 0 when LOCK_MAND is set in order to preserve existing behavior. Cc: Matthew Wilcox <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Jeff Layton <[email protected]>
2021-09-10Merge tag 'char-misc-5.15-rc1-2' of ↵Linus Torvalds1-20/+166
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull habanalabs updates from Greg KH: "Here is another round of misc driver patches for 5.15-rc1. In here is only updates for the Habanalabs driver. This request is late because the previously-objected-to dma-buf patches are all removed and some fixes that you and others found are now included in here as well. All of these have been in linux-next for well over a week with no reports of problems, and they are all self-contained to only this one driver. Full details are in the shortlog" * tag 'char-misc-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (61 commits) habanalabs/gaudi: hwmon default card name habanalabs: add support for f/w reset habanalabs/gaudi: block ICACHE_BASE_ADDERESS_HIGH in TPC habanalabs: cannot sleep while holding spinlock habanalabs: never copy_from_user inside spinlock habanalabs: remove unnecessary device status check habanalabs: disable IRQ in user interrupts spinlock habanalabs: add "in device creation" status habanalabs/gaudi: invalidate PMMU mem cache on init habanalabs/gaudi: size should be printed in decimal habanalabs/gaudi: define DC POWER for secured PMC habanalabs/gaudi: unmask out of bounds SLM access interrupt habanalabs: add userptr_lookup node in debugfs habanalabs/gaudi: fetch TPC/MME ECC errors from F/W habanalabs: modify multi-CS to wait on stream masters habanalabs/gaudi: add monitored SOBs to state dump habanalabs/gaudi: restore user registers when context opens habanalabs/gaudi: increase boot fit timeout habanalabs: update to latest firmware headers habanalabs/gaudi: minimize number of register reads ...
2021-09-10drm: document drm_mode_create_lease object requirementsSimon Ser1-0/+3
validate_lease expects one CRTC, one connector and one plane. Signed-off-by: Simon Ser <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Pekka Paalanen <[email protected]> Cc: Keith Packard <[email protected]> Reviewed-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-09Merge tag 'for-linus-5.15-rc1' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Support for VMAP_STACK - Support for splice_write in hostfs - Fixes for virt-pci - Fixes for virtio_uml - Various fixes * tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: fix stub location calculation um: virt-pci: fix uapi documentation um: enable VMAP_STACK um: virt-pci: don't do DMA from stack hostfs: support splice_write um: virtio_uml: fix memory leak on init failures um: virtio_uml: include linux/virtio-uml.h lib/logic_iomem: fix sparse warnings um: make PCI emulation driver init/exit static
2021-09-09Merge tag 'cxl-for-5.15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL (Compute Express Link) updates from Dan Williams: - Fix detection of CXL host bridges to filter out disabled ACPI0016 devices in the ACPI DSDT. - Fix kernel lockdown integration to disable raw commands when raw PCI access is disabled. - Fix a broken debug message. - Add support for "Get Partition Info". I.e. enumerate the split between volatile and persistent capacity on bi-modal CXL memory expanders. - Re-factor the core by subject area. This is a work in progress. - Prepare libnvdimm to understand CXL labels in addition to EFI labels. This is a work in progress. * tag 'cxl-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (25 commits) cxl/registers: Fix Documentation warning cxl/pmem: Fix Documentation warning cxl/uapi: Fix defined but not used warnings cxl/pci: Fix debug message in cxl_probe_regs() cxl/pci: Fix lockdown level cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports libnvdimm/labels: Add claim class helpers libnvdimm/labels: Add type-guid helpers libnvdimm/labels: Add blk special cases for nlabel and position helpers libnvdimm/labels: Add blk isetcookie set / validation helpers libnvdimm/labels: Add a checksum calculation helper libnvdimm/labels: Introduce label setter helpers libnvdimm/labels: Add isetcookie validation helper libnvdimm/labels: Introduce getters for namespace label fields cxl/mem: Adjust ram/pmem range to represent DPA ranges cxl/mem: Account for partitionable space in ram/pmem ranges cxl/pci: Store memory capacity values cxl/pci: Simplify register setup cxl/pci: Ignore unknown register block types cxl/core: Move memdev management to core ...
2021-09-09Merge tag 'dmaengine-5.15-rc1' of ↵Linus Torvalds1-0/+24
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine updates from Vinod Koul: "New drivers/devices - Support for Renesas RZ/G2L dma controller - New driver for AMD PTDMA controller Updates: - Big pile of idxd updates - Updates for Altera driver, stm32-dma, dw etc" * tag 'dmaengine-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (83 commits) dmaengine: sh: fix some NULL dereferences dmaengine: sh: Fix unused initialization of pointer lmdesc MAINTAINERS: Fix AMD PTDMA DRIVER entry dmaengine: ptdma: remove PT_OFFSET to avoid redefnition dmaengine: ptdma: Add debugfs entries for PTDMA dmaengine: ptdma: register PTDMA controller as a DMA resource dmaengine: ptdma: Initial driver for the AMD PTDMA dmaengine: fsl-dpaa2-qdma: Fix spelling mistake "faile" -> "failed" dmaengine: idxd: remove interrupt disable for dev_lock dmaengine: idxd: remove interrupt disable for cmd_lock dmaengine: idxd: fix setting up priv mode for dwq dmaengine: xilinx_dma: Set DMA mask for coherent APIs dmaengine: ti: k3-psil-j721e: Add entry for CSI2RX dmaengine: sh: Add DMAC driver for RZ/G2L SoC dmaengine: Extend the dma_slave_width for 128 bytes dt-bindings: dma: Document RZ/G2L bindings dmaengine: ioat: depends on !UML dmaengine: idxd: set descriptor allocation size to threshold for swq dmaengine: idxd: make submit failure path consistent on desc freeing dmaengine: idxd: remove interrupt flag for completion list spinlock ...
2021-09-08compat: remove some compat entry pointsArnd Bergmann1-5/+5
These are all handled correctly when calling the native system call entry point, so remove the special cases. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Feng Tang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-07Merge tag 'net-5.15-rc1' of ↵Linus Torvalds2-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes and stragglers from Jakub Kicinski: "Networking stragglers and fixes, including changes from netfilter, wireless and can. Current release - regressions: - qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi - ip_gre: validate csum_start only on pull - bnxt_en: fix 64-bit doorbell operation on 32-bit kernels - ionic: fix double use of queue-lock, fix a sleeping in atomic - can: c_can: fix null-ptr-deref on ioctl() - cs89x0: disable compile testing on powerpc Current release - new code bugs: - bridge: mcast: fix vlan port router deadlock, consistently disable BH Previous releases - regressions: - dsa: tag_rtl4_a: fix egress tags, only port 0 was working - mptcp: fix possible divide by zero - netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex - netfilter: socket: icmp6: fix use-after-scope - stmmac: fix MAC not working when system resume back with WoL active Previous releases - always broken: - ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address - seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6 - mptcp: only send extra TCP acks in eligible socket states - dsa: lantiq_gswip: fix maximum frame length - stmmac: fix overall budget calculation for rxtx_napi - bnxt_en: fix firmware version reporting via devlink - renesas: sh_eth: add missing barrier to fix freeing wrong tx descriptor Stragglers: - netfilter: conntrack: switch to siphash - netfilter: refuse insertion if chain has grown too large - ncsi: add get MAC address command to get Intel i210 MAC address" * tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits) ieee802154: Remove redundant initialization of variable ret net: stmmac: fix MAC not working when system resume back with WoL active net: phylink: add suspend/resume support net: renesas: sh_eth: Fix freeing wrong tx descriptor bonding: 3ad: pass parameter bond_params by reference cxgb3: fix oops on module removal can: c_can: fix null-ptr-deref on ioctl() can: rcar_canfd: add __maybe_unused annotation to silence warning net: wwan: iosm: Unify IO accessors used in the driver net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors net: qcom/emac: Replace strlcpy with strscpy ip6_gre: Revert "ip6_gre: add validation for csum_start" net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static selftests/bpf: Test XDP bonding nest and unwind bonding: Fix negative jump label count on nested bonding MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry stmmac: dwmac-loongson:Fix missing return value iwlwifi: fix printk format warnings in uefi.c net: create netdev->dev_addr assignment helpers bnxt_en: Fix possible unintended driver initiated error recovery ...
2021-09-07Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-4/+7
Pull KVM updates from Paolo Bonzini: "ARM: - Page ownership tracking between host EL1 and EL2 - Rely on userspace page tables to create large stage-2 mappings - Fix incompatibility between pKVM and kmemleak - Fix the PMU reset state, and improve the performance of the virtual PMU - Move over to the generic KVM entry code - Address PSCI reset issues w.r.t. save/restore - Preliminary rework for the upcoming pKVM fixed feature - A bunch of MM cleanups - a vGIC fix for timer spurious interrupts - Various cleanups s390: - enable interpretation of specification exceptions - fix a vcpu_idx vs vcpu_id mixup x86: - fast (lockless) page fault support for the new MMU - new MMU now the default - increased maximum allowed VCPU count - allow inhibit IRQs on KVM_RUN while debugging guests - let Hyper-V-enabled guests run with virtualized LAPIC as long as they do not enable the Hyper-V "AutoEOI" feature - fixes and optimizations for the toggling of AMD AVIC (virtualized LAPIC) - tuning for the case when two-dimensional paging (EPT/NPT) is disabled - bugfixes and cleanups, especially with respect to vCPU reset and choosing a paging mode based on CR0/CR4/EFER - support for 5-level page table on AMD processors Generic: - MMU notifier invalidation callbacks do not take mmu_lock unless necessary - improved caching of LRU kvm_memory_slot - support for histogram statistics - add statistics for halt polling and remote TLB flush requests" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits) KVM: Drop unused kvm_dirty_gfn_invalid() KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted KVM: MMU: mark role_regs and role accessors as maybe unused KVM: MIPS: Remove a "set but not used" variable x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait KVM: stats: Add VM stat for remote tlb flush requests KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count() KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()" KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710 kvm: x86: Increase MAX_VCPUS to 1024 kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host KVM: s390: index kvm->arch.idle_mask by vcpu_idx KVM: s390: Enable specification exception interpretation KVM: arm64: Trim guest debug exception handling KVM: SVM: Add 5-level page table support for SVM ...
2021-09-07Merge tag 'gpio-updates-for-v5.15' of ↵Linus Torvalds2-0/+48
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio updates from Bartosz Golaszewski: "We mostly have various improvements and refactoring all over the place but also some interesting new features - like the virtio GPIO driver that allows guest VMs to use host's GPIOs. We also have a new/old GPIO driver for rockchip - this one has been split out of the pinctrl driver. Summary: - new driver: gpio-virtio allowing a guest VM running linux to access GPIO lines provided by the host - split the GPIO driver out of the rockchip pin control driver - add support for a new model to gpio-aspeed-sgpio, refactor the driver and use generic device property interfaces, improve property sanitization - add ACPI support to gpio-tegra186 - improve the code setting the line names to support multiple GPIO banks per device - constify a bunch of OF functions in the core GPIO code and make the declaration for one of the core OF functions we use consistent within its header - use software nodes in intel_quark_i2c_gpio - add support for the gpio-line-names property in gpio-mt7621 - use the standard GPIO function for setting the GPIO names in gpio-brcmstb - fix a bunch of leaks and other bugs in gpio-mpc8xxx - use generic pm callbacks in gpio-ml-ioh - improve resource management and PM handling in gpio-mlxbf2 - modernize and improve the gpio-dwapb driver - coding style improvements in gpio-rcar - documentation fixes and improvements - update the MAINTAINERS entry for gpio-zynq - minor tweaks in several drivers" * tag 'gpio-updates-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (35 commits) gpio: mpc8xxx: Use 'devm_gpiochip_add_data()' to simplify the code and avoid a leak gpio: mpc8xxx: Fix a potential double iounmap call in 'mpc8xxx_probe()' gpio: mpc8xxx: Fix a resources leak in the error handling path of 'mpc8xxx_probe()' gpio: viperboard: remove platform_set_drvdata() call in probe gpio: virtio: Add missing mailings lists in MAINTAINERS entry gpio: virtio: Fix sparse warnings gpio: remove the obsolete MX35 3DS BOARD MC9S08DZ60 GPIO functions gpio: max730x: Use the right include gpio: Add virtio-gpio driver gpio: mlxbf2: Use DEFINE_RES_MEM_NAMED() helper macro gpio: mlxbf2: Use devm_platform_ioremap_resource() gpio: mlxbf2: Drop wrong use of ACPI_PTR() gpio: mlxbf2: Convert to device PM ops gpio: dwapb: Get rid of legacy platform data mfd: intel_quark_i2c_gpio: Convert GPIO to use software nodes gpio: dwapb: Read GPIO base from gpio-base property gpio: dwapb: Unify ACPI enumeration checks in get_irq() and configure_irqs() gpiolib: Deduplicate forward declaration in the consumer.h header MAINTAINERS: update gpio-zynq.yaml reference gpio: tegra186: Add ACPI support ...
2021-09-07cxl/uapi: Fix defined but not used warningsBen Widawsky1-1/+1
Fix unused-const-variable warnings emitted by gcc when cxlmem.h is used by pretty much all files except pci.c Signed-off-by: Ben Widawsky <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Link: https://lore.kernel.org/r/163072205652.2250120.16833548560832424468.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]>
2021-09-06vduse: Introduce VDUSE - vDPA Device in UserspaceXie Yongji1-0/+306
This VDUSE driver enables implementing software-emulated vDPA devices in userspace. The vDPA device is created by ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device emulation. In order to make the device emulation more secure, the device's control path is handled in kernel. A message mechnism is introduced to forward some dataplane related control messages to userspace. And in the data path, the DMA buffer will be mapped into userspace address space through different ways depending on the vDPA bus to which the vDPA device is attached. In virtio-vdpa case, the MMU-based software IOTLB is used to achieve that. And in vhost-vdpa case, the DMA buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. For more details on VDUSE design and usage, please see the follow-on Documentation commit. NB(mst): when merging this with b542e383d8c0 ("eventfd: Make signal recursion protection a task bit") replace eventfd_signal_count with eventfd_signal_allowed, and drop the previous ("eventfd: Export eventfd_wake_count to modules"). Signed-off-by: Xie Yongji <[email protected]> Acked-by: Jason Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2021-09-06Merge tag 'kvmarm-5.15' of ↵Paolo Bonzini2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 5.15 - Page ownership tracking between host EL1 and EL2 - Rely on userspace page tables to create large stage-2 mappings - Fix incompatibility between pKVM and kmemleak - Fix the PMU reset state, and improve the performance of the virtual PMU - Move over to the generic KVM entry code - Address PSCI reset issues w.r.t. save/restore - Preliminary rework for the upcoming pKVM fixed feature - A bunch of MM cleanups - a vGIC fix for timer spurious interrupts - Various cleanups
2021-09-05virtio/vsock: add 'VIRTIO_VSOCK_SEQ_EOR' bit.Arseny Krasnov1-0/+1
This bit is used to handle POSIX MSG_EOR flag passed from userspace in 'send*()' system calls. It marks end of each record and is visible to receiver using 'recvmsg()' system call. Signed-off-by: Arseny Krasnov <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2021-09-05virtio/vsock: rename 'EOR' to 'EOM' bit.Arseny Krasnov1-1/+1
This current implemented bit is used to mark end of messages ('EOM' - end of message), not records('EOR' - end of record). Also rename 'record' to 'message' in implementation as it is different things. Signed-off-by: Arseny Krasnov <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2021-09-05uapi: virtio_ids: Sync ids with specificationViresh Kumar1-0/+12
This synchronizes the virtio ids with the latest list from virtio specification. Signed-off-by: Viresh Kumar <[email protected]> Link: https://lore.kernel.org/r/61b27e3bc61fb0c9f067001e95cfafc5d37d414a.1627362340.git.viresh.kumar@linaro.org Signed-off-by: Michael S. Tsirkin <[email protected]>
2021-09-04fq_codel: reject silly quantum parametersEric Dumazet1-0/+2
syzbot found that forcing a big quantum attribute would crash hosts fast, essentially using this: tc qd replace dev eth0 root fq_codel quantum 4294967295 This is because fq_codel_dequeue() would have to loop ~2^31 times in : if (flow->deficit <= 0) { flow->deficit += q->quantum; list_move_tail(&flow->flowchain, &q->old_flows); goto begin; } SFQ max quantum is 2^19 (half a megabyte) Lets adopt a max quantum of one megabyte for FQ_CODEL. Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski1-0/+1
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Protect nft_ct template with global mutex, from Pavel Skripkin. 2) Two recent commits switched inet rt and nexthop exception hashes from jhash to siphash. If those two spots are problematic then conntrack is affected as well, so switch voer to siphash too. While at it, add a hard upper limit on chain lengths and reject insertion if this is hit. Patches from Florian Westphal. 3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN, from Benjamin Hesmans. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: socket: icmp6: fix use-after-scope netfilter: refuse insertion if chain has grown too large netfilter: conntrack: switch to siphash netfilter: conntrack: sanitize table size default settings netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-1/+4
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <[email protected]>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03mm: wire up syscall process_mreleaseSuren Baghdasaryan1-1/+3
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Jan Engelhardt <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tim Murray <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodesDave Hansen1-0/+1
Patch series "Introduce multi-preference mempolicy", v7. This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy. This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a preference for nodes which will fulfil memory allocation requests. Unlike the MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the OOM killer if those preferred nodes are not available. Along with these patches are patches for libnuma, numactl, numademo, and memhog. They still need some polish, but can be found here: https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new usage: `numactl -P 0,3,4` The goal of the new mode is to enable some use-cases when using tiered memory usage models which I've lovingly named. 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency requirements allowing preference to be given to all nodes with "fast" memory. 1b. The Indiscriminate Hare - An application knows it wants fast memory (or perhaps slow memory), but doesn't care which node it runs on. The application can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator, etc). This reverses the nodes are chosen today where the kernel attempts to use local memory to the CPU whenever possible. This will attempt to use the local accelerator to the memory. 2. The Tortoise - The administrator (or the application itself) is aware it only needs slow memory, and so can prefer that. Much of this is almost achievable with the bind interface, but the bind interface suffers from an inability to fallback to another set of nodes if binding fails to all nodes in the nodemask. Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the preference. > /* Set first two nodes as preferred in an 8 node system. */ > const unsigned long nodes = 0x3 > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > /* Mimic interleave policy, but have fallback *. > const unsigned long nodes = 0xaa > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); Some internal discussion took place around the interface. There are two alternatives which we have discussed, plus one I stuck in: 1. Ordered list of nodes. Currently it's believed that the added complexity is nod needed for expected usecases. 2. A flag for bind to allow falling back to other nodes. This confuses the notion of binding and is less flexible than the current solution. 3. Create flags or new modes that helps with some ordering. This offers both a friendlier API as well as a solution for more customized usage. It's unknown if it's worth the complexity to support this. Here is sample code for how this might work: > // Prefer specific nodes for some something wacky > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024); > > // Default > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0); > // which is the same as > set_mempolicy(MPOL_DEFAULT, NULL, 0); > > // The Hare > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0); > > // The Tortoise > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0); > > // Prefer the fast memory of the first two sockets > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2); > This patch (of 5): The NUMA APIs currently allow passing in a "preferred node" as a single bit set in a nodemask. If more than one bit it set, bits after the first are ignored. This single node is generally OK for location-based NUMA where memory being allocated will eventually be operated on by a single CPU. However, in systems with multiple memory types, folks want to target a *type* of memory instead of a location. For instance, someone might want some high-bandwidth memory but do not care about the CPU next to which it is allocated. Or, they want a cheap, high capacity allocation and want to target all NUMA nodes which have persistent memory in volatile mode. In both of these cases, the application wants to target a *set* of nodes, but does not want strict MPOL_BIND behavior as that could lead to OOM killer or SIGSEGV. So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes requirement. This is not a pie-in-the-sky dream for an API. This was a response to a specific ask of more than one group at Intel. Specifically: 1. There are existing libraries that target memory types such as https://github.com/memkind/memkind. These are known to suffer from SIGSEGV's when memory is low on targeted memory "kinds" that span more than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an example of this. 2. Volatile-use persistent memory users want to have a memory policy which is targeted at either "cheap and slow" (PMEM) or "expensive and fast" (DRAM). However, they do not want to experience allocation failures when the targeted type is unavailable. 3. Allocate-then-run. Generally, we let the process scheduler decide on which physical CPU to run a task. That location provides a default allocation policy, and memory availability is not generally considered when placing tasks. For situations where memory is valuable and constrained, some users want to allocate memory first, *then* allocate close compute resources to the allocation. This is the reverse of the normal (CPU) model. Accelerators such as GPUs that operate on core-mm-managed memory are interested in this model. A check is added in sanitize_mpol_flags() to not permit 'prefer_many' policy to be used for now, and will be removed in later patch after all implementations for 'prefer_many' are ready, as suggested by Michal Hocko. [[email protected]: suggest to refine policy_node/policy_nodemask handling] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Ben Widawsky <[email protected]> Signed-off-by: Ben Widawsky <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Feng Tang <[email protected]> Cc: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dan Williams <[email protected]> Cc: Huang Ying <[email protected]>b Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-02Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2-0/+108
Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, qla2xxx, target, smartpqi, lpfc, mpt3sas). The core change causing the most churn was replacing the command request field request with a macro, allowing us to offset map to it and remove the redundant field; the same was also done for the tag field. The most impactful change is the final removal of scsi_ioctl, which has been deprecated for over a decade" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits) scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1 scsi: ufs: ufs-exynos: Fix static checker warning scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Copyright updates for 14.0.0.1 patches scsi: lpfc: Update lpfc version to 14.0.0.1 scsi: lpfc: Add bsg support for retrieving adapter cmf data scsi: lpfc: Add cmf_info sysfs entry scsi: lpfc: Add debugfs support for cm framework buffers scsi: lpfc: Add support for maintaining the cm statistics buffer scsi: lpfc: Add rx monitoring statistics scsi: lpfc: Add support for the CM framework scsi: lpfc: Add cmfsync WQE support scsi: lpfc: Add support for cm enablement buffer scsi: lpfc: Add cm statistics buffer support scsi: lpfc: Add EDC ELS support scsi: lpfc: Expand FPIN and RDF receive logging scsi: lpfc: Add MIB feature enablement support scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware scsi: fc: Add EDC ELS definition ...
2021-09-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-2/+15
Pull rdma updates from Jason Gunthorpe: "This is quite a small cycle, no major series stands out. The HNS and rxe drivers saw the most activity this cycle, with rxe being broken for a good chunk of time. The significant deleted line count is due to a SPDX cleanup series. Summary: - Various cleanup and small features for rtrs - kmap_local_page() conversions - Driver updates and fixes for: efa, rxe, mlx5, hfi1, qed, hns - Cache the IB subnet prefix - Rework how CRC is calcuated in rxe - Clean reference counting in iwpm's netlink - Pull object allocation and lifecycle for user QPs to the uverbs core code - Several small hns features and continued general code cleanups - Fix the scatterlist confusion of orig_nents/nents introduced in an earlier patch creating the append operation" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (90 commits) RDMA/mlx5: Relax DCS QP creation checks RDMA/hns: Delete unnecessary blank lines. RDMA/hns: Encapsulate the qp db as a function RDMA/hns: Adjust the order in which irq are requested and enabled RDMA/hns: Remove RST2RST error prints for hw v1 RDMA/hns: Remove dqpn filling when modify qp from Init to Init RDMA/hns: Fix QP's resp incomplete assignment RDMA/hns: Fix query destination qpn RDMA/hfi1: Convert to SPDX identifier IB/rdmavt: Convert to SPDX identifier RDMA/hns: Bugfix for incorrect association between dip_idx and dgid RDMA/hns: Bugfix for the missing assignment for dip_idx RDMA/hns: Bugfix for data type of dip_idx RDMA/hns: Fix incorrect lsn field RDMA/irdma: Remove the repeated declaration RDMA/core/sa_query: Retry SA queries RDMA: Use the sg_table directly and remove the opencoded version from umem lib/scatterlist: Fix wrong update of orig_nents lib/scatterlist: Provide a dedicated function to support table append RDMA/hns: Delete unused hns bitmap interface ...
2021-09-01Merge tag 'drivers-5.15' of ↵Linus Torvalds2-0/+25
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC driver updates from Arnd Bergmann: "These are updates for drivers that are tied to a particular SoC, including the correspondig device tree bindings: - A couple of reset controller changes for unisoc, uniphier, renesas and zte platforms - memory controller driver fixes for omap and tegra - Rockchip io domain driver updates - Lots of updates for qualcomm platforms, mostly touching their firmware and power management drivers - Tegra FUSE and firmware driver updateѕ - Support for virtio transports in the SCMI firmware framework - cleanup of ixp4xx drivers, towards enabling multiplatform support and bringing it up to date with modern platforms - Minor updates for keystone, mediatek, omap, renesas" * tag 'drivers-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (96 commits) reset: simple: remove ZTE details in Kconfig help soc: rockchip: io-domain: Remove unneeded semicolon soc: rockchip: io-domain: add rk3568 support dt-bindings: power: add rk3568-pmu-io-domain support bus: ixp4xx: return on error in ixp4xx_exp_probe() soc: renesas: Prefer memcpy() over strcpy() firmware: tegra: Stop using seq_get_buf() soc/tegra: fuse: Enable fuse clock on suspend for Tegra124 soc/tegra: fuse: Add runtime PM support soc/tegra: fuse: Clear fuse->clk on driver probe failure soc/tegra: pmc: Prevent racing with cpuilde driver soc/tegra: bpmp: Remove unused including <linux/version.h> dt-bindings: soc: ti: pruss: Add dma-coherent property soc: ti: Remove pm_runtime_irq_safe() usage for smartreflex soc: ti: pruss: Enable support for ICSSG subsystems on K3 AM64x SoCs dt-bindings: soc: ti: pruss: Update bindings for K3 AM64x SoCs firmware: arm_scmi: Use WARN_ON() to check configured transports firmware: arm_scmi: Fix boolconv.cocci warnings soc: mediatek: mmsys: Fix missing UFOE component in mt8173 table routing soc: mediatek: mmsys: add MT8365 support ...
2021-09-01Merge tag 'arm64-upstream' of ↵Linus Torvalds1-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Support for 32-bit tasks on asymmetric AArch32 systems (on top of the scheduler changes merged via the tip tree). - More entry.S clean-ups and conversion to C. - MTE updates: allow a preferred tag checking mode to be set per CPU (the overhead of synchronous mode is smaller for some CPUs than others); optimisations for kernel entry/exit path; optionally disable MTE on the kernel command line. - Kselftest improvements for SVE and signal handling, PtrAuth. - Fix unlikely race where a TLBI could use stale ASID on an ASID roll-over (found by inspection). - Miscellaneous fixes: disable trapping of PMSNEVFR_EL1 to higher exception levels; drop unnecessary sigdelsetmask() call in the signal32 handling; remove BUG_ON when failing to allocate SVE state (just signal the process); SYM_CODE annotations. - Other trivial clean-ups: use macros instead of magic numbers, remove redundant returns, typos. * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (56 commits) arm64: Do not trap PMSNEVFR_EL1 arm64: mm: fix comment typo of pud_offset_phys() arm64: signal32: Drop pointless call to sigdelsetmask() arm64/sve: Better handle failure to allocate SVE register storage arm64: Document the requirement for SCR_EL3.HCE arm64: head: avoid over-mapping in map_memory arm64/sve: Add a comment documenting the binutils needed for SVE asm arm64/sve: Add some comments for sve_save/load_state() kselftest/arm64: signal: Add a TODO list for signal handling tests kselftest/arm64: signal: Add test case for SVE register state in signals kselftest/arm64: signal: Verify that signals can't change the SVE vector length kselftest/arm64: signal: Check SVE signal frame shows expected vector length kselftest/arm64: signal: Support signal frames with SVE register data kselftest/arm64: signal: Add SVE to the set of features we can check for arm64: replace in_irq() with in_hardirq() kselftest/arm64: pac: Fix skipping of tests on systems without PAC Documentation: arm64: describe asymmetric 32-bit support arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0 arm64: Advertise CPUs capable of running 32-bit applications in sysfs ...
2021-09-01Merge branch 'exit-cleanups-for-v5.15' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exit cleanups from Eric Biederman: "In preparation of doing something about PTRACE_EVENT_EXIT I have started cleaning up various pieces of code related to do_exit. Most of that code I did not manage to get tested and reviewed before the merge window opened but a handful of very useful cleanups are ready to be merged. The first change is simply the removal of the bdflush system call. The code has now been disabled long enough that even the oldest userspace working userspace setups anyone can find to test are fine with the bdflush system call being removed. Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of calling do_exit directly is interesting only in that it is nearly the most difficult of the incorrect uses of do_exit to remove. The change to the seccomp code to simply send a signal instead of calling do_coredump directly is a very nice little cleanup made possible by realizing the existing signal sending helpers were missing a little bit of functionality that is easy to provide" * 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal/seccomp: Dump core when there is only one live thread signal/seccomp: Refactor seccomp signal and coredump generation signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die exit/bdflush: Remove the deprecated bdflush system call
2021-09-01Merge branch 'siginfo-si_trapno-for-v5.15' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo si_trapno updates from Eric Biederman: "The full set of si_trapno changes was not appropriate as a fix for the newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the related cleanups. This is the rest of the cleanups for si_trapno that reduces it from being a really weird arch special case that is expect to be always present (but isn't) on the architectures that support it to being yet another field in the _sigfault union of struct siginfo. The changes have been reviewed and marinated in linux-next. With the removal of this awkward special case new code (like SIGTRAP TRAP_PERF) that works across architectures should be easier to write and maintain" * 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency signal: Verify the alignment and size of siginfo_t signal: Remove the generic __ARCH_SI_TRAPNO support signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP arm64: Add compile-time asserts for siginfo_t offsets arm: Add compile-time asserts for siginfo_t offsets sparc64: Add compile-time asserts for siginfo_t offsets