aboutsummaryrefslogtreecommitdiff
path: root/include/uapi/linux
AgeCommit message (Collapse)AuthorFilesLines
2019-11-02bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpersDaniel Borkmann1-40/+82
The current bpf_probe_read() and bpf_probe_read_str() helpers are broken in that they assume they can be used for probing memory access for kernel space addresses /as well as/ user space addresses. However, plain use of probe_kernel_read() for both cases will attempt to always access kernel space address space given access is performed under KERNEL_DS and some archs in-fact have overlapping address spaces where a kernel pointer and user pointer would have the /same/ address value and therefore accessing application memory via bpf_probe_read{,_str}() would read garbage values. Lets fix BPF side by making use of recently added 3d7081822f7f ("uaccess: Add non-pagefault user-space read functions"). Unfortunately, the only way to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}() helpers are kept as-is to retain their current behavior. The two *_user() variants attempt the access always under USER_DS set, the two *_kernel() variants will -EFAULT when accessing user memory if the underlying architecture has non-overlapping address ranges, also avoiding throwing the kernel warning via 00c42373d397 ("x86-64: add warning for non-canonical user access address dereferences"). Fixes: a5e8c07059d0 ("bpf: add bpf_probe_read_str helper") Fixes: 2541517c32be ("tracing, perf: Implement BPF programs attached to kprobes") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
2019-11-01io_uring: support for generic async request cancelJens Axboe1-0/+2
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to cancel requests that have been punted to async context and are now in-flight. This works for regular read/write requests to files, as long as they haven't been started yet. For socket based IO (or things like accept4(2)), we can cancel work that is already running as well. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Signed-off-by: Jens Axboe <[email protected]>
2019-10-31bpf: Replace prog_raw_tp+btf_id with prog_tracingAlexei Starovoitov1-0/+2
The bpf program type raw_tp together with 'expected_attach_type' was the most appropriate api to indicate BTF-enabled raw_tp programs. But during development it became apparent that 'expected_attach_type' cannot be used and new 'attach_btf_id' field had to be introduced. Which means that the information is duplicated in two fields where one of them is ignored. Clean it up by introducing new program type where both 'expected_attach_type' and 'attach_btf_id' fields have specific meaning. In the future 'expected_attach_type' will be extended with other attach points that have similar semantics to raw_tp. This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with prog_type = BPF_RPOG_TYPE_TRACING expected_attach_type = BPF_TRACE_RAW_TP attach_btf_id = btf_id of raw tracepoint inside the kernel Future patches will add expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT where programs have the same input context and the same helpers, but different attach points. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2019-10-30net: sched: extend TCA_ACT space with TCA_ACT_FLAGSVlad Buslov1-0/+5
Extend TCA_ACT space with nla_bitfield32 flags. Add TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in tcf_action_init_1() and pass resulting value as additional argument to a_o->init(). Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-30Revert "dma-buf: Add dma-buf heaps framework"Sean Paul1-55/+0
This reverts commit a69b0e855d3fd278ff6f09a23e1edf929538e304. This patchset doesn't meet the UAPI requirements set out in [1] for the DRM subsystem. Once the userspace component is reviewed and ready for merge we can try again. [1]- https://01.org/linuxgraphics/gfx-docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements Fixes: a69b0e855d3f ("dma-buf: Add dma-buf heaps framework") Cc: Laura Abbott <[email protected]> Cc: Benjamin Gaignard <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Liam Mark <[email protected]> Cc: Pratik Patel <[email protected]> Cc: Brian Starkey <[email protected]> Cc: Vincent Donnefort <[email protected]> Cc: Sudipto Paul <[email protected]> Cc: Andrew F. Davis <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chenbo Feng <[email protected]> Cc: Alistair Strachan <[email protected]> Cc: Hridya Valsaraju <[email protected]> Cc: Hillf Danton <[email protected]> Cc: [email protected] Cc: Brian Starkey <[email protected]> Cc: John Stultz <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Rob Herring <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Sean Paul <[email protected]> Cc: "Andrew F. Davis" <[email protected]> Cc: [email protected] Cc: [email protected] Acked-by: David Airlie <[email protected]> Signed-off-by: Sean Paul <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2019-10-30tipc: add smart nagle featureJon Maloy1-0/+1
We introduce a feature that works like a combination of TCP_NAGLE and TCP_CORK, but without some of the weaknesses of those. In particular, we will not observe long delivery delays because of delayed acks, since the algorithm itself decides if and when acks are to be sent from the receiving peer. - The nagle property as such is determined by manipulating a new 'maxnagle' field in struct tipc_sock. If certain conditions are met, 'maxnagle' will define max size of the messages which can be bundled. If it is set to zero no messages are ever bundled, implying that the nagle property is disabled. - A socket with the nagle property enabled enters nagle mode when more than 4 messages have been sent out without receiving any data message from the peer. - A socket leaves nagle mode whenever it receives a data message from the peer. In nagle mode, messages smaller than 'maxnagle' are accumulated in the socket write queue. The last buffer in the queue is marked with a new 'ack_required' bit, which forces the receiving peer to send a CONN_ACK message back to the sender upon reception. The accumulated contents of the write queue is transmitted when one of the following events or conditions occur. - A CONN_ACK message is received from the peer. - A data message is received from the peer. - A SOCK_WAKEUP pseudo message is received from the link level. - The write queue contains more than 64 1k blocks of data. - The connection is being shut down. - There is no CONN_ACK message to expect. I.e., there is currently no outstanding message where the 'ack_required' bit was set. As a consequence, the first message added after we enter nagle mode is always sent directly with this bit set. This new feature gives a 50-100% improvement of throughput for small (i.e., less than MTU size) messages, while it might add up to one RTT to latency time when the socket is in nagle mode. Acked-by: Ying Xue <[email protected]> Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-29io_uring: add support for IORING_OP_ACCEPTJens Axboe1-1/+6
This allows an application to call accept4() in an async fashion. Like other opcodes, we first try a non-blocking accept, then punt to async context if we have to. Signed-off-by: Jens Axboe <[email protected]>
2019-10-29Merge tag 'fuse-fixes-5.4-rc6' of ↵Linus Torvalds1-0/+37
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Mostly virtiofs fixes, but also fixes a regression and couple of longstanding data/metadata writeback ordering issues" * tag 'fuse-fixes-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: redundant get_fuse_inode() calls in fuse_writepages_fill() fuse: Add changelog entries for protocols 7.1 - 7.8 fuse: truncate pending writes on O_TRUNC fuse: flush dirty data/metadata before non-truncate setattr virtiofs: Remove set but not used variable 'fc' virtiofs: Retry request submission from worker context virtiofs: Count pending forgets as in_flight forgets virtiofs: Set FR_SENT flag only after request has been sent virtiofs: No need to check fpq->connected state virtiofs: Do not end request in submission context fuse: don't advise readdirplus for negative lookup fuse: don't dereference req->args on finished request virtio-fs: don't show mount options virtio-fs: Change module name to virtiofs.ko
2019-10-29io_uring: add support for canceling timeout requestsJens Axboe1-0/+1
We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field. Signed-off-by: Jens Axboe <[email protected]>
2019-10-29io_uring: add support for absolute timeoutsJens Axboe1-0/+5
This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring. Signed-off-by: Jens Axboe <[email protected]>
2019-10-29io_uring: allow application controlled CQ ring sizeJens Axboe1-0/+1
We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry. Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently. If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two. Signed-off-by: Jens Axboe <[email protected]>
2019-10-29io_uring: add support for IORING_REGISTER_FILES_UPDATEJens Axboe1-0/+6
Allows the application to remove/replace/add files to/from a file set. Passes in a struct: struct io_uring_files_update { __u32 offset; __s32 *fds; }; that holds an array of fds, size of array passed in through the usual nr_args part of the io_uring_register() system call. The logic is as follows: 1) If ->fds[i] is -1, the existing file at i + ->offset is removed from the set. 2) If ->fds[i] is a valid fd, the existing file at i + ->offset is replaced with ->fds[i]. For case #2, is the existing file is currently empty (fd == -1), the new fd is simply added to the array. Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-28net: Fix misspellings of "configure" and "configuration"Geert Uytterhoeven1-1/+1
Fix various misspellings of "configuration" and "configure". Signed-off-by: Geert Uytterhoeven <[email protected]> Acked-by: Kalle Valo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-28seccomp: rework define for SECCOMP_USER_NOTIF_FLAG_CONTINUEChristian Brauner1-1/+1
Switch from BIT(0) to (1UL << 0). First, there are already two different forms used in the header, so there's no need to add a third. Second, the BIT() macros is kernel internal and afaict not actually exposed to userspace. Maybe there's some magic there I'm missing but it definitely causes issues when compiling a program that tries to use SECCOMP_USER_NOTIF_FLAG_CONTINUE. It currently fails in the following way: # github.com/lxc/lxd/lxd /usr/bin/ld: $WORK/b001/_x003.o: in function `__do_user_notification_continue': lxd/main_checkfeature.go:240: undefined reference to `BIT' collect2: error: ld returned 1 exit status Switching to (1UL << 0) should prevent that and is more in line what is already done in the rest of the header. Cc: Kees Cook <[email protected]> Cc: Andy Lutomirski <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2019-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-1/+27
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-10-27 The following pull-request contains BPF updates for your *net-next* tree. We've added 52 non-merge commits during the last 11 day(s) which contain a total of 65 files changed, 2604 insertions(+), 1100 deletions(-). The main changes are: 1) Revolutionize BPF tracing by using in-kernel BTF to type check BPF assembly code. The work here teaches BPF verifier to recognize kfree_skb()'s first argument as 'struct sk_buff *' in tracepoints such that verifier allows direct use of bpf_skb_event_output() helper used in tc BPF et al (w/o probing memory access) that dumps skb data into perf ring buffer. Also add direct loads to probe memory in order to speed up/replace bpf_probe_read() calls, from Alexei Starovoitov. 2) Big batch of changes to improve libbpf and BPF kselftests. Besides others: generalization of libbpf's CO-RE relocation support to now also include field existence relocations, revamp the BPF kselftest Makefile to add test runner concept allowing to exercise various ways to build BPF programs, and teach bpf_object__open() and friends to automatically derive BPF program type/expected attach type from section names to ease their use, from Andrii Nakryiko. 3) Fix deadlock in stackmap's build-id lookup on rq_lock(), from Song Liu. 4) Allow to read BTF as raw data from bpftool. Most notable use case is to dump /sys/kernel/btf/vmlinux through this, from Jiri Olsa. 5) Use bpf_redirect_map() helper in libbpf's AF_XDP helper prog which manages to improve "rx_drop" performance by ~4%., from Björn Töpel. 6) Fix to restore the flow dissector after reattach BPF test and also fix error handling in bpf_helper_defs.h generation, from Jakub Sitnicki. 7) Improve verifier's BTF ctx access for use outside of raw_tp, from Martin KaFai Lau. 8) Improve documentation for AF_XDP with new sections and to reflect latest features, from Magnus Karlsson. 9) Add back 'version' section parsing to libbpf for old kernels, from John Fastabend. 10) Fix strncat bounds error in libbpf's libbpf_prog_type_by_name(), from KP Singh. 11) Turn on -mattr=+alu32 in LLVM by default for BPF kselftests in order to improve insn coverage for built BPF progs, from Yonghong Song. 12) Misc minor cleanups and fixes, from various others. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-0/+2
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next, more specifically: * Updates for ipset: 1) Coding style fix for ipset comment extension, from Jeremy Sowden. 2) De-inline many functions in ipset, from Jeremy Sowden. 3) Move ipset function definition from header to source file. 4) Move ip_set_put_flags() to source, export it as a symbol, remove inline. 5) Move range_to_mask() to the source file where this is used. 6) Move ip_set_get_ip_port() to the source file where this is used. * IPVS selftests and netns improvements: 7) Two patches to speedup ipvs netns dismantle, from Haishuang Yan. 8) Three patches to add selftest script for ipvs, also from Haishuang Yan. * Conntrack updates and new nf_hook_slow_list() function: 9) Document ct ecache extension, from Florian Westphal. 10) Skip ct extensions from ctnetlink dump, from Florian. 11) Free ct extension immediately, from Florian. 12) Skip access to ecache extension from nf_ct_deliver_cached_events() this is not correct as reported by Syzbot. 13) Add and use nf_hook_slow_list(), from Florian. * Flowtable infrastructure updates: 14) Move priority to nf_flowtable definition. 15) Dynamic allocation of per-device hooks in flowtables. 16) Allow to include netdevice only once in flowtable definitions. 17) Rise maximum number of devices per flowtable. * Netfilter hardware offload infrastructure updates: 18) Add nft_flow_block_chain() helper function. 19) Pass callback list to nft_setup_cb_call(). 20) Add nft_flow_cls_offload_setup() helper function. 21) Remove rules for the unregistered device via netdevice event. 22) Support for multiple devices in a basechain definition at the ingress hook. 22) Add nft_chain_offload_cmd() helper function. 23) Add nft_flow_block_offload_init() helper function. 24) Rewind in case of failing to bind multiple devices to hook. 25) Typo in IPv6 tproxy module description, from Norman Rasmussen. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-10-25tcp: add TCP_INFO status for failed client TFOJason Baron1-1/+9
The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether or not data-in-SYN was ack'd on both the client and server side. We'd like to gather more information on the client-side in the failure case in order to indicate the reason for the failure. This can be useful for not only debugging TFO, but also for creating TFO socket policies. For example, if a middle box removes the TFO option or drops a data-in-SYN, we can can detect this case, and turn off TFO for these connections saving the extra retransmits. The newly added tcpi_fastopen_client_fail status is 2 bits and has the following 4 states: 1) TFO_STATUS_UNSPEC Catch-all state which includes when TFO is disabled via black hole detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE. 2) TFO_COOKIE_UNAVAILABLE If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie is available in the cache. 3) TFO_DATA_NOT_ACKED Data was sent with SYN, we received a SYN/ACK but it did not cover the data portion. Cookie is not accepted by server because the cookie may be invalid or the server may be overloaded. 4) TFO_SYN_RETRANSMITTED Data was sent with SYN, we received a SYN/ACK which did not cover the data after at least 1 additional SYN was sent (without data). It may be the case that a middle-box is dropping data-in-SYN packets. Thus, it would be more efficient to not use TFO on this connection to avoid extra retransmits during connection establishment. These new fields do not cover all the cases where TFO may fail, but other failures, such as SYN/ACK + data being dropped, will result in the connection not becoming established. And a connection blackhole after session establishment shows up as a stalled connection. Signed-off-by: Jason Baron <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Neal Cardwell <[email protected]> Cc: Christoph Paasch <[email protected]> Cc: Yuchung Cheng <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-25fcntl: fix typo in RWH_WRITE_LIFE_NOT_SET r/w hint nameEugene Syromiatnikov1-1/+8
According to commit message in the original commit c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints"), as well as userspace library[1] and man page update[2], R/W hint constants are intended to have RWH_* prefix. However, RWF_WRITE_LIFE_NOT_SET retained "RWF_*" prefix used in the early versions of the proposed patch set[3]. Rename it and provide the old name as a synonym for the new one for backward compatibility. [1] https://github.com/axboe/fio/commit/bd553af6c849 [2] https://github.com/mkerrisk/man-pages/commit/580082a186fd [3] https://www.mail-archive.com/[email protected]/msg09638.html Fixes: c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints") Acked-by: Song Liu <[email protected]> Signed-off-by: Eugene Syromiatnikov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-26crypto: ccp - Retry SEV INIT command in case of integrity check failure.Ashish Kalra1-0/+3
SEV INIT command loads the SEV related persistent data from NVS and initializes the platform context. The firmware validates the persistent state. If validation fails, the firmware will reset the persisent state and return an integrity check failure status. At this point, a subsequent INIT command should succeed, so retry the command. The INIT command retry is only done during driver initialization. Additional enums along with SEV_RET_SECURE_DATA_INVALID are added to sev_ret_code to maintain continuity and relevance of enum values. Signed-off-by: Ashish Kalra <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Brijesh Singh <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2019-10-25dma-buf: Add dma-buf heaps frameworkAndrew F. Davis1-0/+55
This framework allows a unified userspace interface for dma-buf exporters, allowing userland to allocate specific types of memory for use in dma-buf sharing. Each heap is given its own device node, which a user can allocate a dma-buf fd from using the DMA_HEAP_IOC_ALLOC. This code is an evoluiton of the Android ION implementation, and a big thanks is due to its authors/maintainers over time for their effort: Rebecca Schultz Zavin, Colin Cross, Benjamin Gaignard, Laura Abbott, and many other contributors! Cc: Laura Abbott <[email protected]> Cc: Benjamin Gaignard <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Liam Mark <[email protected]> Cc: Pratik Patel <[email protected]> Cc: Brian Starkey <[email protected]> Cc: Vincent Donnefort <[email protected]> Cc: Sudipto Paul <[email protected]> Cc: Andrew F. Davis <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chenbo Feng <[email protected]> Cc: Alistair Strachan <[email protected]> Cc: Hridya Valsaraju <[email protected]> Cc: Hillf Danton <[email protected]> Cc: [email protected] Reviewed-by: Benjamin Gaignard <[email protected]> Reviewed-by: Brian Starkey <[email protected]> Acked-by: Laura Abbott <[email protected]> Tested-by: Ayan Kumar Halder <[email protected]> Signed-off-by: Andrew F. Davis <[email protected]> Signed-off-by: John Stultz <[email protected]> Signed-off-by: Sumit Semwal <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2019-10-24media: v4l2-core: Add new metadata formatVandana BN1-0/+1
Add new metadata format to support metadata output in vivid. Signed-off-by: Vandana BN <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2019-10-24Merge remote-tracking branch 'kvmarm/kvm-arm64/stolen-time' into ↵Marc Zyngier1-0/+2
kvmarm-master/next
2019-10-23compat_ioctl: handle PPPIOCGIDLE for 64-bit time_tArnd Bergmann2-0/+16
The ppp_idle structure is defined in terms of __kernel_time_t, which is defined as 'long' on all architectures, and this usage is not affected by the y2038 problem since it transports a time interval rather than an absolute time. However, the ppp user space defines the same structure as time_t, which may be 64-bit wide on new libc versions even on 32-bit architectures. It's easy enough to just handle both possible structure layouts on all architectures, to deal with the possibility that a user space ppp implementation comes with its own ppp_idle structure definition, as well as to document the fact that the driver is y2038-safe. Doing this also avoids the need for a special compat mode translation, since 32-bit and 64-bit kernels now support the same interfaces. The old 32-bit structure is also available on native 64-bit architectures now, but this is harmless. Cc: [email protected] Cc: [email protected] Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2019-10-23fuse: Add changelog entries for protocols 7.1 - 7.8Alan Somers1-0/+37
Retroactively add changelog entry for FUSE protocols 7.1 through 7.8. Signed-off-by: Alan Somers <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2019-10-23netfilter: nf_tables: support for multiple devices per netdev hookPablo Neira Ayuso1-0/+2
This patch allows you to register one netdev basechain to multiple devices. This adds a new NFTA_HOOK_DEVS netlink attribute to specify the list of netdevices. Basechains store a list of hooks. Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-10-21clone3: add CLONE_CLEAR_SIGHANDChristian Brauner1-0/+3
Reset all signal handlers of the child not set to SIG_IGN to SIG_DFL. Mutually exclusive with CLONE_SIGHAND to not disturb other thread's signal handler. In the spirit of closer cooperation between glibc developers and kernel developers (cf. [2]) this patchset came out of a discussion on the glibc mailing list for improving posix_spawn() (cf. [1], [3], [4]). Kernel support for this feature has been explicitly requested by glibc and I see no reason not to help them with this. The child helper process on Linux posix_spawn must ensure that no signal handlers are enabled, so the signal disposition must be either SIG_DFL or SIG_IGN. However, it requires a sigprocmask to obtain the current signal mask and at least _NSIG sigaction calls to reset the signal handlers for each posix_spawn call or complex state tracking that might lead to data corruption in glibc. Adding this flags lets glibc avoid these problems. [1]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00149.html [3]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00158.html [4]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00160.html [2]: https://lwn.net/Articles/799331/ '[...] by asking for better cooperation with the C-library projects in general. They should be copied on patches containing ABI changes, for example. I noted that there are often times where C-library developers wish the kernel community had done things differently; how could those be avoided in the future? Members of the audience suggested that more glibc developers should perhaps join the linux-api list. The other suggestion was to "copy Florian on everything".' Cc: Florian Weimer <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2019-10-21KVM: arm64: Provide VCPU attributes for stolen timeSteven Price1-0/+2
Allow user space to inform the KVM host where in the physical memory map the paravirtualized time structures should be located. User space can set an attribute on the VCPU providing the IPA base address of the stolen time structure for that VCPU. This must be repeated for every VCPU in the VM. The address is given in terms of the physical address visible to the guest and must be 64 byte aligned. The guest will discover the address via a hypercall. Signed-off-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2019-10-21KVM: arm/arm64: Allow user injection of external data abortsChristoffer Dall1-0/+1
In some scenarios, such as buggy guest or incorrect configuration of the VMM and firmware description data, userspace will detect a memory access to a portion of the IPA, which is not mapped to any MMIO region. For this purpose, the appropriate action is to inject an external abort to the guest. The kernel already has functionality to inject an external abort, but we need to wire up a signal from user space that lets user space tell the kernel to do this. It turns out, we already have the set event functionality which we can perfectly reuse for this. Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2019-10-21KVM: arm/arm64: Allow reporting non-ISV data aborts to userspaceChristoffer Dall1-0/+7
For a long time, if a guest accessed memory outside of a memslot using any of the load/store instructions in the architecture which doesn't supply decoding information in the ESR_EL2 (the ISV bit is not set), the kernel would print the following message and terminate the VM as a result of returning -ENOSYS to userspace: load/store instruction decoding not implemented The reason behind this message is that KVM assumes that all accesses outside a memslot is an MMIO access which should be handled by userspace, and we originally expected to eventually implement some sort of decoding of load/store instructions where the ISV bit was not set. However, it turns out that many of the instructions which don't provide decoding information on abort are not safe to use for MMIO accesses, and the remaining few that would potentially make sense to use on MMIO accesses, such as those with register writeback, are not used in practice. It also turns out that fetching an instruction from guest memory can be a pretty horrible affair, involving stopping all CPUs on SMP systems, handling multiple corner cases of address translation in software, and more. It doesn't appear likely that we'll ever implement this in the kernel. What is much more common is that a user has misconfigured his/her guest and is actually not accessing an MMIO region, but just hitting some random hole in the IPA space. In this scenario, the error message above is almost misleading and has led to a great deal of confusion over the years. It is, nevertheless, ABI to userspace, and we therefore need to introduce a new capability that userspace explicitly enables to change behavior. This patch introduces KVM_CAP_ARM_NISV_TO_USER (NISV meaning Non-ISV) which does exactly that, and introduces a new exit reason to report the event to userspace. User space can then emulate an exception to the guest, restart the guest, suspend the guest, or take any other appropriate action as per the policy of the running system. Reported-by: Heinrich Schuchardt <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Reviewed-by: Alexander Graf <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2019-10-21media: videodev2.h: add V4L2_DEC_CMD_FLUSHHans Verkuil1-0/+1
Add this new V4L2_DEC_CMD_FLUSH decoder command and document it. Reviewed-by: Boris Brezillon <[email protected]> Reviewed-by: Alexandre Courbot <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Jernej Skrabec <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2019-10-21media: vb2: add V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUFHans Verkuil1-5/+8
This patch adds support for the V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF flag. It also adds a new V4L2_BUF_CAP_SUPPORTS_M2M_HOLD_CAPTURE_BUF capability. Drivers should set vb2_queue->subsystem_flags to VB2_V4L2_FL_SUPPORTS_M2M_HOLD_CAPTURE_BUF to indicate support for this flag. Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2019-10-21KVM: PPC: Report single stepping capabilityFabiano Rosas1-0/+1
When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request the next instruction to be single stepped via the KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure. This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order to inform userspace about the state of single stepping support. We currently don't have support for guest single stepping implemented in Book3S HV so the capability is only present for Book3S PR and BookE. Signed-off-by: Fabiano Rosas <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2019-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+1
Several cases of overlapping changes which were for the most part trivially resolvable. Signed-off-by: David S. Miller <[email protected]>
2019-10-17bpf: Check types of arguments passed into helpersAlexei Starovoitov1-1/+26
Introduce new helper that reuses existing skb perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct sk_buff *' as tracepoint argument or can walk other kernel data structures to skb pointer. In order to do that teach verifier to resolve true C types of bpf helpers into in-kernel BTF ids. The type of kernel pointer passed by raw tracepoint into bpf program will be tracked by the verifier all the way until it's passed into helper function. For example: kfree_skb() kernel function calls trace_kfree_skb(skb, loc); bpf programs receives that skb pointer and may eventually pass it into bpf_skb_output() bpf helper which in-kernel is implemented via bpf_skb_event_output() kernel function. Its first argument in the kernel is 'struct sk_buff *'. The verifier makes sure that types match all the way. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2019-10-17bpf: Add attach_btf_id attribute to program loadAlexei Starovoitov1-0/+1
Add attach_btf_id attribute to prog_load command. It's similar to existing expected_attach_type attribute which is used in several cgroup based program types. Unfortunately expected_attach_type is ignored for tracing programs and cannot be reused for new purpose. Hence introduce attach_btf_id to verify bpf programs against given in-kernel BTF type id at load time. It is strictly checked to be valid for raw_tp programs only. In a later patches it will become: btf_id == 0 semantics of existing raw_tp progs. btd_id > 0 raw_tp with BTF and additional type safety. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2019-10-16serial: fsl_linflexuart: Be consistent with the nameStefan-Gabriel Mirea1-1/+1
For consistency reasons, spell the controller name as "LINFlexD" in comments and documentation. Signed-off-by: Stefan-Gabriel Mirea <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-10-15ethtool: Add support for 400Gbps (50Gbps per lane) link modesJiri Pirko1-0/+6
Add support for 400Gbps speed, link modes of 50Gbps per lane Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-15iommu: Introduce guest PASID bind functionJacob Pan1-0/+59
Guest shared virtual address (SVA) may require host to shadow guest PASID tables. Guest PASID can also be allocated from the host via enlightened interfaces. In this case, guest needs to bind the guest mm, i.e. cr3 in guest physical address to the actual PASID table in the host IOMMU. Nesting will be turned on such that guest virtual address can go through a two level translation: - 1st level translates GVA to GPA - 2nd level translates GPA to HPA This patch introduces APIs to bind guest PASID data to the assigned device entry in the physical IOMMU. See the diagram below for usage explanation. .-------------. .---------------------------. | vIOMMU | | Guest process mm, FL only | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush - '-------------' | | | V | | GP '-------------' Guest ------| Shadow |----------------------- GP->HP* --------- v v | Host v .-------------. .----------------------. | pIOMMU | | Bind FL for GVA-GPA | | | '----------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------\.---------------------. | | |Set SL to GPA-HPA | | | '---------------------' '-------------' Where: - FL = First level/stage one page tables - SL = Second level/stage two page tables - GP = Guest PASID - HP = Host PASID * Conversion needed if non-identity GP-HP mapping option is chosen. Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Liu Yi L <[email protected]> Reviewed-by: Jean-Philippe Brucker <[email protected]> Reviewed-by: Jean-Philippe Brucker <[email protected]> Reviewed-by: Eric Auger <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2019-10-15iommu: Introduce cache_invalidate APIYi L Liu1-0/+110
In any virtualization use case, when the first translation stage is "owned" by the guest OS, the host IOMMU driver has no knowledge of caching structure updates unless the guest invalidation activities are trapped by the virtualizer and passed down to the host. Since the invalidation data can be obtained from user space and will be written into physical IOMMU, we must allow security check at various layers. Therefore, generic invalidation data format are proposed here, model specific IOMMU drivers need to convert them into their own format. Signed-off-by: Yi L Liu <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Ashok Raj <[email protected]> Signed-off-by: Eric Auger <[email protected]> Signed-off-by: Jean-Philippe Brucker <[email protected]> Reviewed-by: Jean-Philippe Brucker <[email protected]> Reviewed-by: Eric Auger <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2019-10-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-16/+16
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-10-14 The following pull-request contains BPF updates for your *net-next* tree. 12 days of development and 85 files changed, 1889 insertions(+), 1020 deletions(-) The main changes are: 1) auto-generation of bpf_helper_defs.h, from Andrii. 2) split of bpf_helpers.h into bpf_{helpers, helper_defs, endian, tracing}.h and move into libbpf, from Andrii. 3) Track contents of read-only maps as scalars in the verifier, from Andrii. 4) small x86 JIT optimization, from Daniel. 5) cross compilation support, from Ivan. 6) bpf flow_dissector enhancements, from Jakub and Stanislav. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-10-14PCI: Add PCI_STD_NUM_BARS for the number of standard BARsDenis Efremov1-0/+1
Code that iterates over all standard PCI BARs typically uses PCI_STD_RESOURCE_END. However, that requires the unusual test "i <= PCI_STD_RESOURCE_END" rather than something the typical "i < PCI_STD_NUM_BARS". Add a definition for PCI_STD_NUM_BARS and change loops to use the more idiomatic C style to help avoid fencepost errors. Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Denis Efremov <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Acked-by: Sebastian Ott <[email protected]> # arch/s390/ Acked-by: Bartlomiej Zolnierkiewicz <[email protected]> # video/fbdev/ Acked-by: Gustavo Pimentel <[email protected]> # pci/controller/dwc/ Acked-by: Jack Wang <[email protected]> # scsi/pm8001/ Acked-by: Martin K. Petersen <[email protected]> # scsi/pm8001/ Acked-by: Ulf Hansson <[email protected]> # memstick/
2019-10-13Merge tag 'mac80211-next-for-net-next-2019-10-11' of ↵David S. Miller1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== A few more small things, nothing really stands out: * minstrel improvements from Felix * a TX aggregation simplification * some additional capabilities for hwsim * minor cleanups & docs updates ==================== Signed-off-by: David S. Miller <[email protected]>
2019-10-12Merge tag 'tty-5.4-rc3' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver fixes from Greg KH: "Here are some small tty and serial driver fixes for 5.4-rc3 that resolve a number of reported issues and regressions. None of these are huge, full details are in the shortlog. There's also a MAINTAINERS update that I think you might have already taken in your tree already, but git should handle that merge easily. All have been in linux-next with no reported issues" * tag 'tty-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: MAINTAINERS: kgdb: Add myself as a reviewer for kgdb/kdb tty: serial: imx: Use platform_get_irq_optional() for optional IRQs serial: fix kernel-doc warning in comments serial: 8250_omap: Fix gpio check for auto RTS/CTS serial: mctrl_gpio: Check for NULL pointer tty: serial: fsl_lpuart: Fix lpuart_flush_buffer() tty: serial: Fix PORT_LINFLEXUART definition tty: n_hdlc: fix build on SPARC serial: uartps: Fix uartps_major handling serial: uartlite: fix exit path null pointer tty: serial: linflexuart: Fix magic SysRq handling serial: sh-sci: Use platform_get_irq_optional() for optional interrupts dt-bindings: serial: sh-sci: Document r8a774b1 bindings serial/sifive: select SERIAL_EARLYCON tty: serial: rda: Fix the link time qualifier of 'rda_uart_exit()' tty: serial: owl: Fix the link time qualifier of 'owl_uart_exit()'
2019-10-10seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUEChristian Brauner1-0/+29
This allows the seccomp notifier to continue a syscall. A positive discussion about this feature was triggered by a post to the ksummit-discuss mailing list (cf. [3]) and took place during KSummit (cf. [1]) and again at the containers/checkpoint-restore micro-conference at Linux Plumbers. Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4]) which enables a process (watchee) to retrieve an fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (watcher). The watcher will then be able to receive seccomp messages about the syscalls having been performed by the watchee. This feature is heavily used in some userspace workloads. For example, it is currently used to intercept mknod() syscalls in user namespaces aka in containers. The mknod() syscall can be easily filtered based on dev_t. This allows us to only intercept a very specific subset of mknod() syscalls. Furthermore, mknod() is not possible in user namespaces toto coelo and so intercepting and denying syscalls that are not in the whitelist on accident is not a big deal. The watchee won't notice a difference. In contrast to mknod(), a lot of other syscall we intercept (e.g. setxattr()) cannot be easily filtered like mknod() because they have pointer arguments. Additionally, some of them might actually succeed in user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we currently cannot tell seccomp to continue from a user notifier we are stuck with performing all of the syscalls in lieu of the container. This is a huge security liability since it is extremely difficult to correctly assume all of the necessary privileges of the calling task such that the syscall can be successfully emulated without escaping other additional security restrictions (think missing CAP_MKNOD for mknod(), or MS_NODEV on a filesystem etc.). This can be solved by telling seccomp to resume the syscall. One thing that came up in the discussion was the problem that another thread could change the memory after userspace has decided to let the syscall continue which is a well known TOCTOU with seccomp which is present in other ways already. The discussion showed that this feature is already very useful for any syscall without pointer arguments. For any accidentally intercepted non-pointer syscall it is safe to continue. For syscalls with pointer arguments there is a race but for any cautious userspace and the main usec cases the race doesn't matter. The notifier is intended to be used in a scenario where a more privileged watcher supervises the syscalls of lesser privileged watchee to allow it to get around kernel-enforced limitations by performing the syscall for it whenever deemed save by the watcher. Hence, if a user tricks the watcher into allowing a syscall they will either get a deny based on kernel-enforced restrictions later or they will have changed the arguments in such a way that they manage to perform a syscall with arguments that they would've been allowed to do anyway. In general, it is good to point out again, that the notifier fd was not intended to allow userspace to implement a security policy but rather to work around kernel security mechanisms in cases where the watcher knows that a given action is safe to perform. /* References */ [1]: https://linuxplumbersconf.org/event/4/contributions/560 [2]: https://linuxplumbersconf.org/event/4/contributions/477 [3]: https://lore.kernel.org/r/[email protected] [4]: commit 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Co-developed-by: Kees Cook <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Tycho Andersen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Will Drewry <[email protected]> CC: Tyler Hicks <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2019-10-10media: add V4L2_CID_UNIT_CELL_SIZE controlRicardo Ribalda Delgado1-0/+1
This control returns the unit cell size in nanometres. The struct provides the width and the height in separated fields to take into consideration asymmetric pixels and/or hardware binning. This control is required for automatic calibration of sensors/cameras. Reviewed-by: Philipp Zabel <[email protected]> Signed-off-by: Ricardo Ribalda Delgado <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2019-10-10media: add V4L2_CTRL_TYPE_AREA control typeRicardo Ribalda Delgado1-0/+6
This type contains the width and the height of a rectangular area. Reviewed-by: Jacopo Mondi <[email protected]> Signed-off-by: Ricardo Ribalda Delgado <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2019-10-09scsi: ch: add include guard to chio.hMasahiro Yamada1-7/+4
Add a header include guard just in case. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2019-10-09sctp: add SCTP_SEND_FAILED_EVENT eventXin Long1-1/+15
This patch is to add a new event SCTP_SEND_FAILED_EVENT described in rfc6458#section-6.1.11. It's a update of SCTP_SEND_FAILED event: struct sctp_sndrcvinfo ssf_info is replaced with struct sctp_sndinfo ssfe_info in struct sctp_send_failed_event. SCTP_SEND_FAILED is being deprecated, but we don't remove it in this patch. Both are being processed in sctp_datamsg_destroy() when the corresp event flag is set. Signed-off-by: Xin Long <[email protected]> Acked-by: Neil Horman <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2019-10-07media: cec-funcs.h: use new CEC_OP_UI_CMD definesHans Verkuil1-14/+14
When the new CEC_OP_UI_CMD defines were added I forgot to update this header to use these new defines. This is now fixed. Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2019-10-07media: cec-funcs.h: add status_req checksHans Verkuil1-2/+4
The CEC_MSG_GIVE_DECK_STATUS and CEC_MSG_GIVE_TUNER_DEVICE_STATUS commands both have a status_req argument: ON, OFF, ONCE. If ON or ONCE, then the follower will reply with a STATUS message. Either once or whenever the status changes (status_req == ON). If status_req == OFF, then it will stop sending continuous status updates, but the follower will *not* send a STATUS message in that case. This means that if status_req == OFF, then msg->reply should be 0 as well since no reply is expected in that case. Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>