aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-07-19dma-mapping: make support for dma ops optionalChristoph Hellwig2-1/+6
Avoid the overhead of the dma ops support for tiny builds that only use the direct mapping. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Alexey Kardashevskiy <[email protected]> Reviewed-by: Alexey Kardashevskiy <[email protected]>
2020-07-17inet: Run SK_LOOKUP BPF program on socket lookupJakub Sitnicki1-1/+31
Run a BPF program before looking up a listening socket on the receive path. Program selects a listening socket to yield as result of socket lookup by calling bpf_sk_assign() helper and returning SK_PASS code. Program can revert its decision by assigning a NULL socket with bpf_sk_assign(). Alternatively, BPF program can also fail the lookup by returning with SK_DROP, or let the lookup continue as usual with SK_PASS on return, when no socket has been selected with bpf_sk_assign(). This lets the user match packets with listening sockets freely at the last possible point on the receive path, where we know that packets are destined for local delivery after undergoing policing, filtering, and routing. With BPF code selecting the socket, directing packets destined to an IP range or to a port range to a single socket becomes possible. In case multiple programs are attached, they are run in series in the order in which they were attached. The end result is determined from return codes of all the programs according to following rules: 1. If any program returned SK_PASS and selected a valid socket, the socket is used as result of socket lookup. 2. If more than one program returned SK_PASS and selected a socket, last selection takes effect. 3. If any program returned SK_DROP, and no program returned SK_PASS and selected a socket, socket lookup fails with -ECONNREFUSED. 4. If all programs returned SK_PASS and none of them selected a socket, socket lookup continues to htable-based lookup. Suggested-by: Marek Majkowski <[email protected]> Signed-off-by: Jakub Sitnicki <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-07-17bpf: Introduce SK_LOOKUP program type with a dedicated attach pointJakub Sitnicki3-3/+24
Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer when looking up a listening socket for a new connection request for connection oriented protocols, or when looking up an unconnected socket for a packet for connection-less protocols. When called, SK_LOOKUP BPF program can select a socket that will receive the packet. This serves as a mechanism to overcome the limits of what bind() API allows to express. Two use-cases driving this work are: (1) steer packets destined to an IP range, on fixed port to a socket 192.0.2.0/24, port 80 -> NGINX socket (2) steer packets destined to an IP address, on any port to a socket 198.51.100.1, any port -> L7 proxy socket In its run-time context program receives information about the packet that triggered the socket lookup. Namely IP version, L4 protocol identifier, and address 4-tuple. Context can be further extended to include ingress interface identifier. To select a socket BPF program fetches it from a map holding socket references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...) helper to record the selection. Transport layer then uses the selected socket as a result of socket lookup. In its basic form, SK_LOOKUP acts as a filter and hence must return either SK_PASS or SK_DROP. If the program returns with SK_PASS, transport should look for a socket to receive the packet, or use the one selected by the program if available, while SK_DROP informs the transport layer that the lookup should fail. This patch only enables the user to attach an SK_LOOKUP program to a network namespace. Subsequent patches hook it up to run on local delivery path in ipv4 and ipv6 stacks. Suggested-by: Marek Majkowski <[email protected]> Signed-off-by: Jakub Sitnicki <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-07-17bpf, netns: Handle multiple link attachmentsJakub Sitnicki2-9/+136
Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the prog_array at given position when link gets attached/updated/released. This let's us lift the limit of having just one link attached for the new attach type introduced by subsequent patch. No functional changes intended. Signed-off-by: Jakub Sitnicki <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-07-18futex: Remove unused or redundant includesAndré Almeida1-17/+0
Since 82af7aca ("Removal of FUTEX_FD"), some includes related to file operations aren't needed anymore. More investigation around the includes showed that a lot of includes aren't required for compilation, possible due to redundant includes. Simplify the code by removing unused includes. Signed-off-by: André Almeida <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-18futex: Consistently use fshared as booleanAndré Almeida1-5/+5
Since fshared is only conveying true/false values, declare it as bool. In get_futex_key() the usage of fshared can be restricted to the first part of the function. If fshared is false the function is terminated early and the subsequent code can use a constant 'true' instead of the variable. Signed-off-by: André Almeida <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17futex: Remove needless goto'sAndré Almeida1-24/+16
As stated in the coding style documentation, "if there is no cleanup needed then just return directly", instead of jumping to a label and then returning. Remove such goto's and replace with a return statement. When there's a ternary operator on the return value, replace it with the result of the operation when it is logically possible to determine it by the control flow. Signed-off-by: André Almeida <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17futex: Remove put_futex_key()André Almeida1-49/+12
Since 4b39f99c ("futex: Remove {get,drop}_futex_key_refs()"), put_futex_key() is empty. Remove all references for this function and the then redundant labels. Signed-off-by: André Almeida <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17genirq/affinity: Handle affinity setting on inactive interrupts correctlyThomas Gleixner1-2/+35
Setting interrupt affinity on inactive interrupts is inconsistent when hierarchical irq domains are enabled. The core code should just store the affinity and not call into the irq chip driver for inactive interrupts because the chip drivers may not be in a state to handle such requests. X86 has a hacky workaround for that but all other irq chips have not which causes problems e.g. on GIC V3 ITS. Instead of adding more ugly hacks all over the place, solve the problem in the core code. If the affinity is set on an inactive interrupt then: - Store it in the irq descriptors affinity mask - Update the effective affinity to reflect that so user space has a consistent view - Don't call into the irq chip driver This is the core equivalent of the X86 workaround and works correctly because the affinity setting is established in the irq chip when the interrupt is activated later on. Note, that this is only effective when hierarchical irq domains are enabled by the architecture. Doing it unconditionally would break legacy irq chip implementations. For hierarchial irq domains this works correctly as none of the drivers can have a dependency on affinity setting in inactive state by design. Remove the X86 workaround as it is not longer required. Fixes: 02edee152d6e ("x86/apic/vector: Ignore set_affinity call for inactive interrupts") Reported-by: Ali Saidi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Ali Saidi <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Lower base clock forwarding thresholdFrederic Weisbecker1-1/+1
There is nothing that prevents from forwarding the base clock if it's one jiffy off. The reason for this arbitrary limit of two jiffies is historical and does not longer exist. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Remove must_forward_clkFrederic Weisbecker1-16/+6
There is no reason to keep this guard around. The code makes sure that base->clk remains sane and won't be forwarded beyond jiffies nor set backward. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Spare timer softirq until next expiryFrederic Weisbecker1-41/+8
Now that the core timer infrastructure doesn't depend anymore on periodic base->clk increments, even when the CPU is not in NO_HZ mode, timer softirqs can be skipped until there are timers to expire. Some spurious softirqs can still remain since base->next_expiry doesn't keep track of canceled timers but this still reduces the number of softirqs significantly: ~15 times less for HZ=1000 and ~5 times less for HZ=100. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Expand clk forward logic beyond nohzFrederic Weisbecker1-22/+4
As for next_expiry, the base->clk catch up logic will be expanded beyond NOHZ in order to avoid triggering useless softirqs. If softirqs should only fire to expire pending timers, periodic base->clk increments must be skippable for random amounts of time. Therefore prepare to catch-up with missing updates whenever an up-to-date base clock is needed. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Reuse next expiry cache after nohz exitFrederic Weisbecker1-4/+2
Now that the next expiry it tracked unconditionally when a timer is added, this information can be reused on a tick firing after exiting nohz. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Always keep track of next expiryFrederic Weisbecker1-21/+21
So far next expiry was only tracked while the CPU was in nohz_idle mode in order to cope with missing ticks that can't increment the base->clk periodically anymore. This logic is going to be expanded beyond nohz in order to spare timer softirqs so do it unconditionally. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Optimize _next_timer_interrupt() level iterationFrederic Weisbecker1-1/+9
If a level has a timer that expires before reaching the next level, there is no need to iterate further. The next level is reached when the 3 lower bits of the current level are cleared. If the next event happens before/during that, the next levels won't provide an earlier expiration. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Add comments about calc_index() ceiling workFrederic Weisbecker1-1/+11
calc_index() adds 1 unit of the level granularity to the expiry passed in parameter to ensure that the timer doesn't expire too early. Add a comment to explain that and the resulting layout in the wheel. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Move trigger_dyntick_cpu() to enqueue_timer()Frederic Weisbecker1-36/+25
Consolidate the code by calling trigger_dyntick_cpu() from enqueue_timer() instead of calling it from all its callers. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Use only bucket expiry for base->next_expiry valueAnna-Maria Behnsen1-30/+34
The bucket expiry time is the effective expriy time of timers and is greater than or equal to the requested timer expiry time. This is due to the guarantee that timers never expire early and the reduced expiry granularity in the secondary wheel levels. When a timer is enqueued, trigger_dyntick_cpu() checks whether the timer is the new first timer. This check compares next_expiry with the requested timer expiry value and not with the effective expiry value of the bucket into which the timer was queued. Storing the requested timer expiry value in base->next_expiry can lead to base->clk going backwards if the requested timer expiry value is smaller than base->clk. Commit 30c66fc30ee7 ("timer: Prevent base->clk from moving backward") worked around this by preventing the store when timer->expiry is before base->clk, but did not fix the underlying problem. Use the expiry value of the bucket into which the timer is queued to do the new first timer check. This fixes the base->clk going backward problem. The workaround of commit 30c66fc30ee7 ("timer: Prevent base->clk from moving backward") in trigger_dyntick_cpu() is not longer necessary as the timers bucket expiry is guaranteed to be greater than or equal base->clk. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timers: Preserve higher bits of expiration on index calculationFrederic Weisbecker1-1/+1
The higher bits of the timer expiration are cropped while calling calc_index() due to the implicit cast from unsigned long to unsigned int. This loss shouldn't have consequences on the current code since all the computation to calculate the index is done on the lower 32 bits. However to prepare for returning the actual bucket expiration from calc_index() in order to properly fix base->next_expiry updates, the higher bits need to be preserved. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17timer: Fix wheel index calculation on last levelFrederic Weisbecker1-2/+2
When an expiration delta falls into the last level of the wheel, that delta has be compared against the maximum possible delay and reduced to fit in if necessary. However instead of comparing the delta against the maximum, the code compares the actual expiry against the maximum. Then instead of fixing the delta to fit in, it sets the maximum delta as the expiry value. This can result in various undesired outcomes, the worst possible one being a timer expiring 15 days ahead to fire immediately. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-16sched/fair: handle case of task_h_load() returning 0Vincent Guittot1-2/+13
task_h_load() can return 0 in some situations like running stress-ng mmapfork, which forks thousands of threads, in a sched group on a 224 cores system. The load balance doesn't handle this correctly because env->imbalance never decreases and it will stop pulling tasks only after reaching loop_max, which can be equal to the number of running tasks of the cfs. Make sure that imbalance will be decreased by at least 1. misfit task is the other feature that doesn't handle correctly such situation although it's probably more difficult to face the problem because of the smaller number of CPUs and running tasks on heterogenous system. We can't simply ensure that task_h_load() returns at least one because it would imply to handle underflow in other places. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Cc: <[email protected]> # v4.4+ Link: https://lkml.kernel.org/r/[email protected]
2020-07-16treewide: Remove uninitialized_var() usageKees Cook10-24/+24
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <[email protected]> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <[email protected]> # IB Acked-by: Kalle Valo <[email protected]> # wireless drivers Reviewed-by: Chao Yu <[email protected]> # erofs Signed-off-by: Kees Cook <[email protected]>
2020-07-16bpf: cpumap: Implement XDP_REDIRECT for eBPF programs attached to map entriesLorenzo Bianconi1-2/+15
Introduce XDP_REDIRECT support for eBPF programs attached to cpumap entries. This patch has been tested on Marvell ESPRESSObin using a modified version of xdp_redirect_cpu sample in order to attach a XDP program to CPUMAP entries to perform a redirect on the mvneta interface. In particular the following scenario has been tested: rq (cpu0) --> mvneta - XDP_REDIRECT (cpu0) --> CPUMAP - XDP_REDIRECT (cpu1) --> mvneta $./xdp_redirect_cpu -p xdp_cpu_map0 -d eth0 -c 1 -e xdp_redirect \ -f xdp_redirect_kern.o -m tx_port -r eth0 tx: 285.2 Kpps rx: 285.2 Kpps Attaching a simple XDP program on eth0 to perform XDP_TX gives comparable results: tx: 288.4 Kpps rx: 288.4 Kpps Co-developed-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Link: https://lore.kernel.org/bpf/2cf8373a731867af302b00c4ff16c122630c4980.1594734381.git.lorenzo@kernel.org
2020-07-16bpf: cpumap: Add the possibility to attach an eBPF program to cpumapLorenzo Bianconi1-13/+108
Introduce the capability to attach an eBPF program to cpumap entries. The idea behind this feature is to add the possibility to define on which CPU run the eBPF program if the underlying hw does not support RSS. Current supported verdicts are XDP_DROP and XDP_PASS. This patch has been tested on Marvell ESPRESSObin using xdp_redirect_cpu sample available in the kernel tree to identify possible performance regressions. Results show there are no observable differences in packet-per-second: $./xdp_redirect_cpu --progname xdp_cpu_map0 --dev eth0 --cpu 1 rx: 354.8 Kpps rx: 356.0 Kpps rx: 356.8 Kpps rx: 356.3 Kpps rx: 356.6 Kpps rx: 356.6 Kpps rx: 356.7 Kpps rx: 355.8 Kpps rx: 356.8 Kpps rx: 356.8 Kpps Co-developed-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Link: https://lore.kernel.org/bpf/5c9febdf903d810b3415732e5cd98491d7d9067a.1594734381.git.lorenzo@kernel.org
2020-07-16cpumap: Formalize map value as a named structLorenzo Bianconi1-13/+15
As it has been already done for devmap, introduce 'struct bpf_cpumap_val' to formalize the expected values that can be passed in for a CPUMAP. Update cpumap code to use the struct. Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Link: https://lore.kernel.org/bpf/754f950674665dae6139c061d28c1d982aaf4170.1594734381.git.lorenzo@kernel.org
2020-07-16cpumap: Use non-locked version __ptr_ring_consume_batchedJesper Dangaard Brouer1-1/+1
Commit 77361825bb01 ("bpf: cpumap use ptr_ring_consume_batched") changed away from using single frame ptr_ring dequeue (__ptr_ring_consume) to consume a batched, but it uses a locked version, which as the comment explain isn't needed. Change to use the non-locked version __ptr_ring_consume_batched. Fixes: 77361825bb01 ("bpf: cpumap use ptr_ring_consume_batched") Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/a9c7d06f9a009e282209f0c8c7b2c5d9b9ad60b9.1594734381.git.lorenzo@kernel.org
2020-07-16dma-mapping: inline the fast path dma-direct callsChristoph Hellwig1-65/+0
Inline the single page map/unmap/sync dma-direct calls into the now out of line generic wrappers. This restores the behavior of a single function call that we had before moving the generic calls out of line. Besides the dma-mapping callers there are just a few callers in IOMMU drivers that have a bypass mode, and more of those are going to be switched to the generic bypass soon. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Alexey Kardashevskiy <[email protected]> Reviewed-by: Alexey Kardashevskiy <[email protected]>
2020-07-16dma-mapping: move the remaining DMA API calls out of lineChristoph Hellwig2-9/+164
For a long time the DMA API has been implemented inline in dma-mapping.h, but the function bodies can be quite large. Move them all out of line. This also removes all the dma_direct_* exports as those are just implementation details and should never be used by drivers directly. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Alexey Kardashevskiy <[email protected]> Reviewed-by: Alexey Kardashevskiy <[email protected]>
2020-07-15bpf: Fix NULL pointer dereference in __btf_resolve_helper_id()Peilin Ye1-0/+5
Prevent __btf_resolve_helper_id() from dereferencing `btf_vmlinux` as NULL. This patch fixes the following syzbot bug: https://syzkaller.appspot.com/bug?id=f823224ada908fa5c207902a5a62065e53ca0fcc Reported-by: [email protected] Signed-off-by: Peilin Ye <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-07-14seccomp: Introduce addfd ioctl to seccomp user notifierSargun Dhillon1-2/+173
The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over an fd. It is often used in settings where a supervising task emulates syscalls on behalf of a supervised task in userspace, either to further restrict the supervisee's syscall abilities or to circumvent kernel enforced restrictions the supervisor deems safe to lift (e.g. actually performing a mount(2) for an unprivileged container). While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall, only a certain subset of syscalls could be correctly emulated. Over the last few development cycles, the set of syscalls which can't be emulated has been reduced due to the addition of pidfd_getfd(2). With this we are now able to, for example, intercept syscalls that require the supervisor to operate on file descriptors of the supervisee such as connect(2). However, syscalls that cause new file descriptors to be installed can not currently be correctly emulated since there is no way for the supervisor to inject file descriptors into the supervisee. This patch adds a new addfd ioctl to remove this restriction by allowing the supervisor to install file descriptors into the intercepted task. By implementing this feature via seccomp the supervisor effectively instructs the supervisee to install a set of file descriptors into its own file descriptor table during the intercepted syscall. This way it is possible to intercept syscalls such as open() or accept(), and install (or replace, like dup2(2)) the supervisor's resulting fd into the supervisee. One replacement use-case would be to redirect the stdout and stderr of a supervisee into log file descriptors opened by the supervisor. The ioctl handling is based on the discussions[1] of how Extensible Arguments should interact with ioctls. Instead of building size into the addfd structure, make it a function of the ioctl command (which is how sizes are normally passed to ioctls). To support forward and backward compatibility, just mask out the direction and size, and match everything. The size (and any future direction) checks are done along with copy_struct_from_user() logic. As a note, the seccomp_notif_addfd structure is laid out based on 8-byte alignment without requiring packing as there have been packing issues with uapi highlighted before[2][3]. Although we could overload the newfd field and use -1 to indicate that it is not to be used, doing so requires changing the size of the fd field, and introduces struct packing complexity. [1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://lore.kernel.org/lkml/[email protected]/ [3]: https://lore.kernel.org/lkml/[email protected] Cc: Christoph Hellwig <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Tycho Andersen <[email protected]> Cc: Jann Horn <[email protected]> Cc: Robert Sesek <[email protected]> Cc: Chris Palmer <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Suggested-by: Matt Denton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sargun Dhillon <[email protected]> Reviewed-by: Will Drewry <[email protected]> Co-developed-by: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]>
2020-07-14Merge branch 'usermode-driver-cleanup' of ↵Alexei Starovoitov5-174/+211
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace into bpf-next
2020-07-14PM: sleep: spread "const char *" correctnessAlexey Dobriyan2-3/+3
Fixed string literals can be referred to as "const char *". Signed-off-by: Alexey Dobriyan <[email protected]> [ rjw: Minor subject edit ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-07-14PM: hibernate: fix white space in a few placesXiang Chen1-3/+3
In hibernate.c, some places lack of spaces while some places have redundant spaces. So fix them. Signed-off-by: Xiang Chen <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-07-14dma-pool: do not allocate pool memory from CMANicolas Saenz Julienne1-9/+2
There is no guarantee to CMA's placement, so allocating a zone specific atomic pool from CMA might return memory from a completely different memory zone. So stop using it. Fixes: c84dc6e68a1d ("dma-pool: add additional coherent pools to map to gfp mask") Reported-by: Jeremy Linton <[email protected]> Signed-off-by: Nicolas Saenz Julienne <[email protected]> Tested-by: Jeremy Linton <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-07-14dma-pool: make sure atomic pool suits deviceNicolas Saenz Julienne1-20/+37
When allocating DMA memory from a pool, the core can only guess which atomic pool will fit a device's constraints. If it doesn't, get a safer atomic pool and try again. Fixes: c84dc6e68a1d ("dma-pool: add additional coherent pools to map to gfp mask") Reported-by: Jeremy Linton <[email protected]> Suggested-by: Robin Murphy <[email protected]> Signed-off-by: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-07-14dma-pool: introduce dma_guess_pool()Nicolas Saenz Julienne1-3/+23
dma-pool's dev_to_pool() creates the false impression that there is a way to grantee a mapping between a device's DMA constraints and an atomic pool. It tuns out it's just a guess, and the device might need to use an atomic pool containing memory from a 'safer' (or lower) memory zone. To help mitigate this, introduce dma_guess_pool() which can be fed a device's DMA constraints and atomic pools already known to be faulty, in order for it to provide an better guess on which pool to use. Signed-off-by: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-07-14dma-pool: get rid of dma_in_atomic_pool()Nicolas Saenz Julienne1-10/+1
The function is only used once and can be simplified to a one-liner. Signed-off-by: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-07-14dma-direct: provide function to check physical memory area validityNicolas Saenz Julienne1-1/+1
dma_coherent_ok() checks if a physical memory area fits a device's DMA constraints. Signed-off-by: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-07-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller6-100/+98
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-07-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 36 non-merge commits during the last 7 day(s) which contain a total of 62 files changed, 2242 insertions(+), 468 deletions(-). The main changes are: 1) Avoid trace_printk warning banner by switching bpf_trace_printk to use its own tracing event, from Alan. 2) Better libbpf support on older kernels, from Andrii. 3) Additional AF_XDP stats, from Ciara. 4) build time resolution of BTF IDs, from Jiri. 5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-07-13bpf: Use dedicated bpf_trace_printk event instead of trace_printk()Alan Maguire3-5/+73
The bpf helper bpf_trace_printk() uses trace_printk() under the hood. This leads to an alarming warning message originating from trace buffer allocation which occurs the first time a program using bpf_trace_printk() is loaded. We can instead create a trace event for bpf_trace_printk() and enable it in-kernel when/if we encounter a program using the bpf_trace_printk() helper. With this approach, trace_printk() is not used directly and no warning message appears. This work was started by Steven (see Link) and finished by Alan; added Steven's Signed-off-by with his permission. Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Alan Maguire <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/bpf/[email protected]
2020-07-13pidfd: Replace open-coded receive_fd()Kees Cook1-13/+2
Replace the open-coded version of receive_fd() with a call to the new helper. Thanks to Vamshi K Sthambamkadi <[email protected]> for catching a missed fput() in an earlier version of this patch. Cc: Christoph Hellwig <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Sargun Dhillon <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Kees Cook <[email protected]>
2020-07-13pidfd: Add missing sock updates for pidfd_getfd()Kees Cook1-2/+5
The sock counting (sock_update_netprioidx() and sock_update_classid()) was missing from pidfd's implementation of received fd installation. Add a call to the new __receive_sock() helper. Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Sargun Dhillon <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Signed-off-by: Kees Cook <[email protected]>
2020-07-13bpf: Use BTF_ID to resolve bpf_ctx_convert structJiri Olsa1-8/+6
This way the ID is resolved during compile time, and we can remove the runtime name search. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Tested-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-07-13bpf: Remove btf_id helpers resolvingJiri Olsa1-84/+5
Now when we moved the helpers btf_id arrays into .BTF_ids section, we can remove the code that resolve those IDs in runtime. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Tested-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-07-13bpf: Resolve BTF IDs in vmlinux imageJiri Olsa2-3/+11
Using BTF_ID_LIST macro to define lists for several helpers using BTF arguments. And running resolve_btfids on vmlinux elf object during linking, so the .BTF_ids section gets the IDs resolved. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Tested-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-07-13doc:kmsg: explicitly state the return value in case of SEEK_CURBruno Meneguele1-0/+8
The commit 625d3449788f ("Revert "kernel/printk: add kmsg SEEK_CUR handling"") reverted a change done to the return value in case a SEEK_CUR operation was performed for kmsg buffer based on the fact that different userspace apps were handling the new return value (-ESPIPE) in different ways, breaking them. At the same time -ESPIPE was the wrong decision because kmsg /does support/ seek() but doesn't follow the "normal" behavior userspace is used to. Because of that and also considering the time -EINVAL has been used, it was decided to keep this way to avoid more userspace breakage. This patch adds an official statement to the kmsg documentation pointing to the current return value for SEEK_CUR, -EINVAL, thus userspace libraries and apps can refer to it for a definitive guide on what to expect. Signed-off-by: Bruno Meneguele <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-07-11Merge tag 'riscv-for-linus-5.8-rc5' of ↵Linus Torvalds1-0/+13
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: "I have a few KGDB-related fixes. They're mostly fixes for build warnings, but there's also: - Support for the qSupported and qXfer packets, which are necessary to pass around GDB XML information which we need for the RISC-V GDB port to fully function. - Users can now select STRICT_KERNEL_RWX instead of forcing it on" * tag 'riscv-for-linus-5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Avoid kgdb.h including gdb_xml.h to solve unused-const-variable warning kgdb: Move the extern declaration kgdb_has_hit_break() to generic kgdb.h riscv: Fix "no previous prototype" compile warning in kgdb.c file riscv: enable the Kconfig prompt of STRICT_KERNEL_RWX kgdb: enable arch to support XML packet.
2020-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller30-281/+461
All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <[email protected]>
2020-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds9-96/+202
Pull networking fixes from David Miller: 1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking BPF programs, from Maciej Żenczykowski. 2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu Mariappan. 3) Slay memory leak in nl80211 bss color attribute parsing code, from Luca Coelho. 4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin. 5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric Dumazet. 6) xsk code dips too deeply into DMA mapping implementation internals. Add dma_need_sync and use it. From Christoph Hellwig 7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko. 8) Check for disallowed attributes when loading flow dissector BPF programs. From Lorenz Bauer. 9) Correct packet injection to L3 tunnel devices via AF_PACKET, from Jason A. Donenfeld. 10) Don't advertise checksum offload on ipa devices that don't support it. From Alex Elder. 11) Resolve several issues in TCP MD5 signature support. Missing memory barriers, bogus options emitted when using syncookies, and failure to allow md5 key changes in established states. All from Eric Dumazet. 12) Fix interface leak in hsr code, from Taehee Yoo. 13) VF reset fixes in hns3 driver, from Huazhong Tan. 14) Make loopback work again with ipv6 anycast, from David Ahern. 15) Fix TX starvation under high load in fec driver, from Tobias Waldekranz. 16) MLD2 payload lengths not checked properly in bridge multicast code, from Linus Lüssing. 17) Packet scheduler code that wants to find the inner protocol currently only works for one level of VLAN encapsulation. Allow Q-in-Q situations to work properly here, from Toke Høiland-Jørgensen. 18) Fix route leak in l2tp, from Xin Long. 19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport support and various protocols. From Martin KaFai Lau. 20) Fix socket cgroup v2 reference counting in some situations, from Cong Wang. 21) Cure memory leak in mlx5 connection tracking offload support, from Eli Britstein. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) mlxsw: pci: Fix use-after-free in case of failed devlink reload mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON() net: macb: fix call to pm_runtime in the suspend/resume functions net: macb: fix macb_suspend() by removing call to netif_carrier_off() net: macb: fix macb_get/set_wol() when moving to phylink net: macb: mark device wake capable when "magic-packet" property present net: macb: fix wakeup test in runtime suspend/resume routines bnxt_en: fix NULL dereference in case SR-IOV configuration fails libbpf: Fix libbpf hashmap on (I)LP32 architectures net/mlx5e: CT: Fix memory leak in cleanup net/mlx5e: Fix port buffers cell size value net/mlx5e: Fix 50G per lane indication net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash net/mlx5e: Fix VXLAN configuration restore after function reload net/mlx5e: Fix usage of rcu-protected pointer net/mxl5e: Verify that rpriv is not NULL net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode net/mlx5: Fix eeprom support for SFP module cgroup: Fix sock_cgroup_data on big-endian. selftests: bpf: Fix detach from sockmap tests ...