| Age | Commit message (Collapse) | Author | Files | Lines |
|
Until the very recent commits, many bounded random integers were
calculated using `get_random_u32() % max_plus_one`, which not only
incurs the price of a division -- indicating performance mostly was not
a real issue -- but also does not result in a uniformly distributed
output if max_plus_one is not a power of two. Recent commits moved to
using `prandom_u32_max(max_plus_one)`, which replaces the division with
a faster multiplication, but still does not solve the issue with
non-uniform output.
For some users, maybe this isn't a problem, and for others, maybe it is,
but for the majority of users, probably the question has never been
posed and analyzed, and nobody thought much about it, probably assuming
random is random is random. In other words, the unthinking expectation
of most users is likely that the resultant numbers are uniform.
So we implement here an efficient way of generating uniform bounded
random integers. Through use of compile-time evaluation, and avoiding
divisions as much as possible, this commit introduces no measurable
overhead. At least for hot-path uses tested, any potential difference
was lost in the noise. On both clang and gcc, code generation is pretty
small.
The new function, get_random_u32_below(), lives in random.h, rather than
prandom.h, and has a "get_random_xxx" function name, because it is
suitable for all uses, including cryptography.
In order to be efficient, we implement a kernel-specific variant of
Daniel Lemire's algorithm from "Fast Random Integer Generation in an
Interval", linked below. The kernel's variant takes advantage of
constant folding to avoid divisions entirely in the vast majority of
cases, works on both 32-bit and 64-bit architectures, and requests a
minimal amount of bytes from the RNG.
Link: https://arxiv.org/pdf/1805.10941.pdf
Cc: [email protected] # to ease future backports that use this api
Signed-off-by: Jason A. Donenfeld <[email protected]>
|
|
Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
the target map file, then it invokes generic_map_update_batch() to do
batch update. generic_map_update_batch() will get the target map file
by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().
The problem is map file returned by the second fdget() may be NULL or a
totally different file compared by map file in bpf_map_do_batch(). The
reason is that the first fdget() only guarantees the liveness of struct
file instead of file descriptor and the file description may be released
by concurrent close() through pick_file().
It doesn't incur any problem as for now, because maps with batch update
support don't use map file in .map_fd_get_ptr() ops. But it is better to
fix the potential access of an invalid map file.
Using __bpf_map_get() again in generic_map_update_batch() can not fix
the problem, because batch.map_fd may be closed and reopened, and the
returned map file may be different with map file got in
bpf_map_do_batch(), so just passing the map file directly to
.map_update_batch() in bpf_map_do_batch().
Signed-off-by: Hou Tao <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Make fb_modesetting_disabled() a static-inline function when it is
defined in the header file. Avoid the linker error shown below.
ld: drivers/video/fbdev/core/fbmon.o: in function `fb_modesetting_disabled':
fbmon.c:(.text+0x1e4): multiple definition of `fb_modesetting_disabled'; drivers/video/fbdev/core/fbmem.o:fbmem.c:(.text+0x1bac): first defined here
A bug report is at [1].
Reported-by: Stephen Rothwell <[email protected]>
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Reviewed-by: Javier Martinez Canillas <[email protected]>
Fixes: 0ba2fa8cbd29 ("fbdev: Add support for the nomodeset kernel parameter")
Cc: Javier Martinez Canillas <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/dri-devel/[email protected]/T/#u # 1
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt,
rather than just sampling the module parameter when the VM is first
created. This restore the original behavior of kvm.halt_poll_ns for VMs
that have not opted into KVM_CAP_HALT_POLL.
Notably, this change restores the ability for admins to disable or
change the maximum halt-polling time system wide for VMs not using
KVM_CAP_HALT_POLL.
Reported-by: Christian Borntraeger <[email protected]>
Fixes: acd05785e48c ("kvm: add capability for halt polling")
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
No more users.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
To disentangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.
Make pci_alloc_irq_vectors() a real function instead of wrapper and add
proper kernel doc to it.
Signed-off-by: Ahmed S. Darwish <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Adjust to reality and remove another layer of pointless Kconfig
indirection. CONFIG_GENERIC_MSI_IRQ is good enough to serve
all purposes.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
What a zoo:
PCI_MSI
select GENERIC_MSI_IRQ
PCI_MSI_IRQ_DOMAIN
def_bool y
depends on PCI_MSI
select GENERIC_MSI_IRQ_DOMAIN
Ergo PCI_MSI enables PCI_MSI_IRQ_DOMAIN which in turn selects
GENERIC_MSI_IRQ_DOMAIN. So all the dependencies on PCI_MSI_IRQ_DOMAIN are
just an indirection to PCI_MSI.
Match the reality and just admit that PCI_MSI requires
GENERIC_MSI_IRQ_DOMAIN.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Add a bus token member to struct msi_domain_info and let
msi_create_irq_domain() set the bus token.
That allows to remove the bus token updates at the call sites.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ahmed S. Darwish <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Split the bus token defines out into a seperate header file to avoid
inclusion of irqdomain.h in msi.h.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now that the last user is gone, confine it to the core code.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
To prepare for removing the exposure of __msi_domain_free_irqs() provide a
post_free() callback in the MSI domain ops which can be used to solve
the problem of the only user of __msi_domain_free_irqs() in arch/powerpc.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Nothing outside of the core code requires this.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When a range of descriptors is freed then all of them are not associated to
a linux interrupt. Remove the filter and add a warning to the free function.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When `timerqueue_getnext()` is called on an empty timer queue, it will
use `rb_entry()` on a NULL pointer, which is invalid. Fix that by using
`rb_entry_safe()` which handles NULL pointers.
This has not caused any issues so far because the offset of the `rb_node`
member in `timerqueue_node` is 0, so `rb_entry()` is essentially a no-op.
Fixes: 511885d7061e ("lib/timerqueue: Rely on rbtree semantics for next timer")
Signed-off-by: Barnabás Pőcze <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
PIN_CONFIG_BIAS_PULL_DOWN and PIN_CONFIG_BIAS_PULL_UP values can
be custom or an SI unit such as ohms
Signed-off-by: Niyas Sait <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Linus Walleij <[email protected]>
|
|
Move the declaration of 'struct trace_array' out of #ifdef
CONFIG_TRACING block, to fix the following warning when CONFIG_TRACING
is not set:
>> include/linux/trace.h:63:45: warning: 'struct trace_array' declared
inside parameter list will not be visible outside of this definition or
declaration
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1a77dd1c2bb5 ("scsi: tracing: Fix compile error in trace_array calls when TRACING is disabled")
Cc: "Martin K. Petersen" <[email protected]>
Cc: Arun Easi <[email protected]>
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Signed-off-by: Aashish Sharma <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which is
called when the blkcg reference count reaches 0. This circular dependency
will prevent blkcg from being freed until some other events cause
cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
function to flush stats for a given css and cpu and call it at the blkgs
destruction path, blkcg_destroy_blkgs(), whenever there are still some
pending stats to be flushed. This will ensure that blkcg reference
count can reach 0 ASAP.
Signed-off-by: Waiman Long <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
The fixp_sin32_rad() contains a call to BUG_ON(). If users of
fixp-arith.h have not included the bug.h prior including the fixp-arith.h
the compiler will not find the BUG_ON. Thus every user of fixp-arith.h
must also include bug.h (or roll own variant of BUG_ON()).
Include bug.h from fixp-arith.h so every user of fixp-arith.h does not
need to include the bug.h prior inclusion of fixp-arith.h
Fixes: 559addc25b00 ("[media] fixp-arith: replace sin/cos table by a better precision one")
Signed-off-by: Matti Vaittinen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dmitry Torokhov <[email protected]>
|
|
There are no external users of this function.
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Device mappers had always been getting the default 511 dma mask, but
the underlying device might have a larger alignment requirement. Since
this value is used to determine alloweable direct-io alignment, this
needs to be a stackable limit.
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now that dm has been fixed to track of holder registrations before
add_disk, the somewhat buggy block layer code can be safely removed.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Currently the way polling works on the ring buffer is broken. It will
return immediately if there's any data in the ring buffer whereas a read
will block until the watermark (defined by the tracefs buffer_percent file)
is hit.
That is, a select() or poll() will return as if there's data available,
but then the following read will block. This is broken for the way
select()s and poll()s are supposed to work.
Have the polling on the ring buffer also block the same way reads and
splice does on the ring buffer.
Link: https://lkml.kernel.org/r/[email protected]
Cc: Linux Trace Kernel <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Primiano Tucci <[email protected]>
Cc: [email protected]
Fixes: 1e0d6714aceb7 ("ring-buffer: Do not wake up a splice waiter when page is not full")
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
dispatch_hid_bpf_device_event() is supposed to return either an error,
or a valid pointer to memory containing the data.
Returning NULL simply makes a segfault when CONFIG_HID_BPF is not set
for any processed event.
Fixes: 658ee5a64fcfbbf ("HID: bpf: allocate data memory for device_event BPF program")
Signed-off-by: Benjamin Tissoires <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
|
|
Sbitmap code will need to know how many waiters were actually woken for
its batched wakeups implementation. Return the number of woken
exclusive waiters from __wake_up() to facilitate that.
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Introduce a helper mempool_is_saturated(), which tells if the mempool is
under-filled or not. We need it to figure out whether it should be
freed right into the mempool or could be cached with top level caches.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/636aed30be8c35d78f45e244998bc6209283cccc.1667384020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.
netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64
Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Support the kernel's nomodeset parameter for all PCI-based fbdev
drivers that use aperture helpers to remove other, hardware-agnostic
graphics drivers.
The parameter is a simple way of using the firmware-provided scanout
buffer if the hardware's native driver is broken. The same effect
could be achieved with per-driver options, but the importance of the
graphics output for many users makes a single, unified approach
worthwhile.
With nomodeset specified, the fbdev driver module will not load. This
unifies behavior with similar DRM drivers. In DRM helpers, modules
first check the nomodeset parameter before registering the PCI
driver. As fbdev has no such module helpers, we have to modify each
driver individually.
The name 'nomodeset' is slightly misleading, but has been chosen for
historical reasons. Several drivers implemented it before it became a
general option for DRM. So keeping the existing name was preferred over
introducing a new one.
v2:
* print a warning if a driver does not init (Helge)
* wrap video_firmware_drivers_only() in helper
Signed-off-by: Thomas Zimmermann <[email protected]>
Reviewed-by: Javier Martinez Canillas <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Schedule the deferred-I/O worker instead of the damage worker after
writing to the fbdev framebuffer. The deferred-I/O worker then performs
the dirty-fb update. The fbdev emulation will initialize deferred I/O
for all drivers that require damage updates. It is therefore a valid
assumption that the deferred-I/O worker is present.
It would be possible to perform the damage handling directly from within
the write operation. But doing this could increase the overhead of the
write or interfere with a concurrently scheduled deferred-I/O worker.
Instead, scheduling the deferred-I/O worker with its regular delay of
50 ms removes load off the write operation and allows the deferred-I/O
worker to handle multiple write operations that arrived during the delay
time window.
v3:
* remove unused variable (lkp)
v2:
* keep drm_fb_helper_damage() (Daniel)
* use fb_deferred_io_schedule_flush() (Daniel)
* clarify comments (Daniel)
Signed-off-by: Thomas Zimmermann <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Remove support for configuring the device via platform data because
there are no users of wl1251_platform_data left in the mainline kernel.
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
user_regset_copyin_ignore() apparently cannot fail and so always returns 0.
Let's make this function return *void* instead of *int*...
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Shtylyov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
For queueing packets in XDP we want to add a new redirect map type with
support for 64-bit indexes. To prepare fore this, expand the width of the
'key' argument to the bpf_redirect_map() helper. Since BPF registers are
always 64-bit, this should be safe to do after the fact.
Acked-by: Song Liu <[email protected]>
Reviewed-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Move the received_rps counter value next to the other RPS-related members
in softnet_data. This closes two four-byte holes in the structure, making
room for another pointer in the first two cache lines without bumping the
xmit struct to its own line.
Acked-by: Song Liu <[email protected]>
Reviewed-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Add a new tracepoint hid_bpf_rdesc_fixup() so we can trigger a
report descriptor fixup in the bpf world.
Whenever the program gets attached/detached, the device is reconnected
meaning that userspace will see it disappearing and reappearing with
the new report descriptor.
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Benjamin Tissoires <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
|
|
This function can not be called under IRQ, thus it is only available
while in SEC("syscall").
For consistency, this function requires a HID-BPF context to work with,
and so we also provide a helper to create one based on the HID unique
ID.
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Benjamin Tissoires <[email protected]>
--
changes in v12:
- variable dereferenced before check 'ctx'
|Reported-by: kernel test robot <[email protected]>
|Reported-by: Dan Carpenter <[email protected]>
no changes in v11
no changes in v10
changes in v9:
- fixed kfunc declaration aaccording to latest upstream changes
no changes in v8
changes in v7:
- hid_bpf_allocate_context: remove unused variable
- ensures buf is not NULL
changes in v6:
- rename parameter size into buf__sz to teach the verifier about
the actual buffer size used by the call
- remove the allocated data in the user created context, it's not used
new-ish in v5
Signed-off-by: Jiri Kosina <[email protected]>
|
|
We need to also be able to change the size of the report.
Reducing it is easy, because we already have the incoming buffer that is
big enough, but extending it is harder.
Pre-allocate a buffer that is big enough to handle all reports of the
device, and use that as the primary buffer for BPF programs.
To be able to change the size of the buffer, we change the device_event
API and request it to return the size of the buffer.
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Benjamin Tissoires <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
|
|
Declare an entry point that can use fmod_ret BPF programs, and
also an API to access and change the incoming data.
A simpler implementation would consist in just calling
hid_bpf_device_event() for any incoming event and let users deal
with the fact that they will be called for any event of any device.
The goal of HID-BPF is to partially replace drivers, so this situation
can be problematic because we might have programs which will step on
each other toes.
For that, we add a new API hid_bpf_attach_prog() that can be called
from a syscall and we manually deal with a jump table in hid-bpf.
Whenever we add a program to the jump table (in other words, when we
attach a program to a HID device), we keep the number of time we added
this program in the jump table so we can release it whenever there are
no other users.
HID devices have an RCU protected list of available programs in the
jump table, and those programs are called one after the other thanks
to bpf_tail_call().
To achieve the detection of users losing their fds on the programs we
attached, we add 2 tracing facilities on bpf_prog_release() (for when
a fd is closed) and bpf_free_inode() (for when a pinned program gets
unpinned).
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Benjamin Tissoires <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
|
|
Now that static device properties understand notion of child nodes and
references, let's teach gpiolib to handle them:
- GPIOs are represented as a references to software nodes representing
gpiochip
- references must have 2 arguments - GPIO number within the chip and
GPIO flags (GPIO_ACTIVE_LOW/GPIO_ACTIVE_HIGH, etc)
- a new PROPERTY_ENTRY_GPIO() macro is supplied to ensure the above
- name of the software node representing gpiochip must match label of
the gpiochip, as we use it to locate gpiochip structure at runtime
The following illustrates use of software nodes to describe a "System"
button that is currently specified via use of gpio_keys_platform_data
in arch/mips/alchemy/board-mtx1.c. It follows bindings specified in
Documentation/devicetree/bindings/input/gpio-keys.yaml.
static const struct software_node mxt1_gpiochip2_node = {
.name = "alchemy-gpio2",
};
static const struct property_entry mtx1_gpio_button_props[] = {
PROPERTY_ENTRY_U32("linux,code", BTN_0),
PROPERTY_ENTRY_STRING("label", "System button"),
PROPERTY_ENTRY_GPIO("gpios", &mxt1_gpiochip2_node, 7, GPIO_ACTIVE_LOW),
{ }
};
Similarly, arch/arm/mach-tegra/board-paz00.c can be converted to:
static const struct software_node tegra_gpiochip_node = {
.name = "tegra-gpio",
};
static struct property_entry wifi_rfkill_prop[] __initdata = {
PROPERTY_ENTRY_STRING("name", "wifi_rfkill"),
PROPERTY_ENTRY_STRING("type", "wlan"),
PROPERTY_ENTRY_GPIO("reset-gpios",
&tegra_gpiochip_node, 25, GPIO_ACTIVE_HIGH);
PROPERTY_ENTRY_GPIO("shutdown-gpios",
&tegra_gpiochip_node, 85, GPIO_ACTIVE_HIGH);
{ },
};
static struct platform_device wifi_rfkill_device = {
.name = "rfkill_gpio",
.id = -1,
};
...
software_node_register(&tegra_gpiochip_node);
device_create_managed_software_node(&wifi_rfkill_device.dev,
wifi_rfkill_prop, NULL);
Acked-by: Linus Walleij <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Bartosz Golaszewski <[email protected]>
|
|
While the specification allows devices to either deallocate data
or to actually write zeroes on any Write Zeroes command, many SSDs
only do the sensible thing and deallocate data when the DEAC bit
is specific. Set it when it is supported and the caller doesn't
explicitly opt out of deallocation.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
|
|
Currently both io and admin commands are kept under a
coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely.
$ ls -l /dev/ng*
crw-rw-rw- 1 root root 242, 0 Sep 9 19:20 /dev/ng0n1
crw------- 1 root root 242, 1 Sep 9 19:20 /dev/ng0n2
In the example above, ng0n1 appears as if it may allow unprivileged
read/write operation but it does not and behaves same as ng0n2.
This patch implements a shift from CAP_SYS_ADMIN to more fine-granular
control for io-commands.
If CAP_SYS_ADMIN is present, nothing else is checked as before.
Otherwise, following rules are in place
- any admin-cmd is not allowed
- vendor-specific and fabric commmand are not allowed
- io-commands that can write are allowed if matching FMODE_WRITE
permission is present
- io-commands that read are allowed
Add a helper nvme_cmd_allowed that implements above policy.
Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed
for any decision making.
Since file open mode is counted for any approval/denial, change at
various places to keep file-mode information handy.
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
Added test cases for basic functions and performance of functions
kallsyms_lookup_name(), kallsyms_on_each_symbol() and
kallsyms_on_each_match_symbol(). It also calculates the compression rate
of the kallsyms compression algorithm for the current symbol set.
The basic functions test begins by testing a set of symbols whose address
values are known. Then, traverse all symbol addresses and find the
corresponding symbol name based on the address. It's impossible to
determine whether these addresses are correct, but we can use the above
three functions along with the addresses to test each other. Due to the
traversal operation of kallsyms_on_each_symbol() is too slow, only 60
symbols can be tested in one second, so let it test on average once
every 128 symbols. The other two functions validate all symbols.
If the basic functions test is passed, print only performance test
results. If the test fails, print error information, but do not perform
subsequent performance tests.
Start self-test automatically after system startup if
CONFIG_KALLSYMS_SELFTEST=y.
Example of output content: (prefix 'kallsyms_selftest:' is omitted
start
---------------------------------------------------------
| nr_symbols | compressed size | original size | ratio(%) |
|---------------------------------------------------------|
| 107543 | 1357912 | 2407433 | 56.40 |
---------------------------------------------------------
kallsyms_lookup_name() looked up 107543 symbols
The time spent on each symbol is (ns): min=630, max=35295, avg=7353
kallsyms_on_each_symbol() traverse all: 11782628 ns
kallsyms_on_each_match_symbol() traverse all: 9261 ns
finish
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
|
|
Instead of having to pass multiple arguments that describe the register,
pass the bpf_reg_state into the btf_struct_access callback. Currently,
all call sites simply reuse the btf and btf_id of the reg they want to
check the access of. The only exception to this pattern is the callsite
in check_ptr_to_map_access, hence for that case create a dummy reg to
simulate PTR_TO_BTF_ID access.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Currently, verifier uses MEM_ALLOC type tag to specially tag memory
returned from bpf_ringbuf_reserve helper. However, this is currently
only used for this purpose and there is an implicit assumption that it
only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM
in check_func_arg_reg_off).
Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special
relationship and instead open the use of MEM_ALLOC for more generic
allocations made for user types.
Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it.
Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Currently, the verifier has two return types, RET_PTR_TO_ALLOC_MEM, and
RET_PTR_TO_ALLOC_MEM_OR_NULL, however the former is confusingly named to
imply that it carries MEM_ALLOC, while only the latter does. This causes
confusion during code review leading to conclusions like that the return
value of RET_PTR_TO_DYNPTR_MEM_OR_NULL (which is RET_PTR_TO_ALLOC_MEM |
PTR_MAYBE_NULL) may be consumable by bpf_ringbuf_{submit,commit}.
Rename it to make it clear MEM_ALLOC needs to be tacked on top of
RET_PTR_TO_MEM.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Add the support on the map side to parse, recognize, verify, and build
metadata table for a new special field of the type struct bpf_list_head.
To parameterize the bpf_list_head for a certain value type and the
list_node member it will accept in that value type, we use BTF
declaration tags.
The definition of bpf_list_head in a map value will be done as follows:
struct foo {
struct bpf_list_node node;
int data;
};
struct map_value {
struct bpf_list_head head __contains(foo, node);
};
Then, the bpf_list_head only allows adding to the list 'head' using the
bpf_list_node 'node' for the type struct foo.
The 'contains' annotation is a BTF declaration tag composed of four
parts, "contains:name:node" where the name is then used to look up the
type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
the lookup. The node defines name of the member in this type that has
the type struct bpf_list_node, which is actually used for linking into
the linked list. For now, 'kind' part is hardcoded as struct.
This allows building intrusive linked lists in BPF, using container_of
to obtain pointer to entry, while being completely type safe from the
perspective of the verifier. The verifier knows exactly the type of the
nodes, and knows that list helpers return that type at some fixed offset
where the bpf_list_node member used for this list exists. The verifier
also uses this information to disallow adding types that are not
accepted by a certain list.
For now, no elements can be added to such lists. Support for that is
coming in future patches, hence draining and freeing items is done with
a TODO that will be resolved in a future patch.
Note that the bpf_list_head_free function moves the list out to a local
variable under the lock and releases it, doing the actual draining of
the list items outside the lock. While this helps with not holding the
lock for too long pessimizing other concurrent list operations, it is
also necessary for deadlock prevention: unless every function called in
the critical section would be notrace, a fentry/fexit program could
attach and call bpf_map_update_elem again on the map, leading to the
same lock being acquired if the key matches and lead to a deadlock.
While this requires some special effort on part of the BPF programmer to
trigger and is highly unlikely to occur in practice, it is always better
if we can avoid such a condition.
While notrace would prevent this, doing the draining outside the lock
has advantages of its own, hence it is used to also fix the deadlock
related problem.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
The current offset needs to also skip over the already copied region in
addition to the size of the next field. This case manifests where there
are gaps between adjacent special fields.
It was observed that for a map value with size 48, having fields at:
off: 0, 16, 32
size: 4, 16, 16
The current code does:
memcpy(dst + 0, src + 0, 0)
memcpy(dst + 4, src + 4, 12)
memcpy(dst + 20, src + 20, 12)
memcpy(dst + 36, src + 36, 12)
With the fix, it is done correctly as:
memcpy(dst + 0, src + 0, 0)
memcpy(dst + 4, src + 4, 12)
memcpy(dst + 32, src + 32, 0)
memcpy(dst + 48, src + 48, 0)
Fixes: 4d7d7f69f4b1 ("bpf: Adapt copy_map_value for multiple offset case")
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
In f71b2f64177a ("bpf: Refactor map->off_arr handling"), map->off_arr
was refactored to be btf_field_offs. The number of field offsets is
equal to maximum possible fields limited by BTF_FIELDS_MAX. Hence, reuse
BTF_FIELDS_MAX as spin_lock and timer no longer are to be handled
specially for offset sorting, fix the comment, and remove incorrect
WARN_ON as its rec->cnt can never exceed this value. The reason to keep
separate constant was the it was always more 2 more than total kptrs.
This is no longer the case.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Pull VFIO fixes from Alex Williamson:
- Fixes for potential container registration leak for drivers not
implementing a close callback, duplicate container de-registrations,
and a regression in support for bus reset on last device close from
a device set (Anthony DeRossi)
* tag 'vfio-v6.1-rc6' of https://github.com/awilliam/linux-vfio:
vfio/pci: Check the device set open count on reset
vfio: Export the device set open count
vfio: Fix container device registration life cycle
|
|
Introduces new helper function to aid in .probe_new() refactors. In order
to use existing i2c_get_device_id() on the probe callback, the device
match table needs to be accessible in that function, which would require
bigger refactors in some drivers using the deprecated .probe callback.
This issue was discussed in more detail in the IIO mailing list.
Link: https://lore.kernel.org/all/[email protected]/
Suggested-by: Nuno Sá <[email protected]>
Suggested-by: Andy Shevchenko <[email protected]>
Suggested-by: Jonathan Cameron <[email protected]>
Signed-off-by: Angel Iglesias <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
|