aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-05-18Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-40/+21
Pull mlx5 fix from Michael Tsirkin: "One last minute fixup The patch has been on list for a while but as it was posted as part of a thread it was missed" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa/mlx5: Use consistent RQT size
2022-05-18blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()Jens Axboe1-1/+0
A previous commit got rid of unnecessary rcu_read_lock() inside the IRQ disabling queue_lock, but this debug statement was left. It's now firing since we are indeed not inside a RCU read lock, but we don't need to be as we're still preempt safe. Get rid of the check, as we have a lockdep assert for holding the queue lock right after it anyway. Link: https://lore.kernel.org/linux-block/[email protected]/ Reported-by: Marek Szyprowski <[email protected]> Fixes: 77c570a1ea85 ("blk-cgroup: Remove unnecessary rcu_read_lock/unlock()") Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: disallow mixed provided buffer group registrationsJens Axboe1-3/+5
It's nonsensical to register a provided buffer ring, if a classic provided buffer group with the same ID exists. Depending on the order of which we decide what type to pick, the other type will never get used. Explicitly disallow it and return an error if this is attempted. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: initialize io_buffer_list head when shared ring is unregisteredJens Axboe1-0/+3
We use ->buf_pages != 0 to tell if this is a shared buffer ring or a classic provided buffer group. If we unregister the shared ring and then attempt to use it, buf_pages is zero yet the classic list head isn't properly initialized. This causes io_buffer_select() to think that we have classic buffers available, but then we crash when we try and get one from the list. Just initialize the list if we unregister a shared buffer ring, leaving it in a sane state for either re-registration or for attempting to use it. And do the same for the initial setup from the classic path. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <[email protected]>
2022-05-18Input: ili210x - use one common reset implementationMarek Vasut1-12/+8
Rename ili251x_hardware_reset() to ili210x_hardware_reset(), change its parameter from struct device * to struct gpio_desc *, and use it as one single consistent reset implementation all over the driver. Also increase the minimum reset duration to 12ms, to make sure the reset is really within the spec. Signed-off-by: Marek Vasut <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dmitry Torokhov <[email protected]>
2022-05-18Input: ili210x - fix reset timingMarek Vasut1-2/+2
According to Ilitek "231x & ILI251x Programming Guide" Version: 2.30 "2.1. Power Sequence", "T4 Chip Reset and discharge time" is minimum 10ms and "T2 Chip initial time" is maximum 150ms. Adjust the reset timings such that T4 is 12ms and T2 is 160ms to fit those figures. This prevents sporadic touch controller start up failures when some systems with at least ILI251x controller boot, without this patch the systems sometimes fail to communicate with the touch controller. Fixes: 201f3c803544c ("Input: ili210x - add reset GPIO support") Signed-off-by: Marek Vasut <[email protected]> Link: https://lore.kernel.org/r/[email protected] Cc: [email protected] Signed-off-by: Dmitry Torokhov <[email protected]>
2022-05-18drm/amd: Don't reset dGPUs if the system is going to s2idleMario Limonciello3-1/+17
An A+A configuration on ASUS ROG Strix G513QY proves that the ASIC reset for handling aborted suspend can't work with s2idle. This functionality was introduced in commit daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)"). A few other commits have gone on top of the ASIC reset, but this still doesn't work on the A+A configuration in s2idle. Avoid doing the reset on dGPUs specifically when using s2idle. Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2008 Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-05-18libceph: fix misleading ceph_osdc_cancel_request() commentIlya Dryomov1-2/+7
cancel_request() never guaranteed that after its return the OSD client would be completely done with the OSD request. The callback (if specified) can still be invoked and a ref can still be held. Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Xiubo Li <[email protected]>
2022-05-18libceph: fix potential use-after-free on linger ping and resendsIlya Dryomov2-183/+122
request_reinit() is not only ugly as the comment rightfully suggests, but also unsafe. Even though it is called with osdc->lock held for write in all cases, resetting the OSD request refcount can still race with handle_reply() and result in use-after-free. Taking linger ping as an example: handle_timeout thread handle_reply thread down_read(&osdc->lock) req = lookup_request(...) ... finish_request(req) # unregisters up_read(&osdc->lock) __complete_request(req) linger_ping_cb(req) # req->r_kref == 2 because handle_reply still holds its ref down_write(&osdc->lock) send_linger_ping(lreq) req = lreq->ping_req # same req # cancel_linger_request is NOT # called - handle_reply already # unregistered request_reinit(req) WARN_ON(req->r_kref != 1) # fires request_init(req) kref_init(req->r_kref) # req->r_kref == 1 after kref_init ceph_osdc_put_request(req) kref_put(req->r_kref) # req->r_kref == 0 after kref_put, req is freed <further req initialization/use> !!! This happens because send_linger_ping() always (re)uses the same OSD request for watch ping requests, relying on cancel_linger_request() to unregister it from the OSD client and rip its messages out from the messenger. send_linger() does the same for watch/notify registration and watch reconnect requests. Unfortunately cancel_request() doesn't guarantee that after it returns the OSD client would be completely done with the OSD request -- a ref could still be held and the callback (if specified) could still be invoked too. The original motivation for request_reinit() was inability to deal with allocation failures in send_linger() and send_linger_ping(). Switching to using osdc->req_mempool (currently only used by CephFS) respects that and allows us to get rid of request_reinit(). Cc: [email protected] Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Acked-by: Jeff Layton <[email protected]>
2022-05-18io_uring: add fully sparse buffer registrationPavel Begunkov1-7/+15
Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed buffers as well. It makes the rsrc API more consistent. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-05-18x86/sev: Annotate stack change in the #VC handlerLai Jiangshan1-0/+1
In idtentry_vc(), vc_switch_off_ist() determines a safe stack to switch to, off of the IST stack. Annotate the new stack switch with ENCODE_FRAME_POINTER in case UNWINDER_FRAME_POINTER is used. A stack walk before looks like this: CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl dump_stack kernel_exc_vmm_communication asm_exc_vmm_communication ? native_read_msr ? __x2apic_disable.part.0 ? x2apic_setup ? cpu_init ? trap_init ? start_kernel ? x86_64_start_reservations ? x86_64_start_kernel ? secondary_startup_64_no_verify </TASK> and with the fix, the stack dump is exact: CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl dump_stack kernel_exc_vmm_communication asm_exc_vmm_communication RIP: 0010:native_read_msr Code: ... < snipped regs > ? __x2apic_disable.part.0 x2apic_setup cpu_init trap_init start_kernel x86_64_start_reservations x86_64_start_kernel secondary_startup_64_no_verify </TASK> [ bp: Test in a SEV-ES guest and rewrite the commit message to explain what exactly this does. ] Fixes: a13644f3a53d ("x86/entry/64: Add entry code for #VC handler") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-18fs-verity: remove unused parameter desc_size in fsverity_create_info()Zhang Jianhua4-16/+9
The parameter desc_size in fsverity_create_info() is useless and it is not referenced anywhere. The greatest meaning of desc_size here is to indecate the size of struct fsverity_descriptor and futher calculate the size of signature. However, the desc->sig_size can do it also and it is indeed, so remove it. Therefore, it is no need to acquire desc_size by fsverity_get_descriptor() in ensure_verity_info(), so remove the parameter desc_ret in fsverity_get_descriptor() too. Signed-off-by: Zhang Jianhua <[email protected]> Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-18nvme: add support for TP4084 - Time-to-Ready EnhancementsChristoph Hellwig3-6/+102
Add support for using longer timeouts during controller initialization and letting the controller come up with namespaces that are not ready for I/O yet. We skip these not ready namespaces during scanning and only bring them online once anoter scan is kicked off by the AEN that is set when the NRDY bit gets set in the I/O Command Set Independent Identify Namespace Data Structure. This asynchronous probing avoids blocking the kernel boot when controllers take a very long time to recover after unclean shutdowns (up to minutes). Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2022-05-18Fix double fget() in vhost_net_set_backend()Al Viro1-8/+7
Descriptor table is a shared resource; two fget() on the same descriptor may return different struct file references. get_tap_ptr_ring() is called after we'd found (and pinned) the socket we'll be using and it tries to find the private tun/tap data structures associated with it. Redoing the lookup by the same file descriptor we'd used to get the socket is racy - we need to same struct file. Thanks to Jason for spotting a braino in the original variant of patch - I'd missed the use of fd == -1 for disabling backend, and in that case we can end up with sock == NULL and sock != oldsock. Cc: [email protected] Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Jason Wang <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-05-18vdpa/mlx5: Use consistent RQT sizeEli Cohen1-40/+21
The current code evaluates RQT size based on the configured number of virtqueues. This can raise an issue in the following scenario: Assume MQ was negotiated. 1. mlx5_vdpa_set_map() gets called. 2. handle_ctrl_mq() is called setting cur_num_vqs to some value, lower than the configured max VQs. 3. A second set_map gets called, but now a smaller number of VQs is used to evaluate the size of the RQT. 4. handle_ctrl_mq() is called with a value larger than what the RQT can hold. This will emit errors and the driver state is compromised. To fix this, we use a new field in struct mlx5_vdpa_net to hold the required number of entries in the RQT. This value is evaluated in mlx5_vdpa_set_driver_features() where we have the negotiated features all set up. In addition to that, we take into consideration the max capability of RQT entries early when the device is added so we don't need to take consider it when creating the RQT. Last, we remove the use of mlx5_vdpa_max_qps() which just returns the max_vas / 2 and make the code clearer. Fixes: 52893733f2c5 ("vdpa/mlx5: Add multiqueue support") Acked-by: Jason Wang <[email protected]> Signed-off-by: Eli Cohen <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2022-05-18Merge tag 'sound-5.18' of ↵Linus Torvalds4-6/+79
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A collection of last-minute HD- an USB-audio quirks in addition to a fix for the legacy ISA wavefront driver. All look small and easy" * tag 'sound-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Restore Rane SL-1 quirk ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machine ALSA: hda/realtek: Add quirk for TongFang devices with pop noise ALSA: hda/realtek: Add quirk for the Framework Laptop ALSA: wavefront: Proper check of get_user() error ALSA: hda/realtek: Add quirk for Dell Latitude 7520 ALSA: hda - fix unused Realtek function when PM is not enabled ALSA: usb-audio: Don't get sample rate for MCT Trigger 5 USB-to-HDMI
2022-05-18netfilter: nf_tables: disable expression reduction infraPablo Neira Ayuso1-10/+1
Either userspace or kernelspace need to pre-fetch keys inconditionally before comparisons for this to work. Otherwise, register tracking data is misleading and it might result in reducing expressions which are not yet registers. First expression is also guaranteed to be evaluated always, however, certain expressions break before writing data to registers, before comparing the data, leaving the register in undetermined state. This patch disables this infrastructure by now. Fixes: b2d306542ff9 ("netfilter: nf_tables: do not reduce read-only expressions") Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-05-18netfilter: flowtable: move dst_check to packet pathRitaro Takenaka2-22/+20
Fixes sporadic IPv6 packet loss when flow offloading is enabled. IPv6 route GC and flowtable GC are not synchronized. When dst_cache becomes stale and a packet passes through the flow before the flowtable GC teardowns it, the packet can be dropped. So, it is necessary to check dst every time in packet path. Fixes: 227e1e4d0d6c ("netfilter: nf_flowtable: skip device lookup from interface index") Signed-off-by: Ritaro Takenaka <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-05-18netfilter: flowtable: fix TCP flow teardownPablo Neira Ayuso2-27/+9
This patch addresses three possible problems: 1. ct gc may race to undo the timeout adjustment of the packet path, leaving the conntrack entry in place with the internal offload timeout (one day). 2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE timeout is reached before the flow offload del. 3. tcp ct is always set to ESTABLISHED with a very long timeout in flow offload teardown/delete even though the state might be already CLOSED. Also as a remark we cannot assume that the FIN or RST packet is hitting flow table teardown as the packet might get bumped to the slow path in nftables. This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN state transition. Moreover, teturn the connection's ownership to conntrack upon teardown by clearing the offload flag and fixing the established timeout value. The flow table GC thread will asynchonrnously free the flow table and hardware offload entries. Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on which is also misleading since the flow is back to classic conntrack path. If nf_ct_delete() removes the entry from the conntrack table, then it calls nf_ct_put() which decrements the refcnt. This is not a problem because the flowtable holds a reference to the conntrack object from flow_offload_alloc() path which is released via flow_offload_free(). This patch also updates nft_flow_offload to skip packets in SYN_RECV state. Since we might miss or bump packets to slow path, we do not know what will happen there while we are still in SYN_RECV, this patch postpones offload up to the next packet which also aligns to the existing behaviour in tc-ct. flow_offload_teardown() does not reset the existing tcp state from flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow path might have already update the state to CLOSE/FIN. Joint work with Oz and Sven. Fixes: 1e5b2471bcc4 ("netfilter: nf_flow_table: teardown flow timeout race") Signed-off-by: Oz Shlomo <[email protected]> Signed-off-by: Sven Auhagen <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-05-18random: credit architectural init the exact amountJason A. Donenfeld1-6/+6
RDRAND and RDSEED can fail sometimes, which is fine. We currently initialize the RNG with 512 bits of RDRAND/RDSEED. We only need 256 bits of those to succeed in order to initialize the RNG. Instead of the current "all or nothing" approach, actually credit these contributions the amount that is actually contributed. Reviewed-by: Dominik Brodowski <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: handle latent entropy and command line from random_init()Jason A. Donenfeld4-23/+21
Currently, start_kernel() adds latent entropy and the command line to the entropy bool *after* the RNG has been initialized, deferring when it's actually used by things like stack canaries until the next time the pool is seeded. This surely is not intended. Rather than splitting up which entropy gets added where and when between start_kernel() and random_init(), just do everything in random_init(), which should eliminate these kinds of bugs in the future. While we're at it, rename the awkwardly titled "rand_initialize()" to the more standard "random_init()" nomenclature. Reviewed-by: Dominik Brodowski <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: use proper jiffies comparison macroJason A. Donenfeld1-1/+1
This expands to exactly the same code that it replaces, but makes things consistent by using the same macro for jiffy comparisons throughout. Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: remove ratelimiting for in-kernel unseeded randomnessJason A. Donenfeld2-46/+19
The CONFIG_WARN_ALL_UNSEEDED_RANDOM debug option controls whether the kernel warns about all unseeded randomness or just the first instance. There's some complicated rate limiting and comparison to the previous caller, such that even with CONFIG_WARN_ALL_UNSEEDED_RANDOM enabled, developers still don't see all the messages or even an accurate count of how many were missed. This is the result of basically parallel mechanisms aimed at accomplishing more or less the same thing, added at different points in random.c history, which sort of compete with the first-instance-only limiting we have now. It turns out, however, that nobody cares about the first unseeded randomness instance of in-kernel users. The same first user has been there for ages now, and nobody is doing anything about it. It isn't even clear that anybody _can_ do anything about it. Most places that can do something about it have switched over to using get_random_bytes_wait() or wait_for_random_bytes(), which is the right thing to do, but there is still much code that needs randomness sometimes during init, and as a geeneral rule, if you're not using one of the _wait functions or the readiness notifier callback, you're bound to be doing it wrong just based on that fact alone. So warning about this same first user that can't easily change is simply not an effective mechanism for anything at all. Users can't do anything about it, as the Kconfig text points out -- the problem isn't in userspace code -- and kernel developers don't or more often can't react to it. Instead, show the warning for all instances when CONFIG_WARN_ALL_UNSEEDED_RANDOM is set, so that developers can debug things need be, or if it isn't set, don't show a warning at all. At the same time, CONFIG_WARN_ALL_UNSEEDED_RANDOM now implies setting random.ratelimit_disable=1 on by default, since if you care about one you probably care about the other too. And we can clean up usage around the related urandom_warning ratelimiter as well (whose behavior isn't changing), so that it properly counts missed messages after the 10 message threshold is reached. Cc: Theodore Ts'o <[email protected]> Cc: Dominik Brodowski <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: move initialization out of reseeding hot pathJason A. Donenfeld1-23/+19
Initialization happens once -- by way of credit_init_bits() -- and then it never happens again. Therefore, it doesn't need to be in crng_reseed(), which is a hot path that is called multiple times. It also doesn't make sense to have there, as initialization activity is better associated with initialization routines. After the prior commit, crng_reseed() now won't be called by multiple concurrent callers, which means that we can safely move the "finialize_init" logic into crng_init_bits() unconditionally. Reviewed-by: Dominik Brodowski <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: avoid initializing twice in credit raceJason A. Donenfeld1-5/+5
Since all changes of crng_init now go through credit_init_bits(), we can fix a long standing race in which two concurrent callers of credit_init_bits() have the new bit count >= some threshold, but are doing so with crng_init as a lower threshold, checked outside of a lock, resulting in crng_reseed() or similar being called twice. In order to fix this, we can use the original cmpxchg value of the bit count, and only change crng_init when the bit count transitions from below a threshold to meeting the threshold. Reviewed-by: Dominik Brodowski <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: use symbolic constants for crng_init statesJason A. Donenfeld1-19/+19
crng_init represents a state machine, with three states, and various rules for transitions. For the longest time, we've been managing these with "0", "1", and "2", and expecting people to figure it out. To make the code more obvious, replace these with proper enum values representing the transition, and then redocument what each of these states mean. Reviewed-by: Dominik Brodowski <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random32: use real rng for non-deterministic randomnessJason A. Donenfeld6-395/+15
random32.c has two random number generators in it: one that is meant to be used deterministically, with some predefined seed, and one that does the same exact thing as random.c, except does it poorly. The first one has some use cases. The second one no longer does and can be replaced with calls to random.c's proper random number generator. The relatively recent siphash-based bad random32.c code was added in response to concerns that the prior random32.c was too deterministic. Out of fears that random.c was (at the time) too slow, this code was anonymously contributed. Then out of that emerged a kind of shadow entropy gathering system, with its own tentacles throughout various net code, added willy nilly. Stop👏making👏bespoke👏random👏number👏generators👏. Fortunately, recent advances in random.c mean that we can stop playing with this sketchiness, and just use get_random_u32(), which is now fast enough. In micro benchmarks using RDPMC, I'm seeing the same median cycle count between the two functions, with the mean being _slightly_ higher due to batches refilling (which we can optimize further need be). However, when doing *real* benchmarks of the net functions that actually use these random numbers, the mean cycles actually *decreased* slightly (with the median still staying the same), likely because the additional prandom code means icache misses and complexity, whereas random.c is generally already being used by something else nearby. The biggest benefit of this is that there are many users of prandom who probably should be using cryptographically secure random numbers. This makes all of those accidental cases become secure by just flipping a switch. Later on, we can do a tree-wide cleanup to remove the static inline wrapper functions that this commit adds. There are also some low-ish hanging fruits for making this even faster in the future: a get_random_u16() function for use in the networking stack will give a 2x performance boost there, using SIMD for ChaCha20 will let us compute 4 or 8 or 16 blocks of output in parallel, instead of just one, giving us large buffers for cheap, and introducing a get_random_*_bh() function that assumes irqs are already disabled will shave off a few cycles for ordinary calls. These are things we can chip away at down the road. Acked-by: Jakub Kicinski <[email protected]> Acked-by: Theodore Ts'o <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18siphash: use one source of truth for siphash permutationsJason A. Donenfeld4-61/+52
The SipHash family of permutations is currently used in three places: - siphash.c itself, used in the ordinary way it was intended. - random32.c, in a construction from an anonymous contributor. - random.c, as part of its fast_mix function. Each one of these places reinvents the wheel with the same C code, same rotation constants, and same symmetry-breaking constants. This commit tidies things up a bit by placing macros for the permutations and constants into siphash.h, where each of the three .c users can access them. It also leaves a note dissuading more users of them from emerging. Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: help compiler out with fast_mix() by using simpler argumentsJason A. Donenfeld1-21/+23
Now that fast_mix() has more than one caller, gcc no longer inlines it. That's fine. But it also doesn't handle the compound literal argument we pass it very efficiently, nor does it handle the loop as well as it could. So just expand the code to spell out this function so that it generates the same code as it did before. Performance-wise, this now behaves as it did before the last commit. The difference in actual code size on x86 is 45 bytes, which is less than a cache line. Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18random: do not use input pool from hard IRQsJason A. Donenfeld1-15/+36
Years ago, a separate fast pool was added for interrupts, so that the cost associated with taking the input pool spinlocks and mixing into it would be avoided in places where latency is critical. However, one oversight was that add_input_randomness() and add_disk_randomness() still sometimes are called directly from the interrupt handler, rather than being deferred to a thread. This means that some unlucky interrupts will be caught doing a blake2s_compress() call and potentially spinning on input_pool.lock, which can also be taken by unprivileged users by writing into /dev/urandom. In order to fix this, add_timer_randomness() now checks whether it is being called from a hard IRQ and if so, just mixes into the per-cpu IRQ fast pool using fast_mix(), which is much faster and can be done lock-free. A nice consequence of this, as well, is that it means hard IRQ context FPU support is likely no longer useful. The entropy estimation algorithm used by add_timer_randomness() is also somewhat different than the one used for add_interrupt_randomness(). The former looks at deltas of deltas of deltas, while the latter just waits for 64 interrupts for one bit or for one second since the last bit. In order to bridge these, and since add_interrupt_randomness() runs after an add_timer_randomness() that's called from hard IRQ, we add to the fast pool credit the related amount, and then subtract one to account for add_interrupt_randomness()'s contribution. A downside of this, however, is that the num argument is potentially attacker controlled, which puts a bit more pressure on the fast_mix() sponge to do more than it's really intended to do. As a mitigating factor, the first 96 bits of input aren't attacker controlled (a cycle counter followed by zeros), which means it's essentially two rounds of siphash rather than one, which is somewhat better. It's also not that much different from add_interrupt_randomness()'s use of the irq stack instruction pointer register. Cc: Thomas Gleixner <[email protected]> Cc: Filipe Manana <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-18arm64/sve: Move sve_free() into SVE code sectionGeert Uytterhoeven1-17/+16
If CONFIG_ARM64_SVE is not set: arch/arm64/kernel/fpsimd.c:294:13: warning: ‘sve_free’ defined but not used [-Wunused-function] Fix this by moving sve_free() and __sve_free() into the existing section protected by "#ifdef CONFIG_ARM64_SVE", now the last user outside that section has been removed. Fixes: a1259dd80719 ("arm64/sve: Delay freeing memory in fpsimd_flush_thread()") Signed-off-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Mark Brown <[email protected]> Link: https://lore.kernel.org/r/cd633284683c24cb9469f8ff429915aedf67f868.1652798894.git.geert+renesas@glider.be Signed-off-by: Catalin Marinas <[email protected]>
2022-05-18arm64: Kconfig.platforms: Add commentsJuerg Haefliger1-1/+1
Add trailing comments to endmenu statements for better readability. Signed-off-by: Juerg Haefliger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
2022-05-18arm64: Kconfig: Fix indentation and add commentsJuerg Haefliger1-48/+47
The convention for indentation seems to be a single tab. Help text is further indented by an additional two whitespaces. Fix the lines that violate these rules. While add it, add trailing comments to endif and endmenu statements for better readability. Signed-off-by: Juerg Haefliger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
2022-05-18net: ftgmac100: Disable hardware checksum on AST2600Joel Stanley1-0/+5
The AST2600 when using the i210 NIC over NC-SI has been observed to produce incorrect checksum results with specific MTU values. This was first observed when sending data across a long distance set of networks. On a local network, the following test was performed using a 1MB file of random data. On the receiver run this script: #!/bin/bash while [ 1 ]; do # Zero the stats nstat -r > /dev/null nc -l 9899 > test-file # Check for checksum errors TcpInCsumErrors=$(nstat | grep TcpInCsumErrors) if [ -z "$TcpInCsumErrors" ]; then echo No TcpInCsumErrors else echo TcpInCsumErrors = $TcpInCsumErrors fi done On an AST2600 system: # nc <IP of receiver host> 9899 < test-file The test was repeated with various MTU values: # ip link set mtu 1410 dev eth0 The observed results: 1500 - good 1434 - bad 1400 - good 1410 - bad 1420 - good The test was repeated after disabling tx checksumming: # ethtool -K eth0 tx-checksumming off And all MTU values tested resulted in transfers without error. An issue with the driver cannot be ruled out, however there has been no bug discovered so far. David has done the work to take the original bug report of slow data transfer between long distance connections and triaged it down to this test case. The vendor suspects this this is a hardware issue when using NC-SI. The fixes line refers to the patch that introduced AST2600 support. Reported-by: David Wilder <[email protected]> Reviewed-by: Dylan Hung <[email protected]> Signed-off-by: Joel Stanley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-18igb: skip phy status check where unavailableKevin Mitchell1-1/+2
igb_read_phy_reg() will silently return, leaving phy_data untouched, if hw->ops.read_reg isn't set. Depending on the uninitialized value of phy_data, this led to the phy status check either succeeding immediately or looping continuously for 2 seconds before emitting a noisy err-level timeout. This message went out to the console even though there was no actual problem. Instead, first check if there is read_reg function pointer. If not, proceed without trying to check the phy status register. Fixes: b72f3f72005d ("igb: When GbE link up, wait for Remote receiver status condition") Signed-off-by: Kevin Mitchell <[email protected]> Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-18nfc: pn533: Fix buggy cleanup orderLin Ma1-2/+3
When removing the pn533 device (i2c or USB), there is a logic error. The original code first cancels the worker (flush_delayed_work) and then destroys the workqueue (destroy_workqueue), leaving the timer the last one to be deleted (del_timer). This result in a possible race condition in a multi-core preempt-able kernel. That is, if the cleanup (pn53x_common_clean) is concurrently run with the timer handler (pn533_listen_mode_timer), the timer can queue the poll_work to the already destroyed workqueue, causing use-after-free. This patch reorder the cleanup: it uses the del_timer_sync to make sure the handler is finished before the routine will destroy the workqueue. Note that the timer cannot be activated by the worker again. static void pn533_wq_poll(struct work_struct *work) ... rc = pn533_send_poll_frame(dev); if (rc) return; if (cur_mod->len == 0 && dev->poll_mod_count > 1) mod_timer(&dev->listen_timer, ...); That is, the mod_timer can be called only when pn533_send_poll_frame() returns no error, which is impossible because the device is detaching and the lower driver should return ENODEV code. Signed-off-by: Lin Ma <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-18Merge tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme into ↵Jens Axboe8-15/+37
for-5.19/drivers Pull NVMe updates from Christoph: "nvme updates for Linux 5.19 - tighten the PCI presence check (Stefan Roese): - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, me)" * tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme: nvme: split the enum used for various register constants nvme-fabrics: add a request timeout helper nvme-pci: harden drive presence detect in nvme_dev_disable() nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags nvme: mark internal passthru request RQF_QUIET nvme: remove unneeded include from constants file nvme: add missing status values to verbose logging nvme: set dma alignment to dword nvme: fix interpretation of DMRSL
2022-05-18io_uring: use rcu_dereference in io_closeChristoph Hellwig1-1/+2
Accessing the file table needs a rcu_dereference_protected(). Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: consistently use the EPOLL* definesChristoph Hellwig1-4/+4
POLL* are unannotated values for the userspace ABI, while everything in-kernel should use EPOLL* and the __poll_t type. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: make apoll_events a __poll_tChristoph Hellwig1-3/+4
apoll_events is fed to vfs_poll and the poll tables, so it should be a __poll_t. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: drop a spurious inline on a forward declarationChristoph Hellwig1-1/+1
io_file_get_normal isn't marked inline, so don't claim it as such in the forward declaration. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: don't use ERR_PTR for user pointersChristoph Hellwig1-46/+37
ERR_PTR abuses the high bits of a pointer to transport error information. This is only safe for kernel pointers and not user pointers. Fix io_buffer_select and its helpers to just return NULL for failure and get rid of this abuse. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: use a rwf_t for io_rw.flagsChristoph Hellwig1-1/+1
Use the proper type. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: add support for ring mapped supplied buffersJens Axboe2-12/+258
Provided buffers allow an application to supply io_uring with buffers that can then be grabbed for a read/receive request, when the data source is ready to deliver data. The existing scheme relies on using IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use in real world applications. It's pretty efficient if the application is able to supply back batches of provided buffers when they have been consumed and the application is ready to recycle them, but if fragmentation occurs in the buffer space, it can become difficult to supply enough buffers at the time. This hurts efficiency. Add a register op, IORING_REGISTER_PBUF_RING, which allows an application to setup a shared queue for each buffer group of provided buffers. The application can then supply buffers simply by adding them to this ring, and the kernel can consume then just as easily. The ring shares the head with the application, the tail remains private in the kernel. Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the ring, they must use the mapped ring. Mapped provided buffer rings can co-exist with normal provided buffers, just not within the same group ID. To gauge overhead of the existing scheme and evaluate the mapped ring approach, a simple NOP benchmark was written. It uses a ring of 128 entries, and submits/completes 32 at the time. 'Replenish' is how many buffers are provided back at the time after they have been consumed: Test Replenish NOPs/sec ================================================================ No provided buffers NA ~30M Provided buffers 32 ~16M Provided buffers 1 ~10M Ring buffers 32 ~27M Ring buffers 1 ~27M The ring mapped buffers perform almost as well as not using provided buffers at all, and they don't care if you provided 1 or more back at the same time. This means application can just replenish as they go, rather than need to batch and compact, further reducing overhead in the application. The NOP benchmark above doesn't need to do any compaction, so that overhead isn't even reflected in the above test. Co-developed-by: Dylan Yudaken <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: add io_pin_pages() helperJens Axboe1-27/+50
Abstract this out from io_sqe_buffer_register() so we can use it elsewhere too without duplicating this code. No intended functional changes in this patch. Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: add buffer selection support to IORING_OP_NOPJens Axboe1-1/+12
Obviously not really useful since it's not transferring data, but it is helpful in benchmarking overhead of provided buffers. Signed-off-by: Jens Axboe <[email protected]>
2022-05-18io_uring: fix locking state for empty buffer groupJens Axboe1-11/+14
io_provided_buffer_select() must drop the submit lock, if needed, even in the error handling case. Failure to do so will leave us with the ctx->uring_lock held, causing spew like: ==================================== WARNING: iou-wrk-366/368 still has locks held! 5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted ------------------------------------ 1 lock held by iou-wrk-366/368: #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48 stack backtrace: CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xa4/0xd4 show_stack+0x14/0x5c dump_stack_lvl+0x88/0xb0 dump_stack+0x14/0x2c debug_check_no_locks_held+0x84/0x90 try_to_freeze.isra.0+0x18/0x44 get_signal+0x94/0x6ec io_wqe_worker+0x1d8/0x2b4 ret_from_fork+0x10/0x20 and triggering later hangs off get_signal() because we attempt to re-grab the lock. Reported-by: [email protected] Fixes: 149c69b04a90 ("io_uring: abstract out provided buffer list selection") Signed-off-by: Jens Axboe <[email protected]>
2022-05-18Merge branch 'mptcp-checksums'David S. Miller3-18/+46
Mat Martineau says: ==================== mptcp: Fix checksum byte order on little-endian These patches address a bug in the byte ordering of MPTCP checksums on little-endian architectures. The __sum16 type is always big endian, but was being cast to u16 and then byte-swapped (on little-endian archs) when reading/writing the checksum field in MPTCP option headers. MPTCP checksums are off by default, but are enabled if one or both peers request it in the SYN/SYNACK handshake. The corrected code is verified to interoperate between big-endian and little-endian machines. Patch 1 fixes the checksum byte order, patch 2 partially mitigates interoperation with peers sending bad checksums by falling back to TCP instead of resetting the connection. ==================== Signed-off-by: David S. Miller <[email protected]>
2022-05-18mptcp: Do TCP fallback on early DSS checksum failureMat Martineau2-4/+20
RFC 8684 section 3.7 describes several opportunities for a MPTCP connection to "fall back" to regular TCP early in the connection process, before it has been confirmed that MPTCP options can be successfully propagated on all SYN, SYN/ACK, and data packets. If a peer acknowledges the first received data packet with a regular TCP header (no MPTCP options), fallback is allowed. If the recipient of that first data packet finds a MPTCP DSS checksum error, this provides an opportunity to fail gracefully with a TCP fallback rather than resetting the connection (as might happen if a checksum failure were detected later). This commit modifies the checksum failure code to attempt fallback on the initial subflow of a MPTCP connection, only if it's a failure in the first data mapping. In cases where the peer initiates the connection, requests checksums, is the first to send data, and the peer is sending incorrect checksums (see https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows the connection to proceed as TCP rather than reset. Fixes: dd8bcd1768ff ("mptcp: validate the data checksum") Acked-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-18mptcp: fix checksum byte orderPaolo Abeni3-14/+26
The MPTCP code typecasts the checksum value to u16 and then converts it to big endian while storing the value into the MPTCP option. As a result, the wire encoding for little endian host is wrong, and that causes interoperabilty interoperability issues with other implementation or host with different endianness. Address the issue writing in the packet the unmodified __sum16 value. MPTCP checksum is disabled by default, interoperating with systems with bad mptcp-level csum encoding should cause fallback to TCP. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275 Fixes: c5b39e26d003 ("mptcp: send out checksum for DSS") Fixes: 390b95a5fb84 ("mptcp: receive checksum for DSS") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>