Age | Commit message (Collapse) | Author | Files | Lines |
|
If the RBUF logic is not reset when the kernel starts then there
may be some data left over from any network boot loader. If the
64-byte packet headers are enabled then this can be fatal.
Extend bcmgenet_dma_disable to do perform the reset, but not when
called from bcmgenet_resume in order to preserve a wake packet.
N.B. This different handling of resume is just based on a hunch -
why else wouldn't one reset the RBUF as well as the TBUF? If this
isn't the case then it's easy to change the patch to make the RBUF
reset unconditional.
See: https://github.com/raspberrypi/linux/issues/3850
See: https://github.com/raspberrypi/firmware/issues/1882
Signed-off-by: Phil Elwell <[email protected]>
Signed-off-by: Maarten Vanraes <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In function pci1xxxx_spi_probe, there is a potential null pointer that
may be caused by a failed memory allocation by the function devm_kzalloc.
Hence, a null pointer check needs to be added to prevent null pointer
dereferencing later in the code.
To fix this issue, spi_bus->spi_int[iter] should be checked. The memory
allocated by devm_kzalloc will be automatically released, so just directly
return -ENOMEM without worrying about memory leaks.
Fixes: 1cc0cbea7167 ("spi: microchip: pci1xxxx: Add driver for SPI controller of PCI1XXXX PCIe switch")
Signed-off-by: Huai-Yuan Liu <[email protected]>
Link: https://msgid.link/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
|
|
devm_spi_alloc_controller will allocate an SPI controller and
automatically release a reference on it when dev is unbound from
its driver. It doesn't need to call spi_controller_put explicitly
to put the reference when lpspi driver failed initialization.
Fixes: 2ae0ab0143fc ("spi: lpspi: Avoid potential use-after-free in probe()")
Signed-off-by: Carlos Song <[email protected]>
Reviewed-by: Alexander Sverdlin <[email protected]>
Link: https://msgid.link/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
|
|
In rvu_map_cgx_lmac_pf() the 'iter', which is used as an array index, can reach
value (up to 14) that exceed the size (MAX_LMAC_COUNT = 8) of the array.
Fix this bug by adding 'iter' value check.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 91c6945ea1f9 ("octeontx2-af: cn10k: Add RPM MAC support")
Signed-off-by: Aleksandr Mishin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The MSR_PEBS_DATA_CFG MSR register is used to configure which data groups
should be generated into a PEBS record, and it's shared among all counters.
If there are different configurations among counters, perf combines all the
configurations.
The first perf command as below requires a complete PEBS record
(including memory info, GPRs, XMMs, and LBRs). The second perf command
only requires a basic group. However, after the second perf command is
running, the MSR_PEBS_DATA_CFG register is cleared. Only a basic group is
generated in a PEBS record, which is wrong. The required information
for the first perf command is missed.
$ perf record --intr-regs=AX,SP,XMM0 -a -C 8 -b -W -d -c 100000003 -o /dev/null -e cpu/event=0xd0,umask=0x81/upp &
$ sleep 5
$ perf record --per-thread -c 1 -e cycles:pp --no-timestamp --no-tid taskset -c 8 ./noploop 1000
The first PEBS event is a system-wide PEBS event. The second PEBS event
is a per-thread event. When the thread is scheduled out, the
intel_pmu_pebs_del() function is invoked to update the PEBS state.
Since the system-wide event is still available, the cpuc->n_pebs is 1.
The cpuc->pebs_data_cfg is cleared. The data configuration for the
system-wide PEBS event is lost.
The (cpuc->n_pebs == 1) check was introduced in commit:
b6a32f023fcc ("perf/x86: Fix PEBS threshold initialization")
At that time, it indeed didn't hurt whether the state was updated
during the removal, because only the threshold is updated.
The calculation of the threshold takes the last PEBS event into
account.
However, since commit:
b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
we delay the threshold update, and clear the PEBS data config, which triggers
the bug.
The PEBS data config update scope should not be shrunk during removal.
[ mingo: Improved the changelog & comments. ]
Fixes: b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
Reported-by: Stephane Eranian <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
Tony encountered this OOPS when the last CPU of a domain goes
offline while running a kernel built with CONFIG_NO_HZ_FULL:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
...
RIP: 0010:__find_nth_andnot_bit+0x66/0x110
...
Call Trace:
<TASK>
? __die()
? page_fault_oops()
? exc_page_fault()
? asm_exc_page_fault()
cpumask_any_housekeeping()
mbm_setup_overflow_handler()
resctrl_offline_cpu()
resctrl_arch_offline_cpu()
cpuhp_invoke_callback()
cpuhp_thread_fun()
smpboot_thread_fn()
kthread()
ret_from_fork()
ret_from_fork_asm()
</TASK>
The NULL pointer dereference is encountered while searching for another
online CPU in the domain (of which there are none) that can be used to
run the MBM overflow handler.
Because the kernel is configured with CONFIG_NO_HZ_FULL the search for
another CPU (in its effort to prefer those CPUs that aren't marked
nohz_full) consults the mask representing the nohz_full CPUs,
tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y
tick_nohz_full_mask is not allocated unless the kernel is booted with
the "nohz_full=" parameter and because of that any access to
tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled().
Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled().
The latter ensures tick_nohz_full_mask can be accessed safely and can be
used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not.
[ Use Ingo's suggestion that combines the two NO_HZ checks into one. ]
Fixes: a4846aaf3945 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow")
Reported-by: Tony Luck <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com
Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux fix from Paul Moore:
"A single patch for SELinux to fix a problem where we could potentially
dereference an error pointer if we failed to successfully mount
selinuxfs"
* tag 'selinux-pr-20240402' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: avoid dereference of garbage after mount failure
|
|
Add myself as mlx5 core and EN maintainer.
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Acked-by: Saeed Mahameed <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
syzkaller reported infinite recursive calls of fib6_dump_done() during
netlink socket destruction. [1]
From the log, syzkaller sent an AF_UNSPEC RTM_GETROUTE message, and then
the response was generated. The following recvmmsg() resumed the dump
for IPv6, but the first call of inet6_dump_fib() failed at kzalloc() due
to the fault injection. [0]
12:01:34 executing program 3:
r0 = socket$nl_route(0x10, 0x3, 0x0)
sendmsg$nl_route(r0, ... snip ...)
recvmmsg(r0, ... snip ...) (fail_nth: 8)
Here, fib6_dump_done() was set to nlk_sk(sk)->cb.done, and the next call
of inet6_dump_fib() set it to nlk_sk(sk)->cb.args[3]. syzkaller stopped
receiving the response halfway through, and finally netlink_sock_destruct()
called nlk_sk(sk)->cb.done().
fib6_dump_done() calls fib6_dump_end() and nlk_sk(sk)->cb.done() if it
is still not NULL. fib6_dump_end() rewrites nlk_sk(sk)->cb.done() by
nlk_sk(sk)->cb.args[3], but it has the same function, not NULL, calling
itself recursively and hitting the stack guard page.
To avoid the issue, let's set the destructor after kzalloc().
[0]:
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 0
CPU: 1 PID: 432110 Comm: syz-executor.3 Not tainted 6.8.0-12821-g537c2e91d354-dirty #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:117)
should_fail_ex (lib/fault-inject.c:52 lib/fault-inject.c:153)
should_failslab (mm/slub.c:3733)
kmalloc_trace (mm/slub.c:3748 mm/slub.c:3827 mm/slub.c:3992)
inet6_dump_fib (./include/linux/slab.h:628 ./include/linux/slab.h:749 net/ipv6/ip6_fib.c:662)
rtnl_dump_all (net/core/rtnetlink.c:4029)
netlink_dump (net/netlink/af_netlink.c:2269)
netlink_recvmsg (net/netlink/af_netlink.c:1988)
____sys_recvmsg (net/socket.c:1046 net/socket.c:2801)
___sys_recvmsg (net/socket.c:2846)
do_recvmmsg (net/socket.c:2943)
__x64_sys_recvmmsg (net/socket.c:3041 net/socket.c:3034 net/socket.c:3034)
[1]:
BUG: TASK stack guard page was hit at 00000000f2fa9af1 (stack is 00000000b7912430..000000009a436beb)
stack guard page: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 223719 Comm: kworker/1:3 Not tainted 6.8.0-12821-g537c2e91d354-dirty #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events netlink_sock_destruct_work
RIP: 0010:fib6_dump_done (net/ipv6/ip6_fib.c:570)
Code: 3c 24 e8 f3 e9 51 fd e9 28 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 41 57 41 56 41 55 41 54 55 48 89 fd <53> 48 8d 5d 60 e8 b6 4d 07 fd 48 89 da 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000d980000 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffffffff84405990 RCX: ffffffff844059d3
RDX: ffff8881028e0000 RSI: ffffffff84405ac2 RDI: ffff88810c02f358
RBP: ffff88810c02f358 R08: 0000000000000007 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000224 R12: 0000000000000000
R13: ffff888007c82c78 R14: ffff888007c82c68 R15: ffff888007c82c68
FS: 0000000000000000(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc9000d97fff8 CR3: 0000000102309002 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<#DF>
</#DF>
<TASK>
fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
...
fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
netlink_sock_destruct (net/netlink/af_netlink.c:401)
__sk_destruct (net/core/sock.c:2177 (discriminator 2))
sk_destruct (net/core/sock.c:2224)
__sk_free (net/core/sock.c:2235)
sk_free (net/core/sock.c:2246)
process_one_work (kernel/workqueue.c:3259)
worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:256)
Modules linked in:
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzkaller <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
On some boards with this chip version the BIOS is buggy and misses
to reset the PHY page selector. This results in the PHY ID read
accessing registers on a different page, returning a more or
less random value. Fix this by resetting the page selector first.
Fixes: f1e911d5d0df ("r8169: add basic phylib support")
Cc: [email protected]
Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
If we look up the kbuf, ensure that it doesn't get unregistered until
after we're done with it. Since we're inside mmap, we cannot safely use
the io_uring lock. Rely on the fact that we can lookup the buffer list
under RCU now and grab a reference to it, preventing it from being
unregistered until we're done with it. The lookup returns the
io_buffer_list directly with it referenced.
Cc: [email protected] # v6.4+
Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU")
Signed-off-by: Jens Axboe <[email protected]>
|
|
No functional changes in this patch, just in preparation for being able
to keep the buffer list alive outside of the ctx->uring_lock.
Cc: [email protected] # v6.4+
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now that xarray is being exclusively used for the buffer_list lookup,
this check is no longer needed. Get rid of it and the is_ready member.
Cc: [email protected] # v6.4+
Signed-off-by: Jens Axboe <[email protected]>
|
|
Just rely on the xarray for any kind of bgid. This simplifies things, and
it really doesn't bring us much, if anything.
Cc: [email protected] # v6.4+
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit 82dfb540aeb2 ("VSOCK: Add virtio vsock vsockmon hooks") added
virtio_transport_deliver_tap_pkt() for handing packets to the
vsockmon device. However, in virtio_transport_send_pkt_work(),
the function is called before actually sending the packet (i.e.
before placing it in the virtqueue with virtqueue_add_sgs() and checking
whether it returned successfully).
Queuing the packet in the virtqueue can fail even multiple times.
However, in virtio_transport_deliver_tap_pkt() we deliver the packet
to the monitoring tap interface only the first time we call it.
This certainly avoids seeing the same packet replicated multiple times
in the monitoring interface, but it can show the packet sent with the
wrong timestamp or even before we succeed to queue it in the virtqueue.
Move virtio_transport_deliver_tap_pkt() after calling virtqueue_add_sgs()
and making sure it returned successfully.
Fixes: 82dfb540aeb2 ("VSOCK: Add virtio vsock vsockmon hooks")
Cc: [email protected]
Signed-off-by: Marco Pinna <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
When the ax25 device is detaching, the ax25_dev_device_down()
calls ax25_ds_del_timer() to cleanup the slave_timer. When
the timer handler is running, the ax25_ds_del_timer() that
calls del_timer() in it will return directly. As a result,
the use-after-free bugs could happen, one of the scenarios
is shown below:
(Thread 1) | (Thread 2)
| ax25_ds_timeout()
ax25_dev_device_down() |
ax25_ds_del_timer() |
del_timer() |
ax25_dev_put() //FREE |
| ax25_dev-> //USE
In order to mitigate bugs, when the device is detaching, use
timer_shutdown_sync() to stop the timer.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Duoming Zhou <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Since commit 63b0cd30b78e ("media: ov2680: Add bus-cfg / endpoint
property verification") the ov2680 no longer probes on a imx7s-warp7:
ov2680 1-0036: error -EINVAL: supported link freq 330000000 not found
ov2680 1-0036: probe with driver ov2680 failed with error -22
Fix it by passing the required 'link-frequencies' property as
recommended by:
https://www.kernel.org/doc/html/v6.9-rc1/driver-api/media/camera-sensor.html#handling-clocks
Cc: [email protected]
Fixes: 63b0cd30b78e ("media: ov2680: Add bus-cfg / endpoint property verification")
Signed-off-by: Fabio Estevam <[email protected]>
Signed-off-by: Shawn Guo <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
In the discard worker, we were failing to validate the bucket state -
meaning a corrupt needs_discard btree could cause us to discard a bucket
that we shouldn't.
If check_alloc_info hasn't run yet we just want to bail out, otherwise
it's a filesystem inconsistent error.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Print out the mode as a string, and also print out the btree and
watermark.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Merge series from Zhang Yi <[email protected]>:
We solved some issues related to headphone detection.And for using
the same configuration in different power conditions,we modified the
clock table
|
|
Pull documentation fixes from Jonathan Corbet:
"Four small documentation fixes"
* tag 'docs-6.9-fixes' of git://git.lwn.net/linux:
docs: zswap: fix shell command format
tracing: Fix documentation on tp_printk cmdline option
docs: Fix bitfield handling in kernel-doc
Documentation: dev-tools: Add link to RV docs
|
|
Some laptops where the thermal control is handled by the EC may
provide trip points that fail the kernels new validation, but still have
working temperature sensors. An example of this is the Framework 13 AMD.
This patch allows the thermal zone to still be registered without trip
points if the trip points fail validation, allowing the temperature
sensor to be viewed and used by the user.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218586
Fixes: 9c8647224e9f ("ACPI: thermal: Use library functions to obtain trip point temperature values")
Signed-off-by: Stephen Horvath <[email protected]>
[ rjw: Subject edits, remove redundant braces ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Pull bcachefs fixes from Kent Overstreet:
"Lots of fixes for situations with extreme filesystem damage.
One fix ("Fix journal pins in btree write buffer") applicable to
normal usage; also a dio performance fix.
New repair/construction code is in the final stages, should be ready
in about a week. Anyone that lost btree interior nodes (or a variety
of other damage) as a result of the splitbrain bug will be able to
repair then"
* tag 'bcachefs-2024-04-01' of https://evilpiepirate.org/git/bcachefs: (32 commits)
bcachefs: On emergency shutdown, print out current journal sequence number
bcachefs: Fix overlapping extent repair
bcachefs: Fix remove_dirent()
bcachefs: Logged op errors should be ignored
bcachefs: Improve -o norecovery; opts.recovery_pass_limit
bcachefs: bch2_run_explicit_recovery_pass_persistent()
bcachefs: Ensure bch_sb_field_ext always exists
bcachefs: Flush journal immediately after replay if we did early repair
bcachefs: Resume logged ops after fsck
bcachefs: Add error messages to logged ops fns
bcachefs: Split out recovery_passes.c
bcachefs: fix backpointer for missing alloc key msg
bcachefs: Fix bch2_btree_increase_depth()
bcachefs: Kill bch2_bkey_ptr_data_type()
bcachefs: Fix use after free in check_root_trans()
bcachefs: Fix repair path for missing indirect extents
bcachefs: Fix use after free in bch2_check_fix_ptrs()
bcachefs: Fix btree node keys accounting in topology repair path
bcachefs: Check btree ptr min_key in .invalid
bcachefs: add REQ_SYNC and REQ_IDLE in write dio
...
|
|
mean_and_variance_test_2 and mean_and_variance_test_4 always fail.
The input parameters to those tests are identical to the input parameters
to tests 1 and 3, yet the expected result for tests 2 and 4 is different
for the mean and stddev tests. That will always fail.
Expected mean_and_variance_get_mean(mv) == mean[i], but
mean_and_variance_get_mean(mv) == 22 (0x16)
mean[i] == 10 (0xa)
Drop the bad tests.
Fixes: 65bc41090720 ("mean and variance: More tests")
Closes: https://lore.kernel.org/lkml/[email protected]/
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
It is possible to clear a root's IN_TRANS tag from the radix tree, but
not clear its PERTRANS, if there is some error in between. Eliminate
that possibility by moving the free up to where we clear the tag.
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around. This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.
Helps with the pass rate of generic/269 and generic/475.
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The transaction is only able to free PERTRANS reservations for a root
once that root has been recorded with the TRANS tag on the roots radix
tree. Therefore, until we are sure that this root will get tagged, it
isn't safe to convert. Generally, this is not an issue as *some*
transaction will likely tag the root before long and this reservation
will get freed in that transaction, but technically it could stick
around until unmount and result in a warning about leaked metadata
reservation space.
This path is most exercised by running the generic/269 fstest with
CONFIG_BTRFS_DEBUG.
Fixes: a6496849671a ("btrfs: fix start transaction qgroup rsv double free")
CC: [email protected] # 6.6+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When running delayed inode updates, we do not record the inode's root in
the transaction, but we do allocate PREALLOC and thus converted PERTRANS
space for it. To be sure we free that PERTRANS meta rsv, we must ensure
that we record the root in the transaction.
Fixes: 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
CC: [email protected] # 6.1+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.
However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.
The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:
1. any failure before record_root_in_trans is called successfully
results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
transaction owns freeing the reservation.
This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.
Fixes: e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: [email protected] # 6.1+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
pertrans reservations for subvolumes when quotas are enabled. The
convert function does not properly increment pertrans after decrementing
prealloc, so the count is not accurate.
Note: we check that the fs is not read-only to mirror the logic in
qgroup_convert_meta, which checks that before adding to the pertrans rsv.
Fixes: 8287475a2055 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
CC: [email protected] # 6.1+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Signed-off-by: Patryk Wlazlyn <[email protected]>
Signed-off-by: Len Brown <[email protected]>
|
|
If user request --no-msr or is not able to access the MSRs,
turbostat should clear all the counters added with --add.
Because MSR access permission checks are done after the cmdline is
parsed, the decision has to be defered up until the transition into
no-msr mode happen.
Signed-off-by: Len Brown <[email protected]>
|
|
Checking early if the permissions are even needed gets rid of the
warnings about some of them missing. Earlier we issued a warning in case
of missing MSR and/or perf permissions, even when user never asked for
counters that require those.
Signed-off-by: Patryk Wlazlyn <[email protected]>
Signed-off-by: Len Brown <[email protected]>
|
|
To allow unprivileged user to run turbostat seamlessly.
Signed-off-by: Patryk Wlazlyn <[email protected]>
Signed-off-by: Len Brown <[email protected]>
|
|
By using the perf API we spend less time in between the reads of the
counters, resulting in more accurate calculations of the dependent
metrics.
Using perf API is also usually faster overall, although cache miss, if
we get one, is more costly when using perf vs MSR driver.
We would fallback to the msr reads if the sysfs isn't there or when in
--no-perf mode.
Signed-off-by: Patryk Wlazlyn <[email protected]>
Signed-off-by: Len Brown <[email protected]>
|
|
Add the --no-perf option to allow users to run turbostat without
accessing perf.
Signed-off-by: Patryk Wlazlyn <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Len Brown <[email protected]>
|
|
Add --no-msr option to allow users to run turbostat without
accessing MSRs via the MSR driver.
Signed-off-by: Patryk Wlazlyn <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Len Brown <[email protected]>
|
|
Eliminate redundant debug output for core and package scope counters.
Include name and path for all "ADDED" counters.
Signed-off-by: Len Brown <[email protected]>
|
|
Previously a failed read of /dev/cpu_dma_latency erroneously complained
turbostat: capget(CAP_SYS_ADMIN) failed, try "# setcap cap_sys_admin=ep ./turbostat
This went unnoticed because this file is typically visible to root,
and turbostat was typically run as root.
Going forward, when a non-root user can run turbostat...
Complain about failed read access to this file only if --debug is used.
Signed-off-by: Len Brown <[email protected]>
|
|
If MSRs cannot be read, values can be obtained from cpuid.
Signed-off-by: Patryk Wlazlyn <[email protected]>
Signed-off-by: Len Brown <[email protected]>
|
|
HEAD
KVM/riscv fixes for 6.9, take #1
- Fix spelling mistake in arch_timer selftest
- Remove redundant semicolon in num_isa_ext_regs()
- Fix APLIC setipnum_le/be write emulation
- Fix APLIC in_clrip[x] read emulation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.9, part #1
- Ensure perf events programmed to count during guest execution
are actually enabled before entering the guest in the nVHE
configuration.
- Restore out-of-range handler for stage-2 translation faults.
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches.
- Fix early handling of architectural VHE-only systems to ensure E2H is
appropriately set.
- Correct a format specifier warning in the arch_timer selftest.
- Make the KVM banner message correctly handle all of the possible
configurations.
|
|
nvme_update_zone_info does (admin queue) I/O to the device and can fail.
We fail to abort the queue limits update if that happen, but really
should avoid with the frozen I/O queue as much as possible anyway.
Split the logic into a helper to query the information that can be
called on an unfrozen queue and one to apply it to the queue limits.
Fixes: 9b130d681443 ("nvme: use the atomic queue limits update API")
Reported-by: Kanchan Joshi <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kanchan Joshi <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Serialise cifs_construct_tcon() with cifs_mount_mutex to handle
parallel mounts that may end up reusing the session and tcon created
by it.
Cc: [email protected] # 6.4+
Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
The tcons created by cifs_construct_tcon() on multiuser mounts must
also be able to failover and refresh DFS referrals, so set the
appropriate fields in order to get a full DFS tcon. They could be
shared among different superblocks later, too.
Cc: [email protected] # 6.4+
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Avoid refreshing DFS referral with refpath_lock acquired as the I/O
could block for a while due to a potentially disconnected or slow DFS
root server and then making other threads - that use same @server and
don't require a DFS root server - unable to make any progress.
Cc: [email protected] # 6.4+
Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Avoid potential use-after-free bugs when walking DFS referrals,
mounting and performing DFS failover by ensuring that all children
from parent @tcon->ses are also refcounted. They're all needed across
the entire DFS mount. Get rid of @tcon->dfs_ses_list while we're at
it, too.
Cc: [email protected] # 6.4+
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Sometimes the readout of /sys/kernel/debug/swiotlb/io_tlb_used and
io_tlb_used_hiwater can be a huge number (e.g. 18446744073709551615),
which is actually a negative number if we use "%ld" to print the number.
When swiotlb_create_default_debugfs() is running from late_initcall,
mem->total_used may already be non-zero, because the storage driver
may have already started to perform I/O operations: if the storage
driver is built-in, its probe() callback is called before late_initcall.
swiotlb_create_debugfs_files() should not blindly set mem->total_used
and mem->used_hiwater to 0; actually it doesn't have to initialize the
fields at all, because the fields, as part of the global struct
io_tlb_default_mem, have been implicitly initialized to zero.
Also don't explicitly set mem->transient_nslabs to 0.
Fixes: 8b0977ecc8b3 ("swiotlb: track and report io_tlb_used high water marks in debugfs")
Fixes: 02e765697038 ("swiotlb: add debugfs to track swiotlb transient pool usage")
Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: ZhangPeng <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
In current code, swiotlb_bounce() may do partial sync's correctly in
some circumstances, but may incorrectly fail in other circumstances.
The failure cases require both of these to be true:
1) swiotlb_align_offset() returns a non-zero "offset" value
2) the tlb_addr of the partial sync area points into the first
"offset" bytes of the _second_ or subsequent swiotlb slot allocated
for the mapping
Code added in commit 868c9ddc182b ("swiotlb: add overflow checks
to swiotlb_bounce") attempts to WARN on the invalid case where
tlb_addr points into the first "offset" bytes of the _first_
allocated slot. But there's no way for swiotlb_bounce() to distinguish
the first slot from the second and subsequent slots, so the WARN
can be triggered incorrectly when #2 above is true.
Related, current code calculates an adjustment to the orig_addr stored
in the swiotlb slot. The adjustment compensates for the difference
in the tlb_addr used for the partial sync vs. the tlb_addr for the full
mapping. The adjustment is stored in the local variable tlb_offset.
But when #1 and #2 above are true, it's valid for this adjustment to
be negative. In such case the arithmetic to adjust orig_addr produces
the wrong result due to tlb_offset being declared as unsigned.
Fix these problems by removing the over-constraining validations added
in 868c9ddc182b. Change the declaration of tlb_offset to be signed
instead of unsigned so the adjustment arithmetic works correctly.
Tested with a test-only hack to how swiotlb_tbl_map_single() calls
swiotlb_bounce(). Instead of calling swiotlb_bounce() just once
for the entire mapped area, do a loop with each iteration doing
only a 128 byte partial sync until the entire mapped area is
sync'ed. Then with swiotlb=force on the kernel boot line, run a
variety of raw disk writes followed by read and verification of
all bytes of the written data. The storage device has DMA
min_align_mask set, and the writes are done with a variety of
original buffer memory address alignments and overall buffer
sizes. For many of the combinations, current code triggers the
WARN statements, or the data verification fails. With the fixes,
no WARNs occur and all verifications pass.
Fixes: 5f89468e2f06 ("swiotlb: manipulate orig_addr when tlb_addr has offset")
Fixes: 868c9ddc182b ("swiotlb: add overflow checks to swiotlb_bounce")
Signed-off-by: Michael Kelley <[email protected]>
Dominique Martinet <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|