Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull vfs fget updates from Al Viro:
"fget() to fdget() conversions"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse_dev_ioctl(): switch to fdget()
cgroup_get_from_fd(): switch to fdget_raw()
bpf: switch to fdget_raw()
build_mount_idmapped(): switch to fdget()
kill the last remaining user of proc_ns_fget()
SVM-SEV: convert the rest of fget() uses to fdget() in there
convert sgx_set_attribute() to fdget()/fdput()
convert setns(2) to fdget()/fdput()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner:
"This adds a new pidfd_prepare() helper which allows the caller to
reserve a pidfd number and allocates a new pidfd file that stashes the
provided struct pid.
It should be avoided installing a file descriptor into a task's file
descriptor table just to close it again via close_fd() in case an
error occurs. The fd has been visible to userspace and might already
be in use. Instead, a file descriptor should be reserved but not
installed into the caller's file descriptor table.
If another failure path is hit then the reserved file descriptor and
file can just be put without any userspace visible side-effects. And
if all failure paths are cleared the file descriptor and file can be
installed into the task's file descriptor table.
This helper is now used in all places that open coded this
functionality before. For example, this is currently done during
copy_process() and fanotify used pidfd_create(), which returns a pidfd
that has already been made visibile in the caller's file descriptor
table, but then closed it using close_fd().
In one of the next merge windows there is also new functionality
coming to unix domain sockets that will have to rely on
pidfd_prepare()"
* tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fanotify: use pidfd_prepare()
fork: use pidfd_prepare()
pid: add pidfd_prepare()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull user work thread updates from Christian Brauner:
"This contains the work generalizing the ability to create a kernel
worker from a userspace process.
Such user workers will run with the same credentials as the userspace
process they were created from providing stronger security and
accounting guarantees than the traditional override_creds() approach
ever could've hoped for.
The original work was heavily based and optimzed for the needs of
io_uring which was the first user. However, as it quickly turned out
the ability to create user workers inherting properties from a
userspace process is generally useful.
The vhost subsystem currently creates workers using the kthread api.
The consequences of using the kthread api are that RLIMITs don't work
correctly as they are inherited from khtreadd. This leads to bugs
where more workers are created than would be allowed by the RLIMITs of
the userspace process in lieu of which workers are created.
Problems like this disappear with user workers created from the
userspace processes for which they perform the work. In addition,
providing this api allows vhost to remove additional complexity. For
example, cgroup and mm sharing will just work out of the box with user
workers based on the relevant userspace process instead of manually
ensuring the correct cgroup and mm contexts are used.
So the vhost subsystem should simply be made to use the same mechanism
as io_uring. To this end the original mechanism used for
create_io_thread() is generalized into user workers:
- Introduce PF_USER_WORKER as a generic indicator that a given task
is a user worker, i.e., a kernel task that was created from a
userspace process. Now a PF_IO_WORKER thread is just a specialized
version of PF_USER_WORKER. So io_uring io workers raise both flags.
- Make copy_process() available to core kernel code
- Extend struct kernel_clone_args with the following bitfields
allowing to indicate to copy_process():
- to create a user worker (raise PF_USER_WORKER)
- to not inherit any files from the userspace process
- to ignore signals
After all generic changes are in place the vhost subsystem implements
a new dedicated vhost api based on user workers. Finally, vhost is
switched to rely on the new api moving it off of kthreads.
Thanks to Mike for sticking it out and making it through this rather
arduous journey"
* tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
vhost: use vhost_tasks for worker threads
vhost: move worker thread fields to new struct
vhost_task: Allow vhost layer to use copy_process
fork: allow kernel code to call copy_process
fork: Add kernel_clone_args flag to ignore signals
fork: add kernel_clone_args flag to not dup/clone files
fork/vm: Move common PF_IO_WORKER behavior to new flag
kernel: Make io_thread and kthread bit fields
kthread: Pass in the thread's name during creation
kernel: Allow a kernel thread's name to be set in copy_process
csky: Remove kernel_thread declaration
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull locktorture updates from Paul McKenney:
"This adds tests for nested locking and also adds support for testing
raw spinlocks in PREEMPT_RT kernels"
* tag 'locktorture.2023.04.04a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
locktorture: Add raw_spinlock* torture tests for PREEMPT_RT kernels
locktorture: With nested locks, occasionally skip main lock
locktorture: Add nested locking to rtmutex torture tests
locktorture: Add nested locking to mutex torture tests
locktorture: Add nested_[un]lock() hooks and nlocks parameter
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
"Kernel concurrency sanitizer (KCSAN) updates for v6.4
This fixes kernel-doc warnings and also updates instrumentation from
READ_ONCE() to volatile in order to avoid unaligned load-acquire
instructions on arm64 in kernels built with LTO"
* tag 'kcsan.2023.04.04a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
kcsan: Avoid READ_ONCE() in read_instrumented_memory()
instrumented.h: Fix all kernel-doc format warnings
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
There are a few fixes for new code bugs, including the Mellanox one
noted in the last networking pull. No known regressions outstanding.
Current release - regressions:
- sched: clear actions pointer in miss cookie init fail
- mptcp: fix accept vs worker race
- bpf: fix bpf_arch_text_poke() with new_addr == NULL on s390
- eth: bnxt_en: fix a possible NULL pointer dereference in unload
path
- eth: veth: take into account peer device for
NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
Current release - new code bugs:
- eth: revert "net/mlx5: Enable management PF initialization"
Previous releases - regressions:
- netfilter: fix recent physdev match breakage
- bpf: fix incorrect verifier pruning due to missing register
precision taints
- eth: virtio_net: fix overflow inside xdp_linearize_page()
- eth: cxgb4: fix use after free bugs caused by circular dependency
problem
- eth: mlxsw: pci: fix possible crash during initialization
Previous releases - always broken:
- sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg
- netfilter: validate catch-all set elements
- bridge: don't notify FDB entries with "master dynamic"
- eth: bonding: fix memory leak when changing bond type to ethernet
- eth: i40e: fix accessing vsi->active_filters without holding lock
Misc:
- Mat is back as MPTCP co-maintainer"
* tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits)
net: bridge: switchdev: don't notify FDB entries with "master dynamic"
Revert "net/mlx5: Enable management PF initialization"
MAINTAINERS: Resume MPTCP co-maintainer role
mailmap: add entries for Mat Martineau
e1000e: Disable TSO on i219-LM card to increase speed
bnxt_en: fix free-runnig PHC mode
net: dsa: microchip: ksz8795: Correctly handle huge frame configuration
bpf: Fix incorrect verifier pruning due to missing register precision taints
hamradio: drop ISA_DMA_API dependency
mlxsw: pci: Fix possible crash during initialization
mptcp: fix accept vs worker race
mptcp: stops worker on unaccepted sockets at listener close
net: rpl: fix rpl header size calculation
net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete()
bonding: Fix memory leak when changing bond type to Ethernet
veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next()
bnxt_en: Fix a possible NULL pointer dereference in unload path
bnxt_en: Do not initialize PTP on older P3/P4 chips
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"22 hotfixes.
19 are cc:stable and the remainder address issues which were
introduced during this merge cycle, or aren't considered suitable for
-stable backporting.
19 are for MM and the remainder are for other subsystems"
* tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
nilfs2: initialize unused bytes in segment summary blocks
mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
mm/mmap: regression fix for unmapped_area{_topdown}
maple_tree: fix mas_empty_area() search
maple_tree: make maple state reusable after mas_empty_area_rev()
mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
tools/Makefile: do missed s/vm/mm/
mm: fix memory leak on mm_init error handling
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Revert "userfaultfd: don't fail on unrecognized features"
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug
tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used
mailmap: update jtoppins' entry to reference correct email
mm/mempolicy: fix use-after-free of VMA iterator
mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
mm/mprotect: fix do_mprotect_pkey() return on error
mm/khugepaged: check again on anon uffd-wp during isolation
...
|
|
Juan Jose et al reported an issue found via fuzzing where the verifier's
pruning logic prematurely marks a program path as safe.
Consider the following program:
0: (b7) r6 = 1024
1: (b7) r7 = 0
2: (b7) r8 = 0
3: (b7) r9 = -2147483648
4: (97) r6 %= 1025
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2
7: (97) r6 %= 1
8: (b7) r9 = 0
9: (bd) if r6 <= r9 goto pc+1
10: (b7) r6 = 0
11: (b7) r0 = 0
12: (63) *(u32 *)(r10 -4) = r0
13: (18) r4 = 0xffff888103693400 // map_ptr(ks=4,vs=48)
15: (bf) r1 = r4
16: (bf) r2 = r10
17: (07) r2 += -4
18: (85) call bpf_map_lookup_elem#1
19: (55) if r0 != 0x0 goto pc+1
20: (95) exit
21: (77) r6 >>= 10
22: (27) r6 *= 8192
23: (bf) r1 = r0
24: (0f) r0 += r6
25: (79) r3 = *(u64 *)(r0 +0)
26: (7b) *(u64 *)(r1 +0) = r3
27: (95) exit
The verifier treats this as safe, leading to oob read/write access due
to an incorrect verifier conclusion:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff00000000; 0xffffffff)) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=0 R10=fp0
last_idx 8 first_idx 0
regs=40 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
frame 0: propagating r6
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
from 6 to 9: safe
verification time 110 usec
stack depth 4
processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2
The verifier considers this program as safe by mistakenly pruning unsafe
code paths. In the above func#0, code lines 0-10 are of interest. In line
0-3 registers r6 to r9 are initialized with known scalar values. In line 4
the register r6 is reset to an unknown scalar given the verifier does not
track modulo operations. Due to this, the verifier can also not determine
precisely which branches in line 6 and 9 are taken, therefore it needs to
explore them both.
As can be seen, the verifier starts with exploring the false/fall-through
paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer
arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic,
r6 is correctly marked for precision tracking where backtracking kicks in
where it walks back the current path all the way where r6 was set to 0 in
the fall-through branch.
Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also
here, the state of the registers is the same, that is, r6=0 and r9=0, so
that at line 19 the path can be pruned as it is considered safe. It is
interesting to note that the conditional in line 9 turned r6 into a more
precise state, that is, in the fall-through path at the beginning of line
10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed
here) at the beginning of line 11, r6 turned into a known const r6=0 as
r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must
be 0 (**):
[...] ; R6_w=scalar()
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
[...]
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
[...]
The next path is 'from 6 to 9'. The verifier considers the old and current
state equivalent, and therefore prunes the search incorrectly. Looking into
the two states which are being compared by the pruning logic at line 9, the
old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state
consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968)
R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag
correctly set in the old state, r9 did not. Both r6'es are considered as
equivalent given the old one is a superset of the current, more precise one,
however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9
did not have reg->precise flag set, the verifier does not consider the
register as contributing to the precision state of r6, and therefore it
considered both r9 states as equivalent. However, for this specific pruned
path (which is also the actual path taken at runtime), register r6 will be
0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map.
The purpose of precision tracking is to initially mark registers (including
spilled ones) as imprecise to help verifier's pruning logic finding equivalent
states it can then prune if they don't contribute to the program's safety
aspects. For example, if registers are used for pointer arithmetic or to pass
constant length to a helper, then the verifier sets reg->precise flag and
backtracks the BPF program instruction sequence and chain of verifier states
to ensure that the given register or stack slot including their dependencies
are marked as precisely tracked scalar. This also includes any other registers
and slots that contribute to a tracked state of given registers/stack slot.
This backtracking relies on recorded jmp_history and is able to traverse
entire chain of parent states. This process ends only when all the necessary
registers/slots and their transitive dependencies are marked as precise.
The backtrack_insn() is called from the current instruction up to the first
instruction, and its purpose is to compute a bitmask of registers and stack
slots that need precision tracking in the parent's verifier state. For example,
if a current instruction is r6 = r7, then r6 needs precision after this
instruction and r7 needs precision before this instruction, that is, in the
parent state. Hence for the latter r7 is marked and r6 unmarked.
For the class of jmp/jmp32 instructions, backtrack_insn() today only looks
at call and exit instructions and for all other conditionals the masks
remain as-is. However, in the given situation register r6 has a dependency
on r9 (as described above in **), so also that one needs to be marked for
precision tracking. In other words, if an imprecise register influences a
precise one, then the imprecise register should also be marked precise.
Meaning, in the parent state both dest and src register need to be tracked
for precision and therefore the marking must be more conservative by setting
reg->precise flag for both. The precision propagation needs to cover both
for the conditional: if the src reg was marked but not the dst reg and vice
versa.
After the fix the program is correctly rejected:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff80000000; 0x7fffffff),u32_min=-2147483648) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=240 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=P0 R10=fp0
last_idx 8 first_idx 0
regs=240 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
from 6 to 9: R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
9: (bd) if r6 <= r9 goto pc+1
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
last_idx 9 first_idx 0
regs=200 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
11: R6=scalar(umax=18446744071562067968) R9=-2147483648
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=3,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0_w=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=scalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=scalar(umax=18014398507384832,var_off=(0x0; 0x3fffffffffffff))
22: (27) r6 *= 8192 ; R6_w=scalar(smax=9223372036854767616,umax=18446744073709543424,var_off=(0x0; 0xffffffffffffe000),s32_max=2147475456,u32_max=-8192)
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 21
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
parent didn't have regs=40 stack=0 marks: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_r=Pscalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
last_idx 19 first_idx 11
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
last_idx 9 first_idx 0
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
regs=240 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
math between map_value pointer and register with unbounded min value is not allowed
verification time 886 usec
stack depth 4
processed 49 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 2
Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking")
Reported-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com>
Reported-by: Meador Inge <meadori@google.com>
Reported-by: Simon Scannell <simonscannell@google.com>
Reported-by: Nenad Stojanovski <thenenadx@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com>
Reviewed-by: Meador Inge <meadori@google.com>
Reviewed-by: Simon Scannell <simonscannell@google.com>
|
|
commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
introduces a memory leak by missing a call to destroy_context() when a
percpu_counter fails to allocate.
Before introducing the per-cpu counter allocations, init_new_context() was
the last call that could fail in mm_init(), and thus there was no need to
ever invoke destroy_context() in the error paths. Adding the following
percpu counter allocations adds error paths after init_new_context(),
which means its associated destroy_context() needs to be called when
percpu counters fail to allocate.
Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Do not pull tasks to the local scheduling group if its average load
is higher than the average system load
* tag 'sched_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix imbalance overflow
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"This is a relatively big pull request this late in the cycle but the
major contributor is the cpuset bug which is rather significant:
- Fix several cpuset bugs including one where it wasn't applying the
target cgroup when tasks are created with CLONE_INTO_CGROUP
With a few smaller fixes:
- Fix inversed locking order in cgroup1 freezer implementation
- Fix garbage cpu.stat::core_sched.forceidle_usec reporting in the
root cgroup"
* tag 'cgroup-for-6.3-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset
cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex
cgroup/cpuset: Fix partition root's cpuset.cpus update bug
cgroup: fix display of forceidle time at root
|
|
It is found that attaching a task to the top_cpuset does not currently
ignore CPUs allocated to subpartitions in cpuset_attach_task(). So the
code is changed to fix that.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept
new tasks. It is too late to check that in cpuset_fork(). So we need
to add the cpuset_can_fork() and cpuset_cancel_fork() methods to
pre-check it before we can allow attachment to a different cpuset.
We also need to set the attach_in_progress flag to alert other code
that a new task is going to be added to the cpuset.
Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups")
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
By default, the clone(2) syscall spawn a child process into the same
cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag
introduced by commit ef2c41cf38a7 ("clone3: allow spawning processes
into cgroups"), the child will be spawned into a different cgroup which
is somewhat similar to writing the child's tid into "cgroup.threads".
The current cpuset_fork() method does not properly handle the
CLONE_INTO_CGROUP case where the cpuset of the child may be different
from that of its parent. Update the cpuset_fork() method to treat the
CLONE_INTO_CGROUP case similar to cpuset_attach().
Since the newly cloned task has not been running yet, its actual
memory usage isn't known. So it is not necessary to make change to mm
in cpuset_fork().
Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups")
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
After a successful cpuset_can_attach() call which increments the
attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach()
will be called later. In cpuset_attach(), tasks in cpuset_attach_wq,
if present, will be woken up at the end. That is not the case in
cpuset_cancel_attach(). So missed wakeup is possible if the attach
operation is somehow cancelled. Fix that by doing the wakeup in
cpuset_cancel_attach() as well.
Fixes: e44193d39e8d ("cpuset: let hotplug propagation work wait for task attaching")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
syzbot is reporting circular locking dependency between cpu_hotplug_lock
and freezer_mutex, for commit f5d39b020809 ("freezer,sched: Rewrite core
freezer logic") replaced atomic_inc() in freezer_apply_state() with
static_branch_inc() which holds cpu_hotplug_lock.
cpu_hotplug_lock => cgroup_threadgroup_rwsem => freezer_mutex
cgroup_file_write() {
cgroup_procs_write() {
__cgroup_procs_write() {
cgroup_procs_write_start() {
cgroup_attach_lock() {
cpus_read_lock() {
percpu_down_read(&cpu_hotplug_lock);
}
percpu_down_write(&cgroup_threadgroup_rwsem);
}
}
cgroup_attach_task() {
cgroup_migrate() {
cgroup_migrate_execute() {
freezer_attach() {
mutex_lock(&freezer_mutex);
(...snipped...)
}
}
}
}
(...snipped...)
}
}
}
freezer_mutex => cpu_hotplug_lock
cgroup_file_write() {
freezer_write() {
freezer_change_state() {
mutex_lock(&freezer_mutex);
freezer_apply_state() {
static_branch_inc(&freezer_active) {
static_key_slow_inc() {
cpus_read_lock();
static_key_slow_inc_cpuslocked();
cpus_read_unlock();
}
}
}
mutex_unlock(&freezer_mutex);
}
}
}
Swap locking order by moving cpus_read_lock() in freezer_apply_state()
to before mutex_lock(&freezer_mutex) in freezer_change_state().
Reported-by: syzbot <syzbot+c39682e86c9d84152f93@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=c39682e86c9d84152f93
Suggested-by: Hillf Danton <hdanton@sina.com>
Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When local group is fully busy but its average load is above system load,
computing the imbalance will overflow and local group is not the best
target for pulling this load.
Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
Reported-by: Tingjia Cao <tjcao980311@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingjia Cao <tjcao980311@gmail.com>
Link: https://lore.kernel.org/lkml/CABcWv9_DAhVBOq2=W=2ypKE9dKM5s2DvoV8-U0+GDwwuKZ89jQ@mail.gmail.com/T/
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"This fixes a pair of bugs in which an improbable but very real
sequence of events can cause kfree_rcu() to be a bit too quick about
freeing the memory passed to it.
It turns out that this pair of bugs is about two years old, and so
this is not a v6.3 regression. However: (1) It just started showing up
in the wild and (2) Its consequences are dire, so its fix needs to go
in sooner rather than later.
Testing is of course being upgraded, and the upgraded tests detect
this situation very quickly. But to the best of my knowledge right
now, the tests are not particularly urgent and will thus most likely
show up in the v6.5 merge window (the one after this coming one).
Kudos to Ziwei Dai and his group for tracking this one down the hard
way!"
* tag 'urgent-rcu.2023.04.07a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix "same task" check when redirecting event output
- Do not wait unconditionally for RCU on the event migration path if
there are no events to migrate
* tag 'perf_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix the same task check in perf_event_set_output
perf: Optimize perf_pmu_migrate_context()
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- fix a braino in the swiotlb alignment check fix (Petr Tesarik)
* tag 'dma-mapping-6.3-2023-04-08' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix a braino in the alignment check fix
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"A couple more minor fixes:
- Reset direct->addr back to its original value on error in updating
the direct trampoline code
- Make lastcmd_mutex static"
* tag 'trace-v6.3-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/synthetic: Make lastcmd_mutex static
ftrace: Fix issue that 'direct->addr' not restored in modify_ftrace_direct()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM fixes from Andrew Morton:
"28 hotfixes.
23 are cc:stable and the other five address issues which were
introduced during this merge cycle.
20 are for MM and the remainder are for other subsystems"
* tag 'mm-hotfixes-stable-2023-04-07-16-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits)
maple_tree: fix a potential concurrency bug in RCU mode
maple_tree: fix get wrong data_end in mtree_lookup_walk()
mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
nilfs2: fix sysfs interface lifetime
mm: take a page reference when removing device exclusive entries
mm: vmalloc: avoid warn_alloc noise caused by fatal signal
nilfs2: initialize "struct nilfs_binfo_dat"->bi_pad field
nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread()
zsmalloc: document freeable stats
zsmalloc: document new fullness grouping
fsdax: force clear dirty mark if CoW
mm/hugetlb: fix uffd wr-protection for CoW optimization path
mm: enable maple tree RCU mode by default
maple_tree: add RCU lock checking to rcu callback functions
maple_tree: add smp_rmb() to dead node detection
maple_tree: fix write memory barrier of nodes once dead for RCU mode
maple_tree: remove extra smp_wmb() from mas_dead_leaves()
maple_tree: fix freeing of nodes in rcu mode
maple_tree: detect dead nodes in mas_start()
maple_tree: be more cautious about dead nodes
...
|
|
The lastcmd_mutex is only used in trace_events_synth.c and should be
static.
Link: https://lore.kernel.org/linux-trace-kernel/202304062033.cRStgOuP-lkp@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230406111033.6e26de93@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Fixes: 4ccf11c4e8a8e ("tracing/synthetic: Fix races on freeing last_cmd")
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode. The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.
On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories. Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed. And
the kfree_rcu_monitor() function does in fact check for this.
Except that the kfree_rcu_monitor() function checks these pointers one
at a time. This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately. This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.
To see this, consider the following sequence of events, in which:
o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
then is preempted.
o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
after a later grace period. Except that "from_cset" is freed
right after the previous grace period ended, so that "from_cset"
is immediately freed. Task A resumes and references "from_cset"'s
member, after which nothing good happens.
In full detail:
CPU 0 CPU 1
---------------------- ----------------------
count_memcg_event_mm()
|rcu_read_lock() <---
|mem_cgroup_from_task()
|// css_set_ptr is the "from_cset" mentioned on CPU 1
|css_set_ptr = rcu_dereference((task)->cgroups)
|// Hard irq comes, current task is scheduled out.
cgroup_attach_task()
|cgroup_migrate()
|cgroup_migrate_execute()
|css_set_move_task(task, from_cset, to_cset, true)
|cgroup_move_task(task, to_cset)
|rcu_assign_pointer(.., to_cset)
|...
|cgroup_migrate_finish()
|put_css_set_locked(from_cset)
|from_cset->refcount return 0
|kfree_rcu(cset, rcu_head) // free from_cset after new gp
|add_ptr_to_bulk_krc_lock()
|schedule_delayed_work(&krcp->monitor_work, ..)
kfree_rcu_monitor()
|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
|queue_rcu_work(system_wq, &krwp->rcu_work)
|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp
// There is a perious call_rcu(.., rcu_work_rcufn)
// gp end, rcu_work_rcufn() is called.
rcu_work_rcufn()
|__queue_work(.., rwork->wq, &rwork->work);
|kfree_rcu_work()
|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
|The "from_cset" is freed before new gp end.
// the task resumes some time later.
|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.
v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Syzkaller report a WARNING: "WARN_ON(!direct)" in modify_ftrace_direct().
Root cause is 'direct->addr' was changed from 'old_addr' to 'new_addr' but
not restored if error happened on calling ftrace_modify_direct_caller().
Then it can no longer find 'direct' by that 'old_addr'.
To fix it, restore 'direct->addr' to 'old_addr' explicitly in error path.
Link: https://lore.kernel.org/linux-trace-kernel/20230330025223.1046087-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <ast@kernel.org>
Cc: <daniel@iogearbox.net>
Fixes: 8a141dd7f706 ("ftrace: Fix modify_ftrace_direct.")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The alignment mask in swiotlb_do_find_slots() masks off the high
bits which are not relevant for the alignment, so multiple
requirements are combined with a bitwise OR rather than AND.
In plain English, the stricter the alignment, the more bits must
be set in iotlb_align_mask.
Confusion may arise from the fact that the same variable is also
used to mask off the offset within a swiotlb slot, which is
achieved with a bitwise AND.
Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks")
Reported-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/all/CAA42JLa1y9jJ7BgQvXeUYQh-K2mDNHd2BYZ4iZUz33r5zY7oAQ@mail.gmail.com/
Reported-by: Kelsey Steele <kelseysteele@linux.microsoft.com>
Link: https://lore.kernel.org/all/20230405003549.GA21326@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Use the maple tree in RCU mode for VMA tracking.
The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock. This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.
It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs. syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space. Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.
Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.
[Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a tracing instance is removed, the error messages that hold errors
that occurred in the instance needs to be freed. The following reports a
memory leak:
# cd /sys/kernel/tracing
# mkdir instances/foo
# echo 'hist:keys=x' > instances/foo/events/sched/sched_switch/trigger
# cat instances/foo/error_log
[ 117.404795] hist:sched:sched_switch: error: Couldn't find field
Command: hist:keys=x
^
# rmdir instances/foo
Then check for memory leaks:
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810d8ec700 (size 192):
comm "bash", pid 869, jiffies 4294950577 (age 215.752s)
hex dump (first 32 bytes):
60 dd 68 61 81 88 ff ff 60 dd 68 61 81 88 ff ff `.ha....`.ha....
a0 30 8c 83 ff ff ff ff 26 00 0a 00 00 00 00 00 .0......&.......
backtrace:
[<00000000dae26536>] kmalloc_trace+0x2a/0xa0
[<00000000b2938940>] tracing_log_err+0x277/0x2e0
[<000000004a0e1b07>] parse_atom+0x966/0xb40
[<0000000023b24337>] parse_expr+0x5f3/0xdb0
[<00000000594ad074>] event_hist_trigger_parse+0x27f8/0x3560
[<00000000293a9645>] trigger_process_regex+0x135/0x1a0
[<000000005c22b4f2>] event_trigger_write+0x87/0xf0
[<000000002cadc509>] vfs_write+0x162/0x670
[<0000000059c3b9be>] ksys_write+0xca/0x170
[<00000000f1cddc00>] do_syscall_64+0x3e/0xc0
[<00000000868ac68c>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff888170c35a00 (size 32):
comm "bash", pid 869, jiffies 4294950577 (age 215.752s)
hex dump (first 32 bytes):
0a 20 20 43 6f 6d 6d 61 6e 64 3a 20 68 69 73 74 . Command: hist
3a 6b 65 79 73 3d 78 0a 00 00 00 00 00 00 00 00 :keys=x.........
backtrace:
[<000000006a747de5>] __kmalloc+0x4d/0x160
[<000000000039df5f>] tracing_log_err+0x29b/0x2e0
[<000000004a0e1b07>] parse_atom+0x966/0xb40
[<0000000023b24337>] parse_expr+0x5f3/0xdb0
[<00000000594ad074>] event_hist_trigger_parse+0x27f8/0x3560
[<00000000293a9645>] trigger_process_regex+0x135/0x1a0
[<000000005c22b4f2>] event_trigger_write+0x87/0xf0
[<000000002cadc509>] vfs_write+0x162/0x670
[<0000000059c3b9be>] ksys_write+0xca/0x170
[<00000000f1cddc00>] do_syscall_64+0x3e/0xc0
[<00000000868ac68c>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
The problem is that the error log needs to be freed when the instance is
removed.
Link: https://lore.kernel.org/lkml/76134d9f-a5ba-6a0d-37b3-28310b4a1e91@alu.unizg.hr/
Link: https://lore.kernel.org/linux-trace-kernel/20230404194504.5790b95f@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Fixes: 2f754e771b1a6 ("tracing: Have the error logs show up in the proper instances")
Reported-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/staging
|
|
The kfree_rcu() and kvfree_rcu() macros' single-argument forms are
deprecated. Therefore switch to the new kfree_rcu_mightsleep() and
kvfree_rcu_mightsleep() variants. The goal is to avoid accidental use
of the single-argument forms, which can introduce functionality bugs in
atomic contexts and latency bugs in non-atomic contexts.
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The kvfree_rcu() macro's single-argument form is deprecated. Therefore
switch to the new kvfree_rcu_mightsleep() variant. The goal is to
avoid accidental use of the single-argument forms, which can introduce
functionality bugs in atomic contexts and latency bugs in non-atomic
contexts.
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can
result in a NULL-pointer dereference:
CPU1 CPU2
rcu_preempt_deferred_qs_irqrestore rcu_print_task_exp_stall
if (special.b.blocked) READ_ONCE(rnp->exp_tasks) != NULL
raw_spin_lock_rcu_node
np = rcu_next_node_entry(t, rnp)
if (&t->rcu_node_entry == rnp->exp_tasks)
WRITE_ONCE(rnp->exp_tasks, np)
....
raw_spin_unlock_irqrestore_rcu_node
raw_spin_lock_irqsave_rcu_node
t = list_entry(rnp->exp_tasks->prev,
struct task_struct, rcu_node_entry)
(if rnp->exp_tasks is NULL, this
will dereference a NULL pointer)
The problem is that CPU2 accesses the rcu_node structure's->exp_tasks
field without holding the rcu_node structure's ->lock and CPU2 did
not observe CPU1's change to rcu_node structure's ->exp_tasks in time.
Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL,
then CPU2 might dereference that NULL pointer.
This commit therefore holds the rcu_node structure's ->lock while
accessing that structure's->exp_tasks field.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
Registering a kprobe on __rcu_irq_enter_check_tick() can cause kernel
stack overflow as shown below. This issue can be reproduced by enabling
CONFIG_NO_HZ_FULL and booting the kernel with argument "nohz_full=",
and then giving the following commands at the shell prompt:
# cd /sys/kernel/tracing/
# echo 'p:mp1 __rcu_irq_enter_check_tick' >> kprobe_events
# echo 1 > events/kprobes/enable
This commit therefore adds __rcu_irq_enter_check_tick() to the kprobes
blacklist using NOKPROBE_SYMBOL().
Insufficient stack space to handle exception!
ESR: 0x00000000f2000004 -- BRK (AArch64)
FAR: 0x0000ffffccf3e510
Task stack: [0xffff80000ad30000..0xffff80000ad38000]
IRQ stack: [0xffff800008050000..0xffff800008058000]
Overflow stack: [0xffff089c36f9f310..0xffff089c36fa0310]
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
pstate: 400003c5 (nZcv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __rcu_irq_enter_check_tick+0x0/0x1b8
lr : ct_nmi_enter+0x11c/0x138
sp : ffff80000ad30080
x29: ffff80000ad30080 x28: ffff089c82e20000 x27: 0000000000000000
x26: 0000000000000000 x25: ffff089c02a8d100 x24: 0000000000000000
x23: 00000000400003c5 x22: 0000ffffccf3e510 x21: ffff089c36fae148
x20: ffff80000ad30120 x19: ffffa8da8fcce148 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: ffffa8da8e44ea6c
x14: ffffa8da8e44e968 x13: ffffa8da8e03136c x12: 1fffe113804d6809
x11: ffff6113804d6809 x10: 0000000000000a60 x9 : dfff800000000000
x8 : ffff089c026b404f x7 : 00009eec7fb297f7 x6 : 0000000000000001
x5 : ffff80000ad30120 x4 : dfff800000000000 x3 : ffffa8da8e3016f4
x2 : 0000000000000003 x1 : 0000000000000000 x0 : 0000000000000000
Kernel panic - not syncing: kernel stack overflow
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xf8/0x108
show_stack+0x20/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
panic+0x214/0x404
add_taint+0x0/0xf8
panic_bad_stack+0x144/0x160
handle_bad_stack+0x38/0x58
__bad_stack+0x78/0x7c
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
[...]
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
el1_interrupt+0x28/0x60
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x64/0x68
__ftrace_set_clr_event_nolock+0x98/0x198
__ftrace_set_clr_event+0x58/0x80
system_enable_write+0x144/0x178
vfs_write+0x174/0x738
ksys_write+0xd0/0x188
__arm64_sys_write+0x4c/0x60
invoke_syscall+0x64/0x180
el0_svc_common.constprop.0+0x84/0x160
do_el0_svc+0x48/0xe8
el0_svc+0x34/0xd0
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x190/0x194
SMP: stopping secondary CPUs
Kernel Offset: 0x28da86000000 from 0xffff800008000000
PHYS_OFFSET: 0xfffff76600000000
CPU features: 0x00000,01a00100,0000421b
Memory Limit: none
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/all/20221119040049.795065-1-zhengyejian1@huawei.com/
Fixes: aaf2bc50df1f ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The call to synchronize_srcu() from rcu_tasks_postscan() can be stalled
by a task getting stuck in do_exit() between that function's calls to
exit_tasks_rcu_start() and exit_tasks_rcu_finish(). To ease diagnosis
of this situation, print a stall warning message every rcu_task_stall_info
period when rcu_tasks_postscan() is stalled.
[ paulmck: Adjust to handle CONFIG_SMP=n. ]
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/rcu/20230111212736.GA1062057@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
According to the commit log of the patch that added it to the kernel,
start_poll_synchronize_rcu_expedited() can be invoked very early, as
in long before rcu_init() has been invoked. But before rcu_init(),
the rcu_data structure's ->mynode field has not yet been initialized.
This means that the start_poll_synchronize_rcu_expedited() function's
attempt to set the CPU's leaf rcu_node structure's ->exp_seq_poll_rq
field will result in a segmentation fault.
This commit therefore causes start_poll_synchronize_rcu_expedited() to
set ->exp_seq_poll_rq only after rcu_init() has initialized all CPUs'
rcu_data structures' ->mynode fields. It also removes the check from
the rcu_init() function so that start_poll_synchronize_rcu_expedited(
is unconditionally invoked. Yes, this might result in an unnecessary
boot-time grace period, but this is down in the noise.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The rcu_accelerate_cbs() function is invoked by rcu_report_qs_rdp()
only if there is a grace period in progress that is still blocked
by at least one CPU on this rcu_node structure. This means that
rcu_accelerate_cbs() should never return the value true, and thus that
this function should never set the needwake variable and in turn never
invoke rcu_gp_kthread_wake().
This commit therefore removes the needwake variable and the invocation
of rcu_gp_kthread_wake() in favor of a WARN_ON_ONCE() on the call to
rcu_accelerate_cbs(). The purpose of this new WARN_ON_ONCE() is to
detect situations where the system's opinion differs from ours.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The lazy_rcu_shrink_count() shrinker function is registered even in
kernels built with CONFIG_RCU_LAZY=n, in which case this function
uselessly consumes cycles learning that no CPU has any lazy callbacks
queued.
This commit therefore registers this shrinker function only in the kernels
built with CONFIG_RCU_LAZY=y, where it might actually do something useful.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling
RCU expedited grace periods to actually force-enable scheduling-clock
interrupts on holdout CPUs.
Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS")
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
For kernels built with CONFIG_NO_HZ_FULL=y, the following scenario can result
in the scheduling-clock interrupt remaining enabled on a holdout CPU after
its quiescent state has been reported:
CPU1 CPU2
rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait
acquires rnp->lock mask = rnp->expmask;
for_each_leaf_node_cpu_mask(rnp, cpu, mask)
rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1);
for_each_leaf_node_cpu_mask(rnp, cpu, mask)
rdp = per_cpu_ptr(&rcu_data, cpu1);
if (!rdp->rcu_forced_tick_exp)
continue; rdp->rcu_forced_tick_exp = true;
tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP);
The problem is that CPU2's sampling of rnp->expmask is obsolete by the
time it invokes tick_dep_set_cpu(), and CPU1 is not guaranteed to see
CPU2's store to ->rcu_forced_tick_exp in time to clear it. And even if
CPU1 does see that store, it might invoke tick_dep_clear_cpu() before
CPU2 got around to executing its tick_dep_set_cpu(), which would still
leave the victim CPU with its scheduler-clock tick running.
Either way, an nohz_full real-time application running on the victim
CPU would have its latency needlessly degraded.
Note that expedited RCU grace periods look at context-tracking
information, and so if the CPU is executing in nohz_full usermode
throughout, that CPU cannot be victimized in this manner.
This commit therefore causes synchronize_rcu_expedited_wait to hold
the rcu_node structure's ->lock when checking for holdout CPUs, setting
TICK_DEP_BIT_RCU_EXP, and invoking tick_dep_set_cpu(), thus preventing
this race.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined.
However, cpu_is_hotpluggable() still returns true for those CPUs. This causes
torture tests that do offlining to end up trying to offline this CPU causing
test failures. Such failure happens on all architectures.
Fix the repeated error messages thrown by this (even if the hotplug errors are
harmless) by asking the opinion of the nohz subsystem on whether the CPU can be
hotplugged.
[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]
For drivers/base/ portion:
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: rcu <rcu@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
Now that all references to CONFIG_SRCU have been removed, it is time to
remove CONFIG_SRCU itself.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
This commit adds a comment to help explain why the "else" clause of the
in_serving_softirq() "if" statement does not need to enforce a time limit.
The reason is that this "else" clause handles rcuoc kthreads that do not
block handlers for other softirq vectors.
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
There is an smp_mb() named "E" in srcu_flip() immediately before the
increment (flip) of the srcu_struct structure's ->srcu_idx.
The purpose of E is to order the preceding scan's read of lock counters
against the flipping of the ->srcu_idx, in order to prevent new readers
from continuing to use the old ->srcu_idx value, which might needlessly
extend the grace period.
However, this ordering is already enforced because of the control
dependency between the preceding scan and the ->srcu_idx flip.
This control dependency exists because atomic_long_read() is used
to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx,
and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and
->srcu_unlock_count[] counts match. And such a match cannot happen when
there is an in-flight reader that started before the flip (observation
courtesy Mathieu Desnoyers).
The litmus test below (courtesy of Frederic Weisbecker, with changes
for ctrldep by Boqun and Joel) shows this:
C srcu
(*
* bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
*
* So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
* (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
* relations. Combining the ->ppo with ->rf, a cycle is impossible.
*)
{}
// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int lock1;
int unlock1;
int lock0;
int unlock0;
// SCAN1
unlock1 = READ_ONCE(*UNLOCK1);
smp_mb(); // A
lock1 = READ_ONCE(*LOCK1);
// FLIP
if (lock1 == unlock1) { // Control dep
smp_mb(); // E // Remove E and still passes.
WRITE_ONCE(*IDX, 1);
smp_mb(); // D
// SCAN2
unlock0 = READ_ONCE(*UNLOCK0);
smp_mb(); // A
lock0 = READ_ONCE(*LOCK0);
}
}
// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int tmp;
int idx1;
int idx2;
// 1st reader
idx1 = READ_ONCE(*IDX);
if (idx1 == 0) { // Control dep
tmp = READ_ONCE(*LOCK0);
WRITE_ONCE(*LOCK0, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK0);
WRITE_ONCE(*UNLOCK0, tmp + 1);
} else {
tmp = READ_ONCE(*LOCK1);
WRITE_ONCE(*LOCK1, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK1);
WRITE_ONCE(*UNLOCK1, tmp + 1);
}
}
exists (0:lock1=1 /\ 1:idx1=1)
More complicated litmus tests with multiple SRCU readers also show that
memory barrier E is not needed.
This commit therefore clarifies the comment on memory barrier E.
Why not also remove that redundant smp_mb()?
Because control dependencies are quite fragile due to their not being
recognized by most compilers and tools. Control dependencies therefore
exact an ongoing maintenance burden, and such a burden cannot be justified
in this slowpath. Therefore, that smp_mb() stays until such time as
its overhead becomes a measurable problem in a real workload running on
a real production system, or until such time as compilers start paying
attention to this sort of control dependency.
Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The state space of the GP sequence number isn't documented and the
definitions of its special values are scattered. This commit therefore
gathers some common knowledge near the grace-period sequence-number
definitions.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The same task check in perf_event_set_output has some potential issues
for some usages.
For the current perf code, there is a problem if using of
perf_event_open() to have multiple samples getting into the same mmap’d
memory when they are both attached to the same process.
https://lore.kernel.org/all/92645262-D319-4068-9C44-2409EF44888E@gmail.com/
Because the event->ctx is not ready when the perf_event_set_output() is
invoked in the perf_event_open().
Besides the above issue, before the commit bd2756811766 ("perf: Rewrite
core context handling"), perf record can errors out when sampling with
a hardware event and a software event as below.
$ perf record -e cycles,dummy --per-thread ls
failed to mmap with 22 (Invalid argument)
That's because that prior to the commit a hardware event and a software
event are from different task context.
The problem should be a long time issue since commit c3f00c70276d
("perk: Separate find_get_context() from event initialization").
The task struct is stored in the event->hw.target for each per-thread
event. It is a more reliable way to determine whether two events are
attached to the same task.
The event->hw.target was also introduced several years ago by the
commit 50f16a8bf9d7 ("perf: Remove type specific target pointers"). It
can not only be used to fix the issue with the current code, but also
back port to fix the issues with an older kernel.
Note: The event->hw.target was introduced later than commit
c3f00c70276d. The patch may cannot be applied between the commit
c3f00c70276d and commit 50f16a8bf9d7. Anybody that wants to back-port
this at that period may have to find other solutions.
Fixes: c3f00c70276d ("perf: Separate find_get_context() from event initialization")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
Link: https://lkml.kernel.org/r/20230322202449.512091-1-kan.liang@linux.intel.com
|