| Age | Commit message (Collapse) | Author | Files | Lines |
|
handle_percpu_devid_fasteoi_ipi() has no more users, and
handle_percpu_devid_irq() can do all that it was supposed to do. Get rid of
it.
This reverts commit c5e5ec033c4ab25c53f1fd217849e75deb0bf7bf.
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The code for the legacy RTC and the RTC class based update are pretty much
the same. Consolidate the common parts into one function and just invoke
the actual setter functions.
For RTC class based devices the update code checks whether the offset is
valid for the device, which is usually not the case for the first
invocation. If it's not the same it stores the correct offset and lets the
caller try again. That's not much different from the previous approach
where the first invocation had a pretty low probability to actually hit the
allowed window.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: Alexandre Belloni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The current RTC set_offset_nsec value is not really intuitive to
understand.
tsched twrite(t2.tv_sec - 1) t2 (seconds increment)
The offset is calculated from twrite based on the assumption that t2 -
twrite == 1s. That means for the MC146818 RTC the offset needs to be
negative so that the write happens 500ms before t2.
It's easier to understand when the whole calculation is based on t2. That
avoids negative offsets and the meaning is obvious:
t2 - twrite: The time defined by the chip when seconds increment
after the write.
twrite - tsched: The time for the transport to the point where the chip
is updated.
==> set_offset_nsec = t2 - tsched
ttransport = twrite - tsched
tRTCinc = t2 - twrite
==> set_offset_nsec = ttransport + tRTCinc
tRTCinc is a chip property and can be obtained from the data sheet.
ttransport depends on how the RTC is connected. It is close to 0 for
directly accessible RTCs. For RTCs behind a slow bus, e.g. i2c, it's the
time required to send the update over the bus. This can be estimated or
even calibrated, but that's a different problem.
Adjust the implementation and update comments accordingly.
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Alexandre Belloni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
rtc_set_ntp_time() is not really RTC functionality as the code is just a
user of RTC. Move it into the NTP code which allows further cleanups.
Requested-by: Alexandre Belloni <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: Alexandre Belloni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Miroslav reported that the periodic RTC synchronization in the NTP code
fails more often than not to hit the specified update window.
The reason is that the code uses delayed_work to schedule the update which
needs to be in thread context as the underlying RTC might be connected via
a slow bus, e.g. I2C. In the update function it verifies whether the
current time is correct vs. the requirements of the underlying RTC.
But delayed_work is using the timer wheel for scheduling which is
inaccurate by design. Depending on the distance to the expiry the wheel
gets less granular to allow batching and to avoid the cascading of the
original timer wheel. See 500462a9de65 ("timers: Switch to a non-cascading
wheel") and the code for further details.
The code already deals with this by splitting the 660 seconds period into a
long 659 seconds timer and then retrying with a smaller delta.
But looking at the actual granularities of the timer wheel (which depend on
the HZ configuration) the 659 seconds timer ends up in an outer wheel level
and is affected by a worst case granularity of:
HZ Granularity
1000 32s
250 16s
100 40s
So the initial timer can be already off by max 12.5% which is not a big
issue as the period of the sync is defined as ~11 minutes.
The fine grained second attempt schedules to the desired update point with
a timer expiring less than a second from now. Depending on the actual delta
and the HZ setting even the second attempt can end up in outer wheel levels
which have a large enough granularity to make the correctness check fail.
As this is a fundamental property of the timer wheel there is no way to
make this more accurate short of iterating in one jiffies steps towards the
update point.
Switch it to an hrtimer instead which schedules the actual update work. The
hrtimer will expire precisely (max 1 jiffie delay when high resolution
timers are not available). The actual scheduling delay of the work is the
same as before.
The update is triggered from do_adjtimex() which is a bit racy but not much
more racy than it was before:
if (ntp_synced())
queue_delayed_work(system_power_efficient_wq, &sync_work, 0);
which is racy when the work is currently executed and has not managed to
reschedule itself.
This becomes now:
if (ntp_synced() && !hrtimer_is_queued(&sync_hrtimer))
queue_work(system_power_efficient_wq, &sync_work, 0);
which is racy when the hrtimer has expired and the work is currently
executed and has not yet managed to rearm the hrtimer.
Not a big problem as it just schedules work for nothing.
The new implementation has a safe guard in place to catch the case where
the hrtimer is queued on entry to the work function and avoids an extra
update attempt of the RTC that way.
Reported-by: Miroslav Lichvar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Miroslav Lichvar <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: Alexandre Belloni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
idle_balance() has been renamed to newidle_balance(). To differentiate
with nohz_idle_balance, it seems refining the comment will be helpful
for the readers of the code.
Signed-off-by: Barry Song <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The clearing of SMT siblings from the SIS mask before checking for an idle
core is a small but unnecessary cost. Defer the clearing of the siblings
until the scan moves to the next potential target. The cost of this was
not measured as it is borderline noise but it should be self-evident.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Kernel-doc requires that a kernel-doc markup to be immediately
below the function prototype, as otherwise it will rename it.
So, move sys_sched_yield() markup to the right place.
Also fix the cpu_util() markup: Kernel-doc markups
should use this format:
identifier - description
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/50cd6f460aeb872ebe518a8e9cfffda2df8bdb0a.1606823973.git.mchehab+huawei@kernel.org
|
|
Pull networking fixes from David Miller:
1) IPsec compat fixes, from Dmitry Safonov.
2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai.
3) Fix polling in xsk sockets by using sk_poll_wait() instead of
datagram_poll() which keys off of sk_wmem_alloc and such which xsk
sockets do not update. From Xuan Zhuo.
4) Missing init of rekey_data in cfgh80211, from Sara Sharon.
5) Fix destroy of timer before init, from Davide Caratti.
6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd
Bergmann.
7) Missing error return in rtm_to_fib_config() switch case, from Zhang
Changzhong.
8) Fix some src/dest address handling in vrf and add a testcase. From
Stephen Suryaputra.
9) Fix multicast handling in Seville switches driven by mscc-ocelot
driver. From Vladimir Oltean.
10) Fix proto value passed to skb delivery demux in udp, from Xin Long.
11) HW pkt counters not reported correctly in enetc driver, from Claudiu
Manoil.
12) Fix deadlock in bridge, from Joseph Huang.
13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET.
14) Fix pid fetching in bpftool when there are a lot of results, from
Andrii Nakryiko.
15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso.
16) Various stymmac fixes, from Fugang Duan.
17) Fix null deref in tipc, from Cengiz Can.
18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric
Dumazet.
19) Revert a geneve change that likely isnt necessary, from Jakub
Kicinski.
20) Avoid premature rx buffer reuse in various Intel driversm from Björn
Töpel.
21) retain EcT bits during TIS reflection in tcp, from Wei Wang.
22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell.
23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault
24) Fix propagation of 32-bit signed bounds in bpf verifier and add test
cases, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
selftests: fix poll error in udpgro.sh
selftests/bpf: Fix "dubious pointer arithmetic" test
selftests/bpf: Fix array access with signed variable test
selftests/bpf: Add test for signed 32-bit bound check bug
bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver
net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
net/mlx4_en: Handle TX error CQE
net/mlx4_en: Avoid scheduling restart task if it is already running
tcp: fix cwnd-limited bug for TSO deferral where we send nothing
net: flow_offload: Fix memory leak for indirect flow block
tcp: Retain ECT bits for tos reflection
ethtool: fix stack overflow in ethnl_parse_bitset()
e1000e: fix S0ix flow to allow S0i3.2 subset entry
ice: avoid premature Rx buffer reuse
ixgbe: avoid premature Rx buffer reuse
i40e: avoid premature Rx buffer reuse
igb: avoid transmit queue timeout in xdp path
igb: use xdp_do_flush
igb: skb add metasize for xdp
...
|
|
Alexei Starovoitov says:
====================
pull-request: bpf 2020-12-10
The following pull-request contains BPF updates for your *net* tree.
We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).
The main changes are:
1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.
2) Fix ring_buffer__poll() return value, from Andrii.
3) Fix race in lwt_bpf, from Cong.
4) Fix test_offload, from Toke.
5) Various xsk fixes.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
The 64-bit signed bounds should not affect 32-bit signed bounds unless the
verifier knows that upper 32-bits are either all 1s or all 0s. For example the
register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
since smax_value could be larger than 32-bit subregister can hold.
The verifier refines the smax/s32_max return value from certain helpers in
do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
value is also bounded. When both smin and smax bounds fit into 32-bit
subregister the verifier can propagate those bounds.
Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Jean-Philippe Brucker <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex. The problematic lock ordering found by lockdep
was:
perf_event_open (exec_update_mutex -> ovl_i_mutex)
chown (ovl_i_mutex -> sb_writes)
sendfile (sb_writes -> p->lock)
by reading from a proc file and writing to overlayfs
proc_pid_syscall (p->lock -> exec_update_mutex)
While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same. They are all readers. The only writer is
exec.
There is no reason for readers to block on each other. So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.
Cc: Jann Horn <[email protected]>
Cc: Vasiliy Kulikov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Bernd Edlinger <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christopher Yeoh <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Sargun Dhillon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/[email protected]
Reported-by: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count. Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.
Using task_lookup_next_fd_rcu simplifies task_file_seq_get_next, by
moving the checking for the maximum file descritor into the generic
code, and by remvoing the need for capturing and releasing a reference
on files_struct. As the reference count of files_struct no longer
needs to be maintained bpf_iter_seq_task_file_info can have it's files
member removed and task_file_seq_get_next no longer needs it's fstruct
argument.
The curr_fd local variable does need to become unsigned to be used
with fnext_task. As curr_fd is assigned from and assigned a u32
making curr_fd an unsigned int won't cause problems and might prevent
them.
[1] https://lkml.kernel.org/r/[email protected]
Suggested-by: Oleg Nesterov <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Modify get_file_raw_ptr to use task_lookup_fd_rcu. The helper
task_lookup_fd_rcu does the work of taking the task lock and verifying
that task->files != NULL and then calls files_lookup_fd_rcu. So let
use the helper to make a simpler implementation of get_file_raw_ptr.
Acked-by: Cyrill Gorcunov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
This change renames fcheck_files to files_lookup_fd_rcu. All of the
remaining callers take the rcu_read_lock before calling this function
so the _rcu suffix is appropriate. This change also tightens up the
debug check to verify that all callers hold the rcu_read_lock.
All callers that used to call files_check with the files->file_lock
held have now been changed to call files_lookup_fd_locked.
This change of name has helped remind me of which locks and which
guarantees are in place helping me to catch bugs later in the
patchset.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Use the helper fget_task to simplify bpf_task_fd_query.
As well as simplifying the code this removes one unnecessary increment of
struct files_struct. This unnecessary increment of files_struct.count can
result in exec unnecessarily unsharing files_struct and breaking posix
locks, and it can result in fget_light having to fallback to fget reducing
performance.
This simplification comes from the observation that none of the
callers of get_files_struct actually need to call get_files_struct
that was made when discussing[1] exec and posix file locks.
[1] https://lkml.kernel.org/r/[email protected]
Suggested-by: Oleg Nesterov <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Use the helper fget_task and simplify the code.
As well as simplifying the code this removes one unnecessary increment of
struct files_struct. This unnecessary increment of files_struct.count can
result in exec unnecessarily unsharing files_struct and breaking posix
locks, and it can result in fget_light having to fallback to fget reducing
performance.
Suggested-by: Oleg Nesterov <[email protected]>
Reviewed-by: Cyrill Gorcunov <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Now that exec no longer needs to return the unshared files to their
previous value there is no reason to return displaced.
Instead when unshare_fd creates a copy of the file table, call
put_files_struct before returning from unshare_files.
Acked-by: Christian Brauner <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
The driver has its own HID descriptor parsing code, that had and still
has several issues discovered by syzbot and other tools. Ideally we
should move the driver over to the HID subsystem, so that it uses proven
parsing code. However the devices in question are EOL, and GTCO is not
willing to extend resources for that, so let's simply remove the driver.
Note that our HID support has greatly improved over the last 10 years,
we may also consider reverting 6f8d9e26e7de ("hid-core.c: Adds all GTCO
CalComp Digitizers and InterWrite School Products to blacklist") and see
if GTCO devices actually work with normal HID drivers.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dmitry Torokhov <[email protected]>
|
|
There are multiple locations in the kernel where a struct fwnode_handle
is initialized. Add fwnode_init() so that we have one way of
initializing a fwnode_handle.
Acked-by: Rob Herring <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
* arm64/for-next/fixes: (26 commits)
arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
arm64: mte: Fix typo in macro definition
arm64: entry: fix EL1 debug transitions
arm64: entry: fix NMI {user, kernel}->kernel transitions
arm64: entry: fix non-NMI kernel<->kernel transitions
arm64: ptrace: prepare for EL1 irq/rcu tracking
arm64: entry: fix non-NMI user<->kernel transitions
arm64: entry: move el1 irq/nmi logic to C
arm64: entry: prepare ret_to_user for function call
arm64: entry: move enter_from_user_mode to entry-common.c
arm64: entry: mark entry code as noinstr
arm64: mark idle code as noinstr
arm64: syscall: exit userspace before unmasking exceptions
arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
arm64: pgtable: Fix pte_accessible()
ACPI/IORT: Fix doc warnings in iort.c
arm64/fpsimd: add <asm/insn.h> to <asm/kprobes.h> to fix fpsimd build
arm64: cpu_errata: Apply Erratum 845719 to KRYO2XX Silver
arm64: proton-pack: Add KRYO2XX silver CPUs to spectre-v2 safe-list
arm64: kpti: Add KRYO2XX gold/silver CPU cores to kpti safelist
...
# Conflicts:
# arch/arm64/include/asm/exception.h
# arch/arm64/kernel/sdei.c
|
|
* arm64/for-next/scs:
arm64: sdei: Push IS_ENABLED() checks down to callee functions
arm64: scs: use vmapped IRQ and SDEI shadow stacks
scs: switch to vmapped shadow stacks
|
|
Syzbot reported a lock inversion involving perf. The sore point being
perf holding exec_update_mutex() for a very long time, specifically
across a whole bunch of filesystem ops in pmu::event_init() (uprobes)
and anon_inode_getfile().
This then inverts against procfs code trying to take
exec_update_mutex.
Move the permission checks later, such that we need to hold the mutex
over less code.
Reported-by: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
|
|
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It also improves
the chance that a reader can get the lock as writer optimistic spinning
disproportionally favors writers much more than readers.
Since commit d3681e269fff ("locking/rwsem: Wake up almost all readers
in wait queue"), all the waiting readers are woken up so that they can
all get the read lock and run in parallel. When the number of contending
readers is large, allowing reader optimistic spinning will likely cause
reader fragmentation where multiple smaller groups of readers can get
the read lock in a sequential manner separated by writers. That reduces
reader parallelism.
One possible way to address that drawback is to limit the number of
readers (preferably one) that can do optimistic spinning. These readers
act as representatives of all the waiting readers in the wait queue as
they will wake up all those waiting readers once they get the lock.
Alternatively, as reader optimistic lock stealing has already enhanced
fairness to readers, it may be easier to just remove reader optimistic
spinning and simplifying the optimistic spinning code as a result.
Performance measurements (locking throughput kops/s) using a locking
microbenchmark with 50/50 reader/writer distribution and turbo-boost
disabled was done on a 2-socket Cascade Lake system (48-core 96-thread)
to see the impacts of these changes:
1) Vanilla - 5.10-rc3 kernel
2) Before - 5.10-rc3 kernel with previous patches in this series
2) limit-rspin - 5.10-rc3 kernel with limited reader spinning patch
3) no-rspin - 5.10-rc3 kernel with reader spinning disabled
# of threads CS Load Vanilla Before limit-rspin no-rspin
------------ ------- ------- ------ ----------- --------
2 1 5,185 5,662 5,214 5,077
4 1 5,107 4,983 5,188 4,760
8 1 4,782 4,564 4,720 4,628
16 1 4,680 4,053 4,567 3,402
32 1 4,299 1,115 1,118 1,098
64 1 3,218 983 1,001 957
96 1 1,938 944 957 930
2 20 2,008 2,128 2,264 1,665
4 20 1,390 1,033 1,046 1,101
8 20 1,472 1,155 1,098 1,213
16 20 1,332 1,077 1,089 1,122
32 20 967 914 917 980
64 20 787 874 891 858
96 20 730 836 847 844
2 100 372 356 360 355
4 100 492 425 434 392
8 100 533 537 529 538
16 100 548 572 568 598
32 100 499 520 527 537
64 100 466 517 526 512
96 100 406 497 506 509
The column "CS Load" represents the number of pause instructions issued
in the locking critical section. A CS load of 1 is extremely short and
is not likey in real situations. A load of 20 (moderate) and 100 (long)
are more realistic.
It can be seen that the previous patches in this series have reduced
performance in general except in highly contended cases with moderate
or long critical sections that performance improves a bit. This change
is mostly caused by the "Prevent potential lock starvation" patch that
reduce reader optimistic spinning and hence reduce reader fragmentation.
The patch that further limit reader optimistic spinning doesn't seem to
have too much impact on overall performance as shown in the benchmark
data.
The patch that disables reader optimistic spinning shows reduced
performance at lightly loaded cases, but comparable or slightly better
performance on with heavier contention.
This patch just removes reader optimistic spinning for now. As readers
are not going to do optimistic spinning anymore, we don't need to
consider if the OSQ is empty or not when doing lock stealing.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
If the optimistic spinning queue is empty and the rwsem does not have
the handoff or write-lock bits set, it is actually not necessary to
call rwsem_optimistic_spin() to spin on it. Instead, it can steal the
lock directly as its reader bias is in the count already. If it is
the first reader in this state, it will try to wake up other readers
in the wait queue.
With this patch applied, the following were the lock event counts
after rebooting a 2-socket system and a "make -j96" kernel rebuild.
rwsem_opt_rlock=4437
rwsem_rlock=29
rwsem_rlock_steal=19
So lock stealing represents about 0.4% of all the read locks acquired
in the slow path.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The lock handoff bit is added in commit 4f23dbc1e657 ("locking/rwsem:
Implement lock handoff to prevent lock starvation") to avoid lock
starvation. However, allowing readers to do optimistic spinning does
introduce an unlikely scenario where lock starvation can happen.
The lock handoff bit may only be set when a waiter is being woken up.
In the case of reader unlock, wakeup happens only when the reader count
reaches 0. If there is a continuous stream of incoming readers acquiring
read lock via optimistic spinning, it is possible that the reader count
may never reach 0 and so the handoff bit will never be asserted.
One way to prevent this scenario from happening is to disallow optimistic
spinning if the rwsem is currently owned by readers. If the previous
or current owner is a writer, optimistic spinning will be allowed.
If the previous owner is a reader but the reader count has reached 0
before, a wakeup should have been issued. So the handoff mechanism
will be kicked in to prevent lock starvation. As a result, it should
be OK to do optimistic spinning in this case.
This patch may have some impact on reader performance as it reduces
reader optimistic spinning especially if the lock critical sections
are short the number of contending readers are small.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The atomic count value right after reader count increment can be useful
to determine the rwsem state at trylock time. So the count value is
passed down to rwsem_down_read_slowpath() to be used when appropriate.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
There's a lot needless duplication in __down_{read,write}*(), cure
that with a helper.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
One copy of this logic is better than three.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
All users of rwsem_read_trylock() do rwsem_set_reader_owned(sem) on
success, move it into rwsem_read_trylock() proper.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
|
|
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_interruptible. This is needed for perf_event_open to be
converted (with no semantic changes) from working on a mutex to
wroking on a rwsem.
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_killable_nested. This is needed so that kcmp_lock
can be converted from working on a mutexes to working on rw_semaphores.
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Since the ringbuffer is lockless, there is no need for it to be
protected by @logbuf_lock. Remove @logbuf_lock writer-protection of
the ringbuffer. The reader-protection is not removed because some
variables, used by readers, are using @logbuf_lock for synchronization:
@syslog_seq, @syslog_time, @syslog_partial, @console_seq,
struct kmsg_dumper.
For PRINTK_NMI_DIRECT_CONTEXT_MASK, @logbuf_lock usage is not removed
because it may be used for dumper synchronization.
Without @logbuf_lock synchronization of vprintk_store() it is no
longer possible to use the single static buffer for temporarily
sprint'ing the message. Instead, use vsnprintf() to determine the
length and perform the real vscnprintf() using the area reserved from
the ringbuffer. This leads to suboptimal packing of the message data,
but will result in less wasted storage than multiple per-cpu buffers
to support lockless temporary sprint'ing.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In preparation for removing logbuf_lock, inline log_output()
and log_store() into vprintk_store(). This will simplify dealing
with the various code branches and fallbacks that are possible.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Apparently there has been a longstanding race between udev/systemd and
the module loader. Currently, the module loader sends a uevent right
after sysfs initialization, but before the module calls its init
function. However, some udev rules expect that the module has
initialized already upon receiving the uevent.
This race has been triggered recently (see link in references) in some
systemd mount unit files. For instance, the configfs module creates the
/sys/kernel/config mount point in its init function, however the module
loader issues the uevent before this happens. sys-kernel-config.mount
expects to be able to mount /sys/kernel/config upon receipt of the
module loading uevent, but if the configfs module has not called its
init function yet, then this directory will not exist and the mount unit
fails. A similar situation exists for sys-fs-fuse-connections.mount, as
the fuse sysfs mount point is created during the fuse module's init
function. If udev is faster than module initialization then the mount
unit would fail in a similar fashion.
To fix this race, delay the module KOBJ_ADD uevent until after the
module has finished calling its init routine.
References: https://github.com/systemd/systemd/issues/17586
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Tested-By: Nicolas Morey-Chaisemartin <[email protected]>
Signed-off-by: Jessica Yu <[email protected]>
|
|
membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
syncing the core on all sibling threads but not necessarily the calling
thread. This behavior is fundamentally buggy and cannot be used safely.
Suppose a user program has two threads. Thread A is on CPU 0 and thread B
is on CPU 1. Thread A modifies some text and calls
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).
Then thread B executes the modified code. If, at any point after
membarrier() decides which CPUs to target, thread A could be preempted and
replaced by thread B on CPU 0. This could even happen on exit from the
membarrier() syscall. If this happens, thread B will end up running on CPU
0 without having synced.
In principle, this could be fixed by arranging for the scheduler to issue
sync_core_before_usermode() whenever switching between two threads in the
same mm if there is any possibility of a concurrent membarrier() call, but
this would have considerable overhead. Instead, make membarrier() sync the
calling CPU as well.
As an optimization, this avoids an extra smp_mb() in the default
barrier-only mode and an extra rseq preempt on the caller.
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org
|
|
membarrier() does not explicitly sync_core() remote CPUs; instead, it
relies on the assumption that an IPI will result in a core sync. On x86,
this may be true in practice, but it's not architecturally reliable. In
particular, the SDM and APM do not appear to guarantee that interrupt
delivery is serializing. While IRET does serialize, IPI return can
schedule, thereby switching to another task in the same mm that was
sleeping in a syscall. The new task could then SYSRET back to usermode
without ever executing IRET.
Make this more robust by explicitly calling sync_core_before_usermode()
on remote cores. (This also helps people who search the kernel tree for
instances of sync_core() and sync_core_before_usermode() -- one might be
surprised that the core membarrier code doesn't currently show up in a
such a search.)
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org
|
|
It seems that most RSEQ membarrier users will expect any stores done before
the membarrier() syscall to be visible to the target task(s). While this
is extremely likely to be true in practice, nothing actually guarantees it
by a strict reading of the x86 manuals. Rather than providing this
guarantee by accident and potentially causing a problem down the road, just
add an explicit barrier.
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org
|
|
This moves the bpf_sock_from_file definition into net/core/filter.c
which only gets compiled with CONFIG_NET and also moves the helper proto
usage next to other tracing helpers that are conditional on CONFIG_NET.
This avoids
ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
When compiling a kernel with BPF and without NET.
Reported-by: kernel test robot <[email protected]>
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Florent Revest <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Acked-by: KP Singh <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Return -ENOTSUPP if tracing BPF program is attempted to be attached with
specified attach_btf_obj_fd pointing to non-kernel (neither vmlinux nor
module) BTF object. This scenario might be supported in the future and isn't
outright invalid, so -EINVAL isn't the most appropriate error code.
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Commit 849f3127bb46 ("switch /dev/kmsg to ->write_iter()") refactored
devkmsg_write() and left over a dead assignment on the variable 'len'.
Hence, make clang-analyzer warns:
kernel/printk/printk.c:744:4: warning: Value stored to 'len' is never read
[clang-analyzer-deadcode.DeadStores]
len -= endp - line;
^
Simply remove this obsolete dead assignment here.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lukas Bulwahn <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
|
|
__htab_map_lookup_and_delete_batch() stores a user pointer in the local
variable ubatch and uses that in copy_{from,to}_user(), but ubatch misses a
__user annotation.
So, sparse warns in the various assignments and uses of ubatch:
kernel/bpf/hashtab.c:1415:24: warning: incorrect type in initializer
(different address spaces)
kernel/bpf/hashtab.c:1415:24: expected void *ubatch
kernel/bpf/hashtab.c:1415:24: got void [noderef] __user *
kernel/bpf/hashtab.c:1444:46: warning: incorrect type in argument 2
(different address spaces)
kernel/bpf/hashtab.c:1444:46: expected void const [noderef] __user *from
kernel/bpf/hashtab.c:1444:46: got void *ubatch
kernel/bpf/hashtab.c:1608:16: warning: incorrect type in assignment
(different address spaces)
kernel/bpf/hashtab.c:1608:16: expected void *ubatch
kernel/bpf/hashtab.c:1608:16: got void [noderef] __user *
kernel/bpf/hashtab.c:1609:26: warning: incorrect type in argument 1
(different address spaces)
kernel/bpf/hashtab.c:1609:26: expected void [noderef] __user *to
kernel/bpf/hashtab.c:1609:26: got void *ubatch
Add the __user annotation to repair this chain of propagating __user
annotations in __htab_map_lookup_and_delete_batch().
Signed-off-by: Lukas Bulwahn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
s/ring_buffer_commit_discard/ring_buffer_discard_commit/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qiujun Huang <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Commit a389d86f7fd0 ("ring-buffer: Have nested events still record running
time stamp") removed the only uses of rb_event_is_commit() in
rb_update_event() and rb_update_write_stamp().
Hence, since then, make CC=clang W=1 warns:
kernel/trace/ring_buffer.c:2763:1:
warning: unused function 'rb_event_is_commit' [-Wunused-function]
Remove this obsolete function.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Lukas Bulwahn <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Use of bpf_map_charge_init() was making sure hash tables would not use more
than 4GB of memory.
Since the implicit check disappeared, we have to be more careful
about overflows, to support big hash tables.
syzbot triggers a panic using :
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_LRU_HASH, key_size=16384, value_size=8,
max_entries=262200, map_flags=0, inner_map_fd=-1, map_name="",
map_ifindex=0, btf_fd=-1, btf_key_type_id=0, btf_value_type_id=0,
btf_vmlinux_value_type_id=0}, 64) = ...
BUG: KASAN: vmalloc-out-of-bounds in bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
BUG: KASAN: vmalloc-out-of-bounds in bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
Write of size 2 at addr ffffc90017e4a020 by task syz-executor.5/19786
CPU: 0 PID: 19786 Comm: syz-executor.5 Not tainted 5.10.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
prealloc_init kernel/bpf/hashtab.c:319 [inline]
htab_map_alloc+0xf6e/0x1230 kernel/bpf/hashtab.c:507
find_and_alloc_map kernel/bpf/syscall.c:123 [inline]
map_create kernel/bpf/syscall.c:829 [inline]
__do_sys_bpf+0xa81/0x5170 kernel/bpf/syscall.c:4336
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fd93fbc0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000001a40 RCX: 000000000045deb9
RDX: 0000000000000040 RSI: 0000000020000280 RDI: 0000000000000000
RBP: 000000000119bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000119bf2c
R13: 00007ffc08a7be8f R14: 00007fd93fbc19c0 R15: 000000000119bf2c
Fixes: 755e5d55367a ("bpf: Eliminate rlimit-based memory accounting for hashtab maps")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Fixes: a54895fa057c ("block: remove the request_queue to argument request based tracepoints")
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix userstacktrace option for instances
While writing an application that requires user stack trace option to
work in instances, I found that the instance option has a bug that
makes it a nop. The check for performing the user stack trace in an
instance, checks the top level options (not the instance options) to
determine if a user stack trace should be performed or not.
This is not only incorrect, but also confusing for users. It confused
me for a bit!"
* tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix userstacktrace option for instances
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of updates for the interrupt subsystem:
- Make multiqueue devices which use the managed interrupt affinity
infrastructure work on PowerPC/Pseries. PowerPC does not use the
generic infrastructure for setting up PCI/MSI interrupts and the
multiqueue changes failed to update the legacy PCI/MSI
infrastructure. Make this work by passing the affinity setup
information down to the mapping and allocation functions.
- Move Jason Cooper from MAINTAINERS to CREDITS as his mail is
bouncing and he's not reachable. We hope all is well with him and
say thanks for his work over the years"
* tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc/pseries: Pass MSI affinity to irq_create_mapping()
genirq/irqdomain: Add an irq_create_mapping_affinity() function
MAINTAINERS: Move Jason Cooper to CREDITS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 5.10:
- Three commits fixing possible missed TLB invalidations for
multi-threaded processes when CPUs are hotplugged in and out.
- A fix for a host crash triggerable by host userspace (qemu) in KVM
on Power9.
- A fix for a host crash in machine check handling when running HPT
guests on a HPT host.
- One commit fixing potential missed TLB invalidations when using the
hash MMU on Power9 or later.
- A regression fix for machines with CPUs on node 0 but no memory.
Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
Dronamraju"
* tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
powerpc/numa: Fix a regression on memoryless node 0
powerpc/64s: Trim offlined CPUs from mm_cpumasks
kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation
|