Age | Commit message (Collapse) | Author | Files | Lines |
|
Add the necessary defines for supporting the GHCB version 2 protocol.
This includes defines for:
- MSR-based AP hlt request/response
- Hypervisor Feature request/response
This is the bare minimum of requests that need to be supported by a GHCB
version 2 implementation. There are more requests in the specification,
but those depend on Secure Nested Paging support being available.
These defines are shared between SEV host and guest support.
[ bp: Fold in https://lkml.kernel.org/r/[email protected] too.
Simplify the brewing macro maze into readability. ]
Co-developed-by: Tom Lendacky <[email protected]>
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
pass through newer vector instructions if vector support is enabled.
Reviewed-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
This enables the NNPA, BEAR enhancement,reset DAT protection and
processor activity counter facilities via the cpu model.
Reviewed-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
Using register asm statements has been proven to be very error prone,
especially when using code instrumentation where gcc may add function
calls, which clobbers register contents in an unexpected way.
Therefore get rid of register asm statements in kvm code, even though
there is currently nothing wrong with them. This way we know for sure
that this bug class won't be introduced here.
Signed-off-by: Heiko Carstens <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: Thomas Huth <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[[email protected]: checkpatch strict fix]
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
While the disk state has nothing to do with partitions, BLKRRPART is used
to force a full revalidate after things like a disk format for historical
reasons. Restore that behavior.
Link: https://lore.kernel.org/r/[email protected]
Fixes: 471bd0af544b ("sd: use bdev_check_media_change")
Reported-by: Xiang Chen <[email protected]>
Tested-by: Xiang Chen <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying
"module.sig_enforce=1" on the boot command line sets "sig_enforce".
Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured.
This patch makes the presence of /sys/module/module/parameters/sig_enforce
dependent on CONFIG_MODULE_SIG=y.
Fixes: fda784e50aac ("module: export module signature enforcement status")
Reported-by: Nayna Jain <[email protected]>
Tested-by: Mimi Zohar <[email protected]>
Tested-by: Jessica Yu <[email protected]>
Signed-off-by: Mimi Zohar <[email protected]>
Signed-off-by: Jessica Yu <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This got added 14 years ago in 324ae4df00fd ("Btrfs: Add block group
pinned accounting back") but it was not ever used. Subsequently its
usage got gradually removed in 8790d502e440 ("Btrfs: Add support for
mirroring across drives") and 11833d66be94 ("Btrfs: improve async block
group caching"). Let's remove it for good!
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When the call to platform_get_irq() to obtain the IRQ of the lcd fails, the
returned error code should be propagated. However, we currently do not
explicitly assign this error code to 'ret'. As a result, 0 was incorrectly
returned.
Fixes: 7f7b96a8a0a1 ("drm/kmb: Add support for KeemBay Display")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Anitha Chrisanthus <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
There were two places where we weren't checking for error
(e.g. ERESTARTSYS) while waiting for rdma resolution.
Addresses-Coverity: 1462165 ("Unchecked return value")
Reviewed-by: Tom Talpey <[email protected]>
Reviewed-by: Long Li <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Revert commit 4514d991d992 ("PCI: PM: Do not read power state in
pci_enable_device_flags()") that is reported to cause PCI device
initialization issues on some systems.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=213481
Link: https://lore.kernel.org/linux-acpi/[email protected]
Reported-by: Michael <[email protected]>
Reported-by: Salvatore Bonaccorso <[email protected]>
Fixes: 4514d991d992 ("PCI: PM: Do not read power state in pci_enable_device_flags()")
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Simplify the code.
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Baokun Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently when an unstable clocksource is detected, the raw counters of
that clocksource and watchdog will be printed, which can only be understood
after some math calculation.
So print the delta in nanoseconds as well to make it easier for humans to
check the results.
[ paulmck: Fix typo. ]
Signed-off-by: Feng Tang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. It would be good
to have a way of testing the clocksource watchdog's ability to
distinguish between these two causes of clock skew and instability.
Therefore, provide a new clocksource-wdtest module selected by a new
TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module
parameter named "holdoff" that provides the number of seconds of delay
before testing should start, which defaults to zero when built as a module
and to 10 seconds when built directly into the kernel. Very large systems
that boot slowly may need to increase the value of this module parameter.
This module uses hand-crafted clocksource structures to do its testing,
thus avoiding messing up timing for the rest of the kernel and for user
applications. This module first verifies that the ->uncertainty_margin
field of the clocksource structures are set sanely. It then tests the
delay-detection capability of the clocksource watchdog, increasing the
number of consecutive delays injected, first provoking console messages
complaining about the delays and finally forcing a clock-skew event.
Unexpected test results cause at least one WARN_ON_ONCE() console splat.
If there are no splats, the test has passed. Finally, it fuzzes the
value returned from a clocksource to test the clocksource watchdog's
ability to detect time skew.
This module checks the state of its clocksource after each test, and
uses WARN_ON_ONCE() to emit a console splat if there are any failures.
This should enable all types of test frameworks to detect any such
failures.
This facility is intended for diagnostic use only, and should be avoided
on production systems.
Reported-by: Chris Mason <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Feng Tang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed
by more than 12.5% in order to be marked unstable. Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left. And given that there are now checks for false-positive
skews due to delays between reading the two clocks, it should be possible
to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks
such as TSC.
Therefore, add a new uncertainty_margin field to the clocksource structure
that contains the maximum uncertainty in nanoseconds for the corresponding
clock. This field may be initialized manually, as it is for
clocksource_tsc_early and clocksource_jiffies, which is copied to
refined_jiffies. If the field is not initialized manually, it will be
computed at clock-registry time as the period of the clock in question
based on the scale and freq parameters to __clocksource_update_freq_scale()
function. If either of those two parameters are zero, the
tens-of-milliseconds WATCHDOG_THRESHOLD is used as a cowardly alternative
to dividing by zero. No matter how the uncertainty_margin field is
calculated, it is bounded below by twice WATCHDOG_MAX_SKEW, that is, by 100
microseconds.
Note that manually initialized uncertainty_margin fields are not adjusted,
but there is a WARN_ON_ONCE() that triggers if any such field is less than
twice WATCHDOG_MAX_SKEW. This WARN_ON_ONCE() is intended to discourage
production use of the one-nanosecond uncertainty_margin values that are
used to test the clock-skew code itself.
The actual clock-skew check uses the sum of the uncertainty_margin fields
of the two clocksource structures being compared. Integer overflow is
avoided because the largest computed value of the uncertainty_margin
fields is one billion (10^9), and double that value fits into an
unsigned int. However, if someone manually specifies (say) UINT_MAX,
they will get what they deserve.
Note that the refined_jiffies uncertainty_margin field is initialized to
TICK_NSEC, which means that skew checks involving this clocksource will
be sufficently forgiving. In a similar vein, the clocksource_tsc_early
uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which
replicates the current behavior and allows custom setting if needed
in order to address the rare skews detected for this clocksource in
current mainline.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Feng Tang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU,
that clock is checked on all CPUs. This is thorough, but might not be
what you want on a system with a few tens of CPUs, let alone a few hundred
of them.
Therefore, by default check only up to eight randomly chosen CPUs. Also
provide a new clocksource.verify_n_cpus kernel boot parameter. A value of
-1 says to check all of the CPUs, and a non-negative value says to randomly
select that number of CPUs, without concern about selecting the same CPU
multiple times. However, make use of a cpumask so that a given CPU will be
checked at most once.
Suggested-by: Thomas Gleixner <[email protected]> # For verify_n_cpus=1.
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Feng Tang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy been
solved in the past ten years. Except that it is all too possible that the
problem has instead simply been made less likely, which might mean that
some of the occasional "Marking clocksource 'tsc' as unstable" messages
might be due to desynchronization. How would anyone know?
Therefore apply CPU-to-CPU synchronization checking to newly unstable
clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
Lists of desynchronized CPUs are printed, with the caveat that if it
is the reporting CPU that is itself desynchronized, it will appear that
all the other clocks are wrong. Just like in real life.
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Feng Tang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can delay
interrupts-disabled regions of code ranging from SMI handlers to vCPU
preemption. It would be good to have some indication as to why the clock
was marked unstable.
Therefore, re-read the watchdog clock on either side of the read from the
clock under test. If the watchdog clock shows an excessive time delta
between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot parameter
clocksource.max_cswd_read_retries, which defaults to three, that is, up to
four reads, one initial and up to three retries. If more than one retry
was required, a message is printed on the console (the occasional single
retry is expected behavior, especially in guest OSes). If the maximum
number of retries is exceeded, the clock under test will be marked
unstable. However, the probability of this happening due to various sorts
of delays is quite small. In addition, the reason (clock-read delays) for
the unstable marking will be apparent.
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Feng Tang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
If there is no matched result, check_redundant() will return BFS_RNOMATCH.
Signed-off-by: Xiongwei Song <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Boqun Feng <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The FUTEX_LOCK_PI futex operand uses a CLOCK_REALTIME based absolute
timeout since it was implemented, but it does not require that the
FUTEX_CLOCK_REALTIME flag is set, because that was introduced later.
In theory as none of the user space implementations can set the
FUTEX_CLOCK_REALTIME flag on this operand, it would be possible to
creatively abuse it and make the meaning invers, i.e. select CLOCK_REALTIME
when not set and CLOCK_MONOTONIC when set. But that's a nasty hackery.
Another option would be to have a new FUTEX_CLOCK_MONOTONIC flag only for
FUTEX_LOCK_PI, but that's also awkward because it does not allow libraries
to handle the timeout clock selection consistently.
So provide a new FUTEX_LOCK_PI2 operand which implements the timeout
semantics which the other operands use and leave FUTEX_LOCK_PI alone.
Reported-by: Kurt Kanzenbach <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
futex_lock_pi() is the only futex operation which cannot select the clock
for timeouts (CLOCK_MONOTONIC/CLOCK_REALTIME). That's inconsistent and
there is no particular reason why this cannot be supported.
This was overlooked when CLOCK_REALTIME_FLAG was introduced and
unfortunately not reported when the inconsistency was discovered in glibc.
Prepare the function and enforce the CLOCK_REALTIME_FLAG on FUTEX_LOCK_PI
so that a new FUTEX_LOCK_PI2 can implement it correctly.
Reported-by: Kurt Kanzenbach <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The problem is that rcu_callback_map doesn't have wait_types defined,
and doing so would make it indistinguishable from SOFTIRQ in any case.
Remove it.
Fixes: 9271a40d2a14 ("lockdep/selftest: Add wait context selftests")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Joerg Roedel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When PROVE_RAW_LOCK_NESTING=y many of the selftests FAILED because
HARDIRQ context is out-of-bounds for spinlocks. Instead make the
default hardware context the threaded hardirq context, which preserves
the old locking rules.
The wait-type specific locking selftests will have a non-threaded
HARDIRQ variant.
Fixes: de8f5e4f2dc1 ("lockdep: Introduce wait-type checks")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Joerg Roedel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Even the very first lock can violate the wait-context check, consider
the various IRQ contexts.
Fixes: de8f5e4f2dc1 ("lockdep: Introduce wait-type checks")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Joerg Roedel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Johannes Berg reported a lockdep problem which could be reproduced by
the special test case introduced in this patch, so add it.
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In the step #3 of check_irq_usage(), we seach backwards to find a lock
whose usage conflicts the usage of @target_entry1 on safe/unsafe.
However, we should only keep the irq-unsafe usage of @target_entry1 into
consideration, because it could be a case where a lock is hardirq-unsafe
but soft-safe, and in check_irq_usage() we find it because its
hardirq-unsafe could result into a hardirq-safe-unsafe deadlock, but
currently since we don't filter out the other usage bits, so we may find
a lock dependency path softirq-unsafe -> softirq-safe, which in fact
doesn't cause a deadlock. And this may cause misleading lockdep splats.
Fix this by only keeping LOCKF_ENABLED_IRQ_ALL bits when we try the
backwards search.
Reported-by: Johannes Berg <[email protected]>
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In print_bad_irq_dependency(), save_trace() is called to set the ->trace
for @prev_root as the current call trace, however @prev_root corresponds
to the the held lock, which may not be acquired in current call trace,
therefore it's wrong to use save_trace() to set ->trace of @prev_root.
Moreover, with our adjustment of printing backwards dependency path, the
->trace of @prev_root is unncessary, so remove it.
Reported-by: Johannes Berg <[email protected]>
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
We use the same code to print backwards lock dependency path as the
forwards lock dependency path, and this could result into incorrect
printing because for a backwards lock_list ->trace is not the call trace
where the lock of ->class is acquired.
Fix this by introducing a separate function on printing the backwards
dependency path. Also add a few comments about the printing while we are
at it.
Reported-by: Johannes Berg <[email protected]>
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now cpu.uclamp.min acts as a protection, we need to make sure that the
uclamp request of the task is within the allowed range of the cgroup,
that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
tg->uclamp[UCLAMP_MAX].
As reported by Xuewen [1] we can have some corner cases where there's
inversion between uclamp requested by task (p) and the uclamp values of
the taskgroup it's attached to (tg). Following table demonstrates
2 corner cases:
| p | tg | effective
-----------+-----+------+-----------
CASE 1
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 60%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
CASE 2
-----------+-----+------+-----------
uclamp_min | 0% | 30% | 30%
-----------+-----+------+-----------
uclamp_max | 20% | 50% | 20%
-----------+-----+------+-----------
With this fix we get:
| p | tg | effective
-----------+-----+------+-----------
CASE 1
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 50%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
CASE 2
-----------+-----+------+-----------
uclamp_min | 0% | 30% | 30%
-----------+-----+------+-----------
uclamp_max | 20% | 50% | 30%
-----------+-----+------+-----------
Additionally uclamp_update_active_tasks() must now unconditionally
update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
instance could have an impact on the effective UCLAMP_MIN of the tasks.
| p | tg | effective
-----------+-----+------+-----------
old
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 50%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
*new*
-----------+-----+------+-----------
uclamp_min | 60% | 0% | *60%*
-----------+-----+------+-----------
uclamp_max | 80% |*70%* | *70%*
-----------+-----+------+-----------
[1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/
Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
Reported-by: Xuewen Yan <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
DL keeps track of the utilization on a per-rq basis with the structure
avg_dl. This utilization is updated during task_tick_dl(),
put_prev_task_dl() and set_next_task_dl(). However, when the current
running task changes its policy, set_next_task_dl() which would usually
take care of updating the utilization when the rq starts running DL
tasks, will not see a such change, leaving the avg_dl structure outdated.
When that very same task will be dequeued later, put_prev_task_dl() will
then update the utilization, based on a wrong last_update_time, leading to
a huge spike in the DL utilization signal.
The signal would eventually recover from this issue after few ms. Even
if no DL tasks are run, avg_dl is also updated in
__update_blocked_others(). But as the CPU capacity depends partly on the
avg_dl, this issue has nonetheless a significant impact on the scheduler.
Fix this issue by ensuring a load update when a running task changes
its policy to DL.
Fixes: 3727e0e ("sched/dl: Add dl_rq utilization tracking")
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
RT keeps track of the utilization on a per-rq basis with the structure
avg_rt. This utilization is updated during task_tick_rt(),
put_prev_task_rt() and set_next_task_rt(). However, when the current
running task changes its policy, set_next_task_rt() which would usually
take care of updating the utilization when the rq starts running RT tasks,
will not see a such change, leaving the avg_rt structure outdated. When
that very same task will be dequeued later, put_prev_task_rt() will then
update the utilization, based on a wrong last_update_time, leading to a
huge spike in the RT utilization signal.
The signal would eventually recover from this issue after few ms. Even if
no RT tasks are run, avg_rt is also updated in __update_blocked_others().
But as the CPU capacity depends partly on the avg_rt, this issue has
nonetheless a significant impact on the scheduler.
Fix this issue by ensuring a load update when a running task changes
its policy to RT.
Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Add the missing documentation for the @cpu parameter of
tick_cleanup_dead_cpu().
Signed-off-by: Baokun Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
LLVM does not emit optimal byteswap assembly, which results in high
stack usage in kvmhv_enter_nested_guest() due to the inlining of
byteswap_pt_regs(). With LLVM 12.0.0:
arch/powerpc/kvm/book3s_hv_nested.c:289:6: error: stack frame size of
2512 bytes in function 'kvmhv_enter_nested_guest' [-Werror,-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
^
1 error generated.
While this gets fixed in LLVM, mark byteswap_pt_regs() as
noinline_for_stack so that it does not get inlined and break the build
due to -Werror by default in arch/powerpc/. Not inlining saves
approximately 800 bytes with LLVM 12.0.0:
arch/powerpc/kvm/book3s_hv_nested.c:290:6: warning: stack frame size of
1728 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
^
1 warning generated.
Cc: [email protected] # v4.20+
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://github.com/ClangBuiltLinux/linux/issues/1292
Link: https://bugs.llvm.org/show_bug.cgi?id=49610
Link: https://lore.kernel.org/r/[email protected]/
Link: https://gist.github.com/ba710e3703bf45043a31e2806c843ffd
Link: https://lore.kernel.org/r/[email protected]
|
|
KVM/arm64 support for MTE, courtesy of Steven Price.
It allows the guest to use memory tagging, and offers
a new userspace API to save/restore the tags.
* kvm-arm64/mmu/mte:
KVM: arm64: Document MTE capability and ioctl
KVM: arm64: Add ioctl to fetch/store tags in a guest
KVM: arm64: Expose KVM_ARM_CAP_MTE
KVM: arm64: Save/restore MTE registers
KVM: arm64: Introduce MTE VM feature
arm64: mte: Sync tags for pages where PTE is untagged
Signed-off-by: Marc Zyngier <[email protected]>
|
|
syzbot reported a memory leak related to sigqueue caching.
The assumption that a task cannot cache a sigqueue after the signal handler
has been dropped and exit_task_sigqueue_cache() has been invoked turns out
to be wrong.
Such a task can still invoke release_task(other_task), which cleans up the
signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in
turn will cache the signal because task->sigqueue_cache is NULL. That's
obviously bogus because nothing will free the cached signal of that task
anymore, so the cached item is leaked.
This happens when e.g. the last non-leader thread exits and reaps the
zombie leader.
Prevent this by setting tsk::sigqueue_cache to an error pointer value in
exit_task_sigqueue_cache() which forces any subsequent invocation of
sigqueue_cache_or_free() from that task to hand the sigqueue back to the
kmemcache.
Add comments to all relevant places.
Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct")
Reported-by: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
H_RPT_INVALIDATE if available. The availability of this hcall
is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
DT property.
Signed-off-by: Bharata B Rao <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now that we have H_RPT_INVALIDATE fully implemented, enable
support for the same via KVM_CAP_PPC_RPT_INVALIDATE KVM capability
Signed-off-by: Bharata B Rao <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Enable support for process-scoped invalidations from nested
guests and partition-scoped invalidations for nested guests.
Process-scoped invalidations for any level of nested guests
are handled by implementing H_RPT_INVALIDATE handler in the
nested guest exit path in L0.
Partition-scoped invalidation requests are forwarded to the
right nested guest, handled there and passed down to L0
for eventual handling.
Signed-off-by: Bharata B Rao <[email protected]>
[aneesh: Nested guest partition-scoped invalidation changes]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
[mpe: Squash in fixup patch]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
We actually need to wait for the moving fence after pinning
the BO to make sure that the pin is completed.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
References: https://lore.kernel.org/dri-devel/[email protected]/
CC: [email protected]
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
We actually need to wait for the moving fence after pinning
the BO to make sure that the pin is completed.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
References: https://lore.kernel.org/dri-devel/[email protected]/
CC: [email protected]
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
We actually need to wait for the moving fence after pinning
the BO to make sure that the pin is completed.
v2: grab the lock while waiting
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
References: https://lore.kernel.org/dri-devel/[email protected]/
CC: [email protected]
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.
A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).
Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.
Actually exposing the new capability to user space happens in a later
patch.
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
[maz: move VM_SHARED sampling into the critical section]
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
We used this in may_commit_transaction() in order to determine if we
needed to commit the transaction. However we no longer have that logic
and thus have no use of this counter anymore, so delete it.
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
This was a trick implemented to handle the case where we had a giant
reservation in front of a bunch of little reservations in the ticket
queue. If the giant reservation was too large for the transaction
commit to make a difference we'd ENOSPC everybody out instead of
committing the transaction. This logic was put in to force us to go
back and re-try the transaction commit logic to see if we could make
progress.
Instead now we know we've committed the transaction, so any space that
would have been recovered is now available, and would be caught by the
btrfs_try_granting_tickets() in this loop, so we no longer need this
code and can simply delete it.
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Since we unconditionally commit the transaction now we no longer need to
run the delayed refs to make sure our total_bytes_pinned value is
uptodate, we can simply commit the transaction. Remove this stage from
the data flushing list.
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
may_commit_transaction was introduced before the ticketing
infrastructure existed. There was a problem where we'd legitimately be
out of space, but every reservation would trigger a transaction commit
and then fail. Thus if you had 1000 things trying to make a
reservation, they'd all do the flushing loop and thus commit the
transaction 1000 times before they'd get their ENOSPC.
This helper was introduced to short circuit this, if there wasn't space
that could be reclaimed by committing the transaction then simply ENOSPC
out. This made true ENOSPC tests much faster as we didn't waste a bunch
of time.
However many of our bugs over the years have been from cases where we
didn't account for some space that would be reclaimed by committing a
transaction. The delayed refs rsv space, delayed rsv, many pinned bytes
miscalculations, etc. And in the meantime the original problem has been
solved with ticketing. We no longer will commit the transaction 1000
times. Instead we'll get 1000 waiters, we will go through the flushing
mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
out. The ticketing infrastructure gives us a deterministic way to see
if we're making progress or not, thus we avoid a lot of extra work.
So simplify this step by simply unconditionally committing the
transaction. This removes what is arguably our most common source of
early ENOSPC bugs and will allow us to drastically simplify many of the
things we track because we simply won't need them with this stuff gone.
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When doing a send we don't expect the task to ever start a transaction
after the initial check that verifies if commit roots match the regular
roots. This is because after that we set current->journal_info with a
stub (special value) that signals we are in send context, so that we take
a read lock on an extent buffer when reading it from disk and verifying
it is valid (its generation matches the generation stored in the parent).
This stub was introduced in 2014 by commit a26e8c9f75b0bf ("Btrfs: don't
clear uptodate if the eb is under IO") in order to fix a concurrency issue
between send and balance.
However there is one particular exception where we end up needing to start
a transaction and when this happens it results in a crash with a stack
trace like the following:
[60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
[60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
[60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
[60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
[60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
[60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
[60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
[60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
[60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
[60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
[60015.902408] kernel: FS: 00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
[60015.902410] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
[60015.902414] kernel: Call Trace:
[60015.902419] kernel: kfence_unprotect+0x13/0x30
[60015.902423] kernel: page_fault_oops+0x89/0x270
[60015.902427] kernel: ? search_module_extables+0xf/0x40
[60015.902431] kernel: ? search_bpf_extables+0x57/0x70
[60015.902435] kernel: kernelmode_fixup_or_oops+0xd6/0xf0
[60015.902437] kernel: __bad_area_nosemaphore+0x142/0x180
[60015.902440] kernel: exc_page_fault+0x67/0x150
[60015.902445] kernel: asm_exc_page_fault+0x1e/0x30
[60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
[60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
[60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
[60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
[60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
[60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
[60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
[60015.902468] kernel: btrfs_commit_inode_delayed_inode+0x5d/0x120
[60015.902473] kernel: btrfs_evict_inode+0x2c5/0x3f0
[60015.902476] kernel: evict+0xd1/0x180
[60015.902480] kernel: inode_lru_isolate+0xe7/0x180
[60015.902483] kernel: __list_lru_walk_one+0x77/0x150
[60015.902487] kernel: ? iput+0x1a0/0x1a0
[60015.902489] kernel: ? iput+0x1a0/0x1a0
[60015.902491] kernel: list_lru_walk_one+0x47/0x70
[60015.902495] kernel: prune_icache_sb+0x39/0x50
[60015.902497] kernel: super_cache_scan+0x161/0x1f0
[60015.902501] kernel: do_shrink_slab+0x142/0x240
[60015.902505] kernel: shrink_slab+0x164/0x280
[60015.902509] kernel: shrink_node+0x2c8/0x6e0
[60015.902512] kernel: do_try_to_free_pages+0xcb/0x4b0
[60015.902514] kernel: try_to_free_pages+0xda/0x190
[60015.902516] kernel: __alloc_pages_slowpath.constprop.0+0x373/0xcc0
[60015.902521] kernel: ? __memcg_kmem_charge_page+0xc2/0x1e0
[60015.902525] kernel: __alloc_pages_nodemask+0x30a/0x340
[60015.902528] kernel: pipe_write+0x30b/0x5c0
[60015.902531] kernel: ? set_next_entity+0xad/0x1e0
[60015.902534] kernel: ? switch_mm_irqs_off+0x58/0x440
[60015.902538] kernel: __kernel_write+0x13a/0x2b0
[60015.902541] kernel: kernel_write+0x73/0x150
[60015.902543] kernel: send_cmd+0x7b/0xd0
[60015.902545] kernel: send_extent_data+0x5a3/0x6b0
[60015.902549] kernel: process_extent+0x19b/0xed0
[60015.902551] kernel: btrfs_ioctl_send+0x1434/0x17e0
[60015.902554] kernel: ? _btrfs_ioctl_send+0xe1/0x100
[60015.902557] kernel: _btrfs_ioctl_send+0xbf/0x100
[60015.902559] kernel: ? enqueue_entity+0x18c/0x7b0
[60015.902562] kernel: btrfs_ioctl+0x185f/0x2f80
[60015.902564] kernel: ? psi_task_change+0x84/0xc0
[60015.902569] kernel: ? _flat_send_IPI_mask+0x21/0x40
[60015.902572] kernel: ? check_preempt_curr+0x2f/0x70
[60015.902576] kernel: ? selinux_file_ioctl+0x137/0x1e0
[60015.902579] kernel: ? expand_files+0x1cb/0x1d0
[60015.902582] kernel: ? __x64_sys_ioctl+0x82/0xb0
[60015.902585] kernel: __x64_sys_ioctl+0x82/0xb0
[60015.902588] kernel: do_syscall_64+0x33/0x40
[60015.902591] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
[60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
[60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
[60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
[60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
[60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
[60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
[60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
[60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---
This happens because when writing to the pipe, by calling kernel_write(),
we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
gfp flags, which allow reclaim to happen if there is memory pressure. This
allocation happens at fs/pipe.c:pipe_write().
If the reclaim is triggered, inode eviction can be triggered and that in
turn can result in starting a transaction if the inode has a link count
of 0. The transaction start happens early on during eviction, when we call
btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
there is currently an open file descriptor for an inode with a link count
of 0 and the reclaim task gets a reference on the inode before that
descriptor is closed, in which case the reclaim task ends up doing the
final iput that triggers the inode eviction.
When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
the following assertion at transaction.c:start_transaction():
/* Send isn't supposed to start transactions. */
ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
And when assertions are not enabled, it triggers a crash since after that
assertion we cast current->journal_info into a transaction handle pointer
and then dereference it:
if (current->journal_info) {
WARN_ON(type & TRANS_EXTWRITERS);
h = current->journal_info;
refcount_inc(&h->use_count);
(...)
Which obviously results in a crash due to an invalid memory access.
The same type of issue can happen during other memory allocations we
do directly in the send code with kmalloc (and friends) as they use
GFP_KERNEL and therefore may trigger reclaim too, which started to
happen since 2016 after commit e780b0d1c1523e ("btrfs: send: use
GFP_KERNEL everywhere").
The issue could be solved by setting up a NOFS context for the entire
send operation so that reclaim could not be triggered when allocating
memory or pages through kernel_write(). However that is not very friendly
and we can in fact get rid of the send stub because:
1) The stub was introduced way back in 2014 by commit a26e8c9f75b0bf
("Btrfs: don't clear uptodate if the eb is under IO") to solve an
issue exclusive to when send and balance are running in parallel,
however there were other problems between balance and send and we do
not allow anymore to have balance and send run concurrently since
commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due
to concurrent relocation"). More generically the issues are between
send and relocation, and that last commit eliminated only the
possibility of having send and balance run concurrently, but shrinking
a device also can trigger relocation, and on zoned filesystems we have
relocation of partially used block groups triggered automatically as
well. The previous patch that has a subject of:
"btrfs: ensure relocation never runs while we have send operations running"
Addresses all the remaining cases that can trigger relocation.
2) We can actually allow starting and even committing transactions while
in a send context if needed because send is not holding any locks that
would block the start or the commit of a transaction.
So get rid of all the logic added by commit a26e8c9f75b0bf ("Btrfs: don't
clear uptodate if the eb is under IO"). We can now always call
clear_extent_buffer_uptodate() at verify_parent_transid() since send is
the only case that uses commit roots without having a transaction open or
without holding the commit_root_sem.
Reported-by: Chris Murphy <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|