aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-06-23x86/sev: Add defines for GHCB version 2 MSR protocol requestsBrijesh Singh1-1/+15
Add the necessary defines for supporting the GHCB version 2 protocol. This includes defines for: - MSR-based AP hlt request/response - Hypervisor Feature request/response This is the bare minimum of requests that need to be supported by a GHCB version 2 implementation. There are more requests in the specification, but those depend on Secure Nested Paging support being available. These defines are shared between SEV host and guest support. [ bp: Fold in https://lkml.kernel.org/r/[email protected] too. Simplify the brewing macro maze into readability. ] Co-developed-by: Tom Lendacky <[email protected]> Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-06-23KVM: s390: allow facility 192 (vector-packed-decimal-enhancement facility 2)Christian Borntraeger1-0/+4
pass through newer vector instructions if vector support is enabled. Reviewed-by: Claudio Imbrenda <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Acked-by: Cornelia Huck <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2021-06-23KVM: s390: gen_facilities: allow facilities 165, 193, 194 and 196Christian Borntraeger1-0/+4
This enables the NNPA, BEAR enhancement,reset DAT protection and processor activity counter facilities via the cpu model. Reviewed-by: Claudio Imbrenda <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Acked-by: Cornelia Huck <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2021-06-23KVM: s390: get rid of register asm usageHeiko Carstens1-9/+9
Using register asm statements has been proven to be very error prone, especially when using code instrumentation where gcc may add function calls, which clobbers register contents in an unexpected way. Therefore get rid of register asm statements in kvm code, even though there is currently nothing wrong with them. This way we know for sure that this bug class won't be introduced here. Signed-off-by: Heiko Carstens <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Reviewed-by: Thomas Huth <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Link: https://lore.kernel.org/r/[email protected] [[email protected]: checkpatch strict fix] Signed-off-by: Christian Borntraeger <[email protected]>
2021-06-22scsi: sd: Call sd_revalidate_disk() for ioctl(BLKRRPART)Christoph Hellwig1-4/+18
While the disk state has nothing to do with partitions, BLKRRPART is used to force a full revalidate after things like a disk format for historical reasons. Restore that behavior. Link: https://lore.kernel.org/r/[email protected] Fixes: 471bd0af544b ("sd: use bdev_check_media_change") Reported-by: Xiang Chen <[email protected]> Tested-by: Xiang Chen <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-06-22module: limit enabling module.sig_enforceMimi Zohar1-5/+9
Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying "module.sig_enforce=1" on the boot command line sets "sig_enforce". Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured. This patch makes the presence of /sys/module/module/parameters/sig_enforce dependent on CONFIG_MODULE_SIG=y. Fixes: fda784e50aac ("module: export module signature enforcement status") Reported-by: Nayna Jain <[email protected]> Tested-by: Mimi Zohar <[email protected]> Tested-by: Jessica Yu <[email protected]> Signed-off-by: Mimi Zohar <[email protected]> Signed-off-by: Jessica Yu <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-22btrfs: remove unused btrfs_fs_info::total_pinnedNikolay Borisov1-2/+0
This got added 14 years ago in 324ae4df00fd ("Btrfs: Add block group pinned accounting back") but it was not ever used. Subsequently its usage got gradually removed in 8790d502e440 ("Btrfs: Add support for mirroring across drives") and 11833d66be94 ("Btrfs: improve async block group caching"). Let's remove it for good! Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-22drm/kmb: Fix error return code in kmb_hw_init()Zhen Lei1-0/+1
When the call to platform_get_irq() to obtain the IRQ of the lcd fails, the returned error code should be propagated. However, we currently do not explicitly assign this error code to 'ret'. As a result, 0 was incorrectly returned. Fixes: 7f7b96a8a0a1 ("drm/kmb: Add support for KeemBay Display") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Zhen Lei <[email protected]> Signed-off-by: Anitha Chrisanthus <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-06-22smbdirect: missing rc checks while waiting for rdma eventsSteve French1-2/+12
There were two places where we weren't checking for error (e.g. ERESTARTSYS) while waiting for rdma resolution. Addresses-Coverity: 1462165 ("Unchecked return value") Reviewed-by: Tom Talpey <[email protected]> Reviewed-by: Long Li <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-06-22Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"Rafael J. Wysocki1-3/+13
Revert commit 4514d991d992 ("PCI: PM: Do not read power state in pci_enable_device_flags()") that is reported to cause PCI device initialization issues on some systems. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=213481 Link: https://lore.kernel.org/linux-acpi/[email protected] Reported-by: Michael <[email protected]> Reported-by: Salvatore Bonaccorso <[email protected]> Fixes: 4514d991d992 ("PCI: PM: Do not read power state in pci_enable_device_flags()") Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-06-22clockevents: Use list_move() instead of list_del()/list_add()Baokun Li1-4/+2
Simplify the code. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Baokun Li <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22clocksource: Print deviation in nanoseconds when a clocksource becomes unstableFeng Tang1-4/+4
Currently when an unstable clocksource is detected, the raw counters of that clocksource and watchdog will be printed, which can only be understood after some math calculation. So print the delta in nanoseconds as well to make it easier for humans to check the results. [ paulmck: Fix typo. ] Signed-off-by: Feng Tang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22clocksource: Provide kernel module to test clocksource watchdogPaul E. McKenney6-2/+228
When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. It would be good to have a way of testing the clocksource watchdog's ability to distinguish between these two causes of clock skew and instability. Therefore, provide a new clocksource-wdtest module selected by a new TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module parameter named "holdoff" that provides the number of seconds of delay before testing should start, which defaults to zero when built as a module and to 10 seconds when built directly into the kernel. Very large systems that boot slowly may need to increase the value of this module parameter. This module uses hand-crafted clocksource structures to do its testing, thus avoiding messing up timing for the rest of the kernel and for user applications. This module first verifies that the ->uncertainty_margin field of the clocksource structures are set sanely. It then tests the delay-detection capability of the clocksource watchdog, increasing the number of consecutive delays injected, first provoking console messages complaining about the delays and finally forcing a clock-skew event. Unexpected test results cause at least one WARN_ON_ONCE() console splat. If there are no splats, the test has passed. Finally, it fuzzes the value returned from a clocksource to test the clocksource watchdog's ability to detect time skew. This module checks the state of its clocksource after each test, and uses WARN_ON_ONCE() to emit a console splat if there are any failures. This should enable all types of test frameworks to detect any such failures. This facility is intended for diagnostic use only, and should be avoided on production systems. Reported-by: Chris Mason <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Feng Tang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22clocksource: Reduce clocksource-skew thresholdPaul E. McKenney4-17/+50
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed by more than 12.5% in order to be marked unstable. Except that a clock that is skewed by that much is probably destroying unsuspecting software right and left. And given that there are now checks for false-positive skews due to delays between reading the two clocks, it should be possible to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks such as TSC. Therefore, add a new uncertainty_margin field to the clocksource structure that contains the maximum uncertainty in nanoseconds for the corresponding clock. This field may be initialized manually, as it is for clocksource_tsc_early and clocksource_jiffies, which is copied to refined_jiffies. If the field is not initialized manually, it will be computed at clock-registry time as the period of the clock in question based on the scale and freq parameters to __clocksource_update_freq_scale() function. If either of those two parameters are zero, the tens-of-milliseconds WATCHDOG_THRESHOLD is used as a cowardly alternative to dividing by zero. No matter how the uncertainty_margin field is calculated, it is bounded below by twice WATCHDOG_MAX_SKEW, that is, by 100 microseconds. Note that manually initialized uncertainty_margin fields are not adjusted, but there is a WARN_ON_ONCE() that triggers if any such field is less than twice WATCHDOG_MAX_SKEW. This WARN_ON_ONCE() is intended to discourage production use of the one-nanosecond uncertainty_margin values that are used to test the clock-skew code itself. The actual clock-skew check uses the sum of the uncertainty_margin fields of the two clocksource structures being compared. Integer overflow is avoided because the largest computed value of the uncertainty_margin fields is one billion (10^9), and double that value fits into an unsigned int. However, if someone manually specifies (say) UINT_MAX, they will get what they deserve. Note that the refined_jiffies uncertainty_margin field is initialized to TICK_NSEC, which means that skew checks involving this clocksource will be sufficently forgiving. In a similar vein, the clocksource_tsc_early uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which replicates the current behavior and allows custom setting if needed in order to address the rare skews detected for this clocksource in current mainline. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Feng Tang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22clocksource: Limit number of CPUs checked for clock synchronizationPaul E. McKenney2-2/+82
Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU, that clock is checked on all CPUs. This is thorough, but might not be what you want on a system with a few tens of CPUs, let alone a few hundred of them. Therefore, by default check only up to eight randomly chosen CPUs. Also provide a new clocksource.verify_n_cpus kernel boot parameter. A value of -1 says to check all of the CPUs, and a non-negative value says to randomly select that number of CPUs, without concern about selecting the same CPU multiple times. However, make use of a cpumask so that a given CPU will be checked at most once. Suggested-by: Thomas Gleixner <[email protected]> # For verify_n_cpus=1. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Feng Tang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22clocksource: Check per-CPU clock synchronization when marked unstablePaul E. McKenney3-2/+63
Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has purportedy been solved in the past ten years. Except that it is all too possible that the problem has instead simply been made less likely, which might mean that some of the occasional "Marking clocksource 'tsc' as unstable" messages might be due to desynchronization. How would anyone know? Therefore apply CPU-to-CPU synchronization checking to newly unstable clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are printed, with the caveat that if it is the reporting CPU that is itself desynchronized, it will appear that all the other clocks are wrong. Just like in real life. Reported-by: Chris Mason <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Feng Tang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22clocksource: Retry clock read if long delays detectedPaul E. McKenney2-6/+53
When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Feng Tang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22locking/lockdep: Correct the description error for check_redundant()Xiongwei Song1-1/+1
If there is no matched result, check_redundant() will return BFS_RNOMATCH. Signed-off-by: Xiongwei Song <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Boqun Feng <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-06-22futex: Provide FUTEX_LOCK_PI2 to support clock selectionThomas Gleixner2-1/+8
The FUTEX_LOCK_PI futex operand uses a CLOCK_REALTIME based absolute timeout since it was implemented, but it does not require that the FUTEX_CLOCK_REALTIME flag is set, because that was introduced later. In theory as none of the user space implementations can set the FUTEX_CLOCK_REALTIME flag on this operand, it would be possible to creatively abuse it and make the meaning invers, i.e. select CLOCK_REALTIME when not set and CLOCK_MONOTONIC when set. But that's a nasty hackery. Another option would be to have a new FUTEX_CLOCK_MONOTONIC flag only for FUTEX_LOCK_PI, but that's also awkward because it does not allow libraries to handle the timeout clock selection consistently. So provide a new FUTEX_LOCK_PI2 operand which implements the timeout semantics which the other operands use and leave FUTEX_LOCK_PI alone. Reported-by: Kurt Kanzenbach <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22futex: Prepare futex_lock_pi() for runtime clock selectionThomas Gleixner1-1/+2
futex_lock_pi() is the only futex operation which cannot select the clock for timeouts (CLOCK_MONOTONIC/CLOCK_REALTIME). That's inconsistent and there is no particular reason why this cannot be supported. This was overlooked when CLOCK_REALTIME_FLAG was introduced and unfortunately not reported when the inconsistency was discovered in glibc. Prepare the function and enforce the CLOCK_REALTIME_FLAG on FUTEX_LOCK_PI so that a new FUTEX_LOCK_PI2 can implement it correctly. Reported-by: Kurt Kanzenbach <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22lockdep/selftest: Remove wait-type RCU_CALLBACK testsPeter Zijlstra1-17/+0
The problem is that rcu_callback_map doesn't have wait_types defined, and doing so would make it indistinguishable from SOFTIRQ in any case. Remove it. Fixes: 9271a40d2a14 ("lockdep/selftest: Add wait context selftests") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Joerg Roedel <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22lockdep/selftests: Fix selftests vs PROVE_RAW_LOCK_NESTINGPeter Zijlstra1-0/+1
When PROVE_RAW_LOCK_NESTING=y many of the selftests FAILED because HARDIRQ context is out-of-bounds for spinlocks. Instead make the default hardware context the threaded hardirq context, which preserves the old locking rules. The wait-type specific locking selftests will have a non-threaded HARDIRQ variant. Fixes: de8f5e4f2dc1 ("lockdep: Introduce wait-type checks") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Joerg Roedel <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22lockdep: Fix wait-type for empty stackPeter Zijlstra1-1/+1
Even the very first lock can violate the wait-context check, consider the various IRQ contexts. Fixes: de8f5e4f2dc1 ("lockdep: Introduce wait-type checks") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Joerg Roedel <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22locking/selftests: Add a selftest for check_irq_usage()Boqun Feng1-0/+65
Johannes Berg reported a lockdep problem which could be reproduced by the special test case introduced in this patch, so add it. Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()Boqun Feng1-1/+11
In the step #3 of check_irq_usage(), we seach backwards to find a lock whose usage conflicts the usage of @target_entry1 on safe/unsafe. However, we should only keep the irq-unsafe usage of @target_entry1 into consideration, because it could be a case where a lock is hardirq-unsafe but soft-safe, and in check_irq_usage() we find it because its hardirq-unsafe could result into a hardirq-safe-unsafe deadlock, but currently since we don't filter out the other usage bits, so we may find a lock dependency path softirq-unsafe -> softirq-safe, which in fact doesn't cause a deadlock. And this may cause misleading lockdep splats. Fix this by only keeping LOCKF_ENABLED_IRQ_ALL bits when we try the backwards search. Reported-by: Johannes Berg <[email protected]> Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22locking/lockdep: Remove the unnecessary trace savingBoqun Feng1-3/+0
In print_bad_irq_dependency(), save_trace() is called to set the ->trace for @prev_root as the current call trace, however @prev_root corresponds to the the held lock, which may not be acquired in current call trace, therefore it's wrong to use save_trace() to set ->trace of @prev_root. Moreover, with our adjustment of printing backwards dependency path, the ->trace of @prev_root is unncessary, so remove it. Reported-by: Johannes Berg <[email protected]> Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22locking/lockdep: Fix the dep path printing for backwards BFSBoqun Feng1-2/+106
We use the same code to print backwards lock dependency path as the forwards lock dependency path, and this could result into incorrect printing because for a backwards lock_list ->trace is not the call trace where the lock of ->class is acquired. Fix this by introducing a separate function on printing the backwards dependency path. Also add a few comments about the printing while we are at it. Reported-by: Johannes Berg <[email protected]> Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22sched/uclamp: Fix uclamp_tg_restrict()Qais Yousef1-31/+18
Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 60% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 20% -----------+-----+------+----------- With this fix we get: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. | p | tg | effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- *new* -----------+-----+------+----------- uclamp_min | 60% | 0% | *60%* -----------+-----+------+----------- uclamp_max | 80% |*70%* | *70%* -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-06-22sched/rt: Fix Deadline utilization tracking during policy changeVincent Donnefort1-0/+2
DL keeps track of the utilization on a per-rq basis with the structure avg_dl. This utilization is updated during task_tick_dl(), put_prev_task_dl() and set_next_task_dl(). However, when the current running task changes its policy, set_next_task_dl() which would usually take care of updating the utilization when the rq starts running DL tasks, will not see a such change, leaving the avg_dl structure outdated. When that very same task will be dequeued later, put_prev_task_dl() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the DL utilization signal. The signal would eventually recover from this issue after few ms. Even if no DL tasks are run, avg_dl is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_dl, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to DL. Fixes: 3727e0e ("sched/dl: Add dl_rq utilization tracking") Signed-off-by: Vincent Donnefort <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22sched/rt: Fix RT utilization tracking during policy changeVincent Donnefort1-5/+12
RT keeps track of the utilization on a per-rq basis with the structure avg_rt. This utilization is updated during task_tick_rt(), put_prev_task_rt() and set_next_task_rt(). However, when the current running task changes its policy, set_next_task_rt() which would usually take care of updating the utilization when the rq starts running RT tasks, will not see a such change, leaving the avg_rt structure outdated. When that very same task will be dequeued later, put_prev_task_rt() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the RT utilization signal. The signal would eventually recover from this issue after few ms. Even if no RT tasks are run, avg_rt is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_rt, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to RT. Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking") Signed-off-by: Vincent Donnefort <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22clockevents: Add missing parameter documentationBaokun Li1-0/+1
Add the missing documentation for the @cpu parameter of tick_cleanup_dead_cpu(). Signed-off-by: Baokun Li <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-23KVM: PPC: Book3S HV: Workaround high stack usage with clangNathan Chancellor1-1/+2
LLVM does not emit optimal byteswap assembly, which results in high stack usage in kvmhv_enter_nested_guest() due to the inlining of byteswap_pt_regs(). With LLVM 12.0.0: arch/powerpc/kvm/book3s_hv_nested.c:289:6: error: stack frame size of 2512 bytes in function 'kvmhv_enter_nested_guest' [-Werror,-Wframe-larger-than=] long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu) ^ 1 error generated. While this gets fixed in LLVM, mark byteswap_pt_regs() as noinline_for_stack so that it does not get inlined and break the build due to -Werror by default in arch/powerpc/. Not inlining saves approximately 800 bytes with LLVM 12.0.0: arch/powerpc/kvm/book3s_hv_nested.c:290:6: warning: stack frame size of 1728 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=] long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu) ^ 1 warning generated. Cc: [email protected] # v4.20+ Reported-by: kernel test robot <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://github.com/ClangBuiltLinux/linux/issues/1292 Link: https://bugs.llvm.org/show_bug.cgi?id=49610 Link: https://lore.kernel.org/r/[email protected]/ Link: https://gist.github.com/ba710e3703bf45043a31e2806c843ffd Link: https://lore.kernel.org/r/[email protected]
2021-06-22Merge branch kvm-arm64/mmu/mte into kvmarm-master/nextMarc Zyngier21-17/+423
KVM/arm64 support for MTE, courtesy of Steven Price. It allows the guest to use memory tagging, and offers a new userspace API to save/restore the tags. * kvm-arm64/mmu/mte: KVM: arm64: Document MTE capability and ioctl KVM: arm64: Add ioctl to fetch/store tags in a guest KVM: arm64: Expose KVM_ARM_CAP_MTE KVM: arm64: Save/restore MTE registers KVM: arm64: Introduce MTE VM feature arm64: mte: Sync tags for pages where PTE is untagged Signed-off-by: Marc Zyngier <[email protected]>
2021-06-22signal: Prevent sigqueue caching after task got releasedThomas Gleixner1-1/+16
syzbot reported a memory leak related to sigqueue caching. The assumption that a task cannot cache a sigqueue after the signal handler has been dropped and exit_task_sigqueue_cache() has been invoked turns out to be wrong. Such a task can still invoke release_task(other_task), which cleans up the signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in turn will cache the signal because task->sigqueue_cache is NULL. That's obviously bogus because nothing will free the cached signal of that task anymore, so the cached item is leaked. This happens when e.g. the last non-leader thread exits and reaps the zombie leader. Prevent this by setting tsk::sigqueue_cache to an error pointer value in exit_task_sigqueue_cache() which forces any subsequent invocation of sigqueue_cache_or_free() from that task to hand the sigqueue back to the kmemcache. Add comments to all relevant places. Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct") Reported-by: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVMBharata B Rao2-7/+32
In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall H_RPT_INVALIDATE if available. The availability of this hcall is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions DT property. Signed-off-by: Bharata B Rao <[email protected]> Reviewed-by: Fabiano Rosas <[email protected]> Reviewed-by: David Gibson <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22KVM: PPC: Book3S HV: Add KVM_CAP_PPC_RPT_INVALIDATE capabilityBharata B Rao3-0/+22
Now that we have H_RPT_INVALIDATE fully implemented, enable support for the same via KVM_CAP_PPC_RPT_INVALIDATE KVM capability Signed-off-by: Bharata B Rao <[email protected]> Reviewed-by: David Gibson <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22KVM: PPC: Book3S HV: Nested support in H_RPT_INVALIDATEBharata B Rao5-7/+170
Enable support for process-scoped invalidations from nested guests and partition-scoped invalidations for nested guests. Process-scoped invalidations for any level of nested guests are handled by implementing H_RPT_INVALIDATE handler in the nested guest exit path in L0. Partition-scoped invalidation requests are forwarded to the right nested guest, handled there and passed down to L0 for eventual handling. Signed-off-by: Bharata B Rao <[email protected]> [aneesh: Nested guest partition-scoped invalidation changes] Signed-off-by: Aneesh Kumar K.V <[email protected]> [mpe: Squash in fixup patch] Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22drm/amdgpu: wait for moving fence after pinningChristian König1-1/+13
We actually need to wait for the moving fence after pinning the BO to make sure that the pin is completed. Signed-off-by: Christian König <[email protected]> Reviewed-by: Daniel Vetter <[email protected]> References: https://lore.kernel.org/dri-devel/[email protected]/ CC: [email protected] Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-06-22drm/radeon: wait for moving fence after pinningChristian König1-3/+13
We actually need to wait for the moving fence after pinning the BO to make sure that the pin is completed. Signed-off-by: Christian König <[email protected]> Reviewed-by: Daniel Vetter <[email protected]> References: https://lore.kernel.org/dri-devel/[email protected]/ CC: [email protected] Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-06-22drm/nouveau: wait for moving fence after pinning v2Christian König1-1/+16
We actually need to wait for the moving fence after pinning the BO to make sure that the pin is completed. v2: grab the lock while waiting Signed-off-by: Christian König <[email protected]> Reviewed-by: Daniel Vetter <[email protected]> References: https://lore.kernel.org/dri-devel/[email protected]/ CC: [email protected] Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-06-22KVM: arm64: Document MTE capability and ioctlSteven Price1-0/+61
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports granting a guest access to the tags, and provides a mechanism for the VMM to enable it. A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to access the tags of a guest without having to maintain a PROT_MTE mapping in userspace. The above capability gates access to the ioctl. Reviewed-by: Catalin Marinas <[email protected]> Signed-off-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22KVM: arm64: Add ioctl to fetch/store tags in a guestSteven Price6-0/+105
The VMM may not wish to have it's own mapping of guest memory mapped with PROT_MTE because this causes problems if the VMM has tag checking enabled (the guest controls the tags in physical RAM and it's unlikely the tags are correct for the VMM). Instead add a new ioctl which allows the VMM to easily read/write the tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE while the VMM can still read/write the tags for the purpose of migration. Reviewed-by: Catalin Marinas <[email protected]> Signed-off-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22KVM: arm64: Expose KVM_ARM_CAP_MTESteven Price3-0/+16
It's now safe for the VMM to enable MTE in a guest, so expose the capability to user space. Reviewed-by: Catalin Marinas <[email protected]> Signed-off-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22KVM: arm64: Save/restore MTE registersSteven Price8-6/+124
Define the new system registers that MTE introduces and context switch them. The MTE feature is still hidden from the ID register as it isn't supported in a VM yet. Reviewed-by: Catalin Marinas <[email protected]> Signed-off-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22KVM: arm64: Introduce MTE VM featureSteven Price6-2/+83
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging for a VM. This will expose the feature to the guest and automatically tag memory pages touched by the VM as PG_mte_tagged (and clear the tag storage) to ensure that the guest cannot see stale tags, and so that the tags are correctly saved/restored across swap. Actually exposing the new capability to user space happens in a later patch. Reviewed-by: Catalin Marinas <[email protected]> Signed-off-by: Steven Price <[email protected]> [maz: move VM_SHARED sampling into the critical section] Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-22btrfs: rip out btrfs_space_info::total_bytes_pinnedJosef Bacik7-97/+0
We used this in may_commit_transaction() in order to determine if we needed to commit the transaction. However we no longer have that logic and thus have no use of this counter anymore, so delete it. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-22btrfs: rip the first_ticket_bytes logic from fail_all_ticketsJosef Bacik1-16/+0
This was a trick implemented to handle the case where we had a giant reservation in front of a bunch of little reservations in the ticket queue. If the giant reservation was too large for the transaction commit to make a difference we'd ENOSPC everybody out instead of committing the transaction. This logic was put in to force us to go back and re-try the transaction commit logic to see if we could make progress. Instead now we know we've committed the transaction, so any space that would have been recovered is now available, and would be caught by the btrfs_try_granting_tickets() in this loop, so we no longer need this code and can simply delete it. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-22btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushingJosef Bacik1-16/+0
Since we unconditionally commit the transaction now we no longer need to run the delayed refs to make sure our total_bytes_pinned value is uptodate, we can simply commit the transaction. Remove this stage from the data flushing list. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-22btrfs: rip out may_commit_transactionJosef Bacik3-128/+12
may_commit_transaction was introduced before the ticketing infrastructure existed. There was a problem where we'd legitimately be out of space, but every reservation would trigger a transaction commit and then fail. Thus if you had 1000 things trying to make a reservation, they'd all do the flushing loop and thus commit the transaction 1000 times before they'd get their ENOSPC. This helper was introduced to short circuit this, if there wasn't space that could be reclaimed by committing the transaction then simply ENOSPC out. This made true ENOSPC tests much faster as we didn't waste a bunch of time. However many of our bugs over the years have been from cases where we didn't account for some space that would be reclaimed by committing a transaction. The delayed refs rsv space, delayed rsv, many pinned bytes miscalculations, etc. And in the meantime the original problem has been solved with ticketing. We no longer will commit the transaction 1000 times. Instead we'll get 1000 waiters, we will go through the flushing mechanisms, and if there's no progress after 2 loops we ENOSPC everybody out. The ticketing infrastructure gives us a deterministic way to see if we're making progress or not, thus we avoid a lot of extra work. So simplify this step by simply unconditionally committing the transaction. This removes what is arguably our most common source of early ENOSPC bugs and will allow us to drastically simplify many of the things we track because we simply won't need them with this stuff gone. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-22btrfs: send: fix crash when memory allocations trigger reclaimFilipe Manana5-31/+2
When doing a send we don't expect the task to ever start a transaction after the initial check that verifies if commit roots match the regular roots. This is because after that we set current->journal_info with a stub (special value) that signals we are in send context, so that we take a read lock on an extent buffer when reading it from disk and verifying it is valid (its generation matches the generation stored in the parent). This stub was introduced in 2014 by commit a26e8c9f75b0bf ("Btrfs: don't clear uptodate if the eb is under IO") in order to fix a concurrency issue between send and balance. However there is one particular exception where we end up needing to start a transaction and when this happens it results in a crash with a stack trace like the following: [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80 [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...) [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1 [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015 [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80 [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...) [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246 [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224 [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000 [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002 [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001 [60015.902408] kernel: FS: 00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000 [60015.902410] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0 [60015.902414] kernel: Call Trace: [60015.902419] kernel: kfence_unprotect+0x13/0x30 [60015.902423] kernel: page_fault_oops+0x89/0x270 [60015.902427] kernel: ? search_module_extables+0xf/0x40 [60015.902431] kernel: ? search_bpf_extables+0x57/0x70 [60015.902435] kernel: kernelmode_fixup_or_oops+0xd6/0xf0 [60015.902437] kernel: __bad_area_nosemaphore+0x142/0x180 [60015.902440] kernel: exc_page_fault+0x67/0x150 [60015.902445] kernel: asm_exc_page_fault+0x1e/0x30 [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580 [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...) [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246 [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039 [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000 [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001 [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000 [60015.902468] kernel: btrfs_commit_inode_delayed_inode+0x5d/0x120 [60015.902473] kernel: btrfs_evict_inode+0x2c5/0x3f0 [60015.902476] kernel: evict+0xd1/0x180 [60015.902480] kernel: inode_lru_isolate+0xe7/0x180 [60015.902483] kernel: __list_lru_walk_one+0x77/0x150 [60015.902487] kernel: ? iput+0x1a0/0x1a0 [60015.902489] kernel: ? iput+0x1a0/0x1a0 [60015.902491] kernel: list_lru_walk_one+0x47/0x70 [60015.902495] kernel: prune_icache_sb+0x39/0x50 [60015.902497] kernel: super_cache_scan+0x161/0x1f0 [60015.902501] kernel: do_shrink_slab+0x142/0x240 [60015.902505] kernel: shrink_slab+0x164/0x280 [60015.902509] kernel: shrink_node+0x2c8/0x6e0 [60015.902512] kernel: do_try_to_free_pages+0xcb/0x4b0 [60015.902514] kernel: try_to_free_pages+0xda/0x190 [60015.902516] kernel: __alloc_pages_slowpath.constprop.0+0x373/0xcc0 [60015.902521] kernel: ? __memcg_kmem_charge_page+0xc2/0x1e0 [60015.902525] kernel: __alloc_pages_nodemask+0x30a/0x340 [60015.902528] kernel: pipe_write+0x30b/0x5c0 [60015.902531] kernel: ? set_next_entity+0xad/0x1e0 [60015.902534] kernel: ? switch_mm_irqs_off+0x58/0x440 [60015.902538] kernel: __kernel_write+0x13a/0x2b0 [60015.902541] kernel: kernel_write+0x73/0x150 [60015.902543] kernel: send_cmd+0x7b/0xd0 [60015.902545] kernel: send_extent_data+0x5a3/0x6b0 [60015.902549] kernel: process_extent+0x19b/0xed0 [60015.902551] kernel: btrfs_ioctl_send+0x1434/0x17e0 [60015.902554] kernel: ? _btrfs_ioctl_send+0xe1/0x100 [60015.902557] kernel: _btrfs_ioctl_send+0xbf/0x100 [60015.902559] kernel: ? enqueue_entity+0x18c/0x7b0 [60015.902562] kernel: btrfs_ioctl+0x185f/0x2f80 [60015.902564] kernel: ? psi_task_change+0x84/0xc0 [60015.902569] kernel: ? _flat_send_IPI_mask+0x21/0x40 [60015.902572] kernel: ? check_preempt_curr+0x2f/0x70 [60015.902576] kernel: ? selinux_file_ioctl+0x137/0x1e0 [60015.902579] kernel: ? expand_files+0x1cb/0x1d0 [60015.902582] kernel: ? __x64_sys_ioctl+0x82/0xb0 [60015.902585] kernel: __x64_sys_ioctl+0x82/0xb0 [60015.902588] kernel: do_syscall_64+0x33/0x40 [60015.902591] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...) [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004 [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640 [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000 [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0 [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]--- This happens because when writing to the pipe, by calling kernel_write(), we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the gfp flags, which allow reclaim to happen if there is memory pressure. This allocation happens at fs/pipe.c:pipe_write(). If the reclaim is triggered, inode eviction can be triggered and that in turn can result in starting a transaction if the inode has a link count of 0. The transaction start happens early on during eviction, when we call btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if there is currently an open file descriptor for an inode with a link count of 0 and the reclaim task gets a reference on the inode before that descriptor is closed, in which case the reclaim task ends up doing the final iput that triggers the inode eviction. When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers the following assertion at transaction.c:start_transaction(): /* Send isn't supposed to start transactions. */ ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB); And when assertions are not enabled, it triggers a crash since after that assertion we cast current->journal_info into a transaction handle pointer and then dereference it: if (current->journal_info) { WARN_ON(type & TRANS_EXTWRITERS); h = current->journal_info; refcount_inc(&h->use_count); (...) Which obviously results in a crash due to an invalid memory access. The same type of issue can happen during other memory allocations we do directly in the send code with kmalloc (and friends) as they use GFP_KERNEL and therefore may trigger reclaim too, which started to happen since 2016 after commit e780b0d1c1523e ("btrfs: send: use GFP_KERNEL everywhere"). The issue could be solved by setting up a NOFS context for the entire send operation so that reclaim could not be triggered when allocating memory or pages through kernel_write(). However that is not very friendly and we can in fact get rid of the send stub because: 1) The stub was introduced way back in 2014 by commit a26e8c9f75b0bf ("Btrfs: don't clear uptodate if the eb is under IO") to solve an issue exclusive to when send and balance are running in parallel, however there were other problems between balance and send and we do not allow anymore to have balance and send run concurrently since commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due to concurrent relocation"). More generically the issues are between send and relocation, and that last commit eliminated only the possibility of having send and balance run concurrently, but shrinking a device also can trigger relocation, and on zoned filesystems we have relocation of partially used block groups triggered automatically as well. The previous patch that has a subject of: "btrfs: ensure relocation never runs while we have send operations running" Addresses all the remaining cases that can trigger relocation. 2) We can actually allow starting and even committing transactions while in a send context if needed because send is not holding any locks that would block the start or the commit of a transaction. So get rid of all the logic added by commit a26e8c9f75b0bf ("Btrfs: don't clear uptodate if the eb is under IO"). We can now always call clear_extent_buffer_uptodate() at verify_parent_transid() since send is the only case that uses commit roots without having a transaction open or without holding the commit_root_sem. Reported-by: Chris Murphy <[email protected]> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/ Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>