aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-11-11KVM: SEV: provide helpers to charge/uncharge misc_cgPaolo Bonzini1-7/+15
Avoid code duplication across all callers of misc_cg_try_charge and misc_cg_uncharge. The resource type for KVM is always derived from sev->es_active, and the quantity is always 1. Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: generalize "bugged" VM to "dead" VMPaolo Bonzini3-8/+16
Generalize KVM_REQ_VM_BUGGED so that it can be called even in cases where it is by design that the VM cannot be operated upon. In this case any KVM_BUG_ON should still warn, so introduce a new flag kvm->vm_dead that is separate from kvm->vm_bugged. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: SEV: Refactor out sev_es_state structPeter Gonda3-56/+61
Move SEV-ES vCPU metadata into new sev_es_state struct from vcpu_svm. Signed-off-by: Peter Gonda <[email protected]> Suggested-by: Tom Lendacky <[email protected]> Acked-by: Tom Lendacky <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Cc: Marc Orr <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dr. David Alan Gilbert <[email protected]> Cc: Brijesh Singh <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: [email protected] Cc: [email protected] Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11ALSA: fireworks: add support for Loud Onyx 1200f quirkTakashi Sakamoto1-2/+3
Loud Technologies shipped Onyx 1200f 2008 in its Mackie brand and already discontinued. The model uses component of Fireworks board module as its communication and DSP function. The latest firmware (v4.6.0) has a quirk that tx packet includes wrong value (0x1f) in its dbs field at middle and higher sampling transfer frequency. It brings ALSA fireworks driver discontinuity of data block counter. This commit fixes it by assuming it as a quirk of firmware version 4.6.0. $ cd linux-firewire-tools/src $ python crpp < /sys/bus/firewire/devices/fw1/config_rom ROM header and bus information block ----------------------------------------------------------------- 400 0404b9ef bus_info_length 4, crc_length 4, crc 47599 404 31333934 bus_name "1394" 408 e064a212 irmc 1, cmc 1, isc 1, bmc 0, pmc 0, cyc_clk_acc 100, max_rec 10 (2048), max_rom 2, gen 1, spd 2 (S400) 40c 000ff209 company_id 000ff2 | 410 62550ce0 device_id 0962550ce0 | EUI-64 000ff20962550ce0 root directory ----------------------------------------------------------------- 414 0008088e directory_length 8, crc 2190 418 03000ff2 vendor 41c 8100000f --> descriptor leaf at 458 420 1701200f model 424 81000018 --> descriptor leaf at 484 428 0c008380 node capabilities 42c 8d000003 --> eui-64 leaf at 438 430 d1000005 --> unit directory at 444 434 08000ff2 (immediate value) eui-64 leaf at 438 ----------------------------------------------------------------- 438 000281ae leaf_length 2, crc 33198 43c 000ff209 company_id 000ff2 | 440 62550ce0 device_id 0962550ce0 | EUI-64 000ff20962550ce0 unit directory at 444 ----------------------------------------------------------------- 444 00045d94 directory_length 4, crc 23956 448 1200a02d specifier id: 1394 TA 44c 13010000 version 450 1701200f model 454 8100000c --> descriptor leaf at 484 descriptor leaf at 458 ----------------------------------------------------------------- 458 000a199d leaf_length 10, crc 6557 45c 00000000 textual descriptor 460 00000000 minimal ASCII 464 4d61636b "Mack" 468 69650000 "ie" 46c 00000000 470 00000000 474 00000000 478 00000000 47c 00000000 480 00000000 descriptor leaf at 484 ----------------------------------------------------------------- 484 000a0964 leaf_length 10, crc 2404 488 00000000 textual descriptor 48c 00000000 minimal ASCII 490 4f6e7978 "Onyx" 494 20313230 " 120" 498 30460000 "0F" 49c 00000000 4a0 00000000 4a4 00000000 4a8 00000000 4ac 00000000 Signed-off-by: Takashi Sakamoto <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>
2021-11-11Merge branch 'kvm-guest-sev-migration' into kvm-masterPaolo Bonzini10-9/+202
Add guest api and guest kernel support for SEV live migration. Introduces a new hypercall to notify the host of changes to the page encryption status. If the page is encrypted then it must be migrated through the SEV firmware or a helper VM sharing the key. If page is not encrypted then it can be migrated normally by userspace. This new hypercall is invoked using paravirt_ops. Conflicts: sev_active() replaced by cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).
2021-11-11x86/kvm: Add kexec support for SEV Live Migration.Ashish Kalra1-0/+25
Reset the host's shared pages list related to kernel specific page encryption status settings before we load a new kernel by kexec. We cannot reset the complete shared pages list here as we need to retain the UEFI/OVMF firmware specific settings. The host's shared pages list is maintained for the guest to keep track of all unencrypted guest memory regions, therefore we need to explicitly mark all shared pages as encrypted again before rebooting into the new guest kernel. Signed-off-by: Ashish Kalra <[email protected]> Reviewed-by: Steve Rutherford <[email protected]> Message-Id: <3e051424ab839ea470f88333273d7a185006754f.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11x86/kvm: Add guest support for detecting and enabling SEV Live Migration ↵Ashish Kalra3-0/+91
feature. The guest support for detecting and enabling SEV Live migration feature uses the following logic : - kvm_init_plaform() checks if its booted under the EFI - If not EFI, i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl() to enable the SEV live migration support - If EFI, i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read the UEFI variable which indicates OVMF support for live migration ii) the variable indicates live migration is supported, issue a wrmsrl() to enable the SEV live migration support The EFI live migration check is done using a late_initcall() callback. Also, ensure that _bss_decrypted section is marked as decrypted in the hypervisor's guest page encryption status tracking. Signed-off-by: Ashish Kalra <[email protected]> Reviewed-by: Steve Rutherford <[email protected]> Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11EFI: Introduce the new AMD Memory Encryption GUID.Ashish Kalra1-0/+1
Introduce a new AMD Memory Encryption GUID which is currently used for defining a new UEFI environment variable which indicates UEFI/OVMF support for the SEV live migration feature. This variable is setup when UEFI/OVMF detects host/hypervisor support for SEV live migration and later this variable is read by the kernel using EFI runtime services to verify if OVMF supports the live migration feature. Signed-off-by: Ashish Kalra <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Message-Id: <1cea22976d2208f34d47e0c1ce0ecac816c13111.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11mm: x86: Invoke hypercall when page encryption status is changedBrijesh Singh6-9/+73
Invoke a hypercall when a memory region is changed from encrypted -> decrypted and vice versa. Hypervisor needs to know the page encryption status during the guest migration. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Steve Rutherford <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Message-Id: <0a237d5bb08793916c7790a3e653a2cbe7485761.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11x86/kvm: Add AMD SEV specific Hypercall3Brijesh Singh1-0/+12
KVM hypercall framework relies on alternative framework to patch the VMCALL -> VMMCALL on AMD platform. If a hypercall is made before apply_alternative() is called then it defaults to VMCALL. The approach works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor will be able to decode the instruction and do the right things. But when SEV is active, guest memory is encrypted with guest key and hypervisor will not be able to decode the instruction bytes. To highlight the need to provide this interface, capturing the flow of apply_alternatives() : setup_arch() call init_hypervisor_platform() which detects the hypervisor platform the kernel is running under and then the hypervisor specific initialization code can make early hypercalls. For example, KVM specific initialization in case of SEV will try to mark the "__bss_decrypted" section's encryption state via early page encryption status hypercalls. Now, apply_alternatives() is called much later when setup_arch() calls check_bugs(), so we do need some kind of an early, pre-alternatives hypercall interface. Other cases of pre-alternatives hypercalls include marking per-cpu GHCB pages as decrypted on SEV-ES and per-cpu apf_reason, steal_time and kvm_apic_eoi as decrypted for SEV generally. Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall will be used by the SEV guest to notify encrypted pages to the hypervisor. This kvm_sev_hypercall3() function is abstracted and used as follows : All these early hypercalls are made through early_set_memory_XX() interfaces, which in turn invoke pv_ops (paravirt_ops). This early_set_memory_XX() -> pv_ops.mmu.notify_page_enc_status_changed() is a generic interface and can easily have SEV, TDX and any other future platform specific abstractions added to it. Currently, pv_ops.mmu.notify_page_enc_status_changed() callback is setup to invoke kvm_sev_hypercall3() in case of SEV. Similarly, in case of TDX, pv_ops.mmu.notify_page_enc_status_changed() can be setup to a TDX specific callback. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Steve Rutherford <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Message-Id: <6fd25c749205dd0b1eb492c60d41b124760cc6ae.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11selftests/net: udpgso_bench_rx: fix port argumentWillem de Bruijn1-4/+7
The below commit added optional support for passing a bind address. It configures the sockaddr bind arguments before parsing options and reconfigures on options -b and -4. This broke support for passing port (-p) on its own. Configure sockaddr after parsing all arguments. Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO") Reported-by: Eric Dumazet <[email protected]> Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-11perf/core: Avoid put_page() when GUP failsGreg Thelen1-5/+5
PEBS PERF_SAMPLE_PHYS_ADDR events use perf_virt_to_phys() to convert PMU sampled virtual addresses to physical using get_user_page_fast_only() and page_to_phys(). Some get_user_page_fast_only() error cases return false, indicating no page reference, but still initialize the output page pointer with an unreferenced page. In these error cases perf_virt_to_phys() calls put_page(). This causes page reference count underflow, which can lead to unintentional page sharing. Fix perf_virt_to_phys() to only put_page() if get_user_page_fast_only() returns a referenced page. Fixes: fc7ce9c74c3ad ("perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR") Signed-off-by: Greg Thelen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-11perf/x86/vlbr: Add c->flags to vlbr event constraintsLike Xu1-1/+3
Just like what we do in the x86_get_event_constraints(), the PERF_X86_EVENT_LBR_SELECT flag should also be propagated to event->hw.flags so that the host lbr driver can save/restore MSR_LBR_SELECT for the special vlbr event created by KVM or BPF. Fixes: 097e4311cda9 ("perf/x86: Add constraint to create guest LBR event without hw counter") Reported-by: Wanpeng Li <[email protected]> Signed-off-by: Like Xu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Wanpeng Li <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-11-11perf/x86/lbr: Reset LBR_SELECT during vlbr resetWanpeng Li1-0/+2
lbr_select in kvm guest has residual data even if kvm guest is poweroff. We can get residual data in the next boot. Because lbr_select is not reset during kvm vlbr release. Let's reset LBR_SELECT during vlbr reset. Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-11preempt: Restore preemption model selection configsValentin Schneider5-27/+27
Commit c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode") changed the selectable config names for the preemption model. This means a config file must now select CONFIG_PREEMPT_BEHAVIOUR=y rather than CONFIG_PREEMPT=y to get a preemptible kernel. This means all arch config files would need to be updated - right now they'll all end up with the default CONFIG_PREEMPT_NONE_BEHAVIOUR. Rather than touch a good hundred of config files, restore usage of CONFIG_PREEMPT{_NONE, _VOLUNTARY}. Make them configure: o The build-time preemption model when !PREEMPT_DYNAMIC o The default boot-time preemption model when PREEMPT_DYNAMIC Add siblings of those configs with the _BUILD suffix to unconditionally designate the build-time preemption model (PREEMPT_DYNAMIC is built with the "highest" preemption model it supports, aka PREEMPT). Downstream configs should by now all be depending / selected by CONFIG_PREEMPTION rather than CONFIG_PREEMPT, so only a few sites need patching up. Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-11-11arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology()Wang ShaoBo1-0/+2
When testing cpu online and offline, warning happened like this: [ 146.746743] WARNING: CPU: 92 PID: 974 at kernel/sched/topology.c:2215 build_sched_domains+0x81c/0x11b0 [ 146.749988] CPU: 92 PID: 974 Comm: kworker/92:2 Not tainted 5.15.0 #9 [ 146.750402] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.79 08/21/2021 [ 146.751213] Workqueue: events cpuset_hotplug_workfn [ 146.751629] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 146.752048] pc : build_sched_domains+0x81c/0x11b0 [ 146.752461] lr : build_sched_domains+0x414/0x11b0 [ 146.752860] sp : ffff800040a83a80 [ 146.753247] x29: ffff800040a83a80 x28: ffff20801f13a980 x27: ffff20800448ae00 [ 146.753644] x26: ffff800012a858e8 x25: ffff800012ea48c0 x24: 0000000000000000 [ 146.754039] x23: ffff800010ab7d60 x22: ffff800012f03758 x21: 000000000000005f [ 146.754427] x20: 000000000000005c x19: ffff004080012840 x18: ffffffffffffffff [ 146.754814] x17: 3661613030303230 x16: 30303078303a3239 x15: ffff800011f92b48 [ 146.755197] x14: ffff20be3f95cef6 x13: 2e6e69616d6f642d x12: 6465686373204c4c [ 146.755578] x11: ffff20bf7fc83a00 x10: 0000000000000040 x9 : 0000000000000000 [ 146.755957] x8 : 0000000000000002 x7 : ffffffffe0000000 x6 : 0000000000000002 [ 146.756334] x5 : 0000000090000000 x4 : 00000000f0000000 x3 : 0000000000000001 [ 146.756705] x2 : 0000000000000080 x1 : ffff800012f03860 x0 : 0000000000000001 [ 146.757070] Call trace: [ 146.757421] build_sched_domains+0x81c/0x11b0 [ 146.757771] partition_sched_domains_locked+0x57c/0x978 [ 146.758118] rebuild_sched_domains_locked+0x44c/0x7f0 [ 146.758460] rebuild_sched_domains+0x2c/0x48 [ 146.758791] cpuset_hotplug_workfn+0x3fc/0x888 [ 146.759114] process_one_work+0x1f4/0x480 [ 146.759429] worker_thread+0x48/0x460 [ 146.759734] kthread+0x158/0x168 [ 146.760030] ret_from_fork+0x10/0x20 [ 146.760318] ---[ end trace 82c44aad6900e81a ]--- For some architectures like risc-v and arm64 which use common code clear_cpu_topology() in shutting down CPUx, When CONFIG_SCHED_CLUSTER is set, cluster_sibling in cpu_topology of each sibling adjacent to CPUx is missed clearing, this causes checking failed in topology_span_sane() and rebuilding topology failure at end when CPU online. Different sibling's cluster_sibling in cpu_topology[] when CPU92 offline (CPU 92, 93, 94, 95 are in one cluster): Before revision: CPU [92] [93] [94] [95] cluster_sibling [92] [92-95] [92-95] [92-95] After revision: CPU [92] [93] [94] [95] cluster_sibling [92] [93-95] [93-95] [93-95] Signed-off-by: Wang ShaoBo <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Acked-by: Barry Song <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-11-11sched/fair: Prevent dead task groups from regaining cfs_rq'sMathias Krause5-16/+49
Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/lkml/[email protected]/ Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <[email protected]> Signed-off-by: Mathias Krause <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2021-11-11sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()Vincent Donnefort1-0/+3
Nothing protects the access to the per_cpu variable sd_llc_id. When testing the same CPU (i.e. this_cpu == that_cpu), a race condition exists with update_top_cache_domain(). One scenario being: CPU1 CPU2 ================================================================== per_cpu(sd_llc_id, CPUX) => 0 partition_sched_domains_locked() detach_destroy_domains() cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX) per_cpu(sd_llc_id, CPUX) => 0 per_cpu(sd_llc_id, CPUX) = CPUX per_cpu(sd_llc_id, CPUX) => CPUX return false ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result is a warning triggered from ttwu_queue_wakelist(). Avoid a such race in cpus_share_cache() by always returning true when this_cpu == that_cpu. Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") Reported-by: Jing-Ting Wu <[email protected]> Signed-off-by: Vincent Donnefort <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-11-11x86/smp: Factor out parts of native_smp_prepare_cpus()Boris Ostrovsky3-16/+15
Commit 66558b730f25 ("sched: Add cluster scheduler level for x86") introduced cpu_l2c_shared_map mask which is expected to be initialized by smp_op.smp_prepare_cpus(). That commit only updated native_smp_prepare_cpus() version but not xen_pv_smp_prepare_cpus(). As result Xen PV guests crash in set_cpu_sibling_map(). While the new mask can be allocated in xen_pv_smp_prepare_cpus() one can see that both versions of smp_prepare_cpus ops share a number of common operations that can be factored out. So do that instead. Fixes: 66558b730f25 ("sched: Add cluster scheduler level for x86") Signed-off-by: Boris Ostrovsky <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-11static_call,x86: Robustify trampoline patchingPeter Zijlstra3-4/+14
Add a few signature bytes after the static call trampoline and verify those bytes match before patching the trampoline. This avoids patching random other JMPs (such as CFI jump-table entries) instead. These bytes decode as: d: 53 push %rbx e: 43 54 rex.XB push %r12 And happen to spell "SCT". Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-11net: wwan: iosm: fix compilation warningM Chetan Kumar1-2/+0
curr_phase is unused. Removed the dead code. Fixes: 8d9be0634181 ("net: wwan: iosm: transport layer support for fw flashing/cd") Reported-by: kernel test robot <[email protected]> Signed-off-by: M Chetan Kumar <[email protected]> Reviewed-by: Loic Poulain <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-11cxgb4: fix eeprom len when diagnostics not implementedRahul Lakkireddy2-2/+7
Ensure diagnostics monitoring support is implemented for the SFF 8472 compliant port module and set the correct length for ethtool port module eeprom read. Fixes: f56ec6766dcf ("cxgb4: Add support for ethtool i2c dump") Signed-off-by: Manoj Malviya <[email protected]> Signed-off-by: Rahul Lakkireddy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-11drm/nouveau: hdmigv100.c: fix corrupted HDMI Vendor InfoFrameHans Verkuil1-1/+0
gv100_hdmi_ctrl() writes vendor_infoframe.subpack0_high to 0x6f0110, and then overwrites it with 0. Just drop the overwrite with 0, that's clearly a mistake. Because of this issue the HDMI VIC is 0 instead of 1 in the HDMI Vendor InfoFrame when transmitting 4kp30. Signed-off-by: Hans Verkuil <[email protected]> Fixes: 290ffeafcc1a ("drm/nouveau/disp/gv100: initial support") Reviewed-by: Ben Skeggs <[email protected]> Signed-off-by: Karol Herbst <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-11-11PCI/MSI: Destroy sysfs before freeing entriesThomas Gleixner1-5/+5
free_msi_irqs() frees the MSI entries before destroying the sysfs entries which are exposing them. Nothing prevents a concurrent free while a sysfs file is read and accesses the possibly freed entry. Move the sysfs release ahead of freeing the entries. Fixes: 1c51b50c2995 ("PCI/MSI: Export MSI mode using attributes, not kobjects") Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/87sfw5305m.ffs@tglx
2021-11-11PCI: Add MSI masking quirk for Nvidia ION AHCIMarc Zyngier1-0/+6
The ION AHCI device pretends that MSI masking isn't a thing, while it actually implements it and needs MSIs to be unmasked to work. Add a quirk to that effect. Reported-by: Rui Salvaterra <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Rui Salvaterra <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Bjorn Helgaas <[email protected]> Link: https://lore.kernel.org/r/CALjTZvbzYfBuLB+H=fj2J+9=DxjQ2Uqcy0if_PvmJ-nU-qEgkg@mail.gmail.com Link: https://lore.kernel.org/r/[email protected]
2021-11-11PCI/MSI: Deal with devices lying about their MSI mask capabilityMarc Zyngier2-0/+5
It appears that some devices are lying about their mask capability, pretending that they don't have it, while they actually do. The net result is that now that we don't enable MSIs on such endpoint. Add a new per-device flag to deal with this. Further patches will make use of it, sadly. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Cc: Bjorn Helgaas <[email protected]>
2021-11-11PCI/MSI: Move non-mask check back into low level accessorsThomas Gleixner3-15/+17
The recent rework of PCI/MSI[X] masking moved the non-mask checks from the low level accessors into the higher level mask/unmask functions. This missed the fact that these accessors can be invoked from other places as well. The missing checks break XEN-PV which sets pci_msi_ignore_mask and also violates the virtual MSIX and the msi_attrib.maskbit protections. Instead of sprinkling checks all over the place, lift them back into the low level accessor functions. To avoid checking three different conditions combine them into one property of msi_desc::msi_attrib. [ josef: Fixed the missed conversion in the core code ] Fixes: fcacdfbef5a1 ("PCI/MSI: Provide a new set of mask and unmask functions") Reported-by: Josef Johansson <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Josef Johansson <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected]
2021-11-11drm/ttm: Double check mem_type of BO while evictionxinhui pan1-1/+2
BO might sit in a wrong lru list as there is a small period of memory moving and lru list updating. Lets skip eviction if we hit such mismatch. Suggested-by: Christian König <[email protected]> Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Christian König <[email protected]>
2021-11-11ata: sata_highbank: Remove unnecessary print function dev_err()Xu Wang1-3/+1
The print function dev_err() is redundant because platform_get_irq() already prints an error. Signed-off-by: Xu Wang <[email protected]> Signed-off-by: Damien Le Moal <[email protected]>
2021-11-11libata: fix read log timeout valueDamien Le Moal2-1/+9
Some ATA drives are very slow to respond to READ_LOG_EXT and READ_LOG_DMA_EXT commands issued from ata_dev_configure() when the device is revalidated right after resuming a system or inserting the ATA adapter driver (e.g. ahci). The default 5s timeout (ATA_EH_CMD_DFL_TIMEOUT) used for these commands is too short, causing errors during the device configuration. Ex: ... ata9: SATA max UDMA/133 abar m524288@0x9d200000 port 0x9d200400 irq 209 ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata9.00: ATA-9: XXX XXXXXXXXXXXXXXX, XXXXXXXX, max UDMA/133 ata9.00: qc timeout (cmd 0x2f) ata9.00: Read log page 0x00 failed, Emask 0x4 ata9.00: Read log page 0x00 failed, Emask 0x40 ata9.00: NCQ Send/Recv Log not supported ata9.00: Read log page 0x08 failed, Emask 0x40 ata9.00: 27344764928 sectors, multi 16: LBA48 NCQ (depth 32), AA ata9.00: Read log page 0x00 failed, Emask 0x40 ata9.00: ATA Identify Device Log not supported ata9.00: failed to set xfermode (err_mask=0x40) ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata9.00: configured for UDMA/133 ... The timeout error causes a soft reset of the drive link, followed in most cases by a successful revalidation as that give enough time to the drive to become fully ready to quickly process the read log commands. However, in some cases, this also fails resulting in the device being dropped. Fix this by using adding the ata_eh_revalidate_timeouts entries for the READ_LOG_EXT and READ_LOG_DMA_EXT commands. This defines a timeout increased to 15s, retriable one time. Reported-by: Geert Uytterhoeven <[email protected]> Tested-by: Geert Uytterhoeven <[email protected]> Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]>
2021-11-10net: fix premature exit from NAPI state polling in napi_disable()Alexander Lobakin1-2/+5
Commit 719c57197010 ("net: make napi_disable() symmetric with enable") accidentally introduced a bug sometimes leading to a kernel BUG when bringing an iface up/down under heavy traffic load. Prior to this commit, napi_disable() was polling n->state until none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then always flip them. Now there's a possibility to get away with the NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg() call with an uninitialized variable, rather than straight to another round of the state check. Error path looks like: napi_disable(): unsigned long val, new; /* new is uninitialized */ do { val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or NAPIF_STATE_SCHED is set */ if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */ usleep_range(20, 200); continue; /* go straight to the condition check */ } new = val | <...> } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg() writes garbage */ napi_enable(): do { val = READ_ONCE(n->state); BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */ <...> while the typical BUG splat is like: [ 172.652461] ------------[ cut here ]------------ [ 172.652462] kernel BUG at net/core/dev.c:6937! [ 172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G I 5.15.0 #42 [ 172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021 [ 172.680646] RIP: 0010:napi_enable+0x5a/0xd0 [ 172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48 [ 172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246 [ 172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0 < snip > [ 172.782403] Call Trace: [ 172.784857] <TASK> [ 172.786963] ice_up_complete+0x6f/0x210 [ice] [ 172.791349] ice_xdp+0x136/0x320 [ice] [ 172.795108] ? ice_change_mtu+0x180/0x180 [ice] [ 172.799648] dev_xdp_install+0x61/0xe0 [ 172.803401] dev_xdp_attach+0x1e0/0x550 [ 172.807240] dev_change_xdp_fd+0x1e6/0x220 [ 172.811338] do_setlink+0xee8/0x1010 [ 172.814917] rtnl_setlink+0xe5/0x170 [ 172.818499] ? bpf_lsm_binder_set_context_mgr+0x10/0x10 [ 172.823732] ? security_capable+0x36/0x50 < snip > Fix this by replacing 'do { } while (cmpxchg())' with an "infinite" for-loop with an explicit break. From v1 [0]: - just use a for-loop to simplify both the fix and the existing code (Eric). [0] https://lore.kernel.org/netdev/[email protected] Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable") Suggested-by: Eric Dumazet <[email protected]> # for-loop Signed-off-by: Alexander Lobakin <[email protected]> Reviewed-by: Jesse Brandeburg <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-10Merge tag 'ext4_for_linus' of ↵Linus Torvalds8-250/+300
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Only bug fixes and cleanups for ext4 this merge window. Of note are fixes for the combination of the inline_data and fast_commit fixes, and more accurately calculating when to schedule additional lazy inode table init, especially when CONFIG_HZ is 100HZ" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix error code saved on super block during file system abort ext4: inline data inode fast commit replay fixes ext4: commit inline data during fast commit ext4: scope ret locally in ext4_try_to_trim_range() ext4: remove an unused variable warning with CONFIG_QUOTA=n ext4: fix boolreturn.cocci warnings in fs/ext4/name.c ext4: prevent getting empty inode buffer ext4: move ext4_fill_raw_inode() related functions ext4: factor out ext4_fill_raw_inode() ext4: prevent partial update of the extent blocks ext4: check for inconsistent extents between index and leaf block ext4: check for out-of-order index extents in ext4_valid_extent_entries() ext4: convert from atomic_t to refcount_t on ext4_io_end->count ext4: refresh the ext4_ext_path struct after dropping i_data_sem. ext4: ensure enough credits in ext4_ext_shift_path_extents ext4: correct the left/middle/right debug message for binsearch ext4: fix lazy initialization next schedule time computation in more granular unit Revert "ext4: enforce buffer head state assertion in ext4_da_map_blocks"
2021-11-10Merge tag 'for-5.16-deadlock-fix-tag' of ↵Linus Torvalds1-16/+123
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "Fix for a deadlock when direct/buffered IO is done on a mmaped file and a fault happens (details in the patch). There's a fstest generic/647 that triggers the problem and makes testing hard" * tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix deadlock due to page faults during direct IO reads and writes
2021-11-10Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linuxLinus Torvalds43-1050/+1155
Pull nfsd updates from Bruce Fields: "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping support for a filehandle format deprecated 20 years ago, and further xdr-related cleanup from Chuck" * tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits) nfsd4: remove obselete comment nfsd: document server-to-server-copy parameters NFSD:fix boolreturn.cocci warning nfsd: update create verifier comment SUNRPC: Change return value type of .pc_encode SUNRPC: Replace the "__be32 *p" parameter to .pc_encode NFSD: Save location of NFSv4 COMPOUND status SUNRPC: Change return value type of .pc_decode SUNRPC: Replace the "__be32 *p" parameter to .pc_decode SUNRPC: De-duplicate .pc_release() call sites SUNRPC: Simplify the SVC dispatch code path SUNRPC: Capture value of xdr_buf::page_base SUNRPC: Add trace event when alloc_pages_bulk() makes no progress svcrdma: Split svcrmda_wc_{read,write} tracepoints svcrdma: Split the svcrdma_wc_send() tracepoint svcrdma: Split the svcrdma_wc_receive() tracepoint NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment() SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases NFSD: Initialize pointer ni with NULL and not plain integer 0 NFSD: simplify struct nfsfh ...
2021-11-10Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds59-1814/+2011
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN - Optimisations for READDIR and the 'ls -l' style workload - Further replacements of dprintk() with tracepoints and other tracing improvements - Ensure we re-probe NFSv4 server capabilities when the user does a "mount -o remount" Bugfixes: - Fix an Oops in pnfs_mark_request_commit() - Fix up deadlocks in the commit code - Fix regressions in NFSv2/v3 attribute revalidation due to the change_attr_type optimisations - Fix some dentry verifier races - Fix some missing dentry verifier settings - Fix a performance regression in nfs_set_open_stateid_locked() - SUNRPC was sending multiple SYN calls when re-establishing a TCP connection. - Fix multiple NFSv4 issues due to missing sanity checking of server return values - Fix a potential Oops when FREE_STATEID races with an unmount Cleanups: - Clean up the labelled NFS code - Remove unused header <linux/pnfs_osd_xdr.h>" * tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits) NFSv4: Sanity check the parameters in nfs41_update_target_slotid() NFS: Remove the nfs4_label argument from decode_getattr_*() functions NFS: Remove the nfs4_label argument from nfs_setsecurity NFS: Remove the nfs4_label argument from nfs_fhget() NFS: Remove the nfs4_label argument from nfs_add_or_obtain() NFS: Remove the nfs4_label argument from nfs_instantiate() NFS: Remove the nfs4_label from the nfs_setattrres NFS: Remove the nfs4_label from the nfs4_getattr_res NFS: Remove the f_label from the nfs4_opendata and nfs_openres NFS: Remove the nfs4_label from the nfs4_lookupp_res struct NFS: Remove the label from the nfs4_lookup_res struct NFS: Remove the nfs4_label from the nfs4_link_res struct NFS: Remove the nfs4_label from the nfs4_create_res struct NFS: Remove the nfs4_label from the nfs_entry struct NFS: Create a new nfs_alloc_fattr_with_label() function NFS: Always initialise fattr->label in nfs_fattr_alloc() NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open() NFSv4: Remove unnecessary 'minor version' check NFSv4: Fix potential Oops in decode_op_map() ...
2021-11-10Merge branch 'exit-cleanups-for-v5.16' of ↵Linus Torvalds47-96/+97
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exit cleanups from Eric Biederman: "While looking at some issues related to the exit path in the kernel I found several instances where the code is not using the existing abstractions properly. This set of changes introduces force_fatal_sig a way of sending a signal and not allowing it to be caught, and corrects the misuse of the existing abstractions that I found. A lot of the misuse of the existing abstractions are silly things such as doing something after calling a no return function, rolling BUG by hand, doing more work than necessary to terminate a kernel thread, or calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL). In the review a deficiency in force_fatal_sig and force_sig_seccomp where ptrace or sigaction could prevent the delivery of the signal was found. I have added a change that adds SA_IMMUTABLE to change that makes it impossible to interrupt the delivery of those signals, and allows backporting to fix force_sig_seccomp And Arnd found an issue where a function passed to kthread_run had the wrong prototype, and after my cleanup was failing to build." * 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) soc: ti: fix wkup_m3_rproc_boot_thread return type signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV) exit/r8188eu: Replace the macro thread_exit with a simple return 0 exit/rtl8712: Replace the macro thread_exit with a simple return 0 exit/rtl8723bs: Replace the macro thread_exit with a simple return 0 signal/x86: In emulate_vsyscall force a signal instead of calling do_exit signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails exit/syscall_user_dispatch: Send ordinary signals on failure signal: Implement force_fatal_sig exit/kthread: Have kernel threads return instead of calling do_exit signal/s390: Use force_sigsegv in default_trap_handler signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved. signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON signal/sparc: In setup_tsb_params convert open coded BUG into BUG signal/powerpc: On swapcontext failure force SIGSEGV signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT signal/sparc32: Remove unreachable do_exit in do_sparc_fault ...
2021-11-11Merge tag 'amd-drm-fixes-5.16-2021-11-10' of ↵Dave Airlie32-79/+205
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-fixes-5.16-2021-11-10: amdgpu: - Don't allow partial copy from user for DC debugfs - SRIOV fixes - GFX9 CSB pin count fix - Various IP version check fixes - DP 2.0 fixes - Limit DCN1 MPO fix to DCN1 amdkfd: - SVM fixes - Reset fixes Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-11-10Merge tag 'kernel.sys.v5.16' of ↵Linus Torvalds3-2/+10
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull prctl updates from Christian Brauner: "This contains the missing prctl uapi pieces for PR_SCHED_CORE. In order to activate core scheduling the caller is expected to specify the scope of the new core scheduling domain. For example, passing 2 in the 4th argument of prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, <pid>, 2, 0); would indicate that the new core scheduling domain encompasses all tasks in the process group of <pid>. Specifying 0 would only create a core scheduling domain for the thread identified by <pid> and 2 would encompass the whole thread-group of <pid>. Note, the values 0, 1, and 2 correspond to PIDTYPE_PID, PIDTYPE_TGID, and PIDTYPE_PGID. A first version tried to expose those values directly to which I objected because: - PIDTYPE_* is an enum that is kernel internal which we should not expose to userspace directly. - PIDTYPE_* indicates what a given struct pid is used for it doesn't express a scope. But what the 4th argument of PR_SCHED_CORE prctl() expresses is the scope of the operation, i.e. the scope of the core scheduling domain at creation time. So Eugene's patch now simply introduces three new defines PR_SCHED_CORE_SCOPE_THREAD, PR_SCHED_CORE_SCOPE_THREAD_GROUP, and PR_SCHED_CORE_SCOPE_PROCESS_GROUP. They simply express what happens. This has been on the mailing list for quite a while with all relevant scheduler folks Cced. I announced multiple times that I'd pick this up if I don't see or her anyone else doing it. None of this touches proper scheduler code but only concerns uapi so I think this is fine. With core scheduling being quite common now for vm managers (e.g. moving individual vcpu threads into their own core scheduling domain) and container managers (e.g. moving the init process into its own core scheduling domain and letting all created children inherit it) having to rely on raw numbers passed as the 4th argument in prctl() is a bit annoying and everyone is starting to come up with their own defines" * tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
2021-11-10Merge tag 'pidfd.v5.16' of ↵Linus Torvalds4-24/+43
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd updates from Christian Brauner: "Various places in the kernel have picked up pidfds. The two most recent additions have probably been the ability to use pidfds in bpf maps and the usage of pidfds in mm-based syscalls such as process_mrelease() and process_madvise(). The same pattern to turn a pidfd into a struct task exists in two places. One of those places used PIDTYPE_TGID while the other one used PIDTYPE_PID even though it is clearly documented in all pidfd-helpers that pidfds __currently__ only refer to thread-group leaders (subject to change in the future if need be). This isn't a bug per se but has the potential to be one if we allow pidfds to refer to individual threads. If that happens we want to audit all codepaths that make use of them to ensure they can deal with pidfds refering to individual threads. This adds a simple helper to turn a pidfd into a struct task making it easy to grep for such places. Plus, it gets rid of code-duplication" * tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mm: use pidfd_get_task() pid: add pidfd_get_task() helper
2021-11-10smb3: remove trivial dfs compile warningSteve French1-2/+2
Fix warning caused by recent changes to the dfs code: symbol 'tree_connect_dfs_target' was not declared. Should it be static? Reviewed-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-11-10cifs: support nested dfs links over reconnectPaulo Alcantara9-693/+660
Mounting a dfs link that has nested links was already supported at mount(2), so make it work over reconnect as well. Make the following case work: * mount //root/dfs/link /mnt -o ... - final share: /server/share * in server settings - change target folder of /root/dfs/link3 to /server/share2 - change target folder of /root/dfs/link2 to /root/dfs/link3 - change target folder of /root/dfs/link to /root/dfs/link2 * mount -o remount,... /mnt - refresh all dfs referrals - mark current connection for failover - cifs_reconnect() reconnects to root server - tree_connect() * checks that /root/dfs/link2 is a link, then chase it * checks that root/dfs/link3 is a link, then chase it * finally tree connect to /server/share2 If the mounted share is no longer accessible and a reconnect had been triggered, the client will retry it from both last referral path (/root/dfs/link3) and original referral path (/root/dfs/link). Any new referral paths found while chasing dfs links over reconnect, it will be updated to TCP_Server_Info::leaf_fullpath, accordingly. Signed-off-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-11-10smb3: do not error on fsync when readonlySteve French1-6/+29
Linux allows doing a flush/fsync on a file open for read-only, but the protocol does not allow that. If the file passed in on the flush is read-only try to find a writeable handle for the same inode, if that is not possible skip sending the fsync call to the server to avoid breaking the apps. Reported-by: Julian Sikorski <[email protected]> Tested-by: Julian Sikorski <[email protected]> Suggested-by: Jeremy Allison <[email protected]> Reviewed-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-11-11Merge tag 'drm-misc-next-fixes-2021-11-10' of ↵Dave Airlie14-193/+23
git://anongit.freedesktop.org/drm/drm-misc into drm-next Removed the TTM Huge Page functionnality to address a crash, a timeout fix for udl, CONFIG_FB dependency improvements, a fix for a circular locking depency in imx, a NULL pointer dereference fix for virtio, and a naming collision fix for drm/locking. Signed-off-by: Dave Airlie <[email protected]> From: Maxime Ripard <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20211110082114.vfpkpnecwdfg27lk@gilmour
2021-11-10ALSA: hda: fix general protection fault in azx_runtime_idleKai Vehmanen1-0/+1
Fix a corner case between PCI device driver remove callback and runtime PM idle callback. Following sequence of events can happen: - at azx_create, context is allocated with devm_kzalloc() and stored as pci_set_drvdata() - user-space requests to unbind audio driver - dd.c:__device_release_driver() calls PCI remove - pci-driver.c:pci_device_remove() calls the audio driver azx_remove() callback and this is completed - pci-driver.c:pm_runtime_put_sync() leads to a call to rpm_idle() which again calls azx_runtime_idle() - the azx context object, as returned by dev_get_drvdata(), is no longer valid -> access fault in azx_runtime_idle when executing struct snd_card *card = dev_get_drvdata(dev); chip = card->private_data; if (chip->disabled || hda->init_failed) This was discovered by i915_module_load test with 5.15.0 based linux-next tree. Example log caught by i915_module_load test with linux-next https://intel-gfx-ci.01.org/tree/linux-next/ <4> [264.038232] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b73f0: 0000 [#1] PREEMPT SMP NOPTI <4> [264.038248] CPU: 0 PID: 5374 Comm: i915_module_loa Not tainted 5.15.0-next-20211109-gc8109c2ba35e-next-20211109 #1 [...] <4> [264.038267] RIP: 0010:azx_runtime_idle+0x12/0x60 [snd_hda_intel] [...] <4> [264.038355] Call Trace: <4> [264.038359] <TASK> <4> [264.038362] __rpm_callback+0x3d/0x110 <4> [264.038371] rpm_idle+0x27f/0x380 <4> [264.038376] __pm_runtime_idle+0x3b/0x100 <4> [264.038382] pci_device_remove+0x6d/0xa0 <4> [264.038388] device_release_driver_internal+0xef/0x1e0 <4> [264.038395] unbind_store+0xeb/0x120 <4> [264.038400] kernfs_fop_write_iter+0x11a/0x1c0 Fix the issue by setting drvdata to NULL at end of azx_remove(). Signed-off-by: Kai Vehmanen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>
2021-11-10afs: Use folios in directory handlingDavid Howells2-209/+174
Convert the AFS directory handling code to use folios. With these changes, afs passes -g quick xfstests. Signed-off-by: David Howells <[email protected]> Tested-by: [email protected] cc: Matthew Wilcox (Oracle) <[email protected]> cc: Jeff Layton <[email protected]> cc: Marc Dionne <[email protected]> cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/162877312172.3085614.992850861791211206.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/162981154845.1901565.2078707403143240098.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/163005746215.2472992.8321380998443828308.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163584190457.4023316.10544419117563104940.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/CAH2r5mtECQA6K_OGgU=_G8qLY3G-6-jo1odVyF9EK+O2-EWLFg@mail.gmail.com/ # v3 Link: https://lore.kernel.org/r/163649330345.309189.11182522282723655658.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/163657854055.834781.5800946340537517009.stgit@warthog.procyon.org.uk/ # v5
2021-11-10netfs, 9p, afs, ceph: Use foliosDavid Howells9-416/+428
Convert the netfs helper library to use folios throughout, convert the 9p and afs filesystems to use folios in their file I/O paths and convert the ceph filesystem to use just enough folios to compile. With these changes, afs passes -g quick xfstests. Changes ======= ver #5: - Got rid of folio_end{io,_read,_write}() and inlined the stuff it does instead (Willy decided he didn't want this after all). ver #4: - Fixed a bug in afs_redirty_page() whereby it didn't set the next page index in the loop and returned too early. - Simplified a check in v9fs_vfs_write_folio_locked()[1]. - Undid a change to afs_symlink_readpage()[1]. - Used offset_in_folio() in afs_write_end()[1]. - Changed from using page_endio() to folio_end{io,_read,_write}()[1]. ver #2: - Add 9p foliation. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Tested-by: Jeff Layton <[email protected]> Tested-by: Dominique Martinet <[email protected]> Tested-by: [email protected] cc: Matthew Wilcox (Oracle) <[email protected]> cc: Marc Dionne <[email protected]> cc: Ilya Dryomov <[email protected]> cc: Dominique Martinet <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/YYKa3bfQZxK5/[email protected]/ [1] Link: https://lore.kernel.org/r/[email protected]/ # rfc Link: https://lore.kernel.org/r/162877311459.3085614.10601478228012245108.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/162981153551.1901565.3124454657133703341.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/163005745264.2472992.9852048135392188995.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163584187452.4023316.500389675405550116.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/163649328026.309189.1124218109373941936.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/163657852454.834781.9265101983152100556.stgit@warthog.procyon.org.uk/ # v5
2021-11-10folio: Add a function to get the host inode for a folioDavid Howells2-1/+15
Add a convenience function, folio_inode() that will get the host inode from a folio's mapping. Changes: ver #3: - Fix mistake in function description[2]. ver #2: - Fix contradiction between doc and implementation by disallowing use with swap caches[1]. Signed-off-by: David Howells <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Jeff Layton <[email protected]> Tested-by: Dominique Martinet <[email protected]> Tested-by: [email protected] Link: https://lore.kernel.org/r/[email protected]/ [1] Link: https://lore.kernel.org/r/[email protected]/ [2] Link: https://lore.kernel.org/r/162880453171.3369675.3704943108660112470.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162981151155.1901565.7010079316994382707.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/163005744370.2472992.18324470937328925723.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163584184628.4023316.9386282630968981869.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/163649325519.309189.15072332908703129455.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/163657850401.834781.1031963517399283294.stgit@warthog.procyon.org.uk/ # v5
2021-11-10folio: Add a function to change the private data attached to a folioDavid Howells1-0/+19
Add a function, folio_change_private(), that will change the private data attached to a folio, without the need to twiddle the private bit or the refcount. It assumes that folio_add_private() has already been called on the page. Signed-off-by: David Howells <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Jeff Layton <[email protected]> Tested-by: Dominique Martinet <[email protected]> Tested-by: [email protected] Link: https://lore.kernel.org/r/162981149911.1901565.17776700811659843340.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/163005743485.2472992.5100702469503007023.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163584180781.4023316.5037526301198034310.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/163649324326.309189.17817587229450840783.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/163657848531.834781.14269656212269187893.stgit@warthog.procyon.org.uk/ # v5
2021-11-10Merge tag 'thermal-5.16-rc1-2' of ↵Linus Torvalds5-20/+27
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more thermal control updates from Rafael Wysocki: "These fix two issues in the thermal core and one in the int340x thermal driver. Specifics: - Replace pr_warn() with pr_warn_once() in user_space_bind() to reduce kernel log noise (Rafael Wysocki). - Extend the RFIM mailbox interface in the int340x thermal driver to return 64 bit values to allow all values returned by the hardware to be handled correctly (Srinivas Pandruvada). - Fix possible NULL pointer dereferences in the of_thermal_ family of functions (Subbaraman Narayanamurthy)" * tag 'thermal-5.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal: Replace pr_warn() with pr_warn_once() in user_space_bind() thermal: Fix NULL pointer dereferences in of_thermal_ functions thermal/drivers/int340x: processor_thermal: Suppot 64 bit RFIM responses
2021-11-10Merge tag 'pm-5.16-rc1-2' of ↵Linus Torvalds9-108/+229
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These fix three intel_pstate driver regressions, fix locking in the core code suspending and resuming devices during system PM transitions, fix the handling of cpuidle drivers based on runtime PM during system-wide suspend, fix two issues in the operating performance points (OPP) framework and resource-managed helpers to it. Specifics: - Fix two intel_pstate driver regressions related to the HWP interrupt handling added recently (Srinivas Pandruvada). - Fix intel_pstate driver regression introduced during the 5.11 cycle and causing HWP desired performance to be mishandled in some cases when switching driver modes and during system suspend and shutdown (Rafael Wysocki). - Fix system-wide device suspend and resume locking to avoid deadlocks when device objects are deleted during a system-wide PM transition (Rafael Wysocki). - Modify system-wide suspend of devices to prevent cpuidle drivers based on runtime PM from misbehaving during the "no IRQ" phase of it (Ulf Hansson). - Fix return value of _opp_add_static_v2() helper (YueHaibing). - Fix required-opp handle count (Pavankumar Kondeti). - Add resource managed OPP helpers, update dev_pm_opp_attach_genpd(), update their devfreq users, and make minor DT binding change (Dmitry Osipenko)" * tag 'pm-5.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Avoid calling put_device() under dpm_list_mtx cpufreq: intel_pstate: Clear HWP Status during HWP Interrupt enable cpufreq: intel_pstate: Fix unchecked MSR 0x773 access cpufreq: intel_pstate: Clear HWP desired on suspend/shutdown and offline PM: sleep: Fix runtime PM based cpuidle support dt-bindings: opp: Allow multi-worded OPP entry name opp: Fix return in _opp_add_static_v2() PM / devfreq: tegra30: Check whether clk_round_rate() returns zero rate PM / devfreq: tegra30: Use resource-managed helpers PM / devfreq: Add devm_devfreq_add_governor() opp: Add more resource-managed variants of dev_pm_opp_of_add_table() opp: Change type of dev_pm_opp_attach_genpd(names) argument opp: Fix required-opps phandle array count check