aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-09-09Merge tag 'trace-v5.15-2' of ↵Linus Torvalds14-39/+122
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing updates from Steven Rostedt: - Add migrate-disable counter to tracing header - Fix error handling in event probes - Fix missed unlock in osnoise in error path - Fix merge issue with tools/bootconfig - Clean up bootconfig data when init memory is removed - Fix bootconfig to loop only on subkeys - Have kernel command lines override bootconfig options - Increase field counts for synthetic events - Have histograms dynamic allocate event elements to save space - Fixes in testing and documentation * tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/boot: Fix to loop on only subkeys selftests/ftrace: Exclude "(fault)" in testing add/remove eprobe events tracing: Dynamically allocate the per-elt hist_elt_data array tracing: synth events: increase max fields count tools/bootconfig: Show whole test command for each test case bootconfig: Fix missing return check of xbc_node_compose_key function tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh docs: bootconfig: Add how to use bootconfig for kernel parameters init/bootconfig: Reorder init parameter from bootconfig and cmdline init: bootconfig: Remove all bootconfig data when the init memory is removed tracing/osnoise: Fix missed cpus_read_unlock() in start_per_cpu_kthreads() tracing: Fix some alloc_event_probe() error handling bugs tracing: Add migrate-disabled counter to tracing output.
2021-09-09Merge tag 's390-5.15-2' of ↵Linus Torvalds38-562/+139
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Heiko Carstens: "Except for the xpram device driver removal it is all about fixes and cleanups. - Fix topology update on cpu hotplug, so notifiers see expected masks. This bug was uncovered with SCHED_CORE support. - Fix stack unwinding so that the correct number of entries are omitted like expected by common code. This fixes KCSAN selftests. - Add kmemleak annotation to stack_alloc to avoid false positive kmemleak warnings. - Avoid layering violation in common I/O code and don't unregister subchannel from child-drivers. - Remove xpram device driver for which no real use case exists since the kernel is 64 bit only. Also all hypervisors got required support removed in the meantime, which means the xpram device driver is dead code. - Fix -ENODEV handling of clp_get_state in our PCI code. - Enable KFENCE in debug defconfig. - Cleanup hugetlbfs s390 specific Kconfig dependency. - Quite a lot of trivial fixes to get rid of "W=1" warnings, and and other simple cleanups" * tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: hugetlbfs: s390 is always 64bit s390/ftrace: remove incorrect __va usage s390/zcrypt: remove incorrect kernel doc indicators scsi: zfcp: fix kernel doc comments s390/sclp: add __nonstring annotation s390/hmcdrv_ftp: fix kernel doc comment s390: remove xpram device driver s390/pci: read clp_list_pci_req only once s390/pci: fix clp_get_state() handling of -ENODEV s390/cio: fix kernel doc comment s390/ctrlchar: fix kernel doc comment s390/con3270: use proper type for tasklet function s390/cpum_cf: move array from header to C file s390/mm: fix kernel doc comments s390/topology: fix topology information when calling cpu hotplug notifiers s390/unwind: use current_frame_address() to unwind current task s390/configs: enable CONFIG_KFENCE in debug_defconfig s390/entry: make oklabel within CHKSTG macro local s390: add kmemleak annotation in stack_alloc() s390/cio: dont unregister subchannel from child-drivers
2021-09-09Merge branch 'work.gfs2' of ↵Linus Torvalds3-21/+35
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull gfs2 setattr updates from Al Viro: "Make it possible for filesystems to use a generic 'may_setattr()' and switch gfs2 to using it" * 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: gfs2: Switch to may_setattr in gfs2_setattr fs: Move notify_change permission checks into may_setattr
2021-09-09Merge branch 'work.init' of ↵Linus Torvalds3-36/+83
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull root filesystem type handling updates from Al Viro: "Teach init/do_mounts.c to handle non-block filesystems, hopefully preventing even more special-cased kludges (such as root=/dev/nfs, etc)" * 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: simplify get_filesystem_list / get_all_fs_names init: allow mounting arbitrary non-blockdevice filesystems as root init: split get_fs_names
2021-09-09Merge branch 'work.iov_iter' of ↵Linus Torvalds2-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter fixes from Al Viro: "Fixes for io-uring handling of iov_iter reexpands" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: io_uring: reexpand under-reexpanded iters iov_iter: track truncated size
2021-09-09Merge tag 'cxl-for-5.15' of ↵Linus Torvalds19-889/+1352
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL (Compute Express Link) updates from Dan Williams: - Fix detection of CXL host bridges to filter out disabled ACPI0016 devices in the ACPI DSDT. - Fix kernel lockdown integration to disable raw commands when raw PCI access is disabled. - Fix a broken debug message. - Add support for "Get Partition Info". I.e. enumerate the split between volatile and persistent capacity on bi-modal CXL memory expanders. - Re-factor the core by subject area. This is a work in progress. - Prepare libnvdimm to understand CXL labels in addition to EFI labels. This is a work in progress. * tag 'cxl-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (25 commits) cxl/registers: Fix Documentation warning cxl/pmem: Fix Documentation warning cxl/uapi: Fix defined but not used warnings cxl/pci: Fix debug message in cxl_probe_regs() cxl/pci: Fix lockdown level cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports libnvdimm/labels: Add claim class helpers libnvdimm/labels: Add type-guid helpers libnvdimm/labels: Add blk special cases for nlabel and position helpers libnvdimm/labels: Add blk isetcookie set / validation helpers libnvdimm/labels: Add a checksum calculation helper libnvdimm/labels: Introduce label setter helpers libnvdimm/labels: Add isetcookie validation helper libnvdimm/labels: Introduce getters for namespace label fields cxl/mem: Adjust ram/pmem range to represent DPA ranges cxl/mem: Account for partitionable space in ram/pmem ranges cxl/pci: Store memory capacity values cxl/pci: Simplify register setup cxl/pci: Ignore unknown register block types cxl/core: Move memdev management to core ...
2021-09-09Merge tag 'libnvdimm-for-5.15' of ↵Linus Torvalds10-172/+120
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Fix a race condition in the teardown path of raw mode pmem namespaces. - Cleanup the code that filesystems use to detect filesystem-dax capabilities of their underlying block device. * tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: remove bdev_dax_supported xfs: factor out a xfs_buftarg_is_dax helper dax: stub out dax_supported for !CONFIG_FS_DAX dax: remove __generic_fsdax_supported dax: move the dax_read_lock() locking into dax_supported dax: mark dax_get_by_host static dm: use fs_dax_get_by_bdev instead of dax_get_by_host dax: stop using bdevname fsdax: improve the FS_DAX Kconfig description and help text libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind
2021-09-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds4-6/+8
Pull rdma fixes from Jason Gunthorpe: "I don't usually send a second PR in the merge window, but the fix to mlx5 is significant enough that it should start going through the process ASAP. Along with it comes some of the usual -rc stuff that would normally wait for a -rc2 or so. Summary: Important error case regression fixes in mlx5: - Wrong size used when computing the error path smaller allocation request leads to corruption - Confusing but ultimately harmless alignment mis-calculation Static checker warning fixes: - NULL pointer subtraction in qib - kcalloc in bnxt_re - Missing static on global variable in hfi1" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: IB/hfi1: make hist static RDMA/bnxt_re: Prefer kcalloc over open coded arithmetic IB/qib: Fix null pointer subtraction compiler warning RDMA/mlx5: Fix xlt_chunk_align calculation RDMA/mlx5: Fix number of allocated XLT entries
2021-09-09Merge tag 'dmaengine-5.15-rc1' of ↵Linus Torvalds54-928/+4080
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine updates from Vinod Koul: "New drivers/devices - Support for Renesas RZ/G2L dma controller - New driver for AMD PTDMA controller Updates: - Big pile of idxd updates - Updates for Altera driver, stm32-dma, dw etc" * tag 'dmaengine-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (83 commits) dmaengine: sh: fix some NULL dereferences dmaengine: sh: Fix unused initialization of pointer lmdesc MAINTAINERS: Fix AMD PTDMA DRIVER entry dmaengine: ptdma: remove PT_OFFSET to avoid redefnition dmaengine: ptdma: Add debugfs entries for PTDMA dmaengine: ptdma: register PTDMA controller as a DMA resource dmaengine: ptdma: Initial driver for the AMD PTDMA dmaengine: fsl-dpaa2-qdma: Fix spelling mistake "faile" -> "failed" dmaengine: idxd: remove interrupt disable for dev_lock dmaengine: idxd: remove interrupt disable for cmd_lock dmaengine: idxd: fix setting up priv mode for dwq dmaengine: xilinx_dma: Set DMA mask for coherent APIs dmaengine: ti: k3-psil-j721e: Add entry for CSI2RX dmaengine: sh: Add DMAC driver for RZ/G2L SoC dmaengine: Extend the dma_slave_width for 128 bytes dt-bindings: dma: Document RZ/G2L bindings dmaengine: ioat: depends on !UML dmaengine: idxd: set descriptor allocation size to threshold for swq dmaengine: idxd: make submit failure path consistent on desc freeing dmaengine: idxd: remove interrupt flag for completion list spinlock ...
2021-09-09arm64: mm: limit linear region to 51 bits for KVM in nVHE modeArd Biesheuvel1-1/+15
KVM in nVHE mode divides up its VA space into two equal halves, and picks the half that does not conflict with the HYP ID map to map its linear region. This worked fine when the kernel's linear map itself was guaranteed to cover precisely as many bits of VA space, but this was changed by commit f4693c2716b35d08 ("arm64: mm: extend linear region for 52-bit VA configurations"). The result is that, depending on the placement of the ID map, kernel-VA to hyp-VA translations may produce addresses that either conflict with other HYP mappings (including the ID map itself) or generate addresses outside of the 52-bit addressable range, neither of which is likely to lead to anything useful. Given that 52-bit capable cores are guaranteed to implement VHE, this only affects configurations such as pKVM where we opt into non-VHE mode even if the hardware is VHE capable. So just for these configurations, let's limit the kernel linear map to 51 bits and work around the problem. Fixes: f4693c2716b3 ("arm64: mm: extend linear region for 52-bit VA configurations") Signed-off-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
2021-09-09io_uring: fail links of cancelled timeoutsPavel Begunkov1-0/+2
When we cancel a timeout we should mark it with REQ_F_FAIL, so linked requests are cancelled as well, but not queued for further execution. Cc: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.1631192135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-09-09thermal/drivers/qcom/spmi-adc-tm5: Don't abort probing if a sensor is not usedMatthias Kaehlcke1-0/+6
adc_tm5_register_tzd() registers the thermal zone sensors for all channels of the thermal monitor. If the registration of one channel fails the function skips the processing of the remaining channels and returns an error, which results in _probe() being aborted. One of the reasons the registration could fail is that none of the thermal zones is using the channel/sensor, which hardly is a critical error (if it is an error at all). If this case is detected emit a warning and continue with processing the remaining channels. Fixes: ca66dca5eda6 ("thermal: qcom: add support for adc-tm5 PMIC thermal monitor") Signed-off-by: Matthias Kaehlcke <[email protected]> Reported-by: Stephen Boyd <[email protected]> Reviewed-by: Stephen Boyd <[email protected]> Reviewed-by: Dmitry Baryshkov <[email protected]> Signed-off-by: Daniel Lezcano <[email protected]> Link: https://lore.kernel.org/r/20210823134726.1.I1dd23ddf77e5b3568625d80d6827653af071ce19@changeid
2021-09-09thermal/drivers/intel: Allow processing of HWP interruptSrinivas Pandruvada2-1/+9
Add a weak function to process HWP (Hardware P-states) notifications and move updating HWP_STATUS MSR to this function. This allows HWP interrupts to be processed by the intel_pstate driver in HWP mode by overriding the implementation. Signed-off-by: Srinivas Pandruvada <[email protected]> Acked-by: Zhang Rui <[email protected]> Signed-off-by: Daniel Lezcano <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-09-09io-wq: fix memory leak in create_io_worker()Qiang.zhang1-0/+3
BUG: memory leak unreferenced object 0xffff888126fcd6c0 (size 192): comm "syz-executor.1", pid 11934, jiffies 4294983026 (age 15.690s) backtrace: [<ffffffff81632c91>] kmalloc_node include/linux/slab.h:609 [inline] [<ffffffff81632c91>] kzalloc_node include/linux/slab.h:732 [inline] [<ffffffff81632c91>] create_io_worker+0x41/0x1e0 fs/io-wq.c:739 [<ffffffff8163311e>] io_wqe_create_worker fs/io-wq.c:267 [inline] [<ffffffff8163311e>] io_wqe_enqueue+0x1fe/0x330 fs/io-wq.c:866 [<ffffffff81620b64>] io_queue_async_work+0xc4/0x200 fs/io_uring.c:1473 [<ffffffff8162c59c>] __io_queue_sqe+0x34c/0x510 fs/io_uring.c:6933 [<ffffffff8162c7ab>] io_req_task_submit+0x4b/0xa0 fs/io_uring.c:2233 [<ffffffff8162cb48>] io_async_task_func+0x108/0x1c0 fs/io_uring.c:5462 [<ffffffff816259e3>] tctx_task_work+0x1b3/0x3a0 fs/io_uring.c:2158 [<ffffffff81269b43>] task_work_run+0x73/0xb0 kernel/task_work.c:164 [<ffffffff812dcdd1>] tracehook_notify_signal include/linux/tracehook.h:212 [inline] [<ffffffff812dcdd1>] handle_signal_work kernel/entry/common.c:146 [inline] [<ffffffff812dcdd1>] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] [<ffffffff812dcdd1>] exit_to_user_mode_prepare+0x151/0x180 kernel/entry/common.c:209 [<ffffffff843ff25d>] __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] [<ffffffff843ff25d>] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:302 [<ffffffff843fa4a2>] do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae when create_io_thread() return error, and not retry, the worker object need to be freed. Reported-by: [email protected] Signed-off-by: Qiang.zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-09-09iommu: Clarify default domain KconfigRobin Murphy1-1/+1
Although strictly it is the AMD and Intel drivers which have an existing expectation of lazy behaviour by default, it ends up being rather unintuitive to describe this literally in Kconfig. Express it instead as an architecture dependency, to clarify that it is a valid config-time decision. The end result is the same since virtio-iommu doesn't support lazy mode and thus falls back to strict at runtime regardless. The per-architecture disparity is a matter of historical expectations: the AMD and Intel drivers have been lazy by default since 2008, and changing that gets noticed by people asking where their I/O throughput has gone. Conversely, Arm-based systems with their wider assortment of IOMMU drivers mostly only support strict mode anyway; only the Arm SMMU drivers have later grown support for passthrough and lazy mode, for users who wanted to explicitly trade off isolation for performance. These days, reducing the default level of isolation in a way which may go unnoticed by users who expect otherwise hardly seems worth risking for the sake of one line of Kconfig, so here's where we are. Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Link: https://lore.kernel.org/r/69a0c6f17b000b54b8333ee42b3124c1d5a869e2.1631105737.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]>
2021-09-09iommu/vt-d: Fix a deadlock in intel_svm_drain_prq()Fenghua Yu1-0/+12
pasid_mutex and dev->iommu->param->lock are held while unbinding mm is flushing IO page fault workqueue and waiting for all page fault works to finish. But an in-flight page fault work also need to hold the two locks while unbinding mm are holding them and waiting for the work to finish. This may cause an ABBA deadlock issue as shown below: idxd 0000:00:0a.0: unbind PASID 2 ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc7+ #549 Not tainted [ 186.615245] ---------- dsa_test/898 is trying to acquire lock: ffff888100d854e8 (&param->lock){+.+.}-{3:3}, at: iopf_queue_flush_dev+0x29/0x60 but task is already holding lock: ffffffff82b2f7c8 (pasid_mutex){+.+.}-{3:3}, at: intel_svm_unbind+0x34/0x1e0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (pasid_mutex){+.+.}-{3:3}: __mutex_lock+0x75/0x730 mutex_lock_nested+0x1b/0x20 intel_svm_page_response+0x8e/0x260 iommu_page_response+0x122/0x200 iopf_handle_group+0x1c2/0x240 process_one_work+0x2a5/0x5a0 worker_thread+0x55/0x400 kthread+0x13b/0x160 ret_from_fork+0x22/0x30 -> #1 (&param->fault_param->lock){+.+.}-{3:3}: __mutex_lock+0x75/0x730 mutex_lock_nested+0x1b/0x20 iommu_report_device_fault+0xc2/0x170 prq_event_thread+0x28a/0x580 irq_thread_fn+0x28/0x60 irq_thread+0xcf/0x180 kthread+0x13b/0x160 ret_from_fork+0x22/0x30 -> #0 (&param->lock){+.+.}-{3:3}: __lock_acquire+0x1134/0x1d60 lock_acquire+0xc6/0x2e0 __mutex_lock+0x75/0x730 mutex_lock_nested+0x1b/0x20 iopf_queue_flush_dev+0x29/0x60 intel_svm_drain_prq+0x127/0x210 intel_svm_unbind+0xc5/0x1e0 iommu_sva_unbind_device+0x62/0x80 idxd_cdev_release+0x15a/0x200 [idxd] __fput+0x9c/0x250 ____fput+0xe/0x10 task_work_run+0x64/0xa0 exit_to_user_mode_prepare+0x227/0x230 syscall_exit_to_user_mode+0x2c/0x60 do_syscall_64+0x48/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: &param->lock --> &param->fault_param->lock --> pasid_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pasid_mutex); lock(&param->fault_param->lock); lock(pasid_mutex); lock(&param->lock); *** DEADLOCK *** 2 locks held by dsa_test/898: #0: ffff888100cc1cc0 (&group->mutex){+.+.}-{3:3}, at: iommu_sva_unbind_device+0x53/0x80 #1: ffffffff82b2f7c8 (pasid_mutex){+.+.}-{3:3}, at: intel_svm_unbind+0x34/0x1e0 stack backtrace: CPU: 2 PID: 898 Comm: dsa_test Not tainted 5.14.0-rc7+ #549 Hardware name: Intel Corporation Kabylake Client platform/KBL S DDR4 UD IMM CRB, BIOS KBLSE2R1.R00.X050.P01.1608011715 08/01/2016 Call Trace: dump_stack_lvl+0x5b/0x74 dump_stack+0x10/0x12 print_circular_bug.cold+0x13d/0x142 check_noncircular+0xf1/0x110 __lock_acquire+0x1134/0x1d60 lock_acquire+0xc6/0x2e0 ? iopf_queue_flush_dev+0x29/0x60 ? pci_mmcfg_read+0xde/0x240 __mutex_lock+0x75/0x730 ? iopf_queue_flush_dev+0x29/0x60 ? pci_mmcfg_read+0xfd/0x240 ? iopf_queue_flush_dev+0x29/0x60 mutex_lock_nested+0x1b/0x20 iopf_queue_flush_dev+0x29/0x60 intel_svm_drain_prq+0x127/0x210 ? intel_pasid_tear_down_entry+0x22e/0x240 intel_svm_unbind+0xc5/0x1e0 iommu_sva_unbind_device+0x62/0x80 idxd_cdev_release+0x15a/0x200 pasid_mutex protects pasid and svm data mapping data. It's unnecessary to hold pasid_mutex while flushing the workqueue. To fix the deadlock issue, unlock pasid_pasid during flushing the workqueue to allow the works to be handled. Fixes: d5b9e4bfe0d8 ("iommu/vt-d: Report prq to io-pgfault framework") Reported-and-tested-by: Dave Jiang <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] [joro: Removed timing information from kernel log messages] Signed-off-by: Joerg Roedel <[email protected]>
2021-09-09iommu/vt-d: Fix PASID leak in intel_svm_unbind_mm()Fenghua Yu1-3/+0
The mm->pasid will be used in intel_svm_free_pasid() after load_pasid() during unbinding mm. Clearing it in load_pasid() will cause PASID cannot be freed in intel_svm_free_pasid(). Additionally mm->pasid was updated already before load_pasid() during pasid allocation. No need to update it again in load_pasid() during binding mm. Don't update mm->pasid to avoid the issues in both binding mm and unbinding mm. Fixes: 4048377414162 ("iommu/vt-d: Use iommu_sva_alloc(free)_pasid() helpers") Reported-and-tested-by: Dave Jiang <[email protected]> Co-developed-by: Jacob Pan <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2021-09-09iommu/amd: Remove iommu_init_ga()Suravee Suthikulpanit1-13/+4
Since the function has been simplified and only call iommu_init_ga_log(), remove the function and replace with iommu_init_ga_log() instead. Signed-off-by: Suravee Suthikulpanit <[email protected]> Link: https://lore.kernel.org/r/[email protected] Fixes: 8bda0cfbdc1a ("iommu/amd: Detect and initialize guest vAPIC log") Signed-off-by: Joerg Roedel <[email protected]>
2021-09-09iommu/amd: Relocate GAMSup check to early_enable_iommusWei Huang1-7/+24
Currently, iommu_init_ga() checks and disables IOMMU VAPIC support (i.e. AMD AVIC support in IOMMU) when GAMSup feature bit is not set. However it forgets to clear IRQ_POSTING_CAP from the previously set amd_iommu_irq_ops.capability. This triggers an invalid page fault bug during guest VM warm reboot if AVIC is enabled since the irq_remapping_cap(IRQ_POSTING_CAP) is incorrectly set, and crash the system with the following kernel trace. BUG: unable to handle page fault for address: 0000000000400dd8 RIP: 0010:amd_iommu_deactivate_guest_mode+0x19/0xbc Call Trace: svm_set_pi_irte_mode+0x8a/0xc0 [kvm_amd] ? kvm_make_all_cpus_request_except+0x50/0x70 [kvm] kvm_request_apicv_update+0x10c/0x150 [kvm] svm_toggle_avic_for_irq_window+0x52/0x90 [kvm_amd] svm_enable_irq_window+0x26/0xa0 [kvm_amd] vcpu_enter_guest+0xbbe/0x1560 [kvm] ? avic_vcpu_load+0xd5/0x120 [kvm_amd] ? kvm_arch_vcpu_load+0x76/0x240 [kvm] ? svm_get_segment_base+0xa/0x10 [kvm_amd] kvm_arch_vcpu_ioctl_run+0x103/0x590 [kvm] kvm_vcpu_ioctl+0x22a/0x5d0 [kvm] __x64_sys_ioctl+0x84/0xc0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes by moving the initializing of AMD IOMMU interrupt remapping mode (amd_iommu_guest_ir) earlier before setting up the amd_iommu_irq_ops.capability with appropriate IRQ_POSTING_CAP flag. [joro: Squashed the two patches and limited check_features_on_all_iommus() to CONFIG_IRQ_REMAP to fix a compile warning.] Signed-off-by: Wei Huang <[email protected]> Co-developed-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Fixes: 8bda0cfbdc1a ("iommu/amd: Detect and initialize guest vAPIC log") Signed-off-by: Joerg Roedel <[email protected]>
2021-09-09parisc: Mark sched_clock unstable only if clocks are not syncronizedHelge Deller2-6/+3
We check at runtime if the cr16 clocks are stable across CPUs. Only mark the sched_clock unstable by calling clear_sched_clock_stable() if we know that we run on a system which isn't syncronized across CPUs. Signed-off-by: Helge Deller <[email protected]>
2021-09-09parisc: Move pci_dev_is_behind_card_dino to where it is usedGuenter Roeck1-9/+9
parisc build test images fail to compile with the following error. drivers/parisc/dino.c:160:12: error: 'pci_dev_is_behind_card_dino' defined but not used Move the function just ahead of its only caller to avoid the error. Fixes: 5fa1659105fa ("parisc: Disable HP HSC-PCI Cards to prevent kernel crash") Cc: Helge Deller <[email protected]> Signed-off-by: Guenter Roeck <[email protected]> Signed-off-by: Helge Deller <[email protected]>
2021-09-09parisc: Reduce sigreturn trampoline to 3 instructionsHelge Deller3-9/+8
We can move the INSN_LDI_R20 instruction into the branch delay slot. Signed-off-by: Helge Deller <[email protected]>
2021-09-09parisc: Check user signal stack trampoline is inside TASK_SIZEHelge Deller1-7/+10
Add some additional checks to ensure the signal stack is inside userspace bounds. Signed-off-by: Helge Deller <[email protected]>
2021-09-09parisc: Drop useless debug info and comments from signal.cHelge Deller1-15/+0
Signed-off-by: Helge Deller <[email protected]>
2021-09-09parisc: Drop strnlen_user() in favour of generic versionHelge Deller4-38/+1
As suggested by Arnd Bergmann, drop the parisc version of strnlen_user() and switch to the generic version. Suggested-by: Arnd Bergmann <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Signed-off-by: Helge Deller <[email protected]>
2021-09-09parisc: Add missing FORCE prerequisite in MakefileHelge Deller1-9/+9
Signed-off-by: Helge Deller <[email protected]>
2021-09-09sched: Prevent balance_push() on remote runqueuesThomas Gleixner1-3/+3
sched_setscheduler() and rt_mutex_setprio() invoke the run-queue balance callback after changing priorities or the scheduling class of a task. The run-queue for which the callback is invoked can be local or remote. That's not a problem for the regular rq::push_work which is serialized with a busy flag in the run-queue struct, but for the balance_push() work which is only valid to be invoked on the outgoing CPU that's wrong. It not only triggers the debug warning, but also leaves the per CPU variable push_work unprotected, which can result in double enqueues on the stop machine list. Remove the warning and validate that the function is invoked on the outgoing CPU. Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()") Reported-by: Sebastian Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/87zgt1hdw7.ffs@tglx
2021-09-09sched/idle: Make the idle timer expire in hard interrupt contextSebastian Andrzej Siewior1-2/+2
The intel powerclamp driver will setup a per-CPU worker with RT priority. The worker will then invoke play_idle() in which it remains in the idle poll loop until it is stopped by the timer it started earlier. That timer needs to expire in hard interrupt context on PREEMPT_RT. Otherwise the timer will expire in ksoftirqd as a SOFT timer but that task won't be scheduled on the CPU because its priority is lower than the priority of the worker which is in the idle loop. Always expire the idle timer in hard interrupt context. Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-09-09locking/rtmutex: Fix ww_mutex deadlock checkPeter Zijlstra1-1/+1
Dan reported that rt_mutex_adjust_prio_chain() can be called with .orig_waiter == NULL however commit a055fcc132d4 ("locking/rtmutex: Return success on deadlock for ww_mutex waiters") unconditionally dereferences it. Since both call-sites that have .orig_waiter == NULL don't care for the return value, simply disable the deadlock squash by adding the NULL check. Notably, both callers use the deadlock condition as a termination condition for the iteration; once detected, it is sure that (de)boosting is done. Arguably step [3] would be a more natural termination point, but it's dubious whether adding a third deadlock detection state would improve the code. Fixes: a055fcc132d4 ("locking/rtmutex: Return success on deadlock for ww_mutex waiters") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-09-09rtc: rx8010: select REGMAP_I2CYu-Tung Chang1-0/+1
The rtc-rx8010 uses the I2C regmap but doesn't select it in Kconfig so depending on the configuration the build may fail. Fix it. Signed-off-by: Yu-Tung Chang <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-09-08Input: analog - always use ktime functionsGuenter Roeck1-94/+13
m68k, mips, s390, and sparc allmodconfig images fail to build with the following error. drivers/input/joystick/analog.c:160:2: error: #warning Precise timer not defined for this architecture. Remove architecture specific time handling code and always use ktime functions to determine time deltas. Also remove the now useless use_ktime kernel parameter. Signed-off-by: Guenter Roeck <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Acked-by: Randy Dunlap <[email protected]> # build-tested Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dmitry Torokhov <[email protected]>
2021-09-09cifs: move SMB FSCTL definitions to common codeSteve French2-1/+1
The FSCTL definitions are in smbfsctl.h which should be shared by client and server. Move the updated version of smbfsctl.h into smbfs_common and have the client code use it (subsequent patch will change the server to use this common version of the header). Reviewed-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-09-08cifs: rename cifs_common to smbfs_commonSteve French9-10/+10
As we move to common code between client and server, we have been asked to make the names less confusing, and refer less to "cifs" and more to words which include "smb" instead to e.g. "smbfs" for the client (we already have "ksmbd" for the kernel server, and "smbd" for the user space Samba daemon). So to be more consistent in the naming of common code between client and server and reduce the risk of merge conflicts as more common code is added - rename "cifs_common" to "smbfs_common" (in future releases we also will rename the fs/cifs directory to fs/smbfs) Reviewed-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-09-08cifs: update FSCTL definitionsSteve French1-3/+13
Add some missing defines used by ksmbd to the client version of smbfsctl.h, and add a missing newer define mentioned in the protocol definitions (MS-FSCC). This will also make it easier to move to common code. Reviewed-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-09-09Merge tag 'drm-misc-next-fixes-2021-09-03' of ↵Dave Airlie8-28/+21
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next-fixes for v5.15: - Fix ttm_bo_move_memcpy() when ttm_resource is subclassed. - Small fixes to panfrost, mgag200, vc4. - Small ttm compilation fixes. Signed-off-by: Dave Airlie <[email protected]> From: Maarten Lankhorst <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-09Merge tag 'amd-drm-next-5.15-2021-09-01' of ↵Dave Airlie29-94/+198
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-5.15-2021-09-01: amdgpu: - Misc cleanups, typo fixes - EEPROM fix - Add some new PCI IDs - Scatter/Gather display support for Yellow Carp - PCIe DPM fix for RKL platforms - RAS fix amdkfd: - SVM fix Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-08io-wq: fix silly logic error in io_task_work_match()Jens Axboe1-2/+7
We check for the func with an OR condition, which means it always ends up being false and we never match the task_work we want to cancel. In the unexpected case that we do exit with that pending, we can trigger a hang waiting for a worker to exit, but it was never created. syzbot reports that as such: INFO: task syz-executor687:8514 blocked for more than 143 seconds. Not tainted 5.14.0-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor687 state:D stack:27296 pid: 8514 ppid: 8479 flags:0x00024004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0x940/0x26f0 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x176/0x280 kernel/sched/completion.c:138 io_wq_exit_workers fs/io-wq.c:1162 [inline] io_wq_put_and_exit+0x40c/0xc70 fs/io-wq.c:1197 io_uring_clean_tctx fs/io_uring.c:9607 [inline] io_uring_cancel_generic+0x5fe/0x740 fs/io_uring.c:9687 io_uring_files_cancel include/linux/io_uring.h:16 [inline] do_exit+0x265/0x2a30 kernel/exit.c:780 do_group_exit+0x125/0x310 kernel/exit.c:922 get_signal+0x47f/0x2160 kernel/signal.c:2868 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:865 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:209 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x445cd9 RSP: 002b:00007fc657f4b308 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: 0000000000000001 RBX: 00000000004cb448 RCX: 0000000000445cd9 RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004cb44c RBP: 00000000004cb440 R08: 000000000000000e R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049b154 R13: 0000000000000003 R14: 00007fc657f4b400 R15: 0000000000022000 While in there, also decrement accr->nr_workers. This isn't strictly needed as we're exiting, but let's make sure the accounting matches up. Fixes: 3146cba99aa2 ("io-wq: make worker creation resilient against signals") Reported-by: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-09-08Merge branches 'akpm' and 'akpm-hotfixes' (patches from Andrew)Linus Torvalds47-663/+267
Merge yet more updates and hotfixes from Andrew Morton: "Post-linux-next material, based upon latest upstream to catch the now-merged dependencies: - 10 patches. Subsystems affected by this patch series: mm (vmstat and migration) and compat. And bunch of hotfixes, mostly cc:stable: - 8 patches. Subsystems affected by this patch series: mm (hmm, hugetlb, vmscan, pagealloc, pagemap, kmemleak, mempolicy, and memblock)" * emailed patches from Andrew Morton <[email protected]>: arch: remove compat_alloc_user_space compat: remove some compat entry points mm: simplify compat numa syscalls mm: simplify compat_sys_move_pages kexec: avoid compat_alloc_user_space kexec: move locking into do_kexec_load mm: migrate: change to use bool type for 'page_was_mapped' mm: migrate: fix the incorrect function name in comments mm: migrate: introduce a local variable to get the number of pages mm/vmstat: protect per cpu variables with preempt disable on RT * emailed hotfixes from Andrew Morton <[email protected]>: nds32/setup: remove unused memblock_region variable in setup_memory() mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfp mmap_lock: change trace and locking order mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype mm,vmscan: fix divide by zero in get_scan_count mm/hugetlb: initialize hugetlb_usage in mm_init mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled
2021-09-08nds32/setup: remove unused memblock_region variable in setup_memory()Mike Rapoport1-1/+0
kernel test robot reports unused variable warning: arch/nds32/kernel/setup.c:247:26: warning: Unused variable: region [unusedVariable] struct memblock_region *region; ^ Remove the unused variable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Nick Hu <[email protected]> Cc: Vincent Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/mempolicy: fix a race between offset_il_node and mpol_rebind_taskyanghui1-4/+13
Servers happened below panic: Kernel version:5.4.56 BUG: unable to handle page fault for address: 0000000000002c48 RIP: 0010:__next_zones_zonelist+0x1d/0x40 Call Trace: __alloc_pages_nodemask+0x277/0x310 alloc_page_interleave+0x13/0x70 handle_mm_fault+0xf99/0x1390 __do_page_fault+0x288/0x500 do_page_fault+0x30/0x110 page_fault+0x3e/0x50 The reason for the panic is that MAX_NUMNODES is passed in the third parameter in __alloc_pages_nodemask(preferred_nid). So access to zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic. In offset_il_node(), first_node() returns nid from pol->v.nodes, after this other threads may chang pol->v.nodes before next_node(). This race condition will let next_node return MAX_NUMNODES. So put pol->nodes in a local variable. The race condition is between offset_il_node and cpuset_change_task_nodemask: CPU0: CPU1: alloc_pages_vma() interleave_nid(pol,) offset_il_node(pol,) first_node(pol->v.nodes) cpuset_change_task_nodemask //nodes==0xc mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask(pol,nodes) //nodes==0x3 next_node(nid, pol->v.nodes)//return MAX_NUMNODES Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: yanghui <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfpNaohiro Aota1-1/+2
In a memory pressure situation, I'm seeing the lockdep WARNING below. Actually, this is similar to a known false positive which is already addressed by commit 6dcde60efd94 ("xfs: more lockdep whackamole with kmem_alloc*"). This warning still persists because it's not from kmalloc() itself but from an allocation for kmemleak object. While kmalloc() itself suppress the warning with __GFP_NOLOCKDEP, gfp_kmemleak_mask() is dropping the flag for the kmemleak's allocation. Allow __GFP_NOLOCKDEP to be passed to kmemleak's allocation, so that the warning for it is also suppressed. ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc7-BTRFS-ZNS+ #37 Not tainted ------------------------------------------------------ kswapd0/288 is trying to acquire lock: ffff88825ab45df0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0x8a/0x250 but task is already holding lock: ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x112/0x160 kmem_cache_alloc+0x48/0x400 create_object.isra.0+0x42/0xb10 kmemleak_alloc+0x48/0x80 __kmalloc+0x228/0x440 kmem_alloc+0xd3/0x2b0 kmem_alloc_large+0x5a/0x1c0 xfs_attr_copy_value+0x112/0x190 xfs_attr_shortform_getvalue+0x1fc/0x300 xfs_attr_get_ilocked+0x125/0x170 xfs_attr_get+0x329/0x450 xfs_get_acl+0x18d/0x430 get_acl.part.0+0xb6/0x1e0 posix_acl_xattr_get+0x13a/0x230 vfs_getxattr+0x21d/0x270 getxattr+0x126/0x310 __x64_sys_fgetxattr+0x1a6/0x2a0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}: __lock_acquire+0x2c0f/0x5a00 lock_acquire+0x1a1/0x4b0 down_read_nested+0x50/0x90 xfs_ilock+0x8a/0x250 xfs_can_free_eofblocks+0x34f/0x570 xfs_inactive+0x411/0x520 xfs_fs_destroy_inode+0x2c8/0x710 destroy_inode+0xc5/0x1a0 evict+0x444/0x620 dispose_list+0xfe/0x1c0 prune_icache_sb+0xdc/0x160 super_cache_scan+0x31e/0x510 do_shrink_slab+0x337/0x8e0 shrink_slab+0x362/0x5c0 shrink_node+0x7a7/0x1a40 balance_pgdat+0x64e/0xfe0 kswapd+0x590/0xa80 kthread+0x38c/0x460 ret_from_fork+0x22/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&xfs_nondir_ilock_class); lock(fs_reclaim); lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/288: #0: ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff848a08d8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x269/0x5c0 #2: ffff8881a7a820e8 (&type->s_umount_key#60){++++}-{3:3}, at: super_cache_scan+0x5a/0x510 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naohiro Aota <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: "Darrick J . Wong" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mmap_lock: change trace and locking orderLiam Howlett1-4/+4
Print to the trace log before releasing the lock to avoid racing with other trace log printers of the same lock type. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Suggested-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/page_alloc.c: avoid accessing uninitialized pcp page migratetypeMiaohe Lin1-1/+3
If it's not prepared to free unref page, the pcp page migratetype is unset. Thus we will get rubbish from get_pcppage_migratetype() and might list_del(&page->lru) again after it's already deleted from the list leading to grumble about data corruption. Link: https://lkml.kernel.org/r/[email protected] Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm,vmscan: fix divide by zero in get_scan_countRik van Riel1-1/+1
Commit f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") introduced a divide by zero corner case when oomd is being used in combination with cgroup memory.low protection. When oomd decides to kill a cgroup, it will force the cgroup memory to be reclaimed after killing the tasks, by writing to the memory.max file for that cgroup, forcing the remaining page cache and reclaimable slab to be reclaimed down to zero. Previously, on cgroups with some memory.low protection that would result in the memory being reclaimed down to the memory.low limit, or likely not at all, having the page cache reclaimed asynchronously later. With f56ce412a59d the oomd write to memory.max tries to reclaim all the way down to zero, which may race with another reclaimer, to the point of ending up with the divide by zero below. This patch implements the obvious fix. Link: https://lkml.kernel.org/r/[email protected] Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") Signed-off-by: Rik van Riel <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Chris Down <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/hugetlb: initialize hugetlb_usage in mm_initLiu Zixian2-0/+10
After fork, the child process will get incorrect (2x) hugetlb_usage. If a process uses 5 2MB hugetlb pages in an anonymous mapping, HugetlbPages: 10240 kB and then forks, the child will show, HugetlbPages: 20480 kB The reason for double the amount is because hugetlb_usage will be copied from the parent and then increased when we copy page tables from parent to child. Child will have 2x actual usage. Fix this by adding hugetlb_count_init in mm_init. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5d317b2b6536 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status") Signed-off-by: Liu Zixian <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/hmm: bypass devmap pte when all pfn requested flags are fulfilledLi Zhijian1-1/+4
Previously, we noticed the one rpma example was failed[1] since commit 36f30e486dce ("IB/core: Improve ODP to use hmm_range_fault()"), where it will use ODP feature to do RDMA WRITE between fsdax files. After digging into the code, we found hmm_vma_handle_pte() will still return EFAULT even though all the its requesting flags has been fulfilled. That's because a DAX page will be marked as (_PAGE_SPECIAL | PAGE_DEVMAP) by pte_mkdevmap(). Link: https://github.com/pmem/rpma/issues/1142 [1] Link: https://lkml.kernel.org/r/[email protected] Fixes: 405506274922 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling") Signed-off-by: Li Zhijian <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08io_uring: drop ctx->uring_lock before acquiring sqd->lockJens Axboe1-0/+7
The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock for all the registration opcodes. We also hold a ref to the ctx, and we do drop the lock for other reasons to quiesce, so it's fine to drop the ctx lock temporarily to grab the sqd->lock. This fixes the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.5/25433 is trying to acquire lock: ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 but task is already holding lock: ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ctx->uring_lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 __io_sq_thread fs/io_uring.c:7291 [inline] io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 -> #0 (&sqd->lock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3051 [inline] check_prevs_add kernel/locking/lockdep.c:3174 [inline] validate_chain kernel/locking/lockdep.c:3789 [inline] __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015 lock_acquire kernel/locking/lockdep.c:5625 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590 __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 io_register_iowq_max_workers fs/io_uring.c:10551 [inline] __io_uring_register fs/io_uring.c:10757 [inline] __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ctx->uring_lock); lock(&sqd->lock); lock(&ctx->uring_lock); lock(&sqd->lock); *** DEADLOCK *** Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Reported-by: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-09-08Merge tag 'tag-chrome-platform-for-v5.15' of ↵Linus Torvalds5-23/+123
git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux Pull chrome platform updates from Benson Leung: "cros_ec_typec: - make the cros_ec_typec driver to use the pre-existing cros_ec_check_features() function sensorhub: - add trace events for sample misc: - cros_ec_proto - re-send commands in the event of a timeout (for the FPMCU) - fix warnings in cros_ec_trace related to format output" * tag 'tag-chrome-platform-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: platform/chrome: cros_ec_trace: Fix format warnings platform/chrome: cros_ec_typec: Use existing feature check platform/chrome: cros_ec_proto: Send command again when timeout occurs platform/chrome: sensorhub: Add trace events for sample
2021-09-08Merge tag 'pm-5.15-rc1-2' of ↵Linus Torvalds39-767/+1441
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These are mostly ARM cpufreq driver updates, including one new MediaTek driver that has just passed all of the reviews, with the addition of a revert of a recent intel_pstate commit, some core cpufreq changes and a DT-related update of the operating performance points (OPP) support code. Specifics: - Add new cpufreq driver for the MediaTek MT6779 platform called mediatek-hw along with corresponding DT bindings (Hector.Yuan). - Add DCVS interrupt support to the qcom-cpufreq-hw driver (Thara Gopinath). - Make the qcom-cpufreq-hw driver set the dvfs_possible_from_any_cpu policy flag (Taniya Das). - Blocklist more Qualcomm platforms in cpufreq-dt-platdev (Bjorn Andersson). - Make the vexpress cpufreq driver set the CPUFREQ_IS_COOLING_DEV flag (Viresh Kumar). - Add new cpufreq driver callback to allow drivers to register with the Energy Model in a consistent way and make several drivers use it (Viresh Kumar). - Change the remaining users of the .ready() cpufreq driver callback to move the code from it elsewhere and drop it from the cpufreq core (Viresh Kumar). - Revert recent intel_pstate change adding HWP guaranteed performance change notification support to it that led to problems, because the notification in question is triggered prematurely on some systems (Rafael Wysocki). - Convert the OPP DT bindings to DT schema and clean them up while at it (Rob Herring)" * tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification" cpufreq: mediatek-hw: Add support for CPUFREQ HW cpufreq: Add of_perf_domain_get_sharing_cpumask dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW cpufreq: Remove ready() callback cpufreq: sh: Remove sh_cpufreq_cpu_ready() cpufreq: acpi: Remove acpi_cpufreq_cpu_ready() cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support cpufreq: scmi: Use .register_em() to register with energy model cpufreq: vexpress: Use .register_em() to register with energy model cpufreq: scpi: Use .register_em() to register with energy model dt-bindings: opp: Convert to DT schema dt-bindings: Clean-up OPP binding node names in examples ARM: dts: omap: Drop references to opp.txt cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model cpufreq: omap: Use .register_em() to register with energy model cpufreq: mediatek: Use .register_em() to register with energy model cpufreq: imx6q: Use .register_em() to register with energy model ...
2021-09-08Merge tag 'acpi-5.15-rc1-2' of ↵Linus Torvalds6-52/+197
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI updates from Rafael Wysocki: "These add ACPI support to the PCI VMD driver, improve suspend-to-idle support for AMD platforms and update documentation. Specifics: - Add ACPI support to the PCI VMD driver (Rafael Wysocki) - Rearrange suspend-to-idle support code to reflect the platform firmware expectations on some AMD platforms (Mario Limonciello) - Make SSDT overlays documentation follow the code documented by it more closely (Andy Shevchenko)" * tag 'acpi-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PM: s2idle: Run both AMD and Microsoft methods if both are supported Documentation: ACPI: Align the SSDT overlays file with the code PCI: VMD: ACPI: Make ACPI companion lookup work for VMD bus