aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-03-29mm, page_flags: remove PG_slob_freeVlastimil Babka1-4/+0
With SLOB removed we no longer need the PG_slob_free alias for PG_private. Also update tools/mm/page-types. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Hyeonggon Yoo <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]>
2023-03-29usb: gadget: Add function wakeup supportElson Roy Serrao2-0/+7
USB3.2 spec section 9.2.5.4 quotes that a function may signal that it wants to exit from Function Suspend by sending a Function Wake Notification to the host if it is enabled for function remote wakeup. Add an api in composite layer that can be used by the function drivers to support this feature. Also expose a gadget op so that composite layer can trigger a wakeup request to the UDC driver. Reviewed-by: Thinh Nguyen <[email protected]> Signed-off-by: Elson Roy Serrao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-29usb: gadget: Properly configure the device for remote wakeupElson Roy Serrao2-0/+10
The wakeup bit in the bmAttributes field indicates whether the device is configured for remote wakeup. But this field should be allowed to set only if the UDC supports such wakeup mechanism. So configure this field based on UDC capability. Also inform the UDC whether the device is configured for remote wakeup by implementing a gadget op. Reviewed-by: Thinh Nguyen <[email protected]> Signed-off-by: Elson Roy Serrao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-29vsock: support sockmapBobby Eshleman1-0/+1
This patch adds sockmap support for vsock sockets. It is intended to be usable by all transports, but only the virtio and loopback transports are implemented. SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported. Signed-off-by: Bobby Eshleman <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-29net/mlx5: Introduce other vport query for Q-countersPatrisious Haddad1-3/+10
These new fields in QUERY_Q_COUNTER command allow us to access another vport counters during the query command, which is specially useful to query representor vports. In addition also add the required caps to check if this capability is actually supported. Signed-off-by: Patrisious Haddad <[email protected]> Reviewed-by: Michael Guralnik <[email protected]> Link: https://lore.kernel.org/r/75c73a4a0e60f18c37b35a4a11ca2e2415e4a6f3.1679566038.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]>
2023-03-29xhci: use pm_ptr() instead of #ifdef for CONFIG_PM conditionalsArnd Bergmann2-4/+1
A recent patch caused an unused-function warning in builds with CONFIG_PM disabled, after the function became marked 'static': drivers/usb/host/xhci-pci.c:91:13: error: 'xhci_msix_sync_irqs' defined but not used [-Werror=unused-function] 91 | static void xhci_msix_sync_irqs(struct xhci_hcd *xhci) | ^~~~~~~~~~~~~~~~~~~ This could be solved by adding another #ifdef, but as there is a trend towards removing CONFIG_PM checks in favor of helper macros, do the same conversion here and use pm_ptr() to get either a function pointer or NULL but avoid the warning. As the hidden functions reference some other symbols, make sure those are visible at compile time, at the minimal cost of a few extra bytes for 'struct usb_device'. Fixes: 9abe15d55dcc ("xhci: Move xhci MSI sync function to to xhci-pci") Signed-off-by: Arnd Bergmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-28Merge tag 'mlx5-updates-2023-03-20' of ↵Jakub Kicinski2-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-03-20 mlx5 dynamic msix This patch series adds support for dynamic msix vectors allocation in mlx5. Eli Cohen Says: ================ The following series of patches modifies mlx5_core to work with the dynamic MSIX API. Currently, mlx5_core allocates all the interrupt vectors it needs and distributes them amongst the consumers. With the introduction of dynamic MSIX support, which allows for allocation of interrupts more than once, we now allocate vectors as we need them. This allows other drivers running on top of mlx5_core to allocate interrupt vectors for their own use. An example for this is mlx5_vdpa, which uses these vectors to propagate interrupts directly from the hardware to the vCPU [1]. As a preparation for using this series, a use after free issue is fixed in lib/cpu_rmap.c and the allocator for rmap entries has been modified. A complementary API for irq_cpu_rmap_add() has also been introduced. [1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e ================ * tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Provide external API for allocating vectors net/mlx5: Use one completion vector if eth is disabled net/mlx5: Refactor calculation of required completion vectors net/mlx5: Move devlink registration before mlx5_load net/mlx5: Use dynamic msix vectors allocation net/mlx5: Refactor completion irq request/release code net/mlx5: Improve naming of pci function vectors net/mlx5: Use newer affinity descriptor net/mlx5: Modify struct mlx5_irq to use struct msi_map net/mlx5: Fix wrong comment net/mlx5e: Coding style fix, add empty line lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add lib: cpu_rmap: Use allocator for rmap entries lib: cpu_rmap: Avoid use after free on rmap->obj array entries ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-29driver core: class: mark the struct class for sysfs callbacks as constantGreg Kroah-Hartman1-4/+5
struct class should never be modified in a sysfs callback as there is nothing in the structure to modify, and frankly, the structure is almost never used in a sysfs callback, so mark it as constant to allow struct class to be moved to read-only memory. While we are touching all class sysfs callbacks also mark the attribute as constant as it can not be modified. The bonding code still uses this structure so it can not be removed from the function callbacks. Cc: "David S. Miller" <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Bartosz Golaszewski <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Linus Walleij <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Namjae Jeon <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Russ Weight <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steve French <[email protected]> Cc: Vignesh Raghavendra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-28Merge branch 'locking/rcuref' of ↵Jakub Kicinski5-11/+464
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pulling rcurefs from Peter for tglx's work. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-28mm/thp: rename TRANSPARENT_HUGEPAGE_NEVER_DAX to _UNSUPPORTEDPeter Xu1-1/+1
TRANSPARENT_HUGEPAGE_NEVER_DAX has nothing to do with DAX. It's set when has_transparent_hugepage() returns false, checked in hugepage_vma_check() and will disable THP completely if false. Rename it to TRANSPARENT_HUGEPAGE_UNSUPPORTED to reflect its real purpose. [[email protected]: fix comment, per David] Link: https://lkml.kernel.org/r/ZBMzQW674oHQJV7F@x1n Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28mm: vmscan: add a map_nr_max field to shrinker_infoQi Zheng1-0/+1
Patch series "make slab shrink lockless", v5. This patch series aims to make slab shrink lockless. 1. Background ============= On our servers, we often find the following system cpu hotspots: 52.22% [kernel] [k] down_read_trylock 19.60% [kernel] [k] up_read 8.86% [kernel] [k] shrink_slab 2.44% [kernel] [k] idr_find 1.25% [kernel] [k] count_shadow_nodes 1.18% [kernel] [k] shrink lruvec 0.71% [kernel] [k] mem_cgroup_iter 0.71% [kernel] [k] shrink_node 0.55% [kernel] [k] find_next_bit And we used bpftrace to capture its calltrace as follows: @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 _alloc_pages_slowpath+771 _alloc_pages_nodemask+702 pagecache_get_page+255 filemap_fault+1361 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 1161690 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 balance_pgdat+690 kswapd+389 kthread+246 ret_from_fork+31 ]: 8424884 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 __alloc_pages_slowpath+771 __alloc_pages_nodemask+702 __do_page_cache_readahead+244 filemap_fault+1674 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 20917631 We can see that down_read_trylock() of shrinker_rwsem is being called with high frequency at that time. Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). And more, the shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. [1]. https://lore.kernel.org/lkml/[email protected]/ All the above cases can be solved by replacing the shrinker_rwsem trylocks with SRCU. 2. Survey ========= Before doing the code implementation, I found that there were many similar submissions in the community: a. Davidlohr Bueso submitted a patch in 2015. Subject: [PATCH -next v2] mm: srcu-ify shrinkers Link: https://lore.kernel.org/all/[email protected]/ Result: It was finally merged into the linux-next branch, but failed on arm allnoconfig (without CONFIG_SRCU) b. Tetsuo Handa submitted a patchset in 2017. Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock. Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/ Result: Finally chose to use the current simple way (break when rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU, but SRCU was not unconditionally enabled at the time. c. Kirill Tkhai submitted a patchset in 2018. Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab() Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/ Result: At that time, SRCU was not unconditionally enabled, and there were some objections to enabling SRCU. Later, because Kirill's focus was moved to other things, this patchset was not continued to be updated. d. Sultan Alsawaf submitted a patch in 2021. Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection Link: https://lore.kernel.org/lkml/[email protected]/ Result: Rejected because SRCU was not unconditionally enabled. We can find that almost all these historical commits were abandoned because SRCU was not unconditionally enabled. But now SRCU has been unconditionally enable by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks with SRCU. [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/ 3. Reproduction and testing =========================== We can reproduce the down_read_trylock() hotspot through the following script: ``` #!/bin/bash DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test mkdir -p /sys/fs/cgroup/perf_event/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 32.31% [kernel] [k] down_read_trylock 19.40% [kernel] [k] pv_native_safe_halt 16.24% [kernel] [k] up_read 15.70% [kernel] [k] shrink_slab 4.69% [kernel] [k] _find_next_bit 2.62% [kernel] [k] shrink_node 1.78% [kernel] [k] shrink_lruvec 0.76% [kernel] [k] do_shrink_slab 2) After applying this patchset: 27.83% [kernel] [k] _find_next_bit 16.97% [kernel] [k] shrink_slab 15.82% [kernel] [k] pv_native_safe_halt 9.58% [kernel] [k] shrink_node 8.31% [kernel] [k] shrink_lruvec 5.64% [kernel] [k] do_shrink_slab 3.88% [kernel] [k] mem_cgroup_iter At the same time, we use the following perf command to capture IPC information: perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 1) Before applying this patchset: Performance counter stats for 'system wide' (5 runs): 454187219766 cycles test ( +- 1.84% ) 78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% ) 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) 2) After applying this patchset: Performance counter stats for 'system wide' (5 runs): 841954709443 cycles test ( +- 15.80% ) (98.69%) 527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%) 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) We can see that IPC drops very seriously when calling down_read_trylock() at high frequency. After using SRCU, the IPC is at a normal level. This patch (of 8): To prepare for the subsequent lockless memcg slab shrink, add a map_nr_max field to struct shrinker_info to records its own real shrinker_nr_max. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Suggested-by: Kirill Tkhai <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill Tkhai <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Christian König <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sultan Alsawaf <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28kasan: remove PG_skip_kasan_poison flagPeter Collingbourne2-26/+13
Code inspection reveals that PG_skip_kasan_poison is redundant with kasantag, because the former is intended to be set iff the latter is the match-all tag. It can also be observed that it's basically pointless to poison pages which have kasantag=0, because any pages with this tag would have been pointed to by pointers with match-all tags, so poisoning the pages would have little to no effect in terms of bug detection. Therefore, change the condition in should_skip_kasan_poison() to check kasantag instead, and remove PG_skip_kasan_poison and associated flags. Link: https://lkml.kernel.org/r/[email protected] Link: https://linux-review.googlesource.com/id/I57f825f2eaeaf7e8389d6cf4597c8a5821359838 Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28io-mapping: don't disable preempt on RT in io_mapping_map_atomic_wc().Sebastian Andrzej Siewior1-4/+16
io_mapping_map_atomic_wc() disables preemption and pagefaults for historical reasons. The conversion to io_mapping_map_local_wc(), which only disables migration, cannot be done wholesale because quite some call sites need to be updated to accommodate with the changed semantics. On PREEMPT_RT enabled kernels the io_mapping_map_atomic_wc() semantics are problematic due to the implicit disabling of preemption which makes it impossible to acquire 'sleeping' spinlocks within the mapped atomic sections. PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for more than a decade. It could be argued that this is a justification to do this unconditionally, but PREEMPT_RT covers only a limited number of architectures and it disables some functionality which limits the coverage further. Limit the replacement to PREEMPT_RT for now. This is also done kmap_atomic(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Reported-by: Richard Weinberger <[email protected]> Link: https://lore.kernel.org/CAFLxGvw0WMxaMqYqJ5WgvVSbKHq2D2xcXTOgMCpgq9nDC-MWTQ@mail.gmail.com Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28shmem: add support to ignore swapLuis Chamberlain1-0/+1
In doing experimentations with shmem having the option to avoid swap becomes a useful mechanism. One of the *raves* about brd over shmem is you can avoid swap, but that's not really a good reason to use brd if we can instead use shmem. Using brd has its own good reasons to exist, but just because "tmpfs" doesn't let you do that is not a great reason to avoid it if we can easily add support for it. I don't add support for reconfiguring incompatible options, but if we really wanted to we can add support for that. To avoid swap we use mapping_set_unevictable() upon inode creation, and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Luis Chamberlain <[email protected]> Acked-by: Christian Brauner <[email protected]> Tested-by: Xin Hao <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28mm,jfs: move write_one_page/folio_write_one to jfsChristoph Hellwig1-6/+0
The last remaining user of folio_write_one through the write_one_page wrapper is jfs, so move the functionality there and hard code the call to metapage_writepage. Note that the use of the pagecache by the JFS 'metapage' buffer cache is a bit odd, and we could probably do without VM-level dirty tracking at all, but that's a change for another time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Dave Kleikamp <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Evgeniy Dushistov <[email protected]> Cc: Gang He <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jan Kara via Ocfs2-devel <[email protected]> Cc: Joel Becker <[email protected]> Cc: Joseph Qi <[email protected]> Cc: Joseph Qi <[email protected]> Cc: Jun Piao <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28mm, memcg: Prevent memory.swappiness load/store tearingYue Zhao1-4/+4
The knob for cgroup v1 memory controller: memory.swappiness is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->swappiness and vm_swappiness is lockless, so both of them can be concurrently set at the same time as we are trying to read them. All occurrences of memcg->swappiness and vm_swappiness are updated with [READ|WRITE]_ONCE. [[email protected]: v3] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yue Zhao <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Tang Yizhou <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()Gerald Schaefer1-1/+1
s390 can do more fine-grained handling of spurious TLB protection faults, when there also is the PTE pointer available. Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an additional parameter. This will add no functional change to other architectures, but those with private flush_tlb_fix_spurious_fault() implementations need to be made aware of the new parameter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gerald Schaefer <[email protected]> Reviewed-by: Alexander Gordeev <[email protected]> Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Michael Ellerman <[email protected]> [powerpc] Acked-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28mm: swap: remove unneeded cgroup_throttle_swaprate()Kefeng Wang1-8/+4
All the callers of cgroup_throttle_swaprate() are converted to folio_throttle_swaprate(), so make __cgroup_throttle_swaprate() to take a folio, and rename it to __folio_throttle_swaprate(), also rename gfp_mask to gfp and drop redundant extern keyword. finally, drop unused cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28kasan: call clear_page with a match-all tag instead of changing page tagPeter Collingbourne1-5/+3
Instead of changing the page's tag solely in order to obtain a pointer with a match-all tag and then changing it back again, just convert the pointer that we get from kmap_atomic() into one with a match-all tag before passing it to clear_page(). On a certain microarchitecture, this has been observed to cause a measurable improvement in microbenchmark performance, presumably as a result of being able to avoid the atomic operations on the page tag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Collingbourne <[email protected]> Link: https://linux-review.googlesource.com/id/I0249822cc29097ca7a04ad48e8eb14871f80e711 Reviewed-by: Andrey Konovalov <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28mm, printk: introduce new format %pGt for page_typeHyeonggon Yoo1-1/+6
%pGp format is used to display 'flags' field of a struct page. However, some page flags (i.e. PG_buddy, see page-flags.h for more details) are stored in page_type field. To display human-readable output of page_type, introduce %pGt format. It is important to note the meaning of bits are different in page_type. if page_type is 0xffffffff, no flags are set. Setting PG_buddy (0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit actually means setting a flag. Bits in page_type are inverted when displaying type names. Only values for which page_type_has_type() returns true are considered as page_type, to avoid confusion with mapcount values. if it returns false, only raw values are displayed and not page type names. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hyeonggon Yoo <[email protected]> Reviewed-by: Petr Mladek <[email protected]> [vsprintf part] Cc: Andy Shevchenko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Joe Perches <[email protected]> Cc: John Ogness <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28lazy tlb: allow lazy tlb mm refcounting to be configurableNicholas Piggin1-3/+15
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, so the patch introduces no functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28lazy tlb: introduce lazy tlb mm refcount helper functionsNicholas Piggin1-0/+16
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28mm: multi-gen LRU: clean up sysfs codeT.J. Alumbaugh1-1/+1
This patch cleans up the sysfs code. Specifically, 1. use sysfs_emit(), 2. use __ATTR_RW(), and 3. constify multi-gen LRU struct attribute_group. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28x86/mm/pat: clear VM_PAT if copy_p4d_range failedMa Wupeng1-3/+4
Syzbot reports a warning in untrack_pfn(). Digging into the root we found that this is due to memory allocation failure in pmd_alloc_one. And this failure is produced due to failslab. In copy_page_range(), memory alloaction for pmd failed. During the error handling process in copy_page_range(), mmput() is called to remove all vmas. While untrack_pfn this empty pfn, warning happens. Here's a simplified flow: dup_mm dup_mmap copy_page_range copy_p4d_range copy_pud_range copy_pmd_range pmd_alloc __pmd_alloc pmd_alloc_one page = alloc_pages(gfp, 0); if (!page) return NULL; mmput exit_mmap unmap_vmas unmap_single_vma untrack_pfn follow_phys WARN_ON_ONCE(1); Since this vma is not generate successfully, we can clear flag VM_PAT. In this case, untrack_pfn() will not be called while cleaning this vma. Function untrack_pfn_moved() has also been renamed to fit the new logic. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ma Wupeng <[email protected]> Reported-by: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suresh Siddha <[email protected]> Cc: Toshi Kani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-28fprobe: Skip exit_handler if entry_handler returns !0Masami Hiramatsu (Google)1-2/+2
Skip hooking function return and calling exit_handler if the entry_handler() returns !0. Link: https://lkml.kernel.org/r/167526699798.433354.10998365726830117303.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-28fprobe: Add nr_maxactive to specify rethook_node pool sizeMasami Hiramatsu (Google)1-0/+2
Add nr_maxactive to specify rethook_node pool size. This means the maximum number of actively running target functions concurrently for probing by exit_handler. Note that if the running function is preempted or sleep, it is still counted as 'active'. Link: https://lkml.kernel.org/r/167526697917.433354.17779774988245113106.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-28fprobe: Pass entry_data to handlersMasami Hiramatsu (Google)1-2/+6
Pass the private entry_data to the entry and exit handlers so that they can share the context data, something like saved function arguments etc. User must specify the private entry_data size by @entry_data_size field before registering the fprobe. Link: https://lkml.kernel.org/r/167526696173.433354.17408372048319432574.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-28dma-buf/dma-resv: Add a way to set fence deadlineRob Clark1-0/+2
Add a way to set a deadline on remaining resv fences according to the requested usage. Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Christian König <[email protected]>
2023-03-28firmware: arm_sdei: Fix sleep from invalid context BUGPierre Gondois1-1/+0
Running a preempt-rt (v6.2-rc3-rt1) based kernel on an Ampere Altra triggers: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 24, name: cpuhp/0 preempt_count: 0, expected: 0 RCU nest depth: 0, expected: 0 3 locks held by cpuhp/0/24: #0: ffffda30217c70d0 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x5c/0x248 #1: ffffda30217c7120 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x5c/0x248 #2: ffffda3021c711f0 (sdei_list_lock){....}-{3:3}, at: sdei_cpuhp_up+0x3c/0x130 irq event stamp: 36 hardirqs last enabled at (35): [<ffffda301e85b7bc>] finish_task_switch+0xb4/0x2b0 hardirqs last disabled at (36): [<ffffda301e812fec>] cpuhp_thread_fun+0x21c/0x248 softirqs last enabled at (0): [<ffffda301e80b184>] copy_process+0x63c/0x1ac0 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 0 PID: 24 Comm: cpuhp/0 Not tainted 5.19.0-rc3-rt5-[...] Hardware name: WIWYNN Mt.Jade Server [...] Call trace: dump_backtrace+0x114/0x120 show_stack+0x20/0x70 dump_stack_lvl+0x9c/0xd8 dump_stack+0x18/0x34 __might_resched+0x188/0x228 rt_spin_lock+0x70/0x120 sdei_cpuhp_up+0x3c/0x130 cpuhp_invoke_callback+0x250/0xf08 cpuhp_thread_fun+0x120/0x248 smpboot_thread_fn+0x280/0x320 kthread+0x130/0x140 ret_from_fork+0x10/0x20 sdei_cpuhp_up() is called in the STARTING hotplug section, which runs with interrupts disabled. Use a CPUHP_AP_ONLINE_DYN entry instead to execute the cpuhp cb later, with preemption enabled. SDEI originally got its own cpuhp slot to allow interacting with perf. It got superseded by pNMI and this early slot is not relevant anymore. [1] Some SDEI calls (e.g. SDEI_1_0_FN_SDEI_PE_MASK) take actions on the calling CPU. It is checked that preemption is disabled for them. _ONLINE cpuhp cb are executed in the 'per CPU hotplug thread'. Preemption is enabled in those threads, but their cpumask is limited to 1 CPU. Move 'WARN_ON_ONCE(preemptible())' statements so that SDEI cpuhp cb don't trigger them. Also add a check for the SDEI_1_0_FN_SDEI_PRIVATE_RESET SDEI call which acts on the calling CPU. [1]: https://lore.kernel.org/all/[email protected]/ Suggested-by: James Morse <[email protected]> Signed-off-by: Pierre Gondois <[email protected]> Reviewed-by: James Morse <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2023-03-28atomics: Provide rcuref - scalable reference countingThomas Gleixner2-0/+161
atomic_t based reference counting, including refcount_t, uses atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is implemented with a atomic_try_cmpxchg() loop. High contention of the reference count leads to retry loops and scales badly. There is nothing to improve on this implementation as the semantics have to be preserved. Provide rcuref as a scalable alternative solution which is suitable for RCU managed objects. Similar to refcount_t it comes with overflow and underflow detection and mitigation. rcuref treats the underlying atomic_t as an unsigned integer and partitions this space into zones: 0x00000000 - 0x7FFFFFFF valid zone (1 .. (INT_MAX + 1) references) 0x80000000 - 0xBFFFFFFF saturation zone 0xC0000000 - 0xFFFFFFFE dead zone 0xFFFFFFFF no reference rcuref_get() unconditionally increments the reference count with atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the reference count with atomic_add_negative_release(). This unconditional increment avoids the inc_not_zero() problem, but requires a more complex implementation on the put() side when the count drops from 0 to -1. When this transition is detected then it is attempted to mark the reference count dead, by setting it to the midpoint of the dead zone with a single atomic_cmpxchg_release() operation. This operation can fail due to a concurrent rcuref_get() elevating the reference count from -1 to 0 again. If the unconditional increment in rcuref_get() hits a reference count which is marked dead (or saturated) it will detect it after the fact and bring back the reference count to the midpoint of the respective zone. The zones provide enough tolerance which makes it practically impossible to escape from a zone. The racy implementation of rcuref_put() requires to protect rcuref_put() against a grace period ending in order to prevent a subtle use after free. As RCU is the only mechanism which allows to protect against that, it is not possible to fully replace the atomic_inc_not_zero() based implementation of refcount_t with this scheme. The final drop is slightly more expensive than the atomic_dec_return() counterpart, but that's not the case which this is optimized for. The optimization is on the high frequeunt get()/put() pairs and their scalability. The performance of an uncontended rcuref_get()/put() pair where the put() is not dropping the last reference is still on par with the plain atomic operations, while at the same time providing overflow and underflow detection and mitigation. The performance of rcuref compared to plain atomic_inc_not_zero() and atomic_dec_return() based reference counting under contention: - Micro benchmark: All CPUs running a increment/decrement loop on an elevated reference count, which means the 0 to -1 transition never happens. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.3X to 4.7X - Conversion of dst_entry::__refcnt to rcuref and testing with the localhost memtier/memcached benchmark. That benchmark shows the reference count contention prominently. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.1X to 2.6X over the previous fix for the false sharing issue vs. struct dst_entry::__refcnt. When memtier is run over a real 1Gb network connection, there is a small gain on top of the false sharing fix. The two changes combined result in a 2%-5% total gain for that networked test. Reported-by: Wangyang Guo <[email protected]> Reported-by: Arjan Van De Ven <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-28atomics: Provide atomic_add_negative() variantsThomas Gleixner3-11/+303
atomic_add_negative() does not provide the relaxed/acquire/release variants. Provide them in preparation for a new scalable reference count algorithm. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mark Rutland <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-28driver core: bus: constify class_unregister/destroy()Greg Kroah-Hartman1-2/+2
The class_unregister() and class_destroy() function should be taking a const * to struct class, not just a *, so fix that up. Cc: "Rafael J. Wysocki" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-27fscrypt: new helper function - fscrypt_prepare_lookup_partial()Luís Henriques1-0/+7
This patch introduces a new helper function which can be used both in lookups and in atomic_open operations by filesystems that want to handle filename encryption and no-key dentries themselves. The reason for this function to be used in atomic open is that this operation can act as a lookup if handed a dentry that is negative. And in this case we may need to set DCACHE_NOKEY_NAME. Signed-off-by: Luís Henriques <[email protected]> Tested-by: Xiubo Li <[email protected]> Reviewed-by: Xiubo Li <[email protected]> [ebiggers: improved the function comment, and moved the function to just below __fscrypt_prepare_lookup()] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Eric Biggers <[email protected]>
2023-03-27ethtool: Add support for configuring tx_push_buf_lenShay Agroskin1-4/+10
This attribute, which is part of ethtool's ring param configuration allows the user to specify the maximum number of the packet's payload that can be written directly to the device. Example usage: # ethtool -G [interface] tx-push-buf-len [number of bytes] Co-developed-by: Jakub Kicinski <[email protected]> Signed-off-by: Shay Agroskin <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-27netlink: Add a macro to set policy message with format stringShay Agroskin1-0/+22
Similar to NL_SET_ERR_MSG_FMT, add a macro which sets netlink policy error message with a format string. Signed-off-by: Shay Agroskin <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-27net: introduce a config option to tweak MAX_SKB_FRAGSEric Dumazet1-11/+5
Currently, MAX_SKB_FRAGS value is 17. For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg() attempts order-3 allocations, stuffing 32768 bytes per frag. But with zero copy, we use order-0 pages. For BIG TCP to show its full potential, we add a config option to be able to fit up to 45 segments per skb. This is also needed for BIG TCP rx zerocopy, as zerocopy currently does not support skbs with frag list. We have used MAX_SKB_FRAGS=45 value for years at Google before we deployed 4K MTU, with no adverse effect, other than a recent issue in mlx4, fixed in commit 26782aad00cc ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS") Back then, goal was to be able to receive full size (64KB) GRO packets without the frag_list overhead. Note that /proc/sys/net/core/max_skb_frags can also be used to limit the number of fragments TCP can use in tx packets. By default we keep the old/legacy value of 17 until we get more coverage for the updated values. Sizes of struct skb_shared_info on 64bit arches MAX_SKB_FRAGS | sizeof(struct skb_shared_info): ============================================== 17 320 21 320+64 = 384 25 320+128 = 448 29 320+192 = 512 33 320+256 = 576 37 320+320 = 640 41 320+384 = 704 45 320+448 = 768 This inflation might cause problems for drivers assuming they could pack both the incoming packet (for MTU=1500) and skb_shared_info in half a page, using build_skb(). v3: fix build error when CONFIG_NET=n v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long" Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-27locking/lockdep: Improve the deadlock scenario print for sync and read lockBoqun Feng1-1/+2
Lock scenario print is always a weak spot of lockdep splats. Improvement can be made if we rework the dependency search and the error printing. However without touching the graph search, we can improve a little for the circular deadlock case, since we have the to-be-added lock dependency, and know whether these two locks are read/write/sync. In order to know whether a held_lock is sync or not, a bit was "stolen" from ->references, which reduce our limit for the same lock class nesting from 2^12 to 2^11, and it should still be good enough. Besides, since we now have bit in held_lock for sync, we don't need the "hardirqoffs being 1" trick, and also we can avoid the __lock_release() if we jump out of __lock_acquire() before the held_lock stored. With these changes, a deadlock case evolved with read lock and sync gets a better print-out from: [...] Possible unsafe locking scenario: [...] [...] CPU0 CPU1 [...] ---- ---- [...] lock(srcuA); [...] lock(srcuB); [...] lock(srcuA); [...] lock(srcuB); to [...] Possible unsafe locking scenario: [...] [...] CPU0 CPU1 [...] ---- ---- [...] rlock(srcuA); [...] lock(srcuB); [...] lock(srcuA); [...] sync(srcuB); Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2023-03-27rcu: Annotate SRCU's update-side lockdep dependenciesBoqun Feng1-2/+32
Although all flavors of RCU readers are annotated correctly with lockdep as recursive read locks, they do not set the lock_acquire 'check' parameter. This means that RCU read locks are not added to the lockdep dependency graph, which in turn means that lockdep cannot detect RCU-based deadlocks. This is not a problem for RCU flavors having atomic read-side critical sections because context-based annotations can catch these deadlocks, see for example the RCU_LOCKDEP_WARN() statement in synchronize_rcu(). But context-based annotations are not helpful for sleepable RCU, especially given that it is perfectly legal to do synchronize_srcu(&srcu1) within an srcu_read_lock(&srcu2). However, we can detect SRCU-based by: (1) Making srcu_read_lock() a 'check'ed recursive read lock and (2) Making synchronize_srcu() a empty write lock critical section. Even better, with the newly introduced lock_sync(), we can avoid false positives about irq-unsafe/safe. This commit therefore makes it so. Note that NMI-safe SRCU read side critical sections are currently not annotated, but might be annotated in the future. Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> [ boqun: Add comments for annotation per Waiman's suggestion ] [ boqun: Fix comment warning reported by Stephen Rothwell ] Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2023-03-27locking/lockdep: Introduce lock_sync()Boqun Feng1-0/+5
Currently, functions like synchronize_srcu() do not have lockdep annotations resembling those of other write-side locking primitives. Such annotations might look as follows: lock_acquire(); lock_release(); Such annotations would tell lockdep that synchronize_srcu() acts like an empty critical section that waits for other (read-side) critical sections to finish. This would definitely catch some deadlock, but as pointed out by Paul Mckenney [1], this could also introduce false positives because of irq-safe/unsafe detection. Of course, there are tricks could help with this: might_sleep(); // Existing statement in __synchronize_srcu(). if (IS_ENABLED(CONFIG_PROVE_LOCKING)) { local_irq_disable(); lock_acquire(); lock_release(); local_irq_enable(); } But it would be better for lockdep to provide a separate annonation for functions like synchronize_srcu(), so that people won't need to repeat the ugly tricks above. Therefore introduce lock_sync(), which is simply an lock+unlock pair with no irq safe/unsafe deadlock check. This works because the to-be-annontated functions do not create real critical sections, and there is therefore no way that irq can create extra dependencies. [1]: https://lore.kernel.org/lkml/20180412021233.ewncg5jjuzjw3x62@tardis/ Signed-off-by: Boqun Feng <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> [ boqun: Fix typos reported by Davidlohr Bueso and Paul E. Mckenney ] Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2023-03-27driver core: move sysfs_dev_char_kobj out of class.hGreg Kroah-Hartman1-1/+0
The structure sysfs_dev_char_kobj is local only to the driver core code, so move it out of the global class.h file and into the internal base.h file as no one else should be touching this symbol. Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-27device property: Remove unused struct net_device forward declarationAndy Shevchenko1-1/+0
There is no users in the property.h for the struct net_device. Remove the latter for good. Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-27KVM: x86/ioapic: Resample the pending state of an IRQ when unmaskingDmytro Maluka1-0/+10
KVM irqfd based emulation of level-triggered interrupts doesn't work quite correctly in some cases, particularly in the case of interrupts that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device in its threaded irq handler, i.e. later than it is acked to the interrupt controller (EOI at the end of hardirq), not earlier. Linux keeps such interrupt masked until its threaded handler finishes, to prevent the EOI from re-asserting an unacknowledged interrupt. However, with KVM + vfio (or whatever is listening on the resamplefd) we always notify resamplefd at the EOI, so vfio prematurely unmasks the host physical IRQ, thus a new physical interrupt is fired in the host. This extra interrupt in the host is not a problem per se. The problem is that it is unconditionally queued for injection into the guest, so the guest sees an extra bogus interrupt. [*] There are observed at least 2 user-visible issues caused by those extra erroneous interrupts for a oneshot irq in the guest: 1. System suspend aborted due to a pending wakeup interrupt from ChromeOS EC (drivers/platform/chrome/cros_ec.c). 2. Annoying "invalid report id data" errors from ELAN0000 touchpad (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg every time the touchpad is touched. The core issue here is that by the time when the guest unmasks the IRQ, the physical IRQ line is no longer asserted (since the guest has acked the interrupt to the device in the meantime), yet we unconditionally inject the interrupt queued into the guest by the previous resampling. So to fix the issue, we need a way to detect that the IRQ is no longer pending, and cancel the queued interrupt in this case. With IOAPIC we are not able to probe the physical IRQ line state directly (at least not if the underlying physical interrupt controller is an IOAPIC too), so in this patch we use irqfd resampler for that. Namely, instead of injecting the queued interrupt, we just notify the resampler that this interrupt is done. If the IRQ line is actually already deasserted, we are done. If it is still asserted, a new interrupt will be shortly triggered through irqfd and injected into the guest. In the case if there is no irqfd resampler registered for this IRQ, we cannot fix the issue, so we keep the existing behavior: immediately unconditionally inject the queued interrupt. This patch fixes the issue for x86 IOAPIC only. In the long run, we can fix it for other irqchips and other architectures too, possibly taking advantage of reading the physical state of the IRQ line, which is possible with some other irqchips (e.g. with arm64 GIC, maybe even with the legacy x86 PIC). [*] In this description we assume that the interrupt is a physical host interrupt forwarded to the guest e.g. by vfio. Potentially the same issue may occur also with a purely virtual interrupt from an emulated device, e.g. if the guest handles this interrupt, again, as a oneshot interrupt. Signed-off-by: Dmytro Maluka <[email protected]> Link: https://lore.kernel.org/kvm/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/ Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2023-03-27KVM: irqfd: Make resampler_list an RCU listDmytro Maluka2-1/+2
It is useful to be able to do read-only traversal of the list of all the registered irqfd resamplers without locking the resampler_lock mutex. In particular, we are going to traverse it to search for a resampler registered for the given irq of an irqchip, and that will be done with an irqchip spinlock (ioapic->lock) held, so it is undesirable to lock a mutex in this context. So turn this list into an RCU list. For protecting the read side, reuse kvm->irq_srcu which is already used for protecting a number of irq related things (kvm->irq_routing, irqfd->resampler->list, kvm->irq_ack_notifier_list, kvm->arch.mask_notifier_list). Signed-off-by: Dmytro Maluka <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2023-03-27arm64: perf: Abstract system register accesses awayMarc Zyngier1-0/+45
As we want to enable 32bit support, we need to distanciate the PMUv3 driver from the AArch64 system register names. This patch moves all system register accesses to an architecture specific include file, allowing the 32bit counterpart to be slotted in at a later time. Signed-off-by: Marc Zyngier <[email protected]> Co-developed-by: Zaid Al-Bassam <[email protected]> Signed-off-by: Zaid Al-Bassam <[email protected]> Tested-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2023-03-27arm64: perf: Move PMUv3 driver to drivers/perfMarc Zyngier1-0/+258
Having the ARM PMUv3 driver sitting in arch/arm64/kernel is getting in the way of being able to use perf on ARMv8 cores running a 32bit kernel, such as 32bit KVM guests. This patch moves it into drivers/perf/arm_pmuv3.c, with an include file in include/linux/perf/arm_pmuv3.h. The only thing left in arch/arm64 is some mundane perf stuff. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Zaid Al-Bassam <[email protected]> Tested-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2023-03-27Merge back thermal control material for 6.4-rc1.Rafael J. Wysocki1-0/+19
2023-03-27Merge 6.3-rc4 into usb-nextGreg Kroah-Hartman27-49/+183
We need the USB fixes here, and the USB gadget update for future development patches to be based on. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-27net: phy: constify fwnode_get_phy_node() fwnode argumentRussell King (Oracle)1-1/+1
fwnode_get_phy_node() does not motify the fwnode structure, so make the argument const, Signed-off-by: Russell King (Oracle) <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-27net: sfp: make sfp_bus_find_fwnode() take a const fwnodeRussell King (Oracle)1-2/+3
sfp_bus_find_fwnode() does not write to the fwnode, so let's make it const. Signed-off-by: Russell King (Oracle) <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-26Merge tag 'core_urgent_for_v6.3_rc4' of ↵Linus Torvalds2-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Borislav Petkov: - Do the delayed RCU wakeup for kthreads in the proper order so that former doesn't get ignored - A noinstr warning fix * tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up entry: Fix noinstr warning in __enter_from_user_mode()