aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-10-25mm, pcp: decrease PCP high if free pages < high watermarkHuang Ying1-0/+1
One target of PCP is to minimize pages in PCP if the system free pages is too few. To reach that target, when page reclaiming is active for the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. But this may be too late because the background page reclaiming may introduce latency for some workloads. So, in this patch, during page allocation we will detect whether the number of free pages of the zone is below high watermark. If so, we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. With this, we can reduce the possibility of the premature background page reclaiming caused by too large PCP. The high watermark checking is done in allocating path to reduce the overhead in hotter freeing path. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25mm: tune PCP high automaticallyHuang Ying1-0/+1
The target to tune PCP high automatically is as follows, - Minimize allocation/freeing from/to shared zone - Minimize idle pages in PCP - Minimize pages in PCP if the system free pages is too few To reach these target, a tuning algorithm as follows is designed, - When we refill PCP via allocating from the zone, increase PCP high. Because if we had larger PCP, we could avoid to allocate from the zone. - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()), decrease PCP high to try to free possible idle PCP pages. - When page reclaiming is active for the zone, stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. So, the PCP high can be tuned to the page allocating/freeing depth of workloads eventually. One issue of the algorithm is that if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small. But this isn't a severe issue, because there are no idle pages in this case. One alternative choice is to increase PCP high when we drain PCP via trying to free pages to the zone, but don't increase PCP high during PCP refilling. This can avoid the issue above. But if the number of pages allocated is much less than that of pages freed on a CPU, there will be many idle pages in PCP and it is hard to free these idle pages. 1/8 (>> 3) of PCP high will be decreased periodically. The value 1/8 is kind of arbitrary. Just to make sure that the idle PCP pages will be freed eventually. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the build time decreases 3.5%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 11.0% to 0.5%. The number of PCP draining for high order pages freeing (free_high) decreases 65.6%. The number of pages allocated from zone (instead of from PCP) decreases 83.9%. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Mel Gorman <[email protected]> Suggested-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25mm: add framework for PCP high auto-tuningHuang Ying1-1/+4
The page allocation performance requirements of different workloads are usually different. So, we need to tune PCP (per-CPU pageset) high to optimize the workload page allocation performance. Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand. But, it's hard to find out the best value by hand. And one global configuration may not work best for the different workloads that run on the same system. One solution to these issues is to tune PCP high of each CPU automatically. This patch adds the framework for PCP high auto-tuning. With it, pcp->high of each CPU will be changed automatically by tuning algorithm at runtime. The minimal high (pcp->high_min) is the original PCP high value calculated based on the low watermark pages. While the maximal high (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the maximal pcp->high that can be set via sysctl knob by hand. It's possible that PCP high auto-tuning doesn't work well for some workloads. So, when PCP high is tuned by hand via the sysctl knob, the auto-tuning will be disabled. The PCP high set by hand will be used instead. This patch only adds the framework, so pcp->high will be set to pcp->high_min (original default) always. We will add actual auto-tuning algorithm in the following patches in the series. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25mm, page_alloc: scale the number of pages that are batch allocatedHuang Ying1-1/+2
When a task is allocating a large number of order-0 pages, it may acquire the zone->lock multiple times allocating pages in batches. This may unnecessarily contend on the zone lock when allocating very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent allocations. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 12.6% to 11.0% (with PCP size == 367). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Mel Gorman <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25mm, pcp: reduce lock contention for draining high-order pagesHuang Ying2-0/+7
In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocating and freeing CPUs. On system with small per-CPU data cache slice, pages shouldn't be cached before draining to guarantee cache-hot. But on a system with large per-CPU data cache slice, some pages can be cached before draining to reduce zone lock contention. So, in this patch, instead of draining without any caching, "pcp->batch" pages will be cached in PCP before draining if the size of the per-CPU data cache slice is more than "3 * batch". In theory, if the size of per-CPU data cache slice is more than "2 * batch", we can reuse cache-hot pages between CPUs. But considering the other usage of cache (code, other data accessing, etc.), "3 * batch" is used. Note: "3 * batch" is chosen to make sure the optimization works on recent x86_64 server CPUs. If you want to increase it, please check whether it breaks the optimization. On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 70.5%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 46.1% to 21.3%. The number of PCP draining for high order pages freeing (free_high) decreases 89.9%. The cache miss rate keeps 0.2%. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Sudeep Holla <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Arjan van de Ven <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25cacheinfo: calculate size of per-CPU data cache sliceHuang Ying1-0/+1
This can be used to estimate the size of the data cache slice that can be used by one CPU under ideal circumstances. Both DATA caches and UNIFIED caches are used in calculation. So, the users need to consider the impact of the code cache usage. Because the cache inclusive/non-inclusive information isn't available now, we just use the size of the per-CPU slice of LLC to make the result more predictable across architectures. This may be improved when more cache information is available in the future. A brute-force algorithm to iterate all online CPUs is used to avoid to allocate an extra cpumask, especially in offline callback. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Sudeep Holla <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Arjan van de Ven <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25mm, pcp: avoid to drain PCP when process exitHuang Ying1-1/+11
Patch series "mm: PCP high auto-tuning", v3. The page allocation performance requirements of different workloads are often different. So, we need to tune the PCP (Per-CPU Pageset) high on each CPU automatically to optimize the page allocation performance. The list of patches in series is as follows, [1/9] mm, pcp: avoid to drain PCP when process exit [2/9] cacheinfo: calculate per-CPU data cache size [3/9] mm, pcp: reduce lock contention for draining high-order pages [4/9] mm: restrict the pcp batch scale factor to avoid too long latency [5/9] mm, page_alloc: scale the number of pages that are batch allocated [6/9] mm: add framework for PCP high auto-tuning [7/9] mm: tune PCP high automatically [8/9] mm, pcp: decrease PCP high if free pages < high watermark [9/9] mm, pcp: reduce detecting time of consecutive high order page freeing Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive high-order pages freeing. Patch [4/9], [5/9] optimize batch freeing and allocating. Patch [6/9], [7/9], [8/9] implement and optimize a PCP high auto-tuning method. Patch [9/9] optimize the PCP draining for consecutive high order page freeing based on PCP high auto-tuning. The test results for patches with performance impact are as follows, kbuild ====== On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. build time lock contend% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 14.0 100.0 100.0 patch1 99.5 12.8 19.5 95.6 patch3 99.4 12.6 7.1 95.6 patch5 98.6 11.0 8.1 97.1 patch7 95.1 0.5 2.8 15.6 patch9 95.0 1.0 8.8 20.0 The PCP draining optimization (patch [1/9], [3/9]) and PCP batch allocation optimization (patch [5/9]) reduces zone lock contention a little. The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time visibly. Where the tuning target: the number of pages allocated from zone reduces greatly. So, the zone contention cycles% reduces greatly. With PCP tuning patches (patch [7/9], [9/9]), the average used memory during test increases up to 18.4% because more pages are cached in PCP. But at the end of the test, the number of the used memory decreases to the same level as that of the base patch. That is, the pages cached in PCP will be released to zone after not being used actively. netperf SCTP_STREAM_MANY ======================== On a 2-socket Intel server with 128 logical CPU, we tested SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. score lock contend% free_high alloc_zone cache miss rate% ----- ---------- --------- ---------- ---------------- base 100.0 2.1 100.0 100.0 1.3 patch1 99.4 2.1 99.4 99.4 1.3 patch3 106.4 1.3 13.3 106.3 1.3 patch5 106.0 1.2 13.2 105.9 1.3 patch7 103.4 1.9 6.7 90.3 7.6 patch9 108.6 1.3 13.7 108.6 1.3 The PCP draining optimization (patch [1/9]+[3/9]) improves performance. The PCP high auto-tuning (patch [7/9]) reduces performance a little because PCP draining cannot be triggered in time sometimes. So, the cache miss rate% increases. The further PCP draining optimization (patch [9/9]) based on PCP tuning restore the performance. lmbench3 UNIX (AF_UNIX) ======================= On a 2-socket Intel server with 128 logical CPU, we tested UNIX (AF_UNIX socket) test case of lmbench3 test suite with 16-pair processes. score lock contend% free_high alloc_zone cache miss rate% ----- ---------- --------- ---------- ---------------- base 100.0 51.4 100.0 100.0 0.2 patch1 116.8 46.1 69.5 104.3 0.2 patch3 199.1 21.3 7.0 104.9 0.2 patch5 200.0 20.8 7.1 106.9 0.3 patch7 191.6 19.9 6.8 103.8 2.8 patch9 193.4 21.7 7.0 104.7 2.1 The PCP draining optimization (patch [1/9], [3/9]) improves performance much. The PCP tuning (patch [7/9]) reduces performance a little because PCP draining cannot be triggered in time sometimes. The further PCP draining optimization (patch [9/9]) based on PCP tuning restores the performance partly. The patchset adds several fields in struct per_cpu_pages. The struct layout before/after the patchset is as follows, base ==== struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int high; /* 8 4 */ int batch; /* 12 4 */ short int free_factor; /* 16 2 */ short int expire; /* 18 2 */ /* XXX 4 bytes hole, try to pack */ struct list_head lists[13]; /* 24 208 */ /* size: 256, cachelines: 4, members: 7 */ /* sum members: 228, holes: 1, sum holes: 4 */ /* padding: 24 */ } __attribute__((__aligned__(64))); patched ======= struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int high; /* 8 4 */ int high_min; /* 12 4 */ int high_max; /* 16 4 */ int batch; /* 20 4 */ u8 flags; /* 24 1 */ u8 alloc_factor; /* 25 1 */ u8 expire; /* 26 1 */ /* XXX 1 byte hole, try to pack */ short int free_count; /* 28 2 */ /* XXX 2 bytes hole, try to pack */ struct list_head lists[13]; /* 32 208 */ /* size: 256, cachelines: 4, members: 11 */ /* sum members: 237, holes: 2, sum holes: 3 */ /* padding: 16 */ } __attribute__((__aligned__(64))); The size of the struct doesn't changed with the patchset. This patch (of 9): In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocation and freeing CPUs. But, the PCP draining mechanism may be triggered unexpectedly when process exits. With some customized trace point, it was found that PCP draining (free_high == true) was triggered with the order-1 page freeing with the following call stack, => free_unref_page_commit => free_unref_page => __mmdrop => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 Checking the source code, this is the page table PGD freeing (mm_free_pgd()). It's a order-1 page freeing if CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for security. Just before that, page freeing with the following call stack was found, => free_unref_page_commit => free_unref_page_list => release_pages => tlb_batch_pages_flush => tlb_finish_mmu => exit_mmap => __mmput => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 So, when a process exits, - a large number of user pages of the process will be freed without page allocation, it's highly possible that pcp->free_factor becomes > 0. In fact, this is expected behavior to improve process exit performance. - after freeing all user pages, the PGD will be freed, which is a order-1 page freeing, PCP will be drained. All in all, when a process exits, it's high possible that the PCP will be drained. This is an unexpected behavior. To avoid this, in the patch, the PCP draining will only be triggered for 2 consecutive high-order page freeing. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 14.0% to 12.8% (with PCP size == 367). The number of PCP draining for high order pages freeing (free_high) decreases 80.5%. This helps network workload too for reduced zone lock contention. On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 16.8%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 51.4% to 46.1%. The number of PCP draining for high order pages freeing (free_high) decreases 30.5%. The cache miss rate keeps 0.2%. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25buffer: remove folio_create_empty_buffers()Matthew Wilcox (Oracle)1-3/+1
With all users converted, remove the old create_empty_buffers() and rename folio_create_empty_buffers() to create_empty_buffers(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25buffer: add get_nth_bh()Matthew Wilcox (Oracle)1-0/+22
Extract this useful helper from nilfs_page_get_nth_block() Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Ryusuke Konishi <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Pankaj Raghav <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25buffer: make folio_create_empty_buffers() return a buffer_headMatthew Wilcox (Oracle)1-2/+2
Patch series "Finish the create_empty_buffers() transition", v2. Pankaj recently added folio_create_empty_buffers() as the folio equivalent to create_empty_buffers(). This patch set finishes the conversion by first converting all remaining filesystems to call folio_create_empty_buffers(), then renaming it back to create_empty_buffers(). I took the opportunity to make a few simplifications like making folio_create_empty_buffers() return the head buffer and extracting get_nth_bh() from nilfs2. A few of the patches in this series aren't directly related to create_empty_buffers(), but I saw them while I was working on this and thought they'd be easy enough to add to this series. Compile-tested only, other than ext4. This patch (of 26): Almost all callers want to know the first BH that was allocated for this folio. We already have that handy, so return it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-25Merge tag 'qcom-drivers-for-6.7-2' of ↵Arnd Bergmann1-0/+6
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/drivers More Qualcomm driver updates for v6.7 The Qualcomm SMC an QSEECOM drivers are moved into a "qcom" subdirectory, to declutter the base directory. Missing include guards are added to the qseecom header file. Unneded extern specifiers are removed from the scm call wrappers. __counted_by is added to the apr_rx_buf structure, in the APR driver. Lastly in the pmic_glink driver the pmic_glink drm_bridge type is corrected to DisplayPort, over the incorrect "USB" value. The return values are added to error prints for the various typec set() calls. * tag 'qcom-drivers-for-6.7-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: soc: qcom: pmic_glink_altmode: Print return value on error firmware: qcom: scm: remove unneeded 'extern' specifiers firmware: qcom: scm: add a missing forward declaration for struct device firmware: qcom: move Qualcomm code into its own directory soc: qcom: apr: Add __counted_by for struct apr_rx_buf and use struct_size() soc: qcom: pmic_glink: fix connector type to be DisplayPort firmware: qcom: qseecom: add missing include guards Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2023-10-25file, i915: fix file reference for mmap_singleton()Christian Brauner1-0/+1
Today we got a report at [1] for rcu stalls on the i915 testsuite in [2] due to the conversion of files to SLAB_TYPSSAFE_BY_RCU. Afaict, get_file_rcu() goes into an infinite loop trying to carefully verify that i915->gem.mmap_singleton hasn't changed - see the splat below. So I stared at this code to figure out what it actually does. It seems that the i915->gem.mmap_singleton pointer itself never had rcu semantics. The i915->gem.mmap_singleton is replaced in file->f_op->release::singleton_release(): static int singleton_release(struct inode *inode, struct file *file) { struct drm_i915_private *i915 = file->private_data; cmpxchg(&i915->gem.mmap_singleton, file, NULL); drm_dev_put(&i915->drm); return 0; } The cmpxchg() is ordered against a concurrent update of i915->gem.mmap_singleton from mmap_singleton(). IOW, when mmap_singleton() fails to get a reference on i915->gem.mmap_singleton: While mmap_singleton() does rcu_read_lock(); file = get_file_rcu(&i915->gem.mmap_singleton); rcu_read_unlock(); it allocates a new file via anon_inode_getfile() and does smp_store_mb(i915->gem.mmap_singleton, file); So, then what happens in the case of this bug is that at some point fput() is called and drops the file->f_count to zero leaving the pointer in i915->gem.mmap_singleton in tact. Now, there might be delays until file->f_op->release::singleton_release() is called and i915->gem.mmap_singleton is set to NULL. Say concurrently another task hits mmap_singleton() and does: rcu_read_lock(); file = get_file_rcu(&i915->gem.mmap_singleton); rcu_read_unlock(); When get_file_rcu() fails to get a reference via atomic_inc_not_zero() it will try the reload from i915->gem.mmap_singleton expecting it to be NULL, assuming it has comparable semantics as we expect in __fget_files_rcu(). But it hasn't so it reloads the same pointer again, trying the same atomic_inc_not_zero() again and doing so until file->f_op->release::singleton_release() of the old file has been called. So, in contrast to __fget_files_rcu() here we want to not retry when atomic_inc_not_zero() has failed. We only want to retry in case we managed to get a reference but the pointer did change on reload. <3> [511.395679] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: <3> [511.395716] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-9): P6238 <3> [511.395934] rcu: (detected by 16, t=65002 jiffies, g=123977, q=439 ncpus=20) <6> [511.395944] task:i915_selftest state:R running task stack:10568 pid:6238 tgid:6238 ppid:1001 flags:0x00004002 <6> [511.395962] Call Trace: <6> [511.395966] <TASK> <6> [511.395974] ? __schedule+0x3a8/0xd70 <6> [511.395995] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 <6> [511.396003] ? lockdep_hardirqs_on+0xc3/0x140 <6> [511.396013] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 <6> [511.396029] ? get_file_rcu+0x10/0x30 <6> [511.396039] ? get_file_rcu+0x10/0x30 <6> [511.396046] ? i915_gem_object_mmap+0xbc/0x450 [i915] <6> [511.396509] ? i915_gem_mmap+0x272/0x480 [i915] <6> [511.396903] ? mmap_region+0x253/0xb60 <6> [511.396925] ? do_mmap+0x334/0x5c0 <6> [511.396939] ? vm_mmap_pgoff+0x9f/0x1c0 <6> [511.396949] ? rcu_is_watching+0x11/0x50 <6> [511.396962] ? igt_mmap_offset+0xfc/0x110 [i915] <6> [511.397376] ? __igt_mmap+0xb3/0x570 [i915] <6> [511.397762] ? igt_mmap+0x11e/0x150 [i915] <6> [511.398139] ? __trace_bprintk+0x76/0x90 <6> [511.398156] ? __i915_subtests+0xbf/0x240 [i915] <6> [511.398586] ? __pfx___i915_live_setup+0x10/0x10 [i915] <6> [511.399001] ? __pfx___i915_live_teardown+0x10/0x10 [i915] <6> [511.399433] ? __run_selftests+0xbc/0x1a0 [i915] <6> [511.399875] ? i915_live_selftests+0x4b/0x90 [i915] <6> [511.400308] ? i915_pci_probe+0x106/0x200 [i915] <6> [511.400692] ? pci_device_probe+0x95/0x120 <6> [511.400704] ? really_probe+0x164/0x3c0 <6> [511.400715] ? __pfx___driver_attach+0x10/0x10 <6> [511.400722] ? __driver_probe_device+0x73/0x160 <6> [511.400731] ? driver_probe_device+0x19/0xa0 <6> [511.400741] ? __driver_attach+0xb6/0x180 <6> [511.400749] ? __pfx___driver_attach+0x10/0x10 <6> [511.400756] ? bus_for_each_dev+0x77/0xd0 <6> [511.400770] ? bus_add_driver+0x114/0x210 <6> [511.400781] ? driver_register+0x5b/0x110 <6> [511.400791] ? i915_init+0x23/0xc0 [i915] <6> [511.401153] ? __pfx_i915_init+0x10/0x10 [i915] <6> [511.401503] ? do_one_initcall+0x57/0x270 <6> [511.401515] ? rcu_is_watching+0x11/0x50 <6> [511.401521] ? kmalloc_trace+0xa3/0xb0 <6> [511.401532] ? do_init_module+0x5f/0x210 <6> [511.401544] ? load_module+0x1d00/0x1f60 <6> [511.401581] ? init_module_from_file+0x86/0xd0 <6> [511.401590] ? init_module_from_file+0x86/0xd0 <6> [511.401613] ? idempotent_init_module+0x17c/0x230 <6> [511.401639] ? __x64_sys_finit_module+0x56/0xb0 <6> [511.401650] ? do_syscall_64+0x3c/0x90 <6> [511.401659] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 <6> [511.401684] </TASK> Link: [1]: https://lore.kernel.org/intel-gfx/SJ1PR11MB6129CB39EED831784C331BAFB9DEA@SJ1PR11MB6129.namprd11.prod.outlook.com Link: [2]: https://intel-gfx-ci.01.org/tree/linux-next/next-20231013/bat-dg2-11/igt@i915_selftest@[email protected]#dmesg-warnings10963 Cc: Jann Horn <[email protected]>, Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/20231025-formfrage-watscheln-84526cd3bd7d@brauner Signed-off-by: Christian Brauner <[email protected]>
2023-10-25highmem: Add folio_release_kmap()Matthew Wilcox (Oracle)1-2/+16
This is the folio equivalent of unmap_and_put_page(), which remains as a wrapper for it. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Jan Kara <[email protected]> Message-Id: <[email protected]>
2023-10-25HID: core: remove #ifdef CONFIG_PM from hid_driverThomas Weißschuh1-2/+2
Allow HID drivers to pass ->suspend, ->resume and ->reset_resume via pm_ptr(). Through the usage of pm_ptr() the CONFIG_PM-dependent code will always be compiled, protecting against bitrot. The linker will then garbage-collect the unused function avoiding any overhead. The only overhead in the final kernel image and at runtime are a few extra bytes in 'struct hid_driver'. The same approach is chosen by 'struct usb_driver' and other subsystems. Signed-off-by: Thomas Weißschuh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Benjamin Tissoires <[email protected]>
2023-10-25sh: Remove superhyway bus supportArnd Bergmann1-107/+0
The superhyway bus driver was only referenced on SH4-202, which is now gone, so remove it all as well. I could find no trace of anything ever calling superhyway_register_driver(), not in the git history but also not on the web, so I assume this has never served any purpose on mainline kernels. Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: John Paul Adrian Glaubitz <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: John Paul Adrian Glaubitz <[email protected]>
2023-10-25Merge tag 'opp-updates-6.7' of ↵Rafael J. Wysocki3-4/+44
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm Merge OPP (operating performance points) updates for 6.7 from Viresh Kumar: "- Extend support for the opp-level beyond required-opps (Ulf Hansson). - Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru). - dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm Krait SoCs and document named opp-microvolt property (Bjorn Andersson, Dmitry Baryshkov and Christian Marangi). - Fix -Wunsequenced warning (Nathan Chancellor). - General cleanup (Viresh Kumar)." * tag 'opp-updates-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: dt-bindings: opp: opp-v2-kryo-cpu: Document named opp-microvolt property OPP: No need to defer probe from _opp_attach_genpd() OPP: Remove genpd_virt_dev_lock OPP: Reorder code in _opp_set_required_opps_genpd() OPP: Add _link_required_opps() to avoid code duplication OPP: Fix formatting of if/else block dt-bindings: opp: opp-v2-kryo-cpu: support Qualcomm Krait SoCs OPP: Fix -Wunsequenced in _of_add_opp_table_v1() dt-bindings: opp: opp-v2-kryo-cpu: Allow opp-peak-kBps OPP: debugfs: Fix warning with W=1 builds OPP: Remove doc style comments for internal routines OPP: Add dev_pm_opp_find_level_floor() OPP: Extend support for the opp-level beyond required-opps OPP: Switch to use dev_pm_domain_set_performance_state() OPP: Extend dev_pm_opp_data with a level OPP: Add dev_pm_opp_add_dynamic() to allow more flexibility PM: domains: Implement the ->set_performance_state() callback for genpd PM: domains: Introduce dev_pm_domain_set_performance_state()
2023-10-24kexec: Annotate struct crash_mem with __counted_byKees Cook1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct crash_mem. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Eric Biederman <[email protected]> Cc: [email protected] Acked-by: Baoquan He <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2023-10-24Merge tag 'wireless-2023-10-24' of ↵Jakub Kicinski1-0/+29
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Three more fixes: - don't drop all unprotected public action frames since some don't have a protected dual - fix pointer confusion in scanning code - fix warning in some connections with multiple links * tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: don't drop all unprotected public action frames wifi: cfg80211: fix assoc response warning on failed links wifi: cfg80211: pass correct pointer to rdev_inform_bss() ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-24net: dsa: Use conduit and user termsFlorian Fainelli1-1/+1
Use more inclusive terms throughout the DSA subsystem by moving away from "master" which is replaced by "conduit" and "slave" which is replaced by "user". No functional changes. Acked-by: Rob Herring <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Reviewed-by: Vladimir Oltean <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-24Merge tag 'mm-hotfixes-stable-2023-10-24-09-40' of ↵Linus Torvalds2-5/+42
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 12 are cc:stable and the remainder address post-6.5 issues or aren't considered necessary for earlier kernel versions" * tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: maple_tree: add GFP_KERNEL to allocations in mas_expected_entries() selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier mailmap: correct email aliasing for Oleksij Rempel mailmap: map Bartosz's old address to the current one mm/damon/sysfs: check DAMOS regions update progress from before_terminate() MAINTAINERS: Ondrej has moved kasan: disable kasan_non_canonical_hook() for HW tags kasan: print the original fault addr when access invalid shadow hugetlbfs: close race between MADV_DONTNEED and page fault hugetlbfs: extend hugetlb_vma_lock to private VMAs hugetlbfs: clear resv_map pointer if mmap fails mm: zswap: fix pool refcount bug around shrink_worker() mm/migrate: fix do_pages_move for compat pointers riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking mmap: fix error paths with dup_anon_vma() mmap: fix vma_iterator in error path of vma_merge() mm: fix vm_brk_flags() to not bail out while holding lock mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer mm/page_alloc: correct start page when guard page debug is enabled
2023-10-24ACPI: utils: Introduce acpi_dev_uid_match() for matching _UIDRaag Jadav1-0/+5
Introduce acpi_dev_uid_match() helper that matches the device with supplied _UID string. Signed-off-by: Raag Jadav <[email protected]> Reviewed-by: Mika Westerberg <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-10-24KVM: arm64: Add PMU event filter bits required if EL3 is implementedOliver Upton1-3/+6
Suzuki noticed that KVM's PMU emulation is oblivious to the NSU and NSK event filter bits. On systems that have EL3 these bits modify the filter behavior in non-secure EL0 and EL1, respectively. Even though the kernel doesn't use these bits, it is entirely possible some other guest OS does. Additionally, it would appear that these and the M bit are required by the architecture if EL3 is implemented. Allow the EL3 event filter bits to be set if EL3 is advertised in the guest's ID register. Implement the behavior of NSU and NSK according to the pseudocode, and entirely ignore the M bit for perf event creation. Reported-by: Suzuki K Poulose <[email protected]> Reviewed-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Oliver Upton <[email protected]>
2023-10-24exportfs: add helpers to check if filesystem can encode/decode file handlesAmir Goldstein1-0/+27
The logic of whether filesystem can encode/decode file handles is open coded in many places. In preparation to changing the logic, move the open coded logic into inline helpers. Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-24iommu: Add iommu_domain ops for dirty trackingJoao Martins2-0/+74
Add to iommu domain operations a set of callbacks to perform dirty tracking, particulary to start and stop tracking and to read and clear the dirty data. Drivers are generally expected to dynamically change its translation structures to toggle the tracking and flush some form of control state structure that stands in the IOVA translation path. Though it's not mandatory, as drivers can also enable dirty tracking at boot, and just clear the dirty bits before setting dirty tracking. For each of the newly added IOMMU core APIs: iommu_cap::IOMMU_CAP_DIRTY_TRACKING: new device iommu_capable value when probing for capabilities of the device. .set_dirty_tracking(): an iommu driver is expected to change its translation structures and enable dirty tracking for the devices in the iommu_domain. For drivers making dirty tracking always-enabled, it should just return 0. .read_and_clear_dirty(): an iommu driver is expected to walk the pagetables for the iova range passed in and use iommu_dirty_bitmap_record() to record dirty info per IOVA. When detecting that a given IOVA is dirty it should also clear its dirty state from the PTE, *unless* the flag IOMMU_DIRTY_NO_CLEAR is passed in -- flushing is steered from the caller of the domain_op via iotlb_gather. The iommu core APIs use the same data structure in use for dirty tracking for VFIO device dirty (struct iova_bitmap) abstracted by iommu_dirty_bitmap_record() helper function. domain::dirty_ops: IOMMU domains will store the dirty ops depending on whether the iommu device supports dirty tracking or not. iommu drivers can then use this field to figure if the dirty tracking is supported+enforced on attach. The enforcement is enable via domain_alloc_user() which is done via IOMMUFD hwpt flag introduced later. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2023-10-24vfio: Move iova_bitmap into iommufdJoao Martins1-0/+26
Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking the user bitmaps, so move to the common dependency into IOMMUFD. In doing so, create the symbol IOMMUFD_DRIVER which designates the builtin code that will be used by drivers when selected. Today this means MLX5_VFIO_PCI and PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when supporting dirty tracking and select IOMMUFD_DRIVER accordingly. Given that the symbol maybe be disabled, add header definitions in iova_bitmap.h for when IOMMUFD_DRIVER=n Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Brett Creeley <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Alex Williamson <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2023-10-24arm64, irqchip/gic-v3, ACPI: Move MADT GICC enabled check into a helperJames Morse1-0/+5
ACPI, irqchip and the architecture code all inspect the MADT enabled bit for a GICC entry in the MADT. The addition of an 'online capable' bit means all these sites need updating. Move the current checks behind a helper to make future updates easier. Signed-off-by: James Morse <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Gavin Shan <[email protected]> Signed-off-by: "Russell King (Oracle)" <[email protected]> Acked-by: "Rafael J. Wysocki" <[email protected]> Reviewed-by: Sudeep Holla <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
2023-10-24sched/fair: Remove SIS_PROPPeter Zijlstra1-2/+0
SIS_UTIL seems to work well, lets remove the old thing. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-10-24sched: Add cpus_share_resources APIBarry Song2-1/+14
Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Signed-off-by: Barry Song <[email protected]> Signed-off-by: Yicong Yang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Reviewed-by: Tim Chen <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Tested-and-reviewed-by: Chen Yu <[email protected]> Tested-by: K Prateek Nayak <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-10-23bpf: correct loop detection for iterators convergenceEduard Zingerman1-0/+15
It turns out that .branches > 0 in is_state_visited() is not a sufficient condition to identify if two verifier states form a loop when iterators convergence is computed. This commit adds logic to distinguish situations like below: (I) initial (II) initial | | V V .---------> hdr .. | | | | V V | .------... .------.. | | | | | | V V V V | ... ... .-> hdr .. | | | | | | | V V | V V | succ <- cur | succ <- cur | | | | | V | V | ... | ... | | | | '----' '----' For both (I) and (II) successor 'succ' of the current state 'cur' was previously explored and has branches count at 0. However, loop entry 'hdr' corresponding to 'succ' might be a part of current DFS path. If that is the case 'succ' and 'cur' are members of the same loop and have to be compared exactly. Co-developed-by: Andrii Nakryiko <[email protected]> Co-developed-by: Alexei Starovoitov <[email protected]> Reviewed-by: Andrii Nakryiko <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-23bpf: exact states comparison for iterator convergence checksEduard Zingerman1-0/+1
Convergence for open coded iterators is computed in is_state_visited() by examining states with branches count > 1 and using states_equal(). states_equal() computes sub-state relation using read and precision marks. Read and precision marks are propagated from children states, thus are not guaranteed to be complete inside a loop when branches count > 1. This could be demonstrated using the following unsafe program: 1. r7 = -16 2. r6 = bpf_get_prandom_u32() 3. while (bpf_iter_num_next(&fp[-8])) { 4. if (r6 != 42) { 5. r7 = -32 6. r6 = bpf_get_prandom_u32() 7. continue 8. } 9. r0 = r10 10. r0 += r7 11. r8 = *(u64 *)(r0 + 0) 12. r6 = bpf_get_prandom_u32() 13. } Here verifier would first visit path 1-3, create a checkpoint at 3 with r7=-16, continue to 4-7,3 with r7=-32. Because instructions at 9-12 had not been visitied yet existing checkpoint at 3 does not have read or precision mark for r7. Thus states_equal() would return true and verifier would discard current state, thus unsafe memory access at 11 would not be caught. This commit fixes this loophole by introducing exact state comparisons for iterator convergence logic: - registers are compared using regs_exact() regardless of read or precision marks; - stack slots have to have identical type. Unfortunately, this is too strict even for simple programs like below: i = 0; while(iter_next(&it)) i++; At each iteration step i++ would produce a new distinct state and eventually instruction processing limit would be reached. To avoid such behavior speculatively forget (widen) range for imprecise scalar registers, if those registers were not precise at the end of the previous iteration and do not match exactly. This a conservative heuristic that allows to verify wide range of programs, however it precludes verification of programs that conjure an imprecise value on the first loop iteration and use it as precise on the second. Test case iter_task_vma_for_each() presents one of such cases: unsigned int seen = 0; ... bpf_for_each(task_vma, vma, task, 0) { if (seen >= 1000) break; ... seen++; } Here clang generates the following code: <LBB0_4>: 24: r8 = r6 ; stash current value of ... body ... 'seen' 29: r1 = r10 30: r1 += -0x8 31: call bpf_iter_task_vma_next 32: r6 += 0x1 ; seen++; 33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL 34: r7 += 0x10 35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000 <LBB0_6>: ... exit ... Note that counter in r6 is copied to r8 and then incremented, conditional jump is done using r8. Because of this precision mark for r6 lags one state behind of precision mark on r8 and widening logic kicks in. Adding barrier_var(seen) after conditional is sufficient to force clang use the same register for both counting and conditional jump. This issue was discussed in the thread [1] which was started by Andrew Werner <[email protected]> demonstrating a similar bug in callback functions handling. The callbacks would be addressed in a followup patch. [1] https://lore.kernel.org/bpf/[email protected]/ Co-developed-by: Andrii Nakryiko <[email protected]> Co-developed-by: Alexei Starovoitov <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-24Merge tag 'topic/vmemdup-user-array-2023-10-24-1' of ↵Dave Airlie1-0/+40
git://anongit.freedesktop.org/drm/drm into drm-next vmemdup-user-array API and changes with it. This is just a process PR to merge the topic branch into drm-next, this contains some core kernel and drm changes. Signed-off-by: Dave Airlie <[email protected]> From: Dave Airlie <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2023-10-23ASoC: Merge up v6.6-rc7Mark Brown44-120/+174
Get fixes needed so we can enable build of ams-delta in more configurations.
2023-10-23EDAC/versal: Add a Xilinx Versal memory controller driverShubhrajyoti Datta1-0/+12
Add a EDAC driver for the RAS capabilities on the Xilinx integrated DDR Memory Controllers (DDRMCs) which support both DDR4 and LPDDR4/4X memory interfaces. It has four programmable Network-on-Chip (NoC) interface ports and is designed to handle multiple streams of traffic. The driver reports correctable and uncorrectable errors, and also creates debugfs entries for testing through error injection. [ bp: - Add a pointer to the documentation about the register unlock code. - Squash in a fix for a Smatch static checker issue as reported by Dan Carpenter: https://lore.kernel.org/r/[email protected] ] Co-developed-by: Sai Krishna Potthuri <[email protected]> Signed-off-by: Sai Krishna Potthuri <[email protected]> Signed-off-by: Shubhrajyoti Datta <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-10-23Merge branches 'rcu/torture', 'rcu/fixes', 'rcu/docs', 'rcu/refscale', ↵Frederic Weisbecker6-9/+51
'rcu/tasks' and 'rcu/stall' into rcu/next rcu/torture: RCU torture, locktorture and generic torture infrastructure rcu/fixes: Generic and misc fixes rcu/docs: RCU documentation updates rcu/refscale: RCU reference scalability test updates rcu/tasks: RCU tasks updates rcu/stall: Stall detection updates
2023-10-23wifi: mac80211: don't drop all unprotected public action framesAvraham Stern1-0/+29
Not all public action frames have a protected variant. When MFP is enabled drop only public action frames that have a dual protected variant. Fixes: 76a3059cf124 ("wifi: mac80211: drop some unprotected action frames") Signed-off-by: Avraham Stern <[email protected]> Signed-off-by: Gregory Greenman <[email protected]> Link: https://lore.kernel.org/r/20231016145213.2973e3c8d3bb.I6198b8d3b04cf4a97b06660d346caec3032f232a@changeid Signed-off-by: Johannes Berg <[email protected]>
2023-10-23wifi: remove unused argument of ieee80211_get_tdls_action()Dmitry Antipov1-2/+1
Remove unused 'hdr_size' argument of 'ieee80211_get_tdls_action()' and adjust 'ieee80211_report_used_skb()' accordingly. Signed-off-by: Dmitry Antipov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-10-23Merge tag 'v6.6-rc7' into sched/core, to pick up fixesIngo Molnar17-11/+65
Pick up recent sched/urgent fixes merged upstream. Signed-off-by: Ingo Molnar <[email protected]>
2023-10-23tcp: add support for usec resolution in TCP TS valuesEric Dumazet1-1/+3
Back in 2015, Van Jacobson suggested to use usec resolution in TCP TS values. This has been implemented in our private kernels. Goals were : 1) better observability of delays in networking stacks. 2) better disambiguation of events based on TSval/ecr values. 3) building block for congestion control modules needing usec resolution. Back then we implemented a schem based on private SYN options to negotiate the feature. For upstream submission, we chose to use a route attribute, because this feature is probably going to be used in private networks [1] [2]. ip route add 10/8 ... features tcp_usec_ts Note that RFC 7323 recommends a "timestamp clock frequency in the range 1 ms to 1 sec per tick.", but also mentions "the maximum acceptable clock frequency is one tick every 59 ns." [1] Unfortunately RFC 7323 5.5 (Outdated Timestamps) suggests to invalidate TS.Recent values after a flow was idle for more than 24 days. This is the part making usec_ts a problem for peers following this recommendation for long living idle flows. [2] Attempts to standardize usec ts went nowhere: https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf https://datatracker.ietf.org/doc/draft-wang-tcpm-low-latency-opt/ Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-10-23tcp: add RTAX_FEATURE_TCP_USEC_TSEric Dumazet1-0/+5
This new dst feature flag will be used to allow TCP to use usec based timestamps instead of msec ones. ip route .... feature tcp_usec_ts Also document that RTAX_FEATURE_SACK and RTAX_FEATURE_TIMESTAMP are unused. RTAX_FEATURE_ALLFRAG is also going away soon. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-10-23BackMerge tag 'v6.6-rc7' into drm-nextDave Airlie37-100/+150
This is needed to add the msm pr which is based on a higher base. Signed-off-by: Dave Airlie <[email protected]>
2023-10-22bcachefs: Update export_operations for snapshotsKent Overstreet1-0/+6
When support for snapshots was merged, export operations weren't updated yet. This patch adds new filehandle types for bcachefs that include the subvolume ID and updates export operations for subvolumes - and also .get_parent, support for which was added just prior to snapshots. Signed-off-by: Kent Overstreet <[email protected]>
2023-10-21Merge tag 'perf-urgent-2023-10-21' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fix from Ingo Molnar: "Fix group event semantics" * tag 'perf-urgent-2023-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Disallow mis-matched inherited group reads
2023-10-21nvmem: add explicit config option to read old syntax fixed OF cellsRafał Miłecki1-0/+2
Binding for fixed NVMEM cells defined directly as NVMEM device subnodes has been deprecated. It has been replaced by the "fixed-layout" NVMEM layout binding. New syntax is meant to be clearer and should help avoiding imprecise bindings. NVMEM subsystem already supports the new binding. It should be a good idea to limit support for old syntax to existing drivers that actually support & use it (we can't break backward compatibility!). That way we additionally encourage new bindings & drivers to ignore deprecated binding. It wasn't clear (to me) if rtc and w1 code actually uses old syntax fixed cells. I enabled them to don't risk any breakage. Signed-off-by: Rafał Miłecki <[email protected]> [for meson-{efuse,mx-efuse}.c] Acked-by: Martin Blumenstingl <[email protected]> [for mtk-efuse.c, nvmem/core.c, nvmem-provider.h] Reviewed-by: AngeloGioacchino Del Regno <[email protected]> [MT8192, MT8195 Chromebooks] Tested-by: AngeloGioacchino Del Regno <[email protected]> [for microchip-otpc.c] Reviewed-by: Claudiu Beznea <[email protected]> [SAMA7G5-EK] Tested-by: Claudiu Beznea <[email protected]> Acked-by: Jernej Skrabec <[email protected]> Signed-off-by: Srinivas Kandagatla <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-21usb: chipidea: add CI_HDRC_FORCE_VBUS_ACTIVE_ALWAYS flagTomer Maimon1-0/+1
Adding CI_HDRC_FORCE_VBUS_ACTIVE_ALWAYS flag to modify the vbus_active parameter to active in case the ChipIdea USB IP role is device-only and there is no otgsc register. Signed-off-by: Tomer Maimon <[email protected]> Acked-by: Peter Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-21staging: vc04_services: Support module autoloading using MODULE_DEVICE_TABLEUmang Jain1-0/+4
VC04 has now a independent bus vchiq_bus to register its devices. However, the module auto-loading for bcm2835-audio and bcm2835-camera currently happens through MODULE_ALIAS() macro specified explicitly. The correct way to auto-load a module, is when the alias is picked out from MODULE_DEVICE_TABLE(). In order to get there, we need to introduce vchiq_device_id and add relevant entries in file2alias.c infrastructure so that aliases can be generated. This patch targets adding vchiq_device_id and do_vchiq_entry, in order to generate those alias using the /script/mod/file2alias.c. Going forward the MODULE_ALIAS() from bcm2835-camera and bcm2835-audio will be dropped, in favour of MODULE_DEVICE_TABLE being used there. The alias format for vchiq_bus devices will be "vchiq:<dev_name>". Adjust the vchiq_bus_uevent() to reflect that. Signed-off-by: Umang Jain <[email protected]> Reviewed-by: Laurent Pinchart <[email protected]> Reviewed-by: Kieran Bingham <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-20bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()Hou Tao1-1/+1
The following warning was reported when running "./test_progs -t test_bpf_ma/percpu_free_through_map_free": ------------[ cut here ]------------ WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342 CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:bpf_mem_refill+0x21c/0x2a0 ...... Call Trace: <IRQ> ? bpf_mem_refill+0x21c/0x2a0 irq_work_single+0x27/0x70 irq_work_run_list+0x2a/0x40 irq_work_run+0x18/0x40 __sysvec_irq_work+0x1c/0xc0 sysvec_irq_work+0x73/0x90 </IRQ> <TASK> asm_sysvec_irq_work+0x1b/0x20 RIP: 0010:unit_free+0x50/0x80 ...... bpf_mem_free+0x46/0x60 __bpf_obj_drop_impl+0x40/0x90 bpf_obj_free_fields+0x17d/0x1a0 array_map_free+0x6b/0x170 bpf_map_free_deferred+0x54/0xa0 process_scheduled_works+0xba/0x370 worker_thread+0x16d/0x2e0 kthread+0x105/0x140 ret_from_fork+0x39/0x60 ret_from_fork_asm+0x1b/0x30 </TASK> ---[ end trace 0000000000000000 ]--- The reason is simple: __bpf_obj_drop_impl() does not know the freeing field is a per-cpu pointer and it uses bpf_global_ma to free the pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is used to select the corresponding cache. The bpf_mem_cache with 16-bytes unit_size will always be selected to do the unmatched free and it will trigger the warning in free_bulk() eventually. Because per-cpu kptr doesn't support list or rb-tree now, so fix the problem by only checking whether or not the type of kptr is per-cpu in bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-20bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.hHou Tao1-0/+1
both syscall.c and helpers.c have the declaration of __bpf_obj_drop_impl(), so just move it to a common header file. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-20bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()Hou Tao1-0/+1
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area. Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-20mm/percpu.c: introduce pcpu_alloc_size()Hou Tao1-0/+1
Introduce pcpu_alloc_size() to get the size of the dynamic per-cpu area. It will be used by bpf memory allocator in the following patches. BPF memory allocator maintains per-cpu area caches for multiple area sizes and its free API only has the to-be-freed per-cpu pointer, so it needs the size of dynamic per-cpu area to select the corresponding cache when bpf program frees the dynamic per-cpu pointer. Acked-by: Dennis Zhou <[email protected]> Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-20seq_buf: fix a misleading commentJonathan Corbet1-1/+1
The comment for seq_buf_has_overflowed() says that an overflow condition is marked by len == size, but that's not what the code is testing. Make the comment match reality. Link: https://lkml.kernel.org/r/[email protected] Fixes: 8cd709ae7658a ("tracing: Have seq_buf use full buffer") Signed-off-by: Jonathan Corbet <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>