Age | Commit message (Collapse) | Author | Files | Lines |
|
In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become the
maximal value even if the allocating/freeing depth is small, for example,
in the sender of network workloads. If a CPU was used as sender
originally, then it is used as receiver after context switching, we need
to fill the whole PCP with maximal high before triggering PCP draining for
consecutive high order freeing. This will hurt the performance of some
network workloads.
To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining. So, we can
detect consecutive page freeing much earlier.
On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
With the patch, the network bandwidth improves 5.0%. This restores the
performance drop caused by PCP auto-tuning.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
One target of PCP is to minimize pages in PCP if the system free pages is
too few. To reach that target, when page reclaiming is active for the
zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating
path, decrease PCP high and free some pages in freeing path. But this may
be too late because the background page reclaiming may introduce latency
for some workloads. So, in this patch, during page allocation we will
detect whether the number of free pages of the zone is below high
watermark. If so, we will stop increasing PCP high in allocating path,
decrease PCP high and free some pages in freeing path. With this, we can
reduce the possibility of the premature background page reclaiming caused
by too large PCP.
The high watermark checking is done in allocating path to reduce the
overhead in hotter freeing path.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The target to tune PCP high automatically is as follows,
- Minimize allocation/freeing from/to shared zone
- Minimize idle pages in PCP
- Minimize pages in PCP if the system free pages is too few
To reach these target, a tuning algorithm as follows is designed,
- When we refill PCP via allocating from the zone, increase PCP high.
Because if we had larger PCP, we could avoid to allocate from the
zone.
- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
decrease PCP high to try to free possible idle PCP pages.
- When page reclaiming is active for the zone, stop increasing PCP
high in allocating path, decrease PCP high and free some pages in
freeing path.
So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.
One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become the
maximal value even if the allocating/freeing depth is small. But this
isn't a severe issue, because there are no idle pages in this case.
One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during PCP
refilling. This can avoid the issue above. But if the number of pages
allocated is much less than that of pages freed on a CPU, there will be
many idle pages in PCP and it is hard to free these idle pages.
1/8 (>> 3) of PCP high will be decreased periodically. The value 1/8 is
kind of arbitrary. Just to make sure that the idle PCP pages will be
freed eventually.
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
build time decreases 3.5%. The cycles% of the spinlock contention (mostly
for zone lock) decreases from 11.0% to 0.5%. The number of PCP draining
for high order pages freeing (free_high) decreases 65.6%. The number of
pages allocated from zone (instead of from PCP) decreases 83.9%.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The page allocation performance requirements of different workloads are
usually different. So, we need to tune PCP (per-CPU pageset) high to
optimize the workload page allocation performance. Now, we have a system
wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand.
But, it's hard to find out the best value by hand. And one global
configuration may not work best for the different workloads that run on
the same system. One solution to these issues is to tune PCP high of each
CPU automatically.
This patch adds the framework for PCP high auto-tuning. With it,
pcp->high of each CPU will be changed automatically by tuning algorithm at
runtime. The minimal high (pcp->high_min) is the original PCP high value
calculated based on the low watermark pages. While the maximal high
(pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction
sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the
maximal pcp->high that can be set via sysctl knob by hand.
It's possible that PCP high auto-tuning doesn't work well for some
workloads. So, when PCP high is tuned by hand via the sysctl knob, the
auto-tuning will be disabled. The PCP high set by hand will be used
instead.
This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always. We will add actual auto-tuning
algorithm in the following patches in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When a task is allocating a large number of order-0 pages, it may acquire
the zone->lock multiple times allocating pages in batches. This may
unnecessarily contend on the zone lock when allocating very large number
of pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent allocations.
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
cycles% of the spinlock contention (mostly for zone lock) decreases from
12.6% to 11.0% (with PCP size == 367).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages
on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
PCP is mostly used for high-order pages freeing to improve the cache-hot
pages reusing between page allocating and freeing CPUs.
On system with small per-CPU data cache slice, pages shouldn't be cached
before draining to guarantee cache-hot. But on a system with large
per-CPU data cache slice, some pages can be cached before draining to
reduce zone lock contention.
So, in this patch, instead of draining without any caching, "pcp->batch"
pages will be cached in PCP before draining if the size of the per-CPU
data cache slice is more than "3 * batch".
In theory, if the size of per-CPU data cache slice is more than "2 *
batch", we can reuse cache-hot pages between CPUs. But considering the
other usage of cache (code, other data accessing, etc.), "3 * batch" is
used.
Note: "3 * batch" is chosen to make sure the optimization works on recent
x86_64 server CPUs. If you want to increase it, please check whether it
breaks the optimization.
On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite
with 16-pair processes increase 70.5%. The cycles% of the spinlock
contention (mostly for zone lock) decreases from 46.1% to 21.3%. The
number of PCP draining for high order pages freeing (free_high) decreases
89.9%. The cache miss rate keeps 0.2%.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This can be used to estimate the size of the data cache slice that can be
used by one CPU under ideal circumstances. Both DATA caches and UNIFIED
caches are used in calculation. So, the users need to consider the impact
of the code cache usage.
Because the cache inclusive/non-inclusive information isn't available now,
we just use the size of the per-CPU slice of LLC to make the result more
predictable across architectures. This may be improved when more cache
information is available in the future.
A brute-force algorithm to iterate all online CPUs is used to avoid to
allocate an extra cpumask, especially in offline callback.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: PCP high auto-tuning", v3.
The page allocation performance requirements of different workloads are
often different. So, we need to tune the PCP (Per-CPU Pageset) high on
each CPU automatically to optimize the page allocation performance.
The list of patches in series is as follows,
[1/9] mm, pcp: avoid to drain PCP when process exit
[2/9] cacheinfo: calculate per-CPU data cache size
[3/9] mm, pcp: reduce lock contention for draining high-order pages
[4/9] mm: restrict the pcp batch scale factor to avoid too long latency
[5/9] mm, page_alloc: scale the number of pages that are batch allocated
[6/9] mm: add framework for PCP high auto-tuning
[7/9] mm: tune PCP high automatically
[8/9] mm, pcp: decrease PCP high if free pages < high watermark
[9/9] mm, pcp: reduce detecting time of consecutive high order page freeing
Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
high-order pages freeing.
Patch [4/9], [5/9] optimize batch freeing and allocating.
Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
auto-tuning method.
Patch [9/9] optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.
The test results for patches with performance impact are as follows,
kbuild
======
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service.
build time lock contend% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 14.0 100.0 100.0
patch1 99.5 12.8 19.5 95.6
patch3 99.4 12.6 7.1 95.6
patch5 98.6 11.0 8.1 97.1
patch7 95.1 0.5 2.8 15.6
patch9 95.0 1.0 8.8 20.0
The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
allocation optimization (patch [5/9]) reduces zone lock contention a
little. The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
visibly. Where the tuning target: the number of pages allocated from zone
reduces greatly. So, the zone contention cycles% reduces greatly.
With PCP tuning patches (patch [7/9], [9/9]), the average used memory
during test increases up to 18.4% because more pages are cached in PCP.
But at the end of the test, the number of the used memory decreases to the
same level as that of the base patch. That is, the pages cached in PCP
will be released to zone after not being used actively.
netperf SCTP_STREAM_MANY
========================
On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 2.1 100.0 100.0 1.3
patch1 99.4 2.1 99.4 99.4 1.3
patch3 106.4 1.3 13.3 106.3 1.3
patch5 106.0 1.2 13.2 105.9 1.3
patch7 103.4 1.9 6.7 90.3 7.6
patch9 108.6 1.3 13.7 108.6 1.3
The PCP draining optimization (patch [1/9]+[3/9]) improves performance.
The PCP high auto-tuning (patch [7/9]) reduces performance a little
because PCP draining cannot be triggered in time sometimes. So, the cache
miss rate% increases. The further PCP draining optimization (patch [9/9])
based on PCP tuning restore the performance.
lmbench3 UNIX (AF_UNIX)
=======================
On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.
score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 51.4 100.0 100.0 0.2
patch1 116.8 46.1 69.5 104.3 0.2
patch3 199.1 21.3 7.0 104.9 0.2
patch5 200.0 20.8 7.1 106.9 0.3
patch7 191.6 19.9 6.8 103.8 2.8
patch9 193.4 21.7 7.0 104.7 2.1
The PCP draining optimization (patch [1/9], [3/9]) improves performance
much. The PCP tuning (patch [7/9]) reduces performance a little because
PCP draining cannot be triggered in time sometimes. The further PCP
draining optimization (patch [9/9]) based on PCP tuning restores the
performance partly.
The patchset adds several fields in struct per_cpu_pages. The struct
layout before/after the patchset is as follows,
base
====
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */
/* XXX 4 bytes hole, try to pack */
struct list_head lists[13]; /* 24 208 */
/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));
patched
=======
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int high_min; /* 12 4 */
int high_max; /* 16 4 */
int batch; /* 20 4 */
u8 flags; /* 24 1 */
u8 alloc_factor; /* 25 1 */
u8 expire; /* 26 1 */
/* XXX 1 byte hole, try to pack */
short int free_count; /* 28 2 */
/* XXX 2 bytes hole, try to pack */
struct list_head lists[13]; /* 32 208 */
/* size: 256, cachelines: 4, members: 11 */
/* sum members: 237, holes: 2, sum holes: 3 */
/* padding: 16 */
} __attribute__((__aligned__(64)));
The size of the struct doesn't changed with the patchset.
This patch (of 9):
In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages
on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
PCP is mostly used for high-order pages freeing to improve the cache-hot
pages reusing between page allocation and freeing CPUs.
But, the PCP draining mechanism may be triggered unexpectedly when process
exits. With some customized trace point, it was found that PCP draining
(free_high == true) was triggered with the order-1 page freeing with the
following call stack,
=> free_unref_page_commit
=> free_unref_page
=> __mmdrop
=> exit_mm
=> do_exit
=> do_group_exit
=> __x64_sys_exit_group
=> do_syscall_64
Checking the source code, this is the page table PGD freeing
(mm_free_pgd()). It's a order-1 page freeing if
CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for
security.
Just before that, page freeing with the following call stack was found,
=> free_unref_page_commit
=> free_unref_page_list
=> release_pages
=> tlb_batch_pages_flush
=> tlb_finish_mmu
=> exit_mmap
=> __mmput
=> exit_mm
=> do_exit
=> do_group_exit
=> __x64_sys_exit_group
=> do_syscall_64
So, when a process exits,
- a large number of user pages of the process will be freed without
page allocation, it's highly possible that pcp->free_factor becomes >
0. In fact, this is expected behavior to improve process exit
performance.
- after freeing all user pages, the PGD will be freed, which is a
order-1 page freeing, PCP will be drained.
All in all, when a process exits, it's high possible that the PCP will be
drained. This is an unexpected behavior.
To avoid this, in the patch, the PCP draining will only be triggered for 2
consecutive high-order page freeing.
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
cycles% of the spinlock contention (mostly for zone lock) decreases from
14.0% to 12.8% (with PCP size == 367). The number of PCP draining for
high order pages freeing (free_high) decreases 80.5%.
This helps network workload too for reduced zone lock contention. On a
2-socket Intel server with 128 logical CPU, with the patch, the network
bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with
16-pair processes increase 16.8%. The cycles% of the spinlock contention
(mostly for zone lock) decreases from 51.4% to 46.1%. The number of PCP
draining for high order pages freeing (free_high) decreases 30.5%. The
cache miss rate keeps 0.2%.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
With all users converted, remove the old create_empty_buffers() and rename
folio_create_empty_buffers() to create_empty_buffers().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Extract this useful helper from nilfs_page_get_nth_block()
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Ryusuke Konishi <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Finish the create_empty_buffers() transition", v2.
Pankaj recently added folio_create_empty_buffers() as the folio equivalent
to create_empty_buffers(). This patch set finishes the conversion by
first converting all remaining filesystems to call
folio_create_empty_buffers(), then renaming it back to
create_empty_buffers(). I took the opportunity to make a few
simplifications like making folio_create_empty_buffers() return the head
buffer and extracting get_nth_bh() from nilfs2.
A few of the patches in this series aren't directly related to
create_empty_buffers(), but I saw them while I was working on this and
thought they'd be easy enough to add to this series. Compile-tested only,
other than ext4.
This patch (of 26):
Almost all callers want to know the first BH that was allocated for this
folio. We already have that handy, so return it.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Pankaj Raghav <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
This patch contains two late Netfilter's flowtable fixes for net:
1) Flowtable GC pushes back packets to classic path in every GC run,
ie. every second. This is because NF_FLOW_HW_ESTABLISHED is only
used by sched/act_ct (never set) and IPS_SEEN_REPLY might be unset
by the time the flow is offloaded (this status bit is only reliable
in the sched/act_ct datapath).
2) sched/act_ct logic to push back packets to classic path to reevaluate
if UDP flow is unidirectional only applies if IPS_HW_OFFLOAD_BIT is
set on and no hardware offload request is pending to be handled.
From Vlad Buslov.
These two patches fixes two problems that were introduced in the
previous 6.5 development cycle.
* tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
net/sched: act_ct: additional checks for outdated flows
netfilter: flowtable: GC pushes back packets to classic path
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/drivers
More Qualcomm driver updates for v6.7
The Qualcomm SMC an QSEECOM drivers are moved into a "qcom"
subdirectory, to declutter the base directory. Missing include guards
are added to the qseecom header file. Unneded extern specifiers are
removed from the scm call wrappers.
__counted_by is added to the apr_rx_buf structure, in the APR driver.
Lastly in the pmic_glink driver the pmic_glink drm_bridge type is
corrected to DisplayPort, over the incorrect "USB" value. The return
values are added to error prints for the various typec set() calls.
* tag 'qcom-drivers-for-6.7-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
soc: qcom: pmic_glink_altmode: Print return value on error
firmware: qcom: scm: remove unneeded 'extern' specifiers
firmware: qcom: scm: add a missing forward declaration for struct device
firmware: qcom: move Qualcomm code into its own directory
soc: qcom: apr: Add __counted_by for struct apr_rx_buf and use struct_size()
soc: qcom: pmic_glink: fix connector type to be DisplayPort
firmware: qcom: qseecom: add missing include guards
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
Today we got a report at [1] for rcu stalls on the i915 testsuite in [2]
due to the conversion of files to SLAB_TYPSSAFE_BY_RCU. Afaict,
get_file_rcu() goes into an infinite loop trying to carefully verify
that i915->gem.mmap_singleton hasn't changed - see the splat below.
So I stared at this code to figure out what it actually does. It seems
that the i915->gem.mmap_singleton pointer itself never had rcu semantics.
The i915->gem.mmap_singleton is replaced in
file->f_op->release::singleton_release():
static int singleton_release(struct inode *inode, struct file *file)
{
struct drm_i915_private *i915 = file->private_data;
cmpxchg(&i915->gem.mmap_singleton, file, NULL);
drm_dev_put(&i915->drm);
return 0;
}
The cmpxchg() is ordered against a concurrent update of
i915->gem.mmap_singleton from mmap_singleton(). IOW, when
mmap_singleton() fails to get a reference on i915->gem.mmap_singleton:
While mmap_singleton() does
rcu_read_lock();
file = get_file_rcu(&i915->gem.mmap_singleton);
rcu_read_unlock();
it allocates a new file via anon_inode_getfile() and does
smp_store_mb(i915->gem.mmap_singleton, file);
So, then what happens in the case of this bug is that at some point
fput() is called and drops the file->f_count to zero leaving the pointer
in i915->gem.mmap_singleton in tact.
Now, there might be delays until
file->f_op->release::singleton_release() is called and
i915->gem.mmap_singleton is set to NULL.
Say concurrently another task hits mmap_singleton() and does:
rcu_read_lock();
file = get_file_rcu(&i915->gem.mmap_singleton);
rcu_read_unlock();
When get_file_rcu() fails to get a reference via atomic_inc_not_zero()
it will try the reload from i915->gem.mmap_singleton expecting it to be
NULL, assuming it has comparable semantics as we expect in
__fget_files_rcu().
But it hasn't so it reloads the same pointer again, trying the same
atomic_inc_not_zero() again and doing so until
file->f_op->release::singleton_release() of the old file has been
called.
So, in contrast to __fget_files_rcu() here we want to not retry when
atomic_inc_not_zero() has failed. We only want to retry in case we
managed to get a reference but the pointer did change on reload.
<3> [511.395679] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
<3> [511.395716] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-9): P6238
<3> [511.395934] rcu: (detected by 16, t=65002 jiffies, g=123977, q=439 ncpus=20)
<6> [511.395944] task:i915_selftest state:R running task stack:10568 pid:6238 tgid:6238 ppid:1001 flags:0x00004002
<6> [511.395962] Call Trace:
<6> [511.395966] <TASK>
<6> [511.395974] ? __schedule+0x3a8/0xd70
<6> [511.395995] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
<6> [511.396003] ? lockdep_hardirqs_on+0xc3/0x140
<6> [511.396013] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
<6> [511.396029] ? get_file_rcu+0x10/0x30
<6> [511.396039] ? get_file_rcu+0x10/0x30
<6> [511.396046] ? i915_gem_object_mmap+0xbc/0x450 [i915]
<6> [511.396509] ? i915_gem_mmap+0x272/0x480 [i915]
<6> [511.396903] ? mmap_region+0x253/0xb60
<6> [511.396925] ? do_mmap+0x334/0x5c0
<6> [511.396939] ? vm_mmap_pgoff+0x9f/0x1c0
<6> [511.396949] ? rcu_is_watching+0x11/0x50
<6> [511.396962] ? igt_mmap_offset+0xfc/0x110 [i915]
<6> [511.397376] ? __igt_mmap+0xb3/0x570 [i915]
<6> [511.397762] ? igt_mmap+0x11e/0x150 [i915]
<6> [511.398139] ? __trace_bprintk+0x76/0x90
<6> [511.398156] ? __i915_subtests+0xbf/0x240 [i915]
<6> [511.398586] ? __pfx___i915_live_setup+0x10/0x10 [i915]
<6> [511.399001] ? __pfx___i915_live_teardown+0x10/0x10 [i915]
<6> [511.399433] ? __run_selftests+0xbc/0x1a0 [i915]
<6> [511.399875] ? i915_live_selftests+0x4b/0x90 [i915]
<6> [511.400308] ? i915_pci_probe+0x106/0x200 [i915]
<6> [511.400692] ? pci_device_probe+0x95/0x120
<6> [511.400704] ? really_probe+0x164/0x3c0
<6> [511.400715] ? __pfx___driver_attach+0x10/0x10
<6> [511.400722] ? __driver_probe_device+0x73/0x160
<6> [511.400731] ? driver_probe_device+0x19/0xa0
<6> [511.400741] ? __driver_attach+0xb6/0x180
<6> [511.400749] ? __pfx___driver_attach+0x10/0x10
<6> [511.400756] ? bus_for_each_dev+0x77/0xd0
<6> [511.400770] ? bus_add_driver+0x114/0x210
<6> [511.400781] ? driver_register+0x5b/0x110
<6> [511.400791] ? i915_init+0x23/0xc0 [i915]
<6> [511.401153] ? __pfx_i915_init+0x10/0x10 [i915]
<6> [511.401503] ? do_one_initcall+0x57/0x270
<6> [511.401515] ? rcu_is_watching+0x11/0x50
<6> [511.401521] ? kmalloc_trace+0xa3/0xb0
<6> [511.401532] ? do_init_module+0x5f/0x210
<6> [511.401544] ? load_module+0x1d00/0x1f60
<6> [511.401581] ? init_module_from_file+0x86/0xd0
<6> [511.401590] ? init_module_from_file+0x86/0xd0
<6> [511.401613] ? idempotent_init_module+0x17c/0x230
<6> [511.401639] ? __x64_sys_finit_module+0x56/0xb0
<6> [511.401650] ? do_syscall_64+0x3c/0x90
<6> [511.401659] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<6> [511.401684] </TASK>
Link: [1]: https://lore.kernel.org/intel-gfx/SJ1PR11MB6129CB39EED831784C331BAFB9DEA@SJ1PR11MB6129.namprd11.prod.outlook.com
Link: [2]: https://intel-gfx-ci.01.org/tree/linux-next/next-20231013/bat-dg2-11/igt@i915_selftest@[email protected]#dmesg-warnings10963
Cc: Jann Horn <[email protected]>,
Cc: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/20231025-formfrage-watscheln-84526cd3bd7d@brauner
Signed-off-by: Christian Brauner <[email protected]>
|
|
So that other users could access it. Notably MPTCP will use
it in the next patch.
No functional change intended.
Acked-by: Matthieu Baerts <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
This is the folio equivalent of unmap_and_put_page(), which remains as
a wrapper for it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Message-Id: <[email protected]>
|
|
Allow HID drivers to pass ->suspend, ->resume and ->reset_resume via
pm_ptr().
Through the usage of pm_ptr() the CONFIG_PM-dependent code will always be
compiled, protecting against bitrot.
The linker will then garbage-collect the unused function avoiding any overhead.
The only overhead in the final kernel image and at runtime are a few
extra bytes in 'struct hid_driver'.
The same approach is chosen by 'struct usb_driver' and other subsystems.
Signed-off-by: Thomas Weißschuh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Benjamin Tissoires <[email protected]>
|
|
The main motivation is to repeat that dumb buffers should not be
abused for anything else than basic software rendering with KMS.
User-space devs are more likely to look at the IOCTL docs than to
actively search for the driver-oriented "Dumb Buffer Objects"
section.
v2: reference DRM_CAP_DUMB_BUFFER, DRM_CAP_DUMB_PREFERRED_DEPTH and
DRM_CAP_DUMB_PREFER_SHADOW (Pekka)
Signed-off-by: Simon Ser <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Acked-by: Pekka Paalanen <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
The superhyway bus driver was only referenced on SH4-202, which is now gone,
so remove it all as well.
I could find no trace of anything ever calling superhyway_register_driver(),
not in the git history but also not on the web, so I assume this has never
served any purpose on mainline kernels.
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: John Paul Adrian Glaubitz <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: John Paul Adrian Glaubitz <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux
Merge devfreq updates for v6.7 from Chanwoo Choi:
" Detailed description for this pull request:
1. Update devfreq core
- Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs
to support the specific device like UFS which handle the multiple clocks
through OPP (Operationg Performance Point) framework.
2. Update the devfreq / devfreq-event drivers
- Add perf support to the Rockchip DFI(DDR Monitor Module) devfreq-event driver.
: Generalize rockchip-dfi.c to support new RK3568/RK3588 using different DDR type.
: Covert devicetree bidning document format to yaml.
: DFI is a unit which is suitable for measuring DDR utilization
for the DDR frequency scaling driver. Add perf support feature
to rockchip-dfi.c to extend DFI usage. The perf support has been tested
on a RK3568 and a RK3399.
- Protect the OPP handling code in critical section
because the voltage of shared OPP might be changed by multiple drivers
on Mediatek CCI devfreq driver.
- Use device_get_match_data() on Samsung Exynos PPMU devfreq-event driver."
* tag 'devfreq-next-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux: (26 commits)
dt-bindings: devfreq: event: rockchip,dfi: Add rk3588 support
dt-bindings: devfreq: event: rockchip,dfi: Add rk3568 support
dt-bindings: devfreq: event: convert Rockchip DFI binding to yaml
PM / devfreq: rockchip-dfi: add support for RK3588
PM / devfreq: rockchip-dfi: account for multiple DDRMON_CTRL registers
PM / devfreq: rockchip-dfi: make register stride SoC specific
PM / devfreq: rockchip-dfi: Add perf support
PM / devfreq: rockchip-dfi: give variable a better name
PM / devfreq: rockchip-dfi: Prepare for multiple users
PM / devfreq: rockchip-dfi: Pass private data struct to internal functions
PM / devfreq: rockchip-dfi: Handle LPDDR4X
PM / devfreq: rockchip-dfi: Handle LPDDR2 correctly
PM / devfreq: rockchip-dfi: Add RK3568 support
PM / devfreq: rockchip-dfi: Clean up DDR type register defines
PM / devfreq: rk3399_dmc,dfi: generalize DDRTYPE defines
PM / devfreq: rockchip-dfi: introduce channel mask
PM / devfreq: rockchip-dfi: Use free running counter
PM / devfreq: mediatek: unlock on error in mtk_ccifreq_target()
PM / devfreq: exynos-ppmu: Use device_get_match_data()
PM / devfreq: rockchip-dfi: dfi store raw values in counter struct
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Merge OPP (operating performance points) updates for 6.7 from Viresh
Kumar:
"- Extend support for the opp-level beyond required-opps (Ulf Hansson).
- Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru).
- dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm Krait
SoCs and document named opp-microvolt property (Bjorn Andersson,
Dmitry Baryshkov and Christian Marangi).
- Fix -Wunsequenced warning (Nathan Chancellor).
- General cleanup (Viresh Kumar)."
* tag 'opp-updates-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
dt-bindings: opp: opp-v2-kryo-cpu: Document named opp-microvolt property
OPP: No need to defer probe from _opp_attach_genpd()
OPP: Remove genpd_virt_dev_lock
OPP: Reorder code in _opp_set_required_opps_genpd()
OPP: Add _link_required_opps() to avoid code duplication
OPP: Fix formatting of if/else block
dt-bindings: opp: opp-v2-kryo-cpu: support Qualcomm Krait SoCs
OPP: Fix -Wunsequenced in _of_add_opp_table_v1()
dt-bindings: opp: opp-v2-kryo-cpu: Allow opp-peak-kBps
OPP: debugfs: Fix warning with W=1 builds
OPP: Remove doc style comments for internal routines
OPP: Add dev_pm_opp_find_level_floor()
OPP: Extend support for the opp-level beyond required-opps
OPP: Switch to use dev_pm_domain_set_performance_state()
OPP: Extend dev_pm_opp_data with a level
OPP: Add dev_pm_opp_add_dynamic() to allow more flexibility
PM: domains: Implement the ->set_performance_state() callback for genpd
PM: domains: Introduce dev_pm_domain_set_performance_state()
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux
Merge thermal control (ARM drivers mostly) updates for 6.7-rc1 from
Daniel Lezcano:
"- Add support for Mediatek LVTS MT8192 driver along with the
suspend/resume routines (Balsam Chihi)
- Fix probe for THERMAL_V2 for the Mediatek LVTS driver (Markus
Schneider-Pargmann)
- Remove duplicate error message in the max76620 driver when
thermal_of_zone_register() fails as the sub routine already show one
(Thierry Reding)
- Add i.MX7D compatible bindings to fix a warning from dtbs_check for
the imx6ul platform (Alexander Stein)
- Add sa8775p compatible for the QCom tsens driver (Priyansh Jain)
- Fix error check in lvts_debugfs_init() which is checking against
NULL instead of PTR_ERR() on the LVTS Mediatek driver (Minjie Du)
- Remove unused variable in the thermal/tools (Kuan-Wei Chiu)
- Document the imx8dl thermal sensor (Fabio Estevam)
- Add variable names in callback prototypes to prevent warning from
checkpatch.pl for the imx8mm driver (Bragatheswaran Manickavel)
- Add missing unevaluatedProperties on child node schemas for tegra124
(Rob Herring)
- Add mt7988 support for the Mediatek LVTS driver (Frank Wunderlich)"
* tag 'thermal-v6.7-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux:
thermal/qcom/tsens: Drop ops_v0_1
thermal/drivers/mediatek/lvts_thermal: Update calibration data documentation
thermal/drivers/mediatek/lvts_thermal: Add mt8192 support
thermal/drivers/mediatek/lvts_thermal: Add suspend and resume
dt-bindings: thermal: mediatek: Add LVTS thermal controller definition for mt8192
thermal/drivers/mediatek: Fix probe for THERMAL_V2
thermal/drivers/max77620: Remove duplicate error message
dt-bindings: timer: add imx7d compatible
dt-bindings: net: microchip: Allow nvmem-cell usage
dt-bindings: imx-thermal: Add #thermal-sensor-cells property
dt-bindings: thermal: tsens: Add sa8775p compatible
thermal/drivers/mediatek/lvts_thermal: Fix error check in lvts_debugfs_init()
tools/thermal: Remove unused 'mds' and 'nrhandler' variables
dt-bindings: thermal: fsl,scu-thermal: Document imx8dl
thermal/drivers/imx8mm_thermal: Fix function pointer declaration by adding identifier name
dt-bindings: thermal: nvidia,tegra124-soctherm: Add missing unevaluatedProperties on child node schemas
thermal/drivers/mediatek/lvts_thermal: Add mt7988 support
thermal/drivers/mediatek/lvts_thermal: Make coeff configurable
dt-bindings: thermal: mediatek: Add LVTS thermal sensors for mt7988
dt-bindings: thermal: mediatek: Add mt7988 lvts compatible
|
|
Since 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded
unreplied tuple"), flowtable GC pushes back flows with IPS_SEEN_REPLY
back to classic path in every run, ie. every second. This is because of
a new check for NF_FLOW_HW_ESTABLISHED which is specific of sched/act_ct.
In Netfilter's flowtable case, NF_FLOW_HW_ESTABLISHED never gets set on
and IPS_SEEN_REPLY is unreliable since users decide when to offload the
flow before, such bit might be set on at a later stage.
Fix it by adding a custom .gc handler that sched/act_ct can use to
deal with its NF_FLOW_HW_ESTABLISHED bit.
Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple")
Reported-by: Vladimir Smelhaus <[email protected]>
Reviewed-by: Paul Blakey <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
net->ct.labels_used was meant to convey 'number of ip/nftables rules
that need the label extension allocated'.
act_ct enables this for each net namespace, which voids all attempts
to avoid ct->ext allocation when possible.
Move this increment to the control plane to request label extension
space allocation only when its needed.
Signed-off-by: Florian Westphal <[email protected]>
Reviewed-by: Pedro Tammela <[email protected]>
Reviewed-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add comment to indicate that the callback function target_destroy in the
scsi_host_template must not sleep.
Signed-off-by: Wenchao Hao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.
One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).
In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.
Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.
Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.
An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.
Co-developed-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
For unimplemented counters, the registers PM{C,I}NTEN{SET,CLR}
and PMOVS{SET,CLR} are expected to have the corresponding bits RAZ.
Hence to ensure correct KVM's PMU emulation, mask out the RES0 bits.
Defer this work to the point that userspace can no longer change the
number of advertised PMCs.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
The number of PMU event counters is indicated in PMCR_EL0.N.
For a vCPU with PMUv3 configured, the value is set to the same
value as the current PE on every vCPU reset. Unless the vCPU is
pinned to PEs that has the PMU associated to the guest from the
initial vCPU reset, the value might be different from the PMU's
PMCR_EL0.N on heterogeneous PMU systems.
Fix this by setting the vCPU's PMCR_EL0.N to the PMU's PMCR_EL0.N
value. Track the PMCR_EL0.N per guest, as only one PMU can be set
for the guest (PMCR_EL0.N must be the same for all vCPUs of the
guest), and it is convenient for updating the value.
To achieve this, the patch introduces a helper,
kvm_arm_pmu_get_max_counters(), that reads the maximum number of
counters from the arm_pmu associated to the VM. Make the function
global as upcoming patches will be interested to know the value
while setting the PMCR.N of the guest from userspace.
KVM does not yet support userspace modifying PMCR_EL0.N.
The following patch will add support for that.
Reviewed-by: Sebastian Ott <[email protected]>
Co-developed-by: Marc Zyngier <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Reiji Watanabe <[email protected]>
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Add a helper to read a vCPU's PMCR_EL0, and use it whenever KVM
reads a vCPU's PMCR_EL0.
Currently, the PMCR_EL0 value is tracked per vCPU. The following
patches will make (only) PMCR_EL0.N track per guest. Having the
new helper will be useful to combine the PMCR_EL0.N field
(tracked per guest) and the other fields (tracked per vCPU)
to provide the value of PMCR_EL0.
No functional change intended.
Reviewed-by: Sebastian Ott <[email protected]>
Signed-off-by: Reiji Watanabe <[email protected]>
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Future changes to KVM's sysreg emulation will rely on having a valid PMU
instance to determine the number of implemented counters (PMCR_EL0.N).
This is earlier than when userspace is expected to modify the vPMU
device attributes, where the default is selected today.
Select the default PMU when handling KVM_ARM_VCPU_INIT such that it is
available in time for sysreg emulation.
Reviewed-by: Sebastian Ott <[email protected]>
Co-developed-by: Marc Zyngier <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Reiji Watanabe <[email protected]>
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[Oliver: rewrite changelog]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Use FIELD_GET() to remove dependences on the field position, i.e., the
shift value. No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
|
|
Use FIELD_GET() to remove dependences on the field position, i.e., the
shift value. No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
|
|
The PASID Capability and Control registers are both 16 bits wide. Use
16-bit wide constants in field names to match the register width. No
functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct crash_mem.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Eric Biederman <[email protected]>
Cc: [email protected]
Acked-by: Baoquan He <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Three more fixes:
- don't drop all unprotected public action frames since
some don't have a protected dual
- fix pointer confusion in scanning code
- fix warning in some connections with multiple links
* tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: don't drop all unprotected public action frames
wifi: cfg80211: fix assoc response warning on failed links
wifi: cfg80211: pass correct pointer to rdev_inform_bss()
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
This preserves the existing IFLA_DSA_MASTER which is part of the uAPI
and creates an alias named IFLA_DSA_CONDUIT.
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Use more inclusive terms throughout the DSA subsystem by moving away
from "master" which is replaced by "conduit" and "slave" which is
replaced by "user". No functional changes.
Acked-by: Rob Herring <[email protected]>
Acked-by: Stephen Hemminger <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
generated with:
$ ./tools/net/ynl/ynl-gen-c.py --mode uapi \
> --spec Documentation/netlink/specs/mptcp.yaml \
> --header -o include/uapi/linux/mptcp_pm.h
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
in the current MPTCP control plane, all operations use a netlink
attribute of the same type "MPTCP_PM_ATTR". However, add/del/get/flush
operations only parse the first element in the message _ the one that
describes MPTCP endpoints (that was named MPTCP_PM_ATTR_ADDR and
mostly used in ADD_ADDR operations _ probably the similarity of "attr",
"addr" and "add" might cause some confusion to human readers).
Convert MPTCP from 'small_ops' to 'ops', thus allowing different attributes
for each single operation, hopefully makes all this clearer to human
readers.
- use a separate attribute set for add/del/get/flush address operation,
binary compatible with the existing one, to store the endpoint address.
MPTCP_PM_ENDPOINT_ADDR is added to the uAPI (with the same value as
MPTCP_PM_ATTR_ADDR) for these operations.
- convert mptcp_pm_ops[] and add policy files accordingly.
this prepares MPTCP control plane to be described as YAML spec.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 12 are cc:stable and the remainder address post-6.5
issues or aren't considered necessary for earlier kernel versions"
* tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier
mailmap: correct email aliasing for Oleksij Rempel
mailmap: map Bartosz's old address to the current one
mm/damon/sysfs: check DAMOS regions update progress from before_terminate()
MAINTAINERS: Ondrej has moved
kasan: disable kasan_non_canonical_hook() for HW tags
kasan: print the original fault addr when access invalid shadow
hugetlbfs: close race between MADV_DONTNEED and page fault
hugetlbfs: extend hugetlb_vma_lock to private VMAs
hugetlbfs: clear resv_map pointer if mmap fails
mm: zswap: fix pool refcount bug around shrink_worker()
mm/migrate: fix do_pages_move for compat pointers
riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set
riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking
mmap: fix error paths with dup_anon_vma()
mmap: fix vma_iterator in error path of vma_merge()
mm: fix vm_brk_flags() to not bail out while holding lock
mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer
mm/page_alloc: correct start page when guard page debug is enabled
|
|
Introduce acpi_dev_uid_match() helper that matches the device with
supplied _UID string.
Signed-off-by: Raag Jadav <[email protected]>
Reviewed-by: Mika Westerberg <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
DRM_FORMAT_NV20 and DRM_FORMAT_NV30 formats is the 2x1 and non-subsampled
variant of NV15, a 10-bit 2-plane YUV format that has no padding between
components. Instead, luminance and chrominance samples are grouped into 4s
so that each group is packed into an integer number of bytes:
YYYY = UVUV = 4 * 10 bits = 40 bits = 5 bytes
The '20' and '30' suffix refers to the optimum effective bits per pixel
which is achieved when the total number of luminance samples is a multiple
of 4.
V2: Added NV30 format
Signed-off-by: Jonas Karlman <[email protected]>
Reviewed-by: Sandy Huang <[email protected]>
Reviewed-by: Christopher Obbard <[email protected]>
Tested-by: Christopher Obbard <[email protected]>
Signed-off-by: Heiko Stuebner <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Suzuki noticed that KVM's PMU emulation is oblivious to the NSU and NSK
event filter bits. On systems that have EL3 these bits modify the
filter behavior in non-secure EL0 and EL1, respectively. Even though the
kernel doesn't use these bits, it is entirely possible some other guest
OS does. Additionally, it would appear that these and the M bit are
required by the architecture if EL3 is implemented.
Allow the EL3 event filter bits to be set if EL3 is advertised in the
guest's ID register. Implement the behavior of NSU and NSK according to
the pseudocode, and entirely ignore the M bit for perf event creation.
Reported-by: Suzuki K Poulose <[email protected]>
Reviewed-by: Suzuki K Poulose <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
The NSH bit, which filters event counting at EL2, is required by the
architecture if an implementation has EL2. Even though KVM doesn't
support nested virt yet, it makes no effort to hide the existence of EL2
from the ID registers. Userspace can, however, change the value of PFR0
to hide EL2. Align KVM's sysreg emulation with the architecture and make
NSH RES0 if EL2 isn't advertised. Keep in mind the bit is ignored when
constructing the backing perf event.
While at it, build the event type mask using explicit field definitions
instead of relying on ARMV8_PMU_EVTYPE_MASK. KVM probably should've been
doing this in the first place, as it avoids changes to the
aforementioned mask affecting sysreg emulation.
Reviewed-by: Suzuki K Poulose <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Add new defines for DPC reason fields and use them instead of literals.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ilpo Järvinen <[email protected]>
[bhelgaas: shorten comments]
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Use FIELD_GET() to remove dependencies on the field position, i.e., the
shift value. No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Convert open-coded variants of PCI field access into FIELD_GET/PREP()
to make the code easier to understand.
Add two missing defines into pci_regs.h. Logically, the Max No-Snoop
Latency Register is a separate word sized register in the PCIe spec,
but the pre-existing LTR defines in pci_regs.h with dword long values
seem to consider the registers together (the same goes for the only
user). Thus, follow the custom and make the new values also take both
word long LTR registers as a joint dword register.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
VFIO has an operation where it unmaps an IOVA while returning a bitmap with
the dirty data. In reality the operation doesn't quite query the IO
pagetables that the PTE was dirty or not. Instead it marks as dirty on
anything that was mapped, and doing so in one syscall.
In IOMMUFD the equivalent is done in two operations by querying with
GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
flushes given that after clearing dirty bits IOMMU implementations require
invalidating their IOTLB, plus another invalidation needed for the UNMAP.
To allow dirty bits to be queried faster, add a flag
(IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR) that requests to not clear the dirty
bits from the PTE (but just reading them), under the expectation that the
next operation is the unmap. An alternative is to unmap and just
perpectually mark as dirty as that's the same behaviour as today. So here
equivalent functionally can be provided with unmap alone, and if real dirty
info is required it will amortize the cost while querying.
There's still a race against DMA where in theory the unmap of the IOVA
(when the guest invalidates the IOTLB via emulated iommu) would race
against the VF performing DMA on the same IOVA. As discussed in [0], we are
accepting to resolve this race as throwing away the DMA and it doesn't
matter if it hit physical DRAM or not, the VM can't tell if we threw it
away because the DMA was blocked or because we failed to copy the DRAM.
[0] https://lore.kernel.org/linux-iommu/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a
given device.
Capabilities are IOMMU agnostic and use device_iommu_capable() API passing
one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY_TRACKING for now in the
out_capabilities field returned back to userspace.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Connect a hw_pagetable to the IOMMU core dirty tracking
read_and_clear_dirty iommu domain op. It exposes all of the functionality
for the UAPI that read the dirtied IOVAs while clearing the Dirty bits from
the PTEs.
In doing so, add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and then
copying back to userspace bitmap.
Underneath it uses the IOMMU domain kernel API which will read the dirty
bits, as well as atomically clearing the IOPTE dirty bit and flushing the
IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the
bitmaps user pages efficiently and without copies. Within the iterator
function we iterate over io-pagetable contigous areas that have been
mapped.
Contrary to past incantation of a similar interface in VFIO the IOVA range
to be scanned is tied in to the bitmap size, thus the application needs to
pass a appropriately sized bitmap address taking into account the iova
range being passed *and* page size ... as opposed to allowing bitmap-iova
!= iova.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|