Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"This contains the work to retrieve detailed information about mounts
via two new system calls. This is hopefully the beginning of the end
of the saga that started with fsinfo() years ago.
The LWN articles in [1] and [2] can serve as a summary so we can avoid
rehashing everything here.
At LSFMM in May 2022 we got into a room and agreed on what we want to
do about fsinfo(). Basically, split it into pieces. This is the first
part of that agreement. Specifically, it is concerned with retrieving
information about mounts. So this only concerns the mount information
retrieval, not the mount table change notification, or the extended
filesystem specific mount option work. That is separate work.
Currently mounts have a 32bit id. Mount ids are already in heavy use
by libmount and other low-level userspace but they can't be relied
upon because they're recycled very quickly. We agreed that mounts
should carry a unique 64bit id by which they can be referenced
directly. This is now implemented as part of this work.
The new 64bit mount id is exposed in statx() through the new
STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is
returned. If it is raised and the kernel supports the new 64bit mount
id the flag is raised in the result mask and the new 64bit mount id is
returned. New and old mount ids do not overlap so they cannot be
conflated.
Two new system calls are introduced that operate on the 64bit mount
id: statmount() and listmount(). A summary of the api and usage can be
found on LWN as well (cf. [3]) but of course, I'll provide a summary
here as well.
Both system calls rely on struct mnt_id_req. Which is the request
struct used to pass the 64bit mount id identifying the mount to
operate on. It is extensible to allow for the addition of new
parameters and for future use in other apis that make use of mount
ids.
statmount() mimicks the semantics of statx() and exposes a set flags
that userspace may raise in mnt_id_req to request specific information
to be retrieved. A statmount() call returns a struct statmount filled
in with information about the requested mount. Supported requests are
indicated by raising the request flag passed in struct mnt_id_req in
the @mask argument in struct statmount.
Currently we do support:
- STATMOUNT_SB_BASIC:
Basic filesystem info
- STATMOUNT_MNT_BASIC
Mount information (mount id, parent mount id, mount attributes etc)
- STATMOUNT_PROPAGATE_FROM
Propagation from what mount in current namespace
- STATMOUNT_MNT_ROOT
Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla)
- STATMOUNT_MNT_POINT
Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt)
- STATMOUNT_FS_TYPE
Name of the filesystem type as the magic number isn't enough due to submounts
The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE
are appended to the end of the struct. Userspace can use the offsets
in @fs_type, @mnt_root, and @mnt_point to reference those strings
easily.
The struct statmount reserves quite a bit of space currently for
future extensibility. This isn't really a problem and if this bothers
us we can just send a follow-up pull request during this cycle.
listmount() is given a 64bit mount id via mnt_id_req just as
statmount(). It takes a buffer and a size to return an array of the
64bit ids of the child mounts of the requested mount. Userspace can
thus choose to either retrieve child mounts for a mount in batches or
iterate through the child mounts. For most use-cases it will be
sufficient to just leave space for a few child mounts. But for big
mount tables having an iterator is really helpful. Iterating through a
mount table works by setting @param in mnt_id_req to the mount id of
the last child mount retrieved in the previous listmount() call"
Link: https://lwn.net/Articles/934469 [1]
Link: https://lwn.net/Articles/829212 [2]
Link: https://lwn.net/Articles/950569 [3]
* tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
add selftest for statmount/listmount
fs: keep struct mnt_id_req extensible
wire up syscalls for statmount/listmount
add listmount(2) syscall
statmount: simplify string option retrieval
statmount: simplify numeric option retrieval
add statmount(2) syscall
namespace: extract show_path() helper
mounts: keep list of mounts in an rbtree
add unique mount ID
|
|
Common KVM changes for 6.8:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
|
|
CONFIG_HAVE_KVM is currently used by some architectures to either
enabled the KVM config proper, or to enable host-side code that is
not part of the KVM module. However, CONFIG_KVM's "select" statement
in virt/kvm/Kconfig corresponds to a third meaning, namely to
enable common Kconfigs required by all architectures that support
KVM.
These three meanings can be replaced respectively by an
architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by
a new Kconfig symbol that is in turn selected by the
architecture-specific "config KVM".
Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON.
Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by
architecture code, and it brings in all dependencies of common
KVM code. In particular, INTERVAL_TREE was missing in loongarch
and riscv, so that is another thing that is fixed.
Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08)
Reported-by: Randy Dunlap <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc mm fixes from Andrew Morton:
"12 hotfixes.
Two are cc:stable and the remainder either address post-6.7 issues or
aren't considered necessary for earlier kernel versions"
* tag 'mm-hotfixes-stable-2024-01-05-11-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info()
mailmap: add entries for Mathieu Othacehe
MAINTAINERS: change vmware.com addresses to broadcom.com
arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
mm/mglru: skip special VMAs in lru_gen_look_around()
MAINTAINERS: hand over hwpoison maintainership to Miaohe Lin
MAINTAINERS: remove hugetlb maintainer Mike Kravetz
mm: fix unmap_mapping_range high bits shift bug
mm: memcg: fix split queue list crash when large folio migration
mm: fix arithmetic for max_prop_frac when setting max_ratio
mm: fix arithmetic for bdi min_ratio
mm: align larger anonymous mappings on THP boundaries
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
e009b2efb7a8 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()")
0f2b21477988 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL")
https://lore.kernel.org/all/[email protected]/
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.8
1. Optimization for memslot hugepage checking.
2. Cleanup and fix some HW/SW timer issues.
3. Add LSX/LASX (128bit/256bit SIMD) support.
|
|
Now that we are retiring the IPT action.
Reviewed-by: Victor Noguiera <[email protected]>
Reviewed-by: Pedro Tammela <[email protected]>
Signed-off-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
bcachefs's six locks need kvm_guest, via
ower_on_cpu() -> vcpu_is_preempted() -> is_kvm_guest()
Signed-off-by: Kent Overstreet <[email protected]>
Cc: [email protected]
Acked-by: Michael Ellerman <[email protected]> (powerpc)
|
|
A test [1] in Android test suite started failing after [2] was merged. It
turns out that after handling a major fault under per-VMA lock, the
process major fault counter does not register that fault as major. Before
[2] read faults would be done under mmap_lock, in which case
FAULT_FLAG_TRIED flag is set before retrying. That in turn causes
mm_account_fault() to account the fault as major once retry completes.
With per-VMA locks we often retry because a fault can't be handled without
locking the whole mm using mmap_lock. Therefore such retries do not set
FAULT_FLAG_TRIED flag. This logic does not work after [2] because we can
now handle read major faults under per-VMA lock and upon retry the fact
there was a major fault gets lost. Fix this by setting FAULT_FLAG_TRIED
after retrying under per-VMA lock if VM_FAULT_MAJOR was returned. Ideally
we would use an additional VM_FAULT bit to indicate the reason for the
retry (could not handle under per-VMA lock vs other reason) but this
simpler solution seems to work, so keeping it simple.
[1] https://cs.android.com/android/platform/superproject/+/master:test/vts-testcase/kernel/api/drop_caches_prop/drop_caches_test.cpp
[2] https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock")
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Merge our topic branch containing KVM related patches.
|
|
Commit 8c5fa3b5c4df ("powerpc/64: Make ELFv2 the default for big-endian
builds"), merged in Linux-6.5-rc1 changes the calling ABI in a way
that is incompatible with the current code for the PS3's LV1 hypervisor
calls.
This change just adds the line '# CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 is not set'
to the ps3_defconfig file so that the PPC64_ELF_ABI_V1 is used.
Fixes run time errors like these:
BUG: Kernel NULL pointer dereference at 0x00000000
Faulting instruction address: 0xc000000000047cf0
Oops: Kernel access of bad area, sig: 11 [#1]
Call Trace:
[c0000000023039e0] [c00000000100ebfc] ps3_create_spu+0xc4/0x2b0 (unreliable)
[c000000002303ab0] [c00000000100d4c4] create_spu+0xcc/0x3c4
[c000000002303b40] [c00000000100eae4] ps3_enumerate_spus+0xa4/0xf8
Fixes: 8c5fa3b5c4df ("powerpc/64: Make ELFv2 the default for big-endian builds")
Cc: [email protected] # v6.5+
Signed-off-by: Geoff Levand <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
The MPC8610 symbol used to be default y if MPC8610_HPCD, but since
MPC8610_HPCD was removed MPC8610 is now never used. Remove it.
Fixes: 248667f8bbde ("powerpc: drop HPCD/MPC8610 evaluation platform support")
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues
or are not considered backporting material"
* tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add an old address for Naoya Horiguchi
mm/memory-failure: cast index to loff_t before shifting it
mm/memory-failure: check the mapcount of the precise page
mm/memory-failure: pass the folio and the page to collect_procs()
selftests: secretmem: floor the memory size to the multiple of page_size
mm: migrate high-order folios in swap cache correctly
maple_tree: do not preallocate nodes for slot stores
mm/filemap: avoid buffered read/write race to read inconsistent data
kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset
kexec: select CRYPTO from KEXEC_FILE instead of depending on it
kexec: fix KEXEC_FILE dependencies
|
|
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big
sched.h dependency.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
We're trying to get sched.h down to more or less just types only, not
code - rseq can live in its own header.
This helps us kill the dependency on preempt.h in sched.h.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
In the opal_prd_range_is_valid function within opal-prd.c,
error handling was missing for the of_get_address call.
This patch adds necessary error checking, ensuring that the
function gracefully handles scenarios where of_get_address fails.
Signed-off-by: Haoran Liu <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
Reorder the newly introduced hcall opcodes for Nestedv2 to follow the
increasing-opcode-number convention followed in 'hvcall.h'.
Also updates the value for MAX_HCALL_OPCODE which is used in various
places in arch code for range checking. Notably in the KVM enabled-hcall
logic, and in hcall tracing.
Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests")
Suggested-by: Michael Ellerman <[email protected]>
Signed-off-by: Vaibhav Jain <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
The kernel thread function ps3_probe_thread() invokes the try_to_freeze()
in its loop. But all the kernel threads are non-freezable by default.
So if we want to make a kernel thread to be freezable, we have to invoke
set_freezable() explicitly.
Signed-off-by: Kevin Hao <[email protected]>
Acked-by: Geoff Levand <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
A freezable kernel thread can enter frozen state during freezing by
either calling try_to_freeze() or using wait_event_freezable() and its
variants. So for the following snippet of code in a kernel thread loop:
wait_event_interruptible();
try_to_freeze();
We can change it to a simple wait_event_freezable() and then eliminate
a function call.
Signed-off-by: Kevin Hao <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
The kernel thread function agent_thread_fn() invokes the try_to_freeze()
in its loop. But all the kernel threads are non-freezable by default.
So if we want to make a kernel thread to be freezable, we have to invoke
set_freezable() explicitly.
Signed-off-by: Kevin Hao <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
Then when specifying '-d' for kexec_file_load interface, loaded locations
of kernel/initrd/cmdline etc can be printed out to help debug.
Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
loading related codes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Conor Dooley <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The cleanup for the CONFIG_KEXEC Kconfig logic accidentally changed the
'depends on CRYPTO=y' dependency to a plain 'depends on CRYPTO', which
causes a link failure when all the crypto support is in a loadable module
and kexec_file support is built-in:
x86_64-linux-ld: vmlinux.o: in function `__x64_sys_kexec_file_load':
(.text+0x32e30a): undefined reference to `crypto_alloc_shash'
x86_64-linux-ld: (.text+0x32e58e): undefined reference to `crypto_shash_update'
x86_64-linux-ld: (.text+0x32e6ee): undefined reference to `crypto_shash_final'
Both s390 and x86 have this problem, while ppc64 and riscv have the
correct dependency already. On riscv, the dependency is only used for the
purgatory, not for the kexec_file code itself, which may be a bit
surprising as it means that with CONFIG_CRYPTO=m, it is possible to enable
KEXEC_FILE but then the purgatory code is silently left out.
Move this into the common Kconfig.kexec file in a way that is correct
everywhere, using the dependency on CRYPTO_SHA256=y only when the
purgatory code is available. This requires reversing the dependency
between ARCH_SUPPORTS_KEXEC_PURGATORY and KEXEC_FILE, but the effect
remains the same, other than making riscv behave like the other ones.
On s390, there is an additional dependency on CRYPTO_SHA256_S390, which
should technically not be required but gives better performance. Remove
this dependency here, noting that it was not present in the initial
Kconfig code but was brought in without an explanation in commit
71406883fd357 ("s390/kexec_file: Add kexec_file_load system call").
[[email protected]: fix riscv build]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6af5138083005 ("x86/kexec: refactor for kernel/Kconfig.kexec")
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Eric DeVolder <[email protected]>
Tested-by: Eric DeVolder <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Conor Dooley <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
fsl,tmu-calibration is defined as a u32 matrix in
Documentation/devicetree/bindings/thermal/qoriq-thermal.yaml.
Use matching property syntax. No functional changes.
Signed-off-by: David Heidelberg <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a bug where heavy VAS (accelerator) usage could race with
partition migration and prevent the migration from completing.
- Update MAINTAINERS to add Aneesh & Naveen.
Thanks to Haren Myneni.
* tag 'powerpc-6.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
MAINTAINERS: powerpc: Add Aneesh & Naveen
powerpc/pseries/vas: Migration suspend waits for no in-progress open windows
|
|
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Merge a branch containing SMP topology updates from Srikar, purely so we can
include the cover letter which has a lot of good detail here:
PowerVM systems configured in shared processors mode have some unique
challenges. Some device-tree properties will be missing on a shared
processor. Hence some sched domains may not make sense for shared processor
systems.
Most shared processor systems are over-provisioned. Underlying PowerVM
Hypervisor would schedule at a Big Core (SMT8) granularity. The most recent
power processors support two almost independent cores. In a lightly loaded
condition, it helps the overall system performance if we pack to lesser number
of Big Cores.
Since each thread-group is independent, running threads on both the
thread-groups of a SMT8 core, should have a minimal adverse impact in
non over provisioned scenarios. These changes in this patchset will not
affect in the over provisioned scenario. If there are more threads than
SMT domains, then asym_packing will not kick-in.
System Configuration
type=Shared mode=Uncapped smt=8 lcpu=96 mem=1066409344 kB cpus=96 ent=64.00
So *64 Entitled cores/ 96 Virtual processor* Scenario
lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 768
On-line CPU(s) list: 0-767
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 16
Socket(s): 6
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 6 MiB (192 instances)
L1i cache: 9 MiB (192 instances)
NUMA node(s): 6
NUMA node0 CPU(s): 0-7,32-39,80-87,128-135,176-183,224-231,272-279,320-327,368-375,416-423,464-471,512-519,560-567,608-615,656-663,704-711,752-759
NUMA node1 CPU(s): 8-15,40-47,88-95,136-143,184-191,232-239,280-287,328-335,376-383,424-431,472-479,520-527,568-575,616-623,664-671,712-719,760-767
NUMA node4 CPU(s): 64-71,112-119,160-167,208-215,256-263,304-311,352-359,400-407,448-455,496-503,544-551,592-599,640-647,688-695,736-743
NUMA node5 CPU(s): 16-23,48-55,96-103,144-151,192-199,240-247,288-295,336-343,384-391,432-439,480-487,528-535,576-583,624-631,672-679,720-727
NUMA node6 CPU(s): 72-79,120-127,168-175,216-223,264-271,312-319,360-367,408-415,456-463,504-511,552-559,600-607,648-655,696-703,744-751
NUMA node7 CPU(s): 24-31,56-63,104-111,152-159,200-207,248-255,296-303,344-351,392-399,440-447,488-495,536-543,584-591,632-639,680-687,728-735
ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
6.6.0-rc3 5 3840178 4059268 3978042 3973936.6 84264.456
+patch 5 3768393 3927901 3874994 3854046 71532.926 -3.01692
>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
6.6.0-rc3 4.16 0.00 0.00 95.84 26.06 40.72 4.16 69.88 276906989 578
+patch 4.16 0.00 0.00 95.83 17.70 27.66 4.17 78.26 70436663 119
ebizzy -t 128 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
6.6.0-rc3 5 5520692 5981856 5717709 5727053.2 176093.2
+patch 5 5305888 6259610 5854590 5843311 375917.03 2.02998
>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
6.6.0-rc3 16.66 0.00 0.00 83.33 45.49 71.08 16.67 50.50 288778533 581
+patch 16.65 0.00 0.00 83.35 30.15 47.11 16.65 65.76 85196150 133
ebizzy -t 512 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
6.6.0-rc3 5 19563921 20049955 19701510 19728733 198295.18
+patch 5 19455992 20176445 19718427 19832017 304094.05 0.523521
>From lparstat (when the workload stabilized)
%Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint
66.6.0-rc3 6.44 0.01 0.00 33.55 94.14 147.09 66.45 1.33 313345175 621
6+patch 6.44 0.01 0.00 33.55 94.15 147.11 66.45 1.33 109193889 309
System Configuration
type=Shared mode=Uncapped smt=8 lcpu=40 mem=1067539392 kB cpus=96 ent=40.00
So *40 Entitled cores/ 40 Virtual processor* Scenario
lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 320
On-line CPU(s) list: 0-319
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 10
Socket(s): 4
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 2.5 MiB (80 instances)
L1i cache: 3.8 MiB (80 instances)
NUMA node(s): 4
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303
NUMA node4 CPU(s): 16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279,304-311
NUMA node5 CPU(s): 24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287,312-319
ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
6.6.0-rc3 5 3535518 3864532 3745967 3704233.2 130216.76
+patch 5 3608385 3708026 3649379 3651596.6 37862.163 -1.42099
%Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint
6.6.0-rc3 10.00 0.01 0.00 89.99 22.98 57.45 10.01 41.01 1135139 262
+patch 10.00 0.00 0.00 90.00 16.95 42.37 10.00 47.05 925561 19
ebizzy -t 64 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
6.6.0-rc3 5 4434984 4957281 4548786 4591298.2 211770.2
+patch 5 4461115 4835167 4544716 4607795.8 151474.85 0.359323
%Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint
6.6.0-rc3 20.01 0.00 0.00 79.99 38.22 95.55 20.01 25.77 1287553 265
+patch 19.99 0.00 0.00 80.01 25.55 63.88 19.99 38.44 1077341 20
ebizzy -t 256 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
6.6.0-rc3 5 8850648 8982659 8951911 8936869.2 52278.031
+patch 5 8751038 9060510 8981409 8942268.4 117070.6 0.0604149
%Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint
6.6.0-rc3 80.02 0.01 0.01 19.96 40.00 100.00 80.03 24.00 1597665 276
+patch 80.02 0.01 0.01 19.96 40.00 100.00 80.03 23.99 1383921 63
Observation:
We are able to see Improvement in ebizzy throughput even with lesser
core utilization (almost half the core utilization) in low utilization
scenarios while still retaining throughput in mid and higher utilization
scenarios.
Note: The numbers are with Uncapped + no-noise case. In the Capped and/or
noise case, due to contention on the Cores, the numbers are expected to
further improve.
Note: The numbers included (sched/fair: Enable group_asym_packing in find_idlest_group)
https://lore.kernel.org/all/[email protected]/
|
|
Currently there are four Powerpc specific sched topologies. These are
all statically defined. However not all these topologies are used by
all Powerpc systems.
To avoid unnecessary degenerations by the scheduler, masks and flags
are compared. However if the sched topologies are build dynamically then
the code is simpler and there are greater chances of avoiding
degenerations.
Note:
Even X86 builds its sched topologies dynamically and proposed changes
are very similar to the way X86 is building its topologies.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
PowerVM Hypervisor will schedule at a core granularity. However each
core can have more than one thread_groups. For better utilization in
case of a shared processor, its preferable for the scheduler to pack to
the lowest core. However there is no benefit of moving a thread between
two thread groups of the same core.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
There are some variables that are only updated at boot time.
So add __ro_after_init attribute to such variables
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
Like L2-cache info, coregroup information which is used to determine MC
sched domains is only present on dedicated LPARs. i.e PowerVM doesn't
export coregroup information for shared processor LPARs. Hence disable
creating MC domains on shared LPAR Systems.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
If there are shared processor LPARs, underlying Hypervisor can have more
virtual cores to handle than actual physical cores.
Starting with Power 9, a big core (aka SMT8 core) has 2 nearly
independent thread groups. On a shared processors LPARs, it helps to
pack threads to lesser number of cores so that the overall system
performance and utilization improves. PowerVM schedules at a big core
level. Hence packing to fewer cores helps.
Since each thread-group is independent, running threads on both the
thread-groups of a SMT8 core, should have a minimal adverse impact in
non over provisioned scenarios. These changes in this patchset will not
affect in the over provisioned scenario. If there are more threads than
SMT domains, then asym_packing will not kick-in
For example: Lets says there are two 8-core Shared LPARs that are
actually sharing a 8 Core shared physical pool, each running 8 threads
each. Then Consolidating 8 threads to 4 cores on each LPAR would help
them to perform better. This is because each of the LPAR will get
100% time to run applications and there will no switching required by
the Hypervisor.
To achieve this, enable SD_ASYM_PACKING flag at CACHE, MC and DIE level
when the system is running in shared processor mode and has big cores.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
No functional change in this patch. A helper is added to find if
vcpu is dispatched by hypervisor. Use that instead of opencoding.
Also clarify some of the comments.
Signed-off-by: "Aneesh Kumar K.V" <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
Wire up all archs.
Signed-off-by: Miklos Szeredi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Ian Kent <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
CPU features can be determined in makedumpfile, using
'cur_cpu_spec.cpu_features'.
This provides more data to makedumpfile about the crashed system, and
can help in filtering the vmcore accordingly.
Signed-off-by: Aditya Gupta <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure.
Fixes: 885dcd709ba9 ("powerpc/perf: Add nest IMC PMU support")
Signed-off-by: Kunwu Chan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure.
Fixes: b9ef7b4b867f ("powerpc: Convert to using %pOFn instead of device_node.name")
Signed-off-by: Kunwu Chan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure.
Fixes: 2717a33d6074 ("powerpc/opal-irqchip: Use interrupt names if present")
Signed-off-by: Kunwu Chan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure.
Add a null pointer check, and release 'ent' to avoid memory leaks.
Fixes: bfd2f0d49aef ("powerpc/powernv: Get rid of old scom_controller abstraction")
Signed-off-by: Kunwu Chan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure. Ensure the allocation was successful
by checking the pointer validity.
Suggested-by: Christophe Leroy <[email protected]>
Suggested-by: Michael Ellerman <[email protected]>
Signed-off-by: Kunwu Chan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
The hypervisor returns migration failure if all VAS windows are not
closed. During pre-migration stage, vas_migration_handler() sets
migration_in_progress flag and closes all windows from the list.
The allocate VAS window routine checks the migration flag, setup
the window and then add it to the list. So there is possibility of
the migration handler missing the window that is still in the
process of setup.
t1: Allocate and open VAS t2: Migration event
window
lock vas_pseries_mutex
If migration_in_progress set
unlock vas_pseries_mutex
return
open window HCALL
unlock vas_pseries_mutex
Modify window HCALL lock vas_pseries_mutex
setup window migration_in_progress=true
Closes all windows from the list
// May miss windows that are
// not in the list
unlock vas_pseries_mutex
lock vas_pseries_mutex return
if nr_closed_windows == 0
// No DLPAR CPU or migration
add window to the list
// Window will be added to the
// list after the setup is completed
unlock vas_pseries_mutex
return
unlock vas_pseries_mutex
Close VAS window
// due to DLPAR CPU or migration
return -EBUSY
This patch resolves the issue with the following steps:
- Set the migration_in_progress flag without holding mutex.
- Introduce nr_open_wins_progress counter in VAS capabilities
struct
- This counter tracks the number of open windows are still in
progress
- The allocate setup window thread closes windows if the migration
is set and decrements nr_open_window_progress counter
- The migration handler waits for no in-progress open windows.
The code flow with the fix is as follows:
t1: Allocate and open VAS t2: Migration event
window
lock vas_pseries_mutex
If migration_in_progress set
unlock vas_pseries_mutex
return
open window HCALL
nr_open_wins_progress++
// Window opened, but not
// added to the list yet
unlock vas_pseries_mutex
Modify window HCALL migration_in_progress=true
setup window lock vas_pseries_mutex
Closes all windows from the list
While nr_open_wins_progress {
unlock vas_pseries_mutex
lock vas_pseries_mutex sleep
if nr_closed_windows == 0 // Wait if any open window in
or migration is not started // progress. The open window
// No DLPAR CPU or migration // thread closes the window without
add window to the list // adding to the list and return if
nr_open_wins_progress-- // the migration is in progress.
unlock vas_pseries_mutex
return
Close VAS window
nr_open_wins_progress--
unlock vas_pseries_mutex
return -EBUSY lock vas_pseries_mutex
}
unlock vas_pseries_mutex
return
Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler")
Signed-off-by: Haren Myneni <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
|
|
Commit d49a0626216b95 ("arch: Introduce CONFIG_FUNCTION_ALIGNMENT")
introduced a generic function-alignment infrastructure. Move to using
FUNCTION_ALIGNMENT_4B on powerpc, to use the same alignment as that of
the existing _GLOBAL macro.
Signed-off-by: Sathvika Vasireddy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/21892186ec44abe24df0daf64f577dac0e78783f.1702045299.git.naveen@kernel.org
|
|
ftrace_stub is within the same CU, so there is no need for a subsequent
nop instruction.
Signed-off-by: Naveen N Rao <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/8ee5ec520e37d5523654bb2cd65a17512fb774e2.1702045299.git.naveen@kernel.org
|
|
Replace seven spaces with a tab character to fix an indentation issue
reported by the kernel test robot.
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Naveen N Rao <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/9f058227bd9243f0842786ef7228d87ab10d29f6.1702045299.git.naveen@kernel.org
|
|
Until now the papr_sysparm APIs have been kernel-internal. But user
space needs access to PAPR system parameters too. The only method
available to user space today to get or set system parameters is using
sys_rtas() and /dev/mem to pass RTAS-addressable buffers between user
space and firmware. This is incompatible with lockdown and should be
deprecated.
So provide an alternative ABI to user space in the form of a
/dev/papr-sysparm character device with just two ioctl commands (get
and set). The data payloads involved are small enough to fit in the
ioctl argument buffer, making the code relatively simple.
Exposing the system parameters through sysfs has been considered but
it would be too awkward:
* The kernel currently does not have to contain an exhaustive list of
defined system parameters. This is a convenient property to maintain
because we don't have to update the kernel whenever a new parameter
is added to PAPR. Exporting a named attribute in sysfs for each
parameter would negate this.
* Some system parameters are text-based and some are not.
* Retrieval of at least one system parameter requires input data,
which a simple read-oriented interface can't support.
Signed-off-by: Nathan Lynch <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-11-e9eafd0c8c6c@linux.ibm.com
|
|
The ability to get and set system parameters will be exposed to user
space, so let's get a little more strict about malformed
papr_sysparm_buf objects.
* Create accessors for the length field of struct papr_sysparm_buf.
The length is always stored in MSB order and this is better than
spreading the necessary conversions all over.
* Reject attempts to submit invalid buffers to RTAS.
* Warn if RTAS returns a buffer with an invalid length, clamping the
returned length to a safe value that won't overrun the buffer.
These are meant as precautionary measures to mitigate both firmware
and kernel bugs in this area, should they arise, but I am not aware of
any.
Signed-off-by: Nathan Lynch <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-10-e9eafd0c8c6c@linux.ibm.com
|
|
PowerVM LPARs may retrieve Vital Product Data (VPD) for system
components using the ibm,get-vpd RTAS function.
We can expose this to user space with a /dev/papr-vpd character
device, where the programming model is:
struct papr_location_code plc = { .str = "", }; /* obtain all VPD */
int devfd = open("/dev/papr-vpd", O_RDONLY);
int vpdfd = ioctl(devfd, PAPR_VPD_CREATE_HANDLE, &plc);
size_t size = lseek(vpdfd, 0, SEEK_END);
char *buf = malloc(size);
pread(devfd, buf, size, 0);
When a file descriptor is obtained from ioctl(PAPR_VPD_CREATE_HANDLE),
the file contains the result of a complete ibm,get-vpd sequence. The
file contents are immutable from the POV of user space. To get a new
view of the VPD, the client must create a new handle.
This design choice insulates user space from most of the complexities
that ibm,get-vpd brings:
* ibm,get-vpd must be called more than once to obtain complete
results.
* Only one ibm,get-vpd call sequence should be in progress at a time;
interleaved sequences will disrupt each other. Callers must have a
protocol for serializing their use of the function.
* A call sequence in progress may receive a "VPD changed, try again"
status, requiring the client to abandon the sequence and start
over.
The memory required for the VPD buffers seems acceptable, around 20KB
for all VPD on one of my systems. And the value of the
/rtas/ibm,vpd-size DT property (the estimated maximum size of VPD) is
consistently 300KB across various systems I've checked.
I've implemented support for this new ABI in the rtas_get_vpd()
function in librtas, which the vpdupdate command currently uses to
populate its VPD database. I've verified that an unmodified vpdupdate
binary generates an identical database when using a librtas.so that
prefers the new ABI.
Along with the papr-vpd.h header exposed to user space, this
introduces a common papr-miscdev.h uapi header to share a base ioctl
ID with similar drivers to come.
Tested-by: Michal Suchánek <[email protected]>
Signed-off-by: Nathan Lynch <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-9-e9eafd0c8c6c@linux.ibm.com
|
|
If the function descriptor has a populated lock member, then callers
are required to hold it across calls. Now that the firmware activation
sequence is appropriately guarded, we can warn when the requirement
isn't satisfied.
__do_enter_rtas_trace() gets reorganized a bit as a result of
performing the function descriptor lookup unconditionally now.
Reviewed-by: "Aneesh Kumar K.V (IBM)" <[email protected]>
Signed-off-by: Nathan Lynch <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-8-e9eafd0c8c6c@linux.ibm.com
|
|
Use rtas_ibm_activate_firmware_lock to prevent interleaving call
sequences of the ibm,activate-firmware RTAS function, which typically
requires multiple calls to complete the update. While the spec does
not specifically prohibit interleaved sequences, there's almost
certainly no advantage to allowing them.
Reviewed-by: "Aneesh Kumar K.V (IBM)" <[email protected]>
Signed-off-by: Nathan Lynch <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-7-e9eafd0c8c6c@linux.ibm.com
|
|
On RTAS platforms there is a general restriction that the OS must not
enter RTAS on more than one CPU at a time. This low-level
serialization requirement is satisfied by holding a spin
lock (rtas_lock) across most RTAS function invocations.
However, some pseries RTAS functions require multiple successive calls
to complete a logical operation. Beginning a new call sequence for such a
function may disrupt any other sequences of that function already in
progress. Safe and reliable use of these functions effectively
requires higher-level serialization beyond what is already done at the
level of RTAS entry and exit.
Where a sequence-based RTAS function is invoked only through
sys_rtas(), with no in-kernel users, there is no issue as far as the
kernel is concerned. User space is responsible for appropriately
serializing its call sequences. (Whether user space code actually
takes measures to prevent sequence interleaving is another matter.)
Examples of such functions currently include ibm,platform-dump and
ibm,get-vpd.
But where a sequence-based RTAS function has both user space and
in-kernel uesrs, there is a hazard. Even if the in-kernel call sites
of such a function serialize their sequences correctly, a user of
sys_rtas() can invoke the same function at any time, potentially
disrupting a sequence in progress.
So in order to prevent disruption of kernel-based RTAS call sequences,
they must serialize not only with themselves but also with sys_rtas()
users, somehow. Preferably without adding more function-specific hacks
to sys_rtas(). This is a prerequisite for adding an in-kernel call
sequence of ibm,get-vpd, which is in a change to follow.
Note that it has never been feasible for the kernel to prevent
sys_rtas()-based sequences from being disrupted because control
returns to user space on every call. sys_rtas()-based users of these
functions have always been, and continue to be, responsible for
coordinating their call sequences with other users, even those which
may invoke the RTAS functions through less direct means than
sys_rtas(). This is an unavoidable consequence of exposing
sequence-based RTAS functions through sys_rtas().
* Add an optional mutex member to struct rtas_function.
* Statically define a mutex for each RTAS function with known call
sequence serialization requirements, and assign its address to the
.lock member of the corresponding function table entry, along with
justifying commentary.
* In sys_rtas(), if the table entry for the RTAS function being
called has a populated lock member, acquire it before taking
rtas_lock and entering RTAS.
* Kernel-based RTAS call sequences are expected to access the
appropriate mutex explicitly by name. For example, a user of the
ibm,activate-firmware RTAS function would do:
int token = rtas_function_token(RTAS_FN_IBM_ACTIVATE_FIRMWARE);
int fwrc;
mutex_lock(&rtas_ibm_activate_firmware_lock);
do {
fwrc = rtas_call(token, 0, 1, NULL);
} while (rtas_busy_delay(fwrc));
mutex_unlock(&rtas_ibm_activate_firmware_lock);
There should be no perceivable change introduced here except that
concurrent callers of the same RTAS function via sys_rtas() may block
on a mutex instead of spinning on rtas_lock.
Signed-off-by: Nathan Lynch <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/20231212-papr-sys_rtas-vs-lockdown-v6-6-e9eafd0c8c6c@linux.ibm.com
|