aboutsummaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)AuthorFilesLines
2024-01-10Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefsLinus Torvalds2-0/+2
Pull header cleanups from Kent Overstreet: "The goal is to get sched.h down to a type only header, so the main thing happening in this patchset is splitting out various _types.h headers and dependency fixups, as well as moving some things out of sched.h to better locations. This is prep work for the memory allocation profiling patchset which adds new sched.h interdepencencies" * tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits) Kill sched.h dependency on rcupdate.h kill unnecessary thread_info.h include Kill unnecessary kernel.h include preempt.h: Kill dependency on list.h rseq: Split out rseq.h from sched.h LoongArch: signal.c: add header file to fix build error restart_block: Trim includes lockdep: move held_lock to lockdep_types.h sem: Split out sem_types.h uidgid: Split out uidgid_types.h seccomp: Split out seccomp_types.h refcount: Split out refcount_types.h uapi/linux/resource.h: fix include x86/signal: kill dependency on time.h syscall_user_dispatch.h: split out *_types.h mm_types_task.h: Trim dependencies Split out irqflags_types.h ipc: Kill bogus dependency on spinlock.h shm: Slim down dependencies workqueue: Split out workqueue_types.h ...
2024-01-10Merge tag 'bcachefs-2024-01-10' of https://evilpiepirate.org/git/bcachefsLinus Torvalds1-0/+2
Pull bcachefs updates from Kent Overstreet: - btree write buffer rewrite: instead of adding keys to the btree write buffer at transaction commit time, we now journal them with a different journal entry type and copy them from the journal to the write buffer just prior to journal write. This reduces the number of atomic operations on shared cachelines in the transaction commit path and is a signicant performance improvement on some workloads: multithreaded 4k random writes went from ~650k iops to ~850k iops. - Bring back optimistic spinning for six locks: the new implementation doesn't use osq locks; instead we add to the lock waitlist as normal, and then spin on the lock_acquired bit in the waitlist entry, _not_ the lock itself. - New ioctls: - BCH_IOCTL_DEV_USAGE_V2, which allows for new data types - BCH_IOCTL_OFFLINE_FSCK, which runs the kernel implementation of fsck but without mounting: useful for transparently using the kernel version of fsck from 'bcachefs fsck' when the kernel version is a better match for the on disk filesystem. - BCH_IOCTL_ONLINE_FSCK: online fsck. Not all passes are supported yet, but the passes that are supported are fully featured - errors may be corrected as normal. The new ioctls use the new 'thread_with_file' abstraction for kicking off a kthread that's tied to a file descriptor returned to userspace via the ioctl. - btree_paths within a btree_trans are now dynamically growable, instead of being limited to 64. This is important for the check_directory_structure phase of fsck, and also fixes some issues we were having with btree path overflow in the reflink btree. - Trigger refactoring; prep work for the upcoming disk space accounting rewrite - Numerous bugfixes :) * tag 'bcachefs-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (226 commits) bcachefs: eytzinger0_find() search should be const bcachefs: move "ptrs not changing" optimization to bch2_trigger_extent() bcachefs: fix simulateously upgrading & downgrading bcachefs: Restart recovery passes more reliably bcachefs: bch2_dump_bset() doesn't choke on u64s == 0 bcachefs: improve checksum error messages bcachefs: improve validate_bset_keys() bcachefs: print sb magic when relevant bcachefs: __bch2_sb_field_to_text() bcachefs: %pg is banished bcachefs: Improve would_deadlock trace event bcachefs: fsck_err()s don't need to manually check c->sb.version anymore bcachefs: Upgrades now specify errors to fix, like downgrades bcachefs: no thread_with_file in userspace bcachefs: Don't autofix errors we can't fix bcachefs: add missing bch2_latency_acct() call bcachefs: increase max_active on io_complete_wq bcachefs: add time_stats for btree_node_read_done() bcachefs: don't clear accessed bit in btree node fill bcachefs: Add an option to control btree node prefetching ...
2024-01-10Merge tag 'v6.8-p1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add incremental lskcipher/skcipher processing Algorithms: - Remove SHA1 from drbg - Remove CFB and OFB Drivers: - Add comp high perf mode configuration in hisilicon/zip - Add support for 420xx devices in qat - Add IAA Compression Accelerator driver" * tag 'v6.8-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (172 commits) crypto: iaa - Account for cpu-less numa nodes crypto: scomp - fix req->dst buffer overflow crypto: sahara - add support for crypto_engine crypto: sahara - remove error message for bad aes request size crypto: sahara - remove unnecessary NULL assignments crypto: sahara - remove 'active' flag from sahara_aes_reqctx struct crypto: sahara - use dev_err_probe() crypto: sahara - use devm_clk_get_enabled() crypto: sahara - use BIT() macro crypto: sahara - clean up macro indentation crypto: sahara - do not resize req->src when doing hash operations crypto: sahara - fix processing hash requests with req->nbytes < sg->length crypto: sahara - improve error handling in sahara_sha_process() crypto: sahara - fix wait_for_completion_timeout() error handling crypto: sahara - fix ahash reqsize crypto: sahara - handle zero-length aes requests crypto: skcipher - remove excess kerneldoc members crypto: shash - remove excess kerneldoc members crypto: qat - generate dynamically arbiter mappings crypto: qat - add support for ring pair level telemetry ...
2024-01-09Merge tag 'lsm-pr-20240105' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull security module updates from Paul Moore: - Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and lsm_set_self_attr(). The first syscall simply lists the LSMs enabled, while the second and third get and set the current process' LSM attributes. Yes, these syscalls may provide similar functionality to what can be found under /proc or /sys, but they were designed to support multiple, simultaneaous (stacked) LSMs from the start as opposed to the current /proc based solutions which were created at a time when only one LSM was allowed to be active at a given time. We have spent considerable time discussing ways to extend the existing /proc interfaces to support multiple, simultaneaous LSMs and even our best ideas have been far too ugly to support as a kernel API; after +20 years in the kernel, I felt the LSM layer had established itself enough to justify a handful of syscalls. Support amongst the individual LSM developers has been nearly unanimous, with a single objection coming from Tetsuo (TOMOYO) as he is worried that the LSM_ID_XXX token concept will make it more difficult for out-of-tree LSMs to survive. Several members of the LSM community have demonstrated the ability for out-of-tree LSMs to continue to exist by picking high/unused LSM_ID values as well as pointing out that many kernel APIs rely on integer identifiers, e.g. syscalls (!), but unfortunately Tetsuo's objections remain. My personal opinion is that while I have no interest in penalizing out-of-tree LSMs, I'm not going to penalize in-tree development to support out-of-tree development, and I view this as a necessary step forward to support the push for expanded LSM stacking and reduce our reliance on /proc and /sys which has occassionally been problematic for some container users. Finally, we have included the linux-api folks on (all?) recent revisions of the patchset and addressed all of their concerns. - Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit ioctls on 64-bit systems problem. This patch includes support for all of the existing LSMs which provide ioctl hooks, although it turns out only SELinux actually cares about the individual ioctls. It is worth noting that while Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this patch, they did both indicate they are okay with the changes. - Fix a potential memory leak in the CALIPSO code when IPv6 is disabled at boot. While it's good that we are fixing this, I doubt this is something users are seeing in the wild as you need to both disable IPv6 and then attempt to configure IPv6 labeled networking via NetLabel/CALIPSO; that just doesn't make much sense. Normally this would go through netdev, but Jakub asked me to take this patch and of all the trees I maintain, the LSM tree seemed like the best fit. - Update the LSM MAINTAINERS entry with additional information about our process docs, patchwork, bug reporting, etc. I also noticed that the Lockdown LSM is missing a dedicated MAINTAINERS entry so I've added that to the pull request. I've been working with one of the major Lockdown authors/contributors to see if they are willing to step up and assume a Lockdown maintainer role; hopefully that will happen soon, but in the meantime I'll continue to look after it. - Add a handful of mailmap entries for Serge Hallyn and myself. * tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits) lsm: new security_file_ioctl_compat() hook lsm: Add a __counted_by() annotation to lsm_ctx.ctx calipso: fix memory leak in netlbl_calipso_add_pass() selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test MAINTAINERS: add an entry for the lockdown LSM MAINTAINERS: update the LSM entry mailmap: add entries for Serge Hallyn's dead accounts mailmap: update/replace my old email addresses lsm: mark the lsm_id variables are marked as static lsm: convert security_setselfattr() to use memdup_user() lsm: align based on pointer length in lsm_fill_user_ctx() lsm: consolidate buffer size handling into lsm_fill_user_ctx() lsm: correct error codes in security_getselfattr() lsm: cleanup the size counters in security_getselfattr() lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation lsm: drop LSM_ID_IMA LSM: selftests for Linux Security Module syscalls SELinux: Add selfattr hooks AppArmor: Add selfattr hooks Smack: implement setselfattr and getselfattr hooks ...
2024-01-09Merge tag 'mm-nonmm-stable-2024-01-09-10-33' of ↵Linus Torvalds3-15/+14
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Quite a lot of kexec work this time around. Many singleton patches in many places. The notable patch series are: - nilfs2 folio conversion from Matthew Wilcox in 'nilfs2: Folio conversions for file paths'. - Additional nilfs2 folio conversion from Ryusuke Konishi in 'nilfs2: Folio conversions for directory paths'. - IA64 remnant removal in Heiko Carstens's 'Remove unused code after IA-64 removal'. - Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere in 'Treewide: enable -Wmissing-prototypes'. This had some followup fixes: - Nathan Chancellor has cleaned up the hexagon build in the series 'hexagon: Fix up instances of -Wmissing-prototypes'. - Nathan also addressed some s390 warnings in 's390: A couple of fixes for -Wmissing-prototypes'. - Arnd Bergmann addresses the same warnings for MIPS in his series 'mips: address -Wmissing-prototypes warnings'. - Baoquan He has made kexec_file operate in a top-down-fitting manner similar to kexec_load in the series 'kexec_file: Load kernel at top of system RAM if required' - Baoquan He has also added the self-explanatory 'kexec_file: print out debugging message if required'. - Some checkstack maintenance work from Tiezhu Yang in the series 'Modify some code about checkstack'. - Douglas Anderson has disentangled the watchdog code's logging when multiple reports are occurring simultaneously. The series is 'watchdog: Better handling of concurrent lockups'. - Yuntao Wang has contributed some maintenance work on the crash code in 'crash: Some cleanups and fixes'" * tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (157 commits) crash_core: fix and simplify the logic of crash_exclude_mem_range() x86/crash: use SZ_1M macro instead of hardcoded value x86/crash: remove the unused image parameter from prepare_elf_headers() kdump: remove redundant DEFAULT_CRASH_KERNEL_LOW_SIZE scripts/decode_stacktrace.sh: strip unexpected CR from lines watchdog: if panicking and we dumped everything, don't re-enable dumping watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting watchdog/hardlockup: adopt softlockup logic avoiding double-dumps kexec_core: fix the assignment to kimage->control_page x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init() lib/trace_readwrite.c:: replace asm-generic/io with linux/io nilfs2: cpfile: fix some kernel-doc warnings stacktrace: fix kernel-doc typo scripts/checkstack.pl: fix no space expression between sp and offset x86/kexec: fix incorrect argument passed to kexec_dprintk() x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs nilfs2: add missing set_freezable() for freezable kthread kernel: relay: remove relay_file_splice_read dead code, doesn't work docs: submit-checklist: remove all of "make namespacecheck" ...
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds4-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08Merge tag 'perf-core-2024-01-08' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: - Add branch stack counters ABI extension to better capture the growing amount of information the PMU exposes via branch stack sampling. There's matching tooling support. - Fix race when creating the nr_addr_filters sysfs file - Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support - Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU support - Misc cleanups & fixes * tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Factor out topology_gidnid_map() perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology() perf/x86/amd: Reject branch stack for IBS events perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge perf/x86/intel/uncore: Support IIO free-running counters on GNR perf/x86/intel/uncore: Support Granite Rapids perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR perf: Fix the nr_addr_filters fix perf/x86/intel/cstate: Add Grand Ridge support perf/x86/intel/cstate: Add Sierra Forest support x86/smp: Export symbol cpu_clustergroup_mask() perf/x86/intel/cstate: Cleanup duplicate attr_groups perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file perf/x86/intel: Support branch counters logging perf/x86/intel: Reorganize attrs and is_visible perf: Add branch_sample_call_stack perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag perf: Add branch stack counters
2024-01-08Merge tag 'powerpc-6.8-1' of ↵Linus Torvalds81-552/+1507
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add initial support to recognise the HeXin C2000 processor. - Add papr-vpd and papr-sysparm character device drivers for VPD & sysparm retrieval, so userspace tools can be adapted to avoid doing raw firmware calls from userspace. - Sched domains optimisations for shared processor partitions on P9/P10. - A series of optimisations for KVM running as a nested HV under PowerVM. - Other small features and fixes. Thanks to Aditya Gupta, Aneesh Kumar K.V, Arnd Bergmann, Christophe Leroy, Colin Ian King, Dario Binacchi, David Heidelberg, Geoff Levand, Gustavo A. R. Silva, Haoran Liu, Jordan Niethe, Kajol Jain, Kevin Hao, Kunwu Chan, Li kunyu, Li zeming, Masahiro Yamada, Michal Suchánek, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Randy Dunlap, Sathvika Vasireddy, Srikar Dronamraju, Stephen Rothwell, Vaibhav Jain, and Zhao Ke. * tag 'powerpc-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (96 commits) powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2 powerpc/86xx: Drop unused CONFIG_MPC8610 powerpc/powernv: Add error handling to opal_prd_range_is_valid selftests/powerpc: Fix spelling mistake "EACCESS" -> "EACCES" powerpc/hvcall: Reorder Nestedv2 hcall opcodes powerpc/ps3: Add missing set_freezable() for ps3_probe_thread() powerpc/mpc83xx: Use wait_event_freezable() for freezable kthread powerpc/mpc83xx: Add the missing set_freezable() for agent_thread_fn() powerpc/fsl: Fix fsl,tmu-calibration to match the schema powerpc/smp: Dynamically build Powerpc topology powerpc/smp: Avoid asym packing within thread_group of a core powerpc/smp: Add __ro_after_init attribute powerpc/smp: Disable MC domain for shared processor powerpc/smp: Enable Asym packing for cores on shared processor powerpc/sched: Cleanup vcpu_is_preempted() powerpc: add cpu_spec.cpu_features to vmcoreinfo powerpc/imc-pmu: Add a null pointer check in update_events_in_group() powerpc/powernv: Add a null pointer check in opal_powercap_init() powerpc/powernv: Add a null pointer check in opal_event_init() powerpc/powernv: Add a null pointer check to scom_debug_init_one() ...
2024-01-08mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERKirill A. Shutemov4-4/+4
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-01-08Merge tag 'vfs-6.8.mount' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains the work to retrieve detailed information about mounts via two new system calls. This is hopefully the beginning of the end of the saga that started with fsinfo() years ago. The LWN articles in [1] and [2] can serve as a summary so we can avoid rehashing everything here. At LSFMM in May 2022 we got into a room and agreed on what we want to do about fsinfo(). Basically, split it into pieces. This is the first part of that agreement. Specifically, it is concerned with retrieving information about mounts. So this only concerns the mount information retrieval, not the mount table change notification, or the extended filesystem specific mount option work. That is separate work. Currently mounts have a 32bit id. Mount ids are already in heavy use by libmount and other low-level userspace but they can't be relied upon because they're recycled very quickly. We agreed that mounts should carry a unique 64bit id by which they can be referenced directly. This is now implemented as part of this work. The new 64bit mount id is exposed in statx() through the new STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is returned. If it is raised and the kernel supports the new 64bit mount id the flag is raised in the result mask and the new 64bit mount id is returned. New and old mount ids do not overlap so they cannot be conflated. Two new system calls are introduced that operate on the 64bit mount id: statmount() and listmount(). A summary of the api and usage can be found on LWN as well (cf. [3]) but of course, I'll provide a summary here as well. Both system calls rely on struct mnt_id_req. Which is the request struct used to pass the 64bit mount id identifying the mount to operate on. It is extensible to allow for the addition of new parameters and for future use in other apis that make use of mount ids. statmount() mimicks the semantics of statx() and exposes a set flags that userspace may raise in mnt_id_req to request specific information to be retrieved. A statmount() call returns a struct statmount filled in with information about the requested mount. Supported requests are indicated by raising the request flag passed in struct mnt_id_req in the @mask argument in struct statmount. Currently we do support: - STATMOUNT_SB_BASIC: Basic filesystem info - STATMOUNT_MNT_BASIC Mount information (mount id, parent mount id, mount attributes etc) - STATMOUNT_PROPAGATE_FROM Propagation from what mount in current namespace - STATMOUNT_MNT_ROOT Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla) - STATMOUNT_MNT_POINT Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt) - STATMOUNT_FS_TYPE Name of the filesystem type as the magic number isn't enough due to submounts The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE are appended to the end of the struct. Userspace can use the offsets in @fs_type, @mnt_root, and @mnt_point to reference those strings easily. The struct statmount reserves quite a bit of space currently for future extensibility. This isn't really a problem and if this bothers us we can just send a follow-up pull request during this cycle. listmount() is given a 64bit mount id via mnt_id_req just as statmount(). It takes a buffer and a size to return an array of the 64bit ids of the child mounts of the requested mount. Userspace can thus choose to either retrieve child mounts for a mount in batches or iterate through the child mounts. For most use-cases it will be sufficient to just leave space for a few child mounts. But for big mount tables having an iterator is really helpful. Iterating through a mount table works by setting @param in mnt_id_req to the mount id of the last child mount retrieved in the previous listmount() call" Link: https://lwn.net/Articles/934469 [1] Link: https://lwn.net/Articles/829212 [2] Link: https://lwn.net/Articles/950569 [3] * tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: add selftest for statmount/listmount fs: keep struct mnt_id_req extensible wire up syscalls for statmount/listmount add listmount(2) syscall statmount: simplify string option retrieval statmount: simplify numeric option retrieval add statmount(2) syscall namespace: extract show_path() helper mounts: keep list of mounts in an rbtree add unique mount ID
2024-01-08Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-1/+0
Common KVM changes for 6.8: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
2024-01-08KVM: introduce CONFIG_KVM_COMMONPaolo Bonzini1-2/+1
CONFIG_HAVE_KVM is currently used by some architectures to either enabled the KVM config proper, or to enable host-side code that is not part of the KVM module. However, CONFIG_KVM's "select" statement in virt/kvm/Kconfig corresponds to a third meaning, namely to enable common Kconfigs required by all architectures that support KVM. These three meanings can be replaced respectively by an architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by a new Kconfig symbol that is in turn selected by the architecture-specific "config KVM". Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON. Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by architecture code, and it brings in all dependencies of common KVM code. In particular, INTERVAL_TREE was missing in loongarch and riscv, so that is another thing that is fixed. Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08) Reported-by: Randy Dunlap <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-01-05Merge tag 'mm-hotfixes-stable-2024-01-05-11-35' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc mm fixes from Andrew Morton: "12 hotfixes. Two are cc:stable and the remainder either address post-6.7 issues or aren't considered necessary for earlier kernel versions" * tag 'mm-hotfixes-stable-2024-01-05-11-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() mailmap: add entries for Mathieu Othacehe MAINTAINERS: change vmware.com addresses to broadcom.com arch/mm/fault: fix major fault accounting when retrying under per-VMA lock mm/mglru: skip special VMAs in lru_gen_look_around() MAINTAINERS: hand over hwpoison maintainership to Miaohe Lin MAINTAINERS: remove hugetlb maintainer Mike Kravetz mm: fix unmap_mapping_range high bits shift bug mm: memcg: fix split queue list crash when large folio migration mm: fix arithmetic for max_prop_frac when setting max_ratio mm: fix arithmetic for bdi min_ratio mm: align larger anonymous mappings on THP boundaries
2024-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+2
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c e009b2efb7a8 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()") 0f2b21477988 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL") https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>
2024-01-02Merge tag 'loongarch-kvm-6.8' of ↵Paolo Bonzini7-13/+66
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support.
2024-01-02net/sched: Remove CONFIG_NET_ACT_IPT from default configsJamal Hadi Salim1-1/+0
Now that we are retiring the IPT action. Reviewed-by: Victor Noguiera <[email protected]> Reviewed-by: Pedro Tammela <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-01-01powerpc: Export kvm_guest static key, for bcachefs six locksKent Overstreet1-0/+2
bcachefs's six locks need kvm_guest, via ower_on_cpu() -> vcpu_is_preempted() -> is_kvm_guest() Signed-off-by: Kent Overstreet <[email protected]> Cc: [email protected] Acked-by: Michael Ellerman <[email protected]> (powerpc)
2023-12-29arch/mm/fault: fix major fault accounting when retrying under per-VMA lockSuren Baghdasaryan1-0/+2
A test [1] in Android test suite started failing after [2] was merged. It turns out that after handling a major fault under per-VMA lock, the process major fault counter does not register that fault as major. Before [2] read faults would be done under mmap_lock, in which case FAULT_FLAG_TRIED flag is set before retrying. That in turn causes mm_account_fault() to account the fault as major once retry completes. With per-VMA locks we often retry because a fault can't be handled without locking the whole mm using mmap_lock. Therefore such retries do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] because we can now handle read major faults under per-VMA lock and upon retry the fact there was a major fault gets lost. Fix this by setting FAULT_FLAG_TRIED after retrying under per-VMA lock if VM_FAULT_MAJOR was returned. Ideally we would use an additional VM_FAULT bit to indicate the reason for the retry (could not handle under per-VMA lock vs other reason) but this simpler solution seems to work, so keeping it simple. [1] https://cs.android.com/android/platform/superproject/+/master:test/vts-testcase/kernel/api/drop_caches_prop/drop_caches_test.cpp [2] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29Merge branch 'topic/ppc-kvm' into nextMichael Ellerman8-39/+107
Merge our topic branch containing KVM related patches.
2023-12-29powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2Geoff Levand1-0/+1
Commit 8c5fa3b5c4df ("powerpc/64: Make ELFv2 the default for big-endian builds"), merged in Linux-6.5-rc1 changes the calling ABI in a way that is incompatible with the current code for the PS3's LV1 hypervisor calls. This change just adds the line '# CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 is not set' to the ps3_defconfig file so that the PPC64_ELF_ABI_V1 is used. Fixes run time errors like these: BUG: Kernel NULL pointer dereference at 0x00000000 Faulting instruction address: 0xc000000000047cf0 Oops: Kernel access of bad area, sig: 11 [#1] Call Trace: [c0000000023039e0] [c00000000100ebfc] ps3_create_spu+0xc4/0x2b0 (unreliable) [c000000002303ab0] [c00000000100d4c4] create_spu+0xcc/0x3c4 [c000000002303b40] [c00000000100eae4] ps3_enumerate_spus+0xa4/0xf8 Fixes: 8c5fa3b5c4df ("powerpc/64: Make ELFv2 the default for big-endian builds") Cc: [email protected] # v6.5+ Signed-off-by: Geoff Levand <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-29powerpc/86xx: Drop unused CONFIG_MPC8610Michael Ellerman1-7/+0
The MPC8610 symbol used to be default y if MPC8610_HPCD, but since MPC8610_HPCD was removed MPC8610 is now never used. Remove it. Fixes: 248667f8bbde ("powerpc: drop HPCD/MPC8610 evaluation platform support") Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-27Merge tag 'mm-hotfixes-stable-2023-12-27-15-00' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues or are not considered backporting material" * tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add an old address for Naoya Horiguchi mm/memory-failure: cast index to loff_t before shifting it mm/memory-failure: check the mapcount of the precise page mm/memory-failure: pass the folio and the page to collect_procs() selftests: secretmem: floor the memory size to the multiple of page_size mm: migrate high-order folios in swap cache correctly maple_tree: do not preallocate nodes for slot stores mm/filemap: avoid buffered read/write race to read inconsistent data kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset kexec: select CRYPTO from KEXEC_FILE instead of depending on it kexec: fix KEXEC_FILE dependencies
2023-12-27Kill sched.h dependency on rcupdate.hKent Overstreet1-0/+1
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big sched.h dependency. Signed-off-by: Kent Overstreet <[email protected]>
2023-12-27rseq: Split out rseq.h from sched.hKent Overstreet1-0/+1
We're trying to get sched.h down to more or less just types only, not code - rseq can live in its own header. This helps us kill the dependency on preempt.h in sched.h. Signed-off-by: Kent Overstreet <[email protected]>
2023-12-21powerpc/powernv: Add error handling to opal_prd_range_is_validHaoran Liu1-0/+2
In the opal_prd_range_is_valid function within opal-prd.c, error handling was missing for the of_get_address call. This patch adds necessary error checking, ensuring that the function gracefully handles scenarios where of_get_address fails. Signed-off-by: Haoran Liu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-21powerpc/hvcall: Reorder Nestedv2 hcall opcodesVaibhav Jain1-10/+10
Reorder the newly introduced hcall opcodes for Nestedv2 to follow the increasing-opcode-number convention followed in 'hvcall.h'. Also updates the value for MAX_HCALL_OPCODE which is used in various places in arch code for range checking. Notably in the KVM enabled-hcall logic, and in hcall tracing. Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests") Suggested-by: Michael Ellerman <[email protected]> Signed-off-by: Vaibhav Jain <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-21powerpc/ps3: Add missing set_freezable() for ps3_probe_thread()Kevin Hao1-0/+1
The kernel thread function ps3_probe_thread() invokes the try_to_freeze() in its loop. But all the kernel threads are non-freezable by default. So if we want to make a kernel thread to be freezable, we have to invoke set_freezable() explicitly. Signed-off-by: Kevin Hao <[email protected]> Acked-by: Geoff Levand <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-21powerpc/mpc83xx: Use wait_event_freezable() for freezable kthreadKevin Hao1-2/+1
A freezable kernel thread can enter frozen state during freezing by either calling try_to_freeze() or using wait_event_freezable() and its variants. So for the following snippet of code in a kernel thread loop: wait_event_interruptible(); try_to_freeze(); We can change it to a simple wait_event_freezable() and then eliminate a function call. Signed-off-by: Kevin Hao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-21powerpc/mpc83xx: Add the missing set_freezable() for agent_thread_fn()Kevin Hao1-0/+2
The kernel thread function agent_thread_fn() invokes the try_to_freeze() in its loop. But all the kernel threads are non-freezable by default. So if we want to make a kernel thread to be freezable, we have to invoke set_freezable() explicitly. Signed-off-by: Kevin Hao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-20kexec_file, power: print out debugging message if requiredBaoquan He2-13/+13
Then when specifying '-d' for kexec_file_load interface, loaded locations of kernel/initrd/cmdline etc can be printed out to help debug. Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file loading related codes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Cc: Conor Dooley <[email protected]> Cc: Joe Perches <[email protected]> Cc: Nathan Chancellor <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-20kexec: fix KEXEC_FILE dependenciesArnd Bergmann1-2/+2
The cleanup for the CONFIG_KEXEC Kconfig logic accidentally changed the 'depends on CRYPTO=y' dependency to a plain 'depends on CRYPTO', which causes a link failure when all the crypto support is in a loadable module and kexec_file support is built-in: x86_64-linux-ld: vmlinux.o: in function `__x64_sys_kexec_file_load': (.text+0x32e30a): undefined reference to `crypto_alloc_shash' x86_64-linux-ld: (.text+0x32e58e): undefined reference to `crypto_shash_update' x86_64-linux-ld: (.text+0x32e6ee): undefined reference to `crypto_shash_final' Both s390 and x86 have this problem, while ppc64 and riscv have the correct dependency already. On riscv, the dependency is only used for the purgatory, not for the kexec_file code itself, which may be a bit surprising as it means that with CONFIG_CRYPTO=m, it is possible to enable KEXEC_FILE but then the purgatory code is silently left out. Move this into the common Kconfig.kexec file in a way that is correct everywhere, using the dependency on CRYPTO_SHA256=y only when the purgatory code is available. This requires reversing the dependency between ARCH_SUPPORTS_KEXEC_PURGATORY and KEXEC_FILE, but the effect remains the same, other than making riscv behave like the other ones. On s390, there is an additional dependency on CRYPTO_SHA256_S390, which should technically not be required but gives better performance. Remove this dependency here, noting that it was not present in the initial Kconfig code but was brought in without an explanation in commit 71406883fd357 ("s390/kexec_file: Add kexec_file_load system call"). [[email protected]: fix riscv build] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 6af5138083005 ("x86/kexec: refactor for kernel/Kconfig.kexec") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Eric DeVolder <[email protected]> Tested-by: Eric DeVolder <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Conor Dooley <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Herbert Xu <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-19powerpc/fsl: Fix fsl,tmu-calibration to match the schemaDavid Heidelberg2-74/+76
fsl,tmu-calibration is defined as a u32 matrix in Documentation/devicetree/bindings/thermal/qoriq-thermal.yaml. Use matching property syntax. No functional changes. Signed-off-by: David Heidelberg <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-17Merge tag 'powerpc-6.7-5' of ↵Linus Torvalds2-7/+46
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix a bug where heavy VAS (accelerator) usage could race with partition migration and prevent the migration from completing. - Update MAINTAINERS to add Aneesh & Naveen. Thanks to Haren Myneni. * tag 'powerpc-6.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: MAINTAINERS: powerpc: Add Aneesh & Naveen powerpc/pseries/vas: Migration suspend waits for no in-progress open windows
2023-12-15cred: get rid of CONFIG_DEBUG_CREDENTIALSJens Axboe1-1/+0
This code is rarely (never?) enabled by distros, and it hasn't caught anything in decades. Let's kill off this legacy debug code. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2023-12-15Merge branch 'smp-topo' into nextMichael Ellerman1-54/+70
Merge a branch containing SMP topology updates from Srikar, purely so we can include the cover letter which has a lot of good detail here: PowerVM systems configured in shared processors mode have some unique challenges. Some device-tree properties will be missing on a shared processor. Hence some sched domains may not make sense for shared processor systems. Most shared processor systems are over-provisioned. Underlying PowerVM Hypervisor would schedule at a Big Core (SMT8) granularity. The most recent power processors support two almost independent cores. In a lightly loaded condition, it helps the overall system performance if we pack to lesser number of Big Cores. Since each thread-group is independent, running threads on both the thread-groups of a SMT8 core, should have a minimal adverse impact in non over provisioned scenarios. These changes in this patchset will not affect in the over provisioned scenario. If there are more threads than SMT domains, then asym_packing will not kick-in. System Configuration type=Shared mode=Uncapped smt=8 lcpu=96 mem=1066409344 kB cpus=96 ent=64.00 So *64 Entitled cores/ 96 Virtual processor* Scenario lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 768 On-line CPU(s) list: 0-767 Model name: POWER10 (architected), altivec supported Model: 2.0 (pvr 0080 0200) Thread(s) per core: 8 Core(s) per socket: 16 Socket(s): 6 Hypervisor vendor: pHyp Virtualization type: para L1d cache: 6 MiB (192 instances) L1i cache: 9 MiB (192 instances) NUMA node(s): 6 NUMA node0 CPU(s): 0-7,32-39,80-87,128-135,176-183,224-231,272-279,320-327,368-375,416-423,464-471,512-519,560-567,608-615,656-663,704-711,752-759 NUMA node1 CPU(s): 8-15,40-47,88-95,136-143,184-191,232-239,280-287,328-335,376-383,424-431,472-479,520-527,568-575,616-623,664-671,712-719,760-767 NUMA node4 CPU(s): 64-71,112-119,160-167,208-215,256-263,304-311,352-359,400-407,448-455,496-503,544-551,592-599,640-647,688-695,736-743 NUMA node5 CPU(s): 16-23,48-55,96-103,144-151,192-199,240-247,288-295,336-343,384-391,432-439,480-487,528-535,576-583,624-631,672-679,720-727 NUMA node6 CPU(s): 72-79,120-127,168-175,216-223,264-271,312-319,360-367,408-415,456-463,504-511,552-559,600-607,648-655,696-703,744-751 NUMA node7 CPU(s): 24-31,56-63,104-111,152-159,200-207,248-255,296-303,344-351,392-399,440-447,488-495,536-543,584-591,632-639,680-687,728-735 ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 3840178 4059268 3978042 3973936.6 84264.456 +patch 5 3768393 3927901 3874994 3854046 71532.926 -3.01692 >From lparstat (when the workload stabilized) Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 4.16 0.00 0.00 95.84 26.06 40.72 4.16 69.88 276906989 578 +patch 4.16 0.00 0.00 95.83 17.70 27.66 4.17 78.26 70436663 119 ebizzy -t 128 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 5520692 5981856 5717709 5727053.2 176093.2 +patch 5 5305888 6259610 5854590 5843311 375917.03 2.02998 >From lparstat (when the workload stabilized) Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 16.66 0.00 0.00 83.33 45.49 71.08 16.67 50.50 288778533 581 +patch 16.65 0.00 0.00 83.35 30.15 47.11 16.65 65.76 85196150 133 ebizzy -t 512 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 19563921 20049955 19701510 19728733 198295.18 +patch 5 19455992 20176445 19718427 19832017 304094.05 0.523521 >From lparstat (when the workload stabilized) %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 66.6.0-rc3 6.44 0.01 0.00 33.55 94.14 147.09 66.45 1.33 313345175 621 6+patch 6.44 0.01 0.00 33.55 94.15 147.11 66.45 1.33 109193889 309 System Configuration type=Shared mode=Uncapped smt=8 lcpu=40 mem=1067539392 kB cpus=96 ent=40.00 So *40 Entitled cores/ 40 Virtual processor* Scenario lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 320 On-line CPU(s) list: 0-319 Model name: POWER10 (architected), altivec supported Model: 2.0 (pvr 0080 0200) Thread(s) per core: 8 Core(s) per socket: 10 Socket(s): 4 Hypervisor vendor: pHyp Virtualization type: para L1d cache: 2.5 MiB (80 instances) L1i cache: 3.8 MiB (80 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 NUMA node4 CPU(s): 16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279,304-311 NUMA node5 CPU(s): 24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287,312-319 ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 3535518 3864532 3745967 3704233.2 130216.76 +patch 5 3608385 3708026 3649379 3651596.6 37862.163 -1.42099 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 10.00 0.01 0.00 89.99 22.98 57.45 10.01 41.01 1135139 262 +patch 10.00 0.00 0.00 90.00 16.95 42.37 10.00 47.05 925561 19 ebizzy -t 64 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 4434984 4957281 4548786 4591298.2 211770.2 +patch 5 4461115 4835167 4544716 4607795.8 151474.85 0.359323 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 20.01 0.00 0.00 79.99 38.22 95.55 20.01 25.77 1287553 265 +patch 19.99 0.00 0.00 80.01 25.55 63.88 19.99 38.44 1077341 20 ebizzy -t 256 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 8850648 8982659 8951911 8936869.2 52278.031 +patch 5 8751038 9060510 8981409 8942268.4 117070.6 0.0604149 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 80.02 0.01 0.01 19.96 40.00 100.00 80.03 24.00 1597665 276 +patch 80.02 0.01 0.01 19.96 40.00 100.00 80.03 23.99 1383921 63 Observation: We are able to see Improvement in ebizzy throughput even with lesser core utilization (almost half the core utilization) in low utilization scenarios while still retaining throughput in mid and higher utilization scenarios. Note: The numbers are with Uncapped + no-noise case. In the Capped and/or noise case, due to contention on the Cores, the numbers are expected to further improve. Note: The numbers included (sched/fair: Enable group_asym_packing in find_idlest_group) https://lore.kernel.org/all/[email protected]/
2023-12-15powerpc/smp: Dynamically build Powerpc topologySrikar Dronamraju1-50/+28
Currently there are four Powerpc specific sched topologies. These are all statically defined. However not all these topologies are used by all Powerpc systems. To avoid unnecessary degenerations by the scheduler, masks and flags are compared. However if the sched topologies are build dynamically then the code is simpler and there are greater chances of avoiding degenerations. Note: Even X86 builds its sched topologies dynamically and proposed changes are very similar to the way X86 is building its topologies. Signed-off-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-15powerpc/smp: Avoid asym packing within thread_group of a coreSrikar Dronamraju1-0/+13
PowerVM Hypervisor will schedule at a core granularity. However each core can have more than one thread_groups. For better utilization in case of a shared processor, its preferable for the scheduler to pack to the lowest core. However there is no benefit of moving a thread between two thread groups of the same core. Signed-off-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-15powerpc/smp: Add __ro_after_init attributeSrikar Dronamraju1-5/+5
There are some variables that are only updated at boot time. So add __ro_after_init attribute to such variables Signed-off-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-15powerpc/smp: Disable MC domain for shared processorSrikar Dronamraju1-0/+4
Like L2-cache info, coregroup information which is used to determine MC sched domains is only present on dedicated LPARs. i.e PowerVM doesn't export coregroup information for shared processor LPARs. Hence disable creating MC domains on shared LPAR Systems. Signed-off-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-15powerpc/smp: Enable Asym packing for cores on shared processorSrikar Dronamraju1-2/+23
If there are shared processor LPARs, underlying Hypervisor can have more virtual cores to handle than actual physical cores. Starting with Power 9, a big core (aka SMT8 core) has 2 nearly independent thread groups. On a shared processors LPARs, it helps to pack threads to lesser number of cores so that the overall system performance and utilization improves. PowerVM schedules at a big core level. Hence packing to fewer cores helps. Since each thread-group is independent, running threads on both the thread-groups of a SMT8 core, should have a minimal adverse impact in non over provisioned scenarios. These changes in this patchset will not affect in the over provisioned scenario. If there are more threads than SMT domains, then asym_packing will not kick-in For example: Lets says there are two 8-core Shared LPARs that are actually sharing a 8 Core shared physical pool, each running 8 threads each. Then Consolidating 8 threads to 4 cores on each LPAR would help them to perform better. This is because each of the LPAR will get 100% time to run applications and there will no switching required by the Hypervisor. To achieve this, enable SD_ASYM_PACKING flag at CACHE, MC and DIE level when the system is running in shared processor mode and has big cores. Signed-off-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-15powerpc/sched: Cleanup vcpu_is_preempted()Aneesh Kumar K.V1-8/+25
No functional change in this patch. A helper is added to find if vcpu is dispatched by hypervisor. Use that instead of opencoding. Also clarify some of the comments. Signed-off-by: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-14wire up syscalls for statmount/listmountMiklos Szeredi1-0/+2
Wire up all archs. Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ian Kent <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-12-13powerpc: add cpu_spec.cpu_features to vmcoreinfoAditya Gupta1-0/+1
CPU features can be determined in makedumpfile, using 'cur_cpu_spec.cpu_features'. This provides more data to makedumpfile about the crashed system, and can help in filtering the vmcore accordingly. Signed-off-by: Aditya Gupta <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-13powerpc/imc-pmu: Add a null pointer check in update_events_in_group()Kunwu Chan1-0/+6
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fixes: 885dcd709ba9 ("powerpc/perf: Add nest IMC PMU support") Signed-off-by: Kunwu Chan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-13powerpc/powernv: Add a null pointer check in opal_powercap_init()Kunwu Chan1-0/+6
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fixes: b9ef7b4b867f ("powerpc: Convert to using %pOFn instead of device_node.name") Signed-off-by: Kunwu Chan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-13powerpc/powernv: Add a null pointer check in opal_event_init()Kunwu Chan1-0/+2
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fixes: 2717a33d6074 ("powerpc/opal-irqchip: Use interrupt names if present") Signed-off-by: Kunwu Chan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-13powerpc/powernv: Add a null pointer check to scom_debug_init_one()Kunwu Chan1-0/+5
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Add a null pointer check, and release 'ent' to avoid memory leaks. Fixes: bfd2f0d49aef ("powerpc/powernv: Get rid of old scom_controller abstraction") Signed-off-by: Kunwu Chan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-13powerpc/mm: Fix null-pointer dereference in pgtable_cache_addKunwu Chan1-2/+3
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Ensure the allocation was successful by checking the pointer validity. Suggested-by: Christophe Leroy <[email protected]> Suggested-by: Michael Ellerman <[email protected]> Signed-off-by: Kunwu Chan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-13powerpc/pseries/vas: Migration suspend waits for no in-progress open windowsHaren Myneni2-7/+46
The hypervisor returns migration failure if all VAS windows are not closed. During pre-migration stage, vas_migration_handler() sets migration_in_progress flag and closes all windows from the list. The allocate VAS window routine checks the migration flag, setup the window and then add it to the list. So there is possibility of the migration handler missing the window that is still in the process of setup. t1: Allocate and open VAS t2: Migration event window lock vas_pseries_mutex If migration_in_progress set unlock vas_pseries_mutex return open window HCALL unlock vas_pseries_mutex Modify window HCALL lock vas_pseries_mutex setup window migration_in_progress=true Closes all windows from the list // May miss windows that are // not in the list unlock vas_pseries_mutex lock vas_pseries_mutex return if nr_closed_windows == 0 // No DLPAR CPU or migration add window to the list // Window will be added to the // list after the setup is completed unlock vas_pseries_mutex return unlock vas_pseries_mutex Close VAS window // due to DLPAR CPU or migration return -EBUSY This patch resolves the issue with the following steps: - Set the migration_in_progress flag without holding mutex. - Introduce nr_open_wins_progress counter in VAS capabilities struct - This counter tracks the number of open windows are still in progress - The allocate setup window thread closes windows if the migration is set and decrements nr_open_window_progress counter - The migration handler waits for no in-progress open windows. The code flow with the fix is as follows: t1: Allocate and open VAS t2: Migration event window lock vas_pseries_mutex If migration_in_progress set unlock vas_pseries_mutex return open window HCALL nr_open_wins_progress++ // Window opened, but not // added to the list yet unlock vas_pseries_mutex Modify window HCALL migration_in_progress=true setup window lock vas_pseries_mutex Closes all windows from the list While nr_open_wins_progress { unlock vas_pseries_mutex lock vas_pseries_mutex sleep if nr_closed_windows == 0 // Wait if any open window in or migration is not started // progress. The open window // No DLPAR CPU or migration // thread closes the window without add window to the list // adding to the list and return if nr_open_wins_progress-- // the migration is in progress. unlock vas_pseries_mutex return Close VAS window nr_open_wins_progress-- unlock vas_pseries_mutex return -EBUSY lock vas_pseries_mutex } unlock vas_pseries_mutex return Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler") Signed-off-by: Haren Myneni <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2023-12-13powerpc/Kconfig: Select FUNCTION_ALIGNMENT_4BSathvika Vasireddy2-3/+1
Commit d49a0626216b95 ("arch: Introduce CONFIG_FUNCTION_ALIGNMENT") introduced a generic function-alignment infrastructure. Move to using FUNCTION_ALIGNMENT_4B on powerpc, to use the same alignment as that of the existing _GLOBAL macro. Signed-off-by: Sathvika Vasireddy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/21892186ec44abe24df0daf64f577dac0e78783f.1702045299.git.naveen@kernel.org