blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2023-10-18	hugetlbfs: close race between MADV_DONTNEED and page fault	Rik van Riel	1	-2/+33
	Malloc libraries, like jemalloc and tcalloc, take decisions on when to call madvise independently from the code in the main application. This sometimes results in the application page faulting on an address, right after the malloc library has shot down the backing memory with MADV_DONTNEED. Usually this is harmless, because we always have some 4kB pages sitting around to satisfy a page fault. However, with hugetlbfs systems often allocate only the exact number of huge pages that the application wants. Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of any lock taken on the page fault path, which can open up the following race condition: CPU 1 CPU 2 MADV_DONTNEED unmap page shoot down TLB entry page fault fail to allocate a huge page killed with SIGBUS free page Fix that race by pulling the locking from __unmap_hugepage_final_range into helper functions called from zap_page_range_single. This ensures page faults stay locked out of the MADV_DONTNEED VMA until the huge pages have actually been freed. Link: https://lkml.kernel.org/r/[email protected] Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing") Signed-off-by: Rik van Riel <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18	hugetlbfs: extend hugetlb_vma_lock to private VMAs	Rik van Riel	1	-0/+6
	Extend the locking scheme used to protect shared hugetlb mappings from truncate vs page fault races, in order to protect private hugetlb mappings (with resv_map) against MADV_DONTNEED. Add a read-write semaphore to the resv_map data structure, and use that from the hugetlb_vma_(un)lock_* functions, in preparation for closing the race between MADV_DONTNEED and page faults. Link: https://lkml.kernel.org/r/[email protected] Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing") Signed-off-by: Rik van Riel <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18	Merge tag 'qcom-drivers-for-6.7' of ↵	Arnd Bergmann	5	-55/+123
	https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/drivers Qualcomm driver updates for v6.7 This introduces partial support for the Qualcomm Secure Execution Environment SCM interface, and uses this to implement EFI variable access on the Windows On Snapdragon devices (for now). The 32/64-bit calling convention detector of the SCM interface is updated to not choose 64-bit convention when Linux is 32-bit. The "extern" specifier is dropped from the interface include file. The LLCC driver gains support for carrying configuration for multiple different system/DDR configurations for a given platform, and selecting between them. Support for Q[DR]U1000 is added to the driver. All exported symbols are transitioned to EXPORT_SYMBOL_GPL(). The platform_drivers in the Qualcomm SoC are transitioned to the void-returning remove_new implementation. The rmtfs memory driver gains support for leaving guard pages around the used area, to avoid issues if the allocation happens to be placed adjacent to another protected memory region. The socinfo driver gains knowledge about IPQ8174, QCM6490, SM7150P and various PMICs used together with SM8550. * tag 'qcom-drivers-for-6.7' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (44 commits) soc: qcom: socinfo: Convert to platform remove callback returning void soc: qcom: smsm: Convert to platform remove callback returning void soc: qcom: smp2p: Convert to platform remove callback returning void soc: qcom: smem: Convert to platform remove callback returning void soc: qcom: rmtfs_mem: Convert to platform remove callback returning void soc: qcom: qcom_stats: Convert to platform remove callback returning void soc: qcom: qcom_gsbi: Convert to platform remove callback returning void soc: qcom: qcom_aoss: Convert to platform remove callback returning void soc: qcom: pmic_glink: Convert to platform remove callback returning void soc: qcom: ocmem: Convert to platform remove callback returning void soc: qcom: llcc-qcom: Convert to platform remove callback returning void soc: qcom: icc-bwmon: Convert to platform remove callback returning void firmware: qcom_scm: use 64-bit calling convention only when client is 64-bit soc: qcom: llcc: Handle a second device without data corruption soc: qcom: Switch to EXPORT_SYMBOL_GPL() soc: qcom: smem: Annotate struct qcom_smem with __counted_by soc: qcom: rmtfs: Support discarding guard pages dt-bindings: reserved-memory: rmtfs: Allow guard pages dt-bindings: firmware: qcom,scm: document IPQ5018 compatible firmware: qcom_scm: disable SDI if required ... Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2023-10-19	kprobes: freelist.h removed	wuqiang.matt	1	-129/+0
	This patch will remove freelist.h from kernel source tree, since the only use cases (kretprobe and rethook) are converted to objpool. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: wuqiang.matt <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-10-18	kprobes: kretprobe scalability improvement	wuqiang.matt	2	-20/+7
	kretprobe is using freelist to manage return-instances, but freelist, as LIFO queue based on singly linked list, scales badly and reduces the overall throughput of kretprobed routines, especially for high contention scenarios. Here's a typical throughput test of sys_prctl (counts in 10 seconds, measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl): OS: Debian 10 X86_64, Linux 6.5rc7 with freelist HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s 1T 2T 4T 8T 16T 24T 24150045 29317964 15446741 12494489 18287272 17708768 32T 48T 64T 72T 96T 128T 16200682 13737658 11645677 11269858 10470118 9931051 This patch introduces objpool to replace freelist. objpool is a high performance queue, which can bring near-linear scalability to kretprobed routines. Tests of kretprobe throughput show the biggest ratio as 159x of original freelist. Here's the result: 1T 2T 4T 8T 16T native: 41186213 82336866 164250978 328662645 658810299 freelist: 24150045 29317964 15446741 12494489 18287272 objpool: 23926730 48010314 96125218 191782984 385091769 32T 48T 64T 96T 128T native: 1330338351 1969957941 2512291791 2615754135 2671040914 freelist: 16200682 13737658 11645677 10470118 9931051 objpool: 764481096 1147149781 1456220214 1502109662 1579015050 Testings on 96-core ARM64 output similarly, but with the biggest ratio up to 448x: OS: Debian 10 AARCH64, Linux 6.5rc7 HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s 1T 2T 4T 8T 16T native: . 30066096 63569843 126194076 257447289 505800181 freelist: 16152090 11064397 11124068 7215768 5663013 objpool: 13997541 28032100 55726624 110099926 221498787 24T 32T 48T 64T 96T native: 763305277 1015925192 1521075123 2033009392 3021013752 freelist: 5015810 4602893 3766792 3382478 2945292 objpool: 328192025 439439564 668534502 887401381 1319972072 Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: wuqiang.matt <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-10-18	lib: objpool added: ring-array based lockless MPMC	wuqiang.matt	1	-0/+181
	objpool is a scalable implementation of high performance queue for object allocation and reclamation, such as kretprobe instances. With leveraging percpu ring-array to mitigate hot spots of memory contention, it delivers near-linear scalability for high parallel scenarios. The objpool is best suited for the following cases: 1) Memory allocation or reclamation are prohibited or too expensive 2) Consumers are of different priorities, such as irqs and threads Limitations: 1) Maximum objects (capacity) is fixed after objpool creation 2) All pre-allocated objects are managed in percpu ring array, which consumes more memory than linked lists Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: wuqiang.matt <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-10-18	fs: rename inode i_atime and i_mtime fields	Jeff Layton	1	-10/+10
	Rename these two fields to discourage direct access (and to help ensure that we mop up any leftover direct accesses). Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-18	linux: convert to new timestamp accessors	Jeff Layton	1	-3/+3
	Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-18	fs: new accessor methods for atime and mtime	Jeff Layton	1	-13/+72
	Recently, we converted the ctime accesses in the kernel to use new accessor functions. Linus recently pointed out though that if we add accessors for the atime and mtime, then that would allow us to seamlessly change how these timestamps are stored in the inode. Add new accessor functions for the atime and mtime that mirror the accessors for the ctime. Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-18	Merge tag 'nf-next-23-10-18' of ↵	David S. Miller	1	-0/+10
	https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter next pull request 2023-10-18 This series contains initial netfilter skb drop_reason support, from myself. First few patches fix up a few spots to make sure we won't trip when followup patches embed error numbers in the upper bits (we already do this in some places). Then, nftables and bridge netfilter get converted to call kfree_skb_reason directly to let tooling pinpoint exact location of packet drops, rather than the existing NF_DROP catchall in nf_hook_slow(). I would like to eventually convert all netfilter modules, but as some callers cannot deal with NF_STOLEN (notably act_ct), more preparation work is needed for this. Last patch gets rid of an ugly 'de-const' cast in nftables. ==================== Signed-off-by: David S. Miller <[email protected]>
2023-10-17	workqueue: Provide one lock class key per work_on_cpu() callsite	Frederic Weisbecker	1	-7/+39
	All callers of work_on_cpu() share the same lock class key for all the functions queued. As a result the workqueue related locking scenario for a function A may be spuriously accounted as an inversion against the locking scenario of function B such as in the following model: long A(void arg) { mutex_lock(&mutex); mutex_unlock(&mutex); } long B(void arg) { } void launchA(void) { work_on_cpu(0, A, NULL); } void launchB(void) { mutex_lock(&mutex); work_on_cpu(1, B, NULL); mutex_unlock(&mutex); } launchA and launchB running concurrently have no chance to deadlock. However the above can be reported by lockdep as a possible locking inversion because the works containing A() and B() are treated as belonging to the same locking class. The following shows an existing example of such a spurious lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted ------------------------------------------------------ kworker/0:1/9 is trying to acquire lock: ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0 but task is already holding lock: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __flush_work+0x83/0x4e0 work_on_cpu+0x97/0xc0 rcu_nocb_cpu_offload+0x62/0xb0 rcu_nocb_toggle+0xd0/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}: __mutex_lock+0x81/0xc80 rcu_nocb_cpu_deoffload+0x38/0xb0 rcu_nocb_toggle+0x144/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 percpu_down_write+0x31/0x200 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 other info that might help us debug this: Chain exists of: cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&wfc.work)); lock(rcu_state.barrier_mutex); lock((work_completion)(&wfc.work)); lock(cpu_hotplug_lock); * DEADLOCK * 2 locks held by kworker/0:1/9: #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500 #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 stack backtrace: CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: events work_for_cpu_fn Call Trace: rcu-torture: rcu_torture_read_exit: Start of episode <TASK> dump_stack_lvl+0x4a/0x80 check_noncircular+0x132/0x150 __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 ? _cpu_down+0x57/0x2b0 percpu_down_write+0x31/0x200 ? _cpu_down+0x57/0x2b0 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 ? __pfx_worker_thread+0x10/0x10 kthread+0xe6/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK Fix this with providing one lock class key per work_on_cpu() caller. Reported-and-tested-by: Paul E. McKenney <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-10-18	ethtool: Add forced speed to supported link modes maps	Paul Greenwalt	2	-14/+52
	The need to map Ethtool forced speeds to Ethtool supported link modes is common among drivers. To support this, add a common structure for forced speed maps and a function to init them. This is solution was originally introduced in commit 1d4e4ecccb11 ("qede: populate supported link modes maps on module init") for qede driver. ethtool_forced_speed_maps_init() should be called during driver init with an array of struct ethtool_forced_speed_map to populate the mapping. Definitions for maps themselves are left in the driver code, as the sets of supported link modes may vary between the devices. Suggested-by: Andrew Lunn <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Pawel Chmielewski <[email protected]> Signed-off-by: Paul Greenwalt <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-10-18	netfilter: make nftables drops visible in net dropmonitor	Florian Westphal	1	-0/+10
	net_dropmonitor blames core.c:nf_hook_slow. Add NF_DROP_REASON() helper and use it in nft_do_chain(). The helper releases the skb, so exact drop location becomes available. Calling code will observe the NF_STOLEN verdict instead. Adjust nf_hook_slow so we can embed an erro value wih NF_STOLEN verdicts, just like we do for NF_DROP. After this, drop in nftables can be pinpointed to a drop due to a rule or the chain policy. Signed-off-by: Florian Westphal <[email protected]>
2023-10-18	parport: Use kasprintf() instead of fixed buffer formatting	Andy Shevchenko	1	-2/+0
	Improve readability and maintainability by replacing a hardcoded string allocation and formatting by the use of the kasprintf() helper. Signed-off-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-18	mei: docs: fix spelling errors	Tomas Winkler	1	-2/+2
	Fix spelling errors in the mei code base. Signed-off-by: Tomas Winkler <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-18	mei: bus: add send and recv api with timeout	Alexander Usyskin	1	-0/+8
	Add variation of the send and recv functions on bus that define timeout. Caller can use such functions in flow that can stuck to bail out and not to put down the whole system. Signed-off-by: Alexander Usyskin <[email protected]> Signed-off-by: Alan Previn <[email protected]> Signed-off-by: Tomas Winkler <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-17	Merge tag 'mlx5-updates-2023-10-10' of ↵	Jakub Kicinski	1	-14/+2
	git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-10-10 1) Adham Faris, Increase max supported channels number to 256 2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes 3) Shay Drory, Replace global mlx5_intf_lock with HCA devcom component lock 4) Wei Zhang, Optimize SF creation flow During SF creation, HCA state gets changed from INVALID to IN_USE step by step. Accordingly, FW sends vhca event to driver to inform about this state change asynchronously. Each vhca event is critical because all related SW/FW operations are triggered by it. Currently there is only a single mlx5 general event handler which not only handles vhca event but many other events. This incurs huge bottleneck because all events are forced to be handled in serial manner. Moreover, all SFs share same table_lock which inevitably impacts each other when they are created in parallel. This series will solve this issue by: 1. A dedicated vhca event handler is introduced to eliminate the mutual impact with other mlx5 events. 2. Max FW threads work queues are employed in the vhca event handler to fully utilize FW capability. 3. Redesign SF active work logic to completely remove table_lock. With above optimization, SF creation time is reduced by 25%, i.e. from 80s to 60s when creating 100 SFs. Patches summary: Patch 1 - implement dedicated vhca event handler with max FW cmd threads of work queues. Patch 2 - remove table_lock by redesigning SF active work logic. * tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Allow IPsec soft/hard limits in bytes net/mlx5e: Increase max supported channels number to 256 net/mlx5e: Preparations for supporting larger number of channels net/mlx5e: Refactor mlx5e_rss_init() and mlx5e_rss_free() API's net/mlx5e: Refactor mlx5e_rss_set_rxfh() and mlx5e_rss_get_rxfh() net/mlx5e: Refactor rx_res_init() and rx_res_free() APIs net/mlx5e: Use PTR_ERR_OR_ZERO() to simplify code net/mlx5: Use PTR_ERR_OR_ZERO() to simplify code net/mlx5: fix config name in Kconfig parameter documentation net/mlx5: Remove unused declaration net/mlx5: Replace global mlx5_intf_lock with HCA devcom component lock net/mlx5: Refactor LAG peer device lookout bus logic to mlx5 devcom net/mlx5: Avoid false positive lockdep warning by adding lock_class_key net/mlx5: Redesign SF active work to remove table_lock net/mlx5: Parallelize vhca event handling ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-17	net: phylink: remove a bunch of unused validation methods	Russell King (Oracle)	1	-11/+0
	Remove exports for phylink_caps_to_linkmodes(), phylink_get_capabilities(), phylink_validate_mask_caps() and phylink_generic_validate(). Also, as phylink_generic_validate() is no longer called, we can remove its implementation as well. Signed-off-by: Russell King (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-17	net: phylink: remove .validate() method	Russell King (Oracle)	1	-38/+0
	The MAC .validate() method is no longer used, so remove it from the phylink_mac_ops structure, and remove the callsite in phylink_validate_mac_and_pcs(). Signed-off-by: Russell King (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-17	net: phylink: provide mac_get_caps() method	Russell King (Oracle)	1	-0/+15
	Provide a new method, mac_get_caps() to get the MAC capabilities for the specified interface mode. This is for MACs which have special requirements, such as not supporting half-duplex in certain interface modes, and will replace the validate() method. Signed-off-by: Russell King (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-17	Merge tag 'wireless-next-2023-10-16' of ↵	Jakub Kicinski	1	-0/+2
	git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.7 The second pull request for v6.7, with only driver changes this time. We have now support for mt7925 PCIe and USB variants, few new features and of course some fixes. Major changes: mt76 - mt7925 support ath12k - read board data variant name from SMBIOS wfx - Remain-On-Channel (ROC) support * tag 'wireless-next-2023-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (109 commits) wifi: rtw89: mac: do bf_monitor only if WiFi 6 chips wifi: rtw89: mac: set bf_assoc capabilities according to chip gen wifi: rtw89: mac: set bfee_ctrl() according to chip gen wifi: rtw89: mac: add registers of MU-EDCA parameters for WiFi 7 chips wifi: rtw89: mac: generalize register of MU-EDCA switch according to chip gen wifi: rtw89: mac: update RTS threshold according to chip gen wifi: rtlwifi: simplify TX command fill callbacks wifi: hostap: remove unused ioctl function wifi: atmel: remove unused ioctl function wifi: rtw89: coex: add annotation __counted_by() to struct rtw89_btc_btf_set_mon_reg wifi: rtw89: coex: add annotation __counted_by() for struct rtw89_btc_btf_set_slot_table wifi: rtw89: add EHT radiotap in monitor mode wifi: rtw89: show EHT rate in debugfs wifi: rtw89: parse TX EHT rate selected by firmware from RA C2H report wifi: rtw89: Add EHT rate mask as parameters of RA H2C command wifi: rtw89: parse EHT information from RX descriptor and PPDU status packet wifi: radiotap: add bandwidth definition of EHT U-SIG wifi: rtlwifi: use convenient list_count_nodes() wifi: p54: Annotate struct p54_cal_database with __counted_by wifi: brcmfmac: fweh: Add __counted_by for struct brcmf_fweh_queue_item and use struct_size() ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-17	Merge tag 'nvme-6.7-2023-10-17' of git://git.infradead.org/nvme into ↵	Jens Axboe	5	-2/+58
	for-6.7/block Pull NVMe updates from Keith: "nvme updates for Linux 6.7 - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees)" * tag 'nvme-6.7-2023-10-17' of git://git.infradead.org/nvme: (24 commits) nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() nvme: rework NVME_AUTH Kconfig selection nvmet-tcp: peek icreq before starting TLS nvmet-tcp: control messages for recvmsg() nvmet-tcp: enable TLS handshake upcall nvmet: Set 'TREQ' to 'required' when TLS is enabled nvmet-tcp: allocate socket file nvmet-tcp: make nvmet_tcp_alloc_queue() a void function nvmet: make TCP sectype settable via configfs nvme-fabrics: parse options 'keyring' and 'tls_key' nvme-tcp: improve icreq/icresp logging nvme-tcp: control message handling for recvmsg() nvme-tcp: enable TLS handshake upcall nvme-tcp: allocate socket file security/keys: export key_lookup() nvme-keyring: implement nvme_tls_psk_default() nvme-tcp: add definitions for TLS cipher suites ...
2023-10-17	nvme-auth: use transformed key size to create resp	Mark O'Donovan	1	-1/+2
	This does not change current behaviour as the driver currently verifies that the secret size is the same size as the length of the transformation hash. Co-developed-by: Akash Appaiah <[email protected]> Signed-off-by: Akash Appaiah <[email protected]> Signed-off-by: Mark O'Donovan <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2023-10-17	nvme-auth: alloc nvme_dhchap_key as single buffer	Mark O'Donovan	1	-1/+3
	Co-developed-by: Akash Appaiah <[email protected]> Signed-off-by: Akash Appaiah <[email protected]> Signed-off-by: Mark O'Donovan <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2023-10-17	PCI/MSI: Provide stubs for IMS functions	Reinette Chatre	1	-8/+26
	The IMS related functions (pci_create_ims_domain(), pci_ims_alloc_irq(), and pci_ims_free_irq()) are not declared when CONFIG_PCI_MSI is disabled. Provide definitions of these functions for use when callers are compiled with CONFIG_PCI_MSI disabled. Fixes: 0194425af0c8 ("PCI/MSI: Provide IMS (Interrupt Message Store) support") Fixes: c9e5bea27383 ("PCI/MSI: Provide pci_ims_alloc/free_irq()") Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/14ff656899a3757453f8584c1109d7a9b98fa258.1697564731.git.reinette.chatre@intel.com
2023-10-17	Merge branch 'linus' into smp/core	Thomas Gleixner	352	-3000/+7736
	Pull in upstream to get the fixes so depending changes can be applied.
2023-10-17	block:sed-opal: SED Opal keystore	Greg Joyce	1	-0/+26
	Add read and write functions that allow SED Opal keys to stored in a permanent keystore. Signed-off-by: Greg Joyce <[email protected]> Reviewed-by: Jonathan Derrick <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-10-17	Merge branch 'for-6.7/io_uring' into for-6.7/block	Jens Axboe	2	-2/+26
	Merge in io_uring fixes, as the ublk simplifying cancelations and aborts depend on the two patches from Ming adding cancelation support for uring_cmd. * for-6.7/io_uring: io_uring/kbuf: Use slab for struct io_buffer objects io_uring/kbuf: Allow the full buffer id space for provided buffers io_uring/kbuf: Fix check of BID wrapping in provided buffers io_uring/rsrc: cleanup io_pin_pages() io_uring: cancelable uring_cmd io_uring: retain top 8bits of uring_cmd flags for kernel internal use io_uring: add IORING_OP_WAITID support exit: add internal include file with helpers exit: add kernel_waitid_prepare() helper exit: move core of do_wait() into helper exit: abstract out should_wake helper for child_wait_callback() io_uring/rw: add support for IORING_OP_READ_MULTISHOT io_uring/rw: mark readv/writev as vectored in the opcode definition io_uring/rw: split io_read() into a helper
2023-10-17	net, bpf: Add a warning if NAPI cb missed xdp_do_flush().	Sebastian Andrzej Siewior	1	-0/+3
	A few drivers were missing a xdp_do_flush() invocation after XDP_REDIRECT. Add three helper functions each for one of the per-CPU lists. Return true if the per-CPU list is non-empty and flush the list. Add xdp_do_check_flushed() which invokes each helper functions and creates a warning if one of the functions had a non-empty list. Hide everything behind CONFIG_DEBUG_NET. Suggested-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-17	locking/seqlock: Fix grammar in comment	Cuda-Chen	1	-1/+1
	The "neither writes before and after ..." for the description of do_write_seqcount_end() should be "neither writes before nor after". Signed-off-by: Cuda-Chen <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-10-17	i3c: Add support for bus enumeration & notification	Jeremy Kerr	1	-0/+11
	This allows other drivers to be notified when new i3c busses are attached, referring to a whole i3c bus as opposed to individual devices. Signed-off-by: Jeremy Kerr <[email protected]> Signed-off-by: Matt Johnston <[email protected]> Acked-by: Alexandre Belloni <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-10-17	pmdomain: mediatek: Add support for MT8365	Fabien Parent	1	-0/+41
	Add the needed board data to support MT8365 SoC. Signed-off-by: Fabien Parent <[email protected]> Signed-off-by: Markus Schneider-Pargmann <[email protected]> Reviewed-by: AngeloGioacchino Del Regno <[email protected]> Reviewed-by: Alexandre Mergnat <[email protected]> Tested-by: Alexandre Mergnat <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Ulf Hansson <[email protected]>
2023-10-17	printk: Constify name for add_preferred_console()	Tony Lindgren	1	-1/+1
	While adding a preferred console handling for serial_core for serial port hardware based device addressing, Jiri suggested we constify name for add_preferred_console(). The name gets copied anyways. This allows serial core to add a preferred console using serial drv->dev_name without copying it. Note that constifying options causes changes all over the place because of struct console for match(). Suggested-by: Jiri Slaby <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Tony Lindgren <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-17	printk: Check valid console index for preferred console	Tony Lindgren	1	-1/+1
	Let's check for valid console index values for preferred console to avoid bogus console index numbers from kernel command line. Let's also return an error for negative index numbers for the preferred console. Unlike for device drivers, a negative index is not valid for the preferred console. Let's also constify idx while at it. Signed-off-by: Tony Lindgren <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-17	vgacon: remove screen_info dependency	Arnd Bergmann	1	-0/+7
	The vga console driver is fairly self-contained, and only used by architectures that explicitly initialize the screen_info settings. Chance every instance that picks the vga console by setting conswitchp to call a function instead, and pass a reference to the screen_info there. Reviewed-by: Javier Martinez Canillas <[email protected]> Acked-by: Khalid Azzi <[email protected]> Acked-by: Helge Deller <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-17	Merge tag 'drm-habanalabs-next-2023-10-10' of ↵	Dave Airlie	2	-0/+2209
	https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next This tag contains habanalabs driver changes for v6.7. The notable changes are: - uAPI changes: - Expose tsc clock sampling to better sync clock information in profiler. - Enhance engine error reporting in the info ioctl. - Block access to the eventfd operations through the control device. - Disable the option of the user to register multiple times with the same offset for timestamp dump by the driver. If a user wants to use the same offset in the timestamp buffer for different interrupt, it needs to first de-register the offset. - When exporting dma-buf (for p2p), force the user to specify size/offset in multiples of PAGE_SIZE. This is instead of the driver doing the rounding to PAGE_SIZE, which has caused the driver to map more memory than was intended by the user. - New features and improvements: - Complete the move of the driver to the accel subsystem by removing the custom habanalabs class and major and registering to accel subsystem. - Move the firmware interface files to include/linux/habanalabs. This is a pre-requisite for upstreaming the NIC drivers of Gaudi (as they need to include those files). - Perform device hard-reset upon PCIe AXI drain event to prevent the failure from cascading to different IP blocks in the SoC. In secured environments, this is done automatically by the firmware. - Print device name when it is removed for better debuggability. - Add support for trace of dma map sgtable operations. - Optimize handling of user interrupts by splitting the interrupts to two lists. One list for fast handling and second list for handling with timestamp recording, which is slower. - Prevent double device hard-reset due to 2 adjacent H/W events. - Set device status 'malfunction' while in rmmod. - Firmware related fixes: - Extend preboot timeout because preboot loading might take longer than expected in certain cases. - Add a protection mechanism for the Event Queue. In case it is full, the firmware will be able to notify about it through a dedicated interrupt. - Perform device hard-reset in case scrubbing of memory has failed. - Bug fixes and code cleanups: - Small fixes of dma-buf handling in Gaudi2, such as handling an offset != 0, using the correct exported size, creation of sg table. - Fix spmu mask creation. - Fix bug in wait for cs completion for decoder workloads. - Cleanup Greco name from documentation. - Fix bug in recording timestamp during cs completion interrupt handling. - Fix CoreSight ETF configuration and flush logic. - Fix small bug in hpriv_list handling (the list that contains the private data per process that opens our device). Signed-off-by: Dave Airlie <[email protected]> # -----BEGIN PGP SIGNATURE----- # # iQEzBAABCgAdFiEE7TEboABC71LctBLFZR1NuKta54AFAmUlHoQACgkQZR1NuKta # 54DsXQf8CW+W4iWJf5UDTj/E/giu9rVRrsUsU0hhCcXbecIxRsLObYXtulENu5/u # VuEAo/tAvo0LUKi8pdIv6ernDKaxZ1+fimlfXMCzllAA/ts3yp1NgunprsIsx3tv # YgcJ2GNR8UlVZ1qYuZl+4dOTyD0yfRMROUXBe7wqKnUXOEepOiLBxq6W15tZiJnx # L+V0yGkNk6pAoADIXLW9EgEXiN/bJZCXGPWp06i/Nz7cHIHJGoV59wAqftqllCtk # 8ZMkLByjlQKPhc5AgWBtKE8EGVip3sm7b/Q2Gq0ZXdZiebyVJ+AjuuDOdtq1UCIw # Rcp2576E7rByIBu3RAFlrioWhuR5Zw== # =2ien # -----END PGP SIGNATURE----- # gpg: Signature made Tue 10 Oct 2023 19:51:00 AEST # gpg: using RSA key ED311BA00042EF52DCB412C5651D4DB8AB5AE780 # gpg: Can't check signature: No public key From: Oded Gabbay <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2023-10-16	Merge tag 'for-netdev' of ↵	Jakub Kicinski	5	-41/+83
	https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-16	page_pool: fragment API support for 32-bit arch with 64-bit DMA	Yunsheng Lin	1	-12/+1
	Currently page_pool_alloc_frag() is not supported in 32-bit arch with 64-bit DMA because of the overlap issue between pp_frag_count and dma_addr_upper in 'struct page' for those arches, which seems to be quite common, see [1], which means driver may need to handle it when using fragment API. It is assumed that the combination of the above arch with an address space >16TB does not exist, as all those arches have 64b equivalent, it seems logical to use the 64b version for a system with a large address space. It is also assumed that dma address is page aligned when we are dma mapping a page aligned buffer, see [2]. That means we're storing 12 bits of 0 at the lower end for a dma address, we can reuse those bits for the above arches to support 32b+12b, which is 16TB of memory. If we make a wrong assumption, a warning is emitted so that user can report to us. 1. https://lore.kernel.org/all/[email protected]/ 2. https://lore.kernel.org/all/[email protected]/ Tested-by: Alexander Lobakin <[email protected]> Signed-off-by: Yunsheng Lin <[email protected]> CC: Lorenzo Bianconi <[email protected]> CC: Alexander Duyck <[email protected]> CC: Liang Chen <[email protected]> CC: Guillaume Tucker <[email protected]> CC: Matthew Wilcox <[email protected]> CC: Linux-MM <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-16	bitmap: move bitmap_*_region() functions to bitmap.h	Yury Norov	1	-3/+61
	Now that bitmap_*_region() functions are implemented as thin wrappers around others, it's worth to move them to the header, as it opens room for compile-time optimizations. CC: Andy Shevchenko <[email protected]> CC: Rasmus Villemoes <[email protected]> CC: Greg Kroah-Hartman <[email protected]> Signed-off-by: Yury Norov <[email protected]>
2023-10-16	dax, kmem: calculate abstract distance with general interface	Huang Ying	1	-0/+2
	Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is used for slow memory type in kmem driver. This limits the usage of kmem driver, for example, it cannot be used for HBM (high bandwidth memory). So, we use the general abstract distance calculation mechanism in kmem drivers to get more accurate abstract distance on systems with proper support. The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as fallback only. Now, multiple memory types may be managed by kmem. These memory types are put into the "kmem_memory_types" list and protected by kmem_memory_type_lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Tested-by: Bharata B Rao <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Wei Xu <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Rafael J Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	acpi, hmat: calculate abstract distance with HMAT	Huang Ying	1	-0/+18
	A memory tiering abstract distance calculation algorithm based on ACPI HMAT is implemented. The basic idea is as follows. The performance attributes of system default DRAM nodes are recorded as the base line. Whose abstract distance is MEMTIER_ADISTANCE_DRAM. Then, the ratio of the abstract distance of a memory node (target) to MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance attributes of the node to that of the default DRAM nodes. The functions to record the read/write latency/bandwidth of the default DRAM nodes and calculate abstract distance according to read/write latency/bandwidth ratio will be used by CXL CDAT (Coherent Device Attribute Table) and other memory device drivers. So, they are put in memory-tiers.c. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Tested-by: Bharata B Rao <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Wei Xu <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Rafael J Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	memory tiering: add abstract distance calculation algorithms management	Huang Ying	1	-0/+19
	Patch series "memory tiering: calculate abstract distance based on ACPI HMAT", v4. We have the explicit memory tiers framework to manage systems with multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices. Where, same kind of memory devices will be grouped into memory types, then put into memory tiers. To describe the performance of a memory type, abstract distance is defined. Which is in direct proportion to the memory latency and inversely proportional to the memory bandwidth. To keep the code as simple as possible, fixed abstract distance is used in dax/kmem to describe slow memory such as Optane DCPMM. To support more memory types, in this series, we added the abstract distance calculation algorithm management mechanism, provided a algorithm implementation based on ACPI HMAT, and used the general abstract distance calculation interface in dax/kmem driver. So, dax/kmem can support HBM (high bandwidth memory) in addition to the original Optane DCPMM. This patch (of 4): The abstract distance may be calculated by various drivers, such as ACPI HMAT, CXL CDAT, etc. While it may be used by various code which hot-add memory node, such as dax/kmem etc. To decouple the algorithm users and the providers, the abstract distance calculation algorithms management mechanism is implemented in this patch. It provides interface for the providers to register the implementation, and interface for the users. Multiple algorithm implementations can cooperate via calculating abstract distance for different memory nodes. The preference of algorithm implementations can be specified via priority (notifier_block.priority). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Tested-by: Bharata B Rao <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Wei Xu <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Rafael J Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	mm/filemap: remove hugetlb special casing in filemap.c	Sidhartha Kumar	2	-30/+14
	Remove special cased hugetlb handling code within the page cache by changing the granularity of ->index to the base page size rather than the huge page size. The motivation of this patch is to reduce complexity within the filemap code while also increasing performance by removing branches that are evaluated on every page cache lookup. To support the change in index, new wrappers for hugetlb page cache interactions are added. These wrappers perform the conversion to a linear index which is now expected by the page cache for huge pages. ========================= PERFORMANCE ====================================== Perf was used to check the performance differences after the patch. Overall the performance is similar to mainline with a very small larger overhead that occurs in __filemap_add_folio() and hugetlb_add_to_page_cache(). This is because of the larger overhead that occurs in xa_load() and xa_store() as the xarray is now using more entries to store hugetlb folios in the page cache. Timing aarch64 2MB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt real 1m49.568s user 0m0.000s sys 1m49.461s 6.5-rc3: [root]# time fallocate -l 700GB test.txt real 1m47.495s user 0m0.000s sys 1m47.370s 1GB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt real 1m47.024s user 0m0.000s sys 1m46.921s 6.5-rc3: [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt real 1m44.551s user 0m0.000s sys 1m44.438s x86 2MB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt real 0m22.383s user 0m0.000s sys 0m22.255s 6.5-rc3: [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt real 0m22.735s user 0m0.038s sys 0m22.567s 1GB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt real 0m25.786s user 0m0.001s sys 0m25.589s 6.5-rc3: [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt real 0m33.454s user 0m0.001s sys 0m33.193s aarch64: workload - fallocate a 700GB file backed by huge pages 6.5-rc3 + this patch: 2MB Page Size: --100.00%--__arm64_sys_fallocate ksys_fallocate vfs_fallocate hugetlbfs_fallocate \| \|--95.04%--__pi_clear_page \| \|--3.57%--clear_huge_page \| \| \| \|--2.63%--rcu_all_qs \| \| \| --0.91%--__cond_resched \| --0.67%--__cond_resched 0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache 0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio 6.5-rc3 2MB Page Size: --100.00%--__arm64_sys_fallocate ksys_fallocate vfs_fallocate hugetlbfs_fallocate \| \|--94.91%--__pi_clear_page \| \|--4.11%--clear_huge_page \| \| \| \|--3.00%--rcu_all_qs \| \| \| --1.10%--__cond_resched \| --0.59%--__cond_resched 0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache 0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio x86 workload - fallocate a 100GB file backed by huge pages 6.5-rc3 + this patch: 2MB Page Size: hugetlbfs_fallocate \| --99.57%--clear_huge_page \| --98.47%--clear_page_erms \| --0.53%--asm_sysvec_apic_timer_interrupt 0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store 6.5-rc3 2MB Page Size: --99.93%--__x64_sys_fallocate vfs_fallocate hugetlbfs_fallocate \| --99.38%--clear_huge_page \| \|--98.40%--clear_page_erms \| --0.59%--__cond_resched 0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio ========================= TESTING ====================================== This patch passes libhugetlbfs tests and LTP hugetlb tests ********** TEST SUMMARY * 2M * 32-bit 64-bit * Total testcases: 110 113 * Skipped: 0 0 * PASS: 107 113 * FAIL: 0 0 * Killed by signal: 3 0 * Bad configuration: 0 0 * Expected FAIL: 0 0 * Unexpected PASS: 0 0 * Test not present: 0 0 * Strange test result: 0 0 ********** Done executing testcases. LTP Version: 20220527-178-g2761a81c4 page migration was also tested using Mike Kravetz's test program.[8] [[email protected]: fix an NULL vs IS_ERR() bug] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Reported-and-tested-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	mm/ksm: support fork/exec for prctl	Stefan Roesch	1	-5/+8
	Patch series "mm/ksm: add fork-exec support for prctl", v4. A process can enable KSM with the prctl system call. When the process is forked the KSM flag is inherited by the child process. However if the process is executing an exec system call directly after the fork, the KSM setting is cleared. This patch series addresses this problem. 1) Change the mask in coredump.h for execing a new process 2) Add a new test case in ksm_functional_tests This patch (of 2): Today we have two ways to enable KSM: 1) madvise system call This allows to enable KSM for a memory region for a long time. 2) prctl system call This is a recent addition to enable KSM for the complete process. In addition when a process is forked, the KSM setting is inherited. This change only affects the second case. One of the use cases for (2) was to support the ability to enable KSM for cgroups. This allows systemd to enable KSM for the seed process. By enabling it in the seed process all child processes inherit the setting. This works correctly when the process is forked. However it doesn't support fork/exec workflow. From the previous cover letter: .... Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. .... In addition it can also be a bit surprising that fork keeps the KSM setting and fork/exec does not. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") Reviewed-by: David Hildenbrand <[email protected]> Reported-by: Carl Klemm <[email protected]> Tested-by: Carl Klemm <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	sched/numa, mm: make numa migrate functions to take a folio	Kefeng Wang	1	-3/+3
	The cpupid (or access time) is stored in the head page for THP, so it is safely to make should_numa_migrate_memory() and numa_hint_fault_latency() to take a folio. This is in preparation for large folio numa balancing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	mm: mempolicy: make mpol_misplaced() to take a folio	Kefeng Wang	1	-2/+3
	In preparation for large folio numa balancing, make mpol_misplaced() to take a folio, no functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	mm: memory: add vm_normal_folio_pmd()	Kefeng Wang	1	-0/+2
	Patch series "mm: convert numa balancing functions to use a folio", v2. do_numa_pages() only handles non-compound pages, and only PMD-mapped THPs are handled in do_huge_pmd_numa_page(). But a large, PTE-mapped folio will be supported so let's convert more numa balancing functions to use/take a folio in preparation for that, no functional change intended for now. This patch (of 6): The new vm_normal_folio_pmd() wrapper is similar to vm_normal_folio(), which allow them to completely replace the struct page variables with struct folio variables. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-16	Merge tag 'amlogic-drivers-for-v6.7' of ↵	Arnd Bergmann	1	-1/+1
	https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux into soc/drivers Amlogic drivers changes for v6.7: - correct meson_sm_* API retval handling - Use device_get_match_data() in meson SM * tag 'amlogic-drivers-for-v6.7' of https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux: firmware: meson: Use device_get_match_data() drivers: meson: sm: correct meson_sm_* API retval handling Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2023-10-16	Merge tag 'ffa-updates-6.7' of ↵	Arnd Bergmann	1	-12/+67
	git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into soc/drivers Arm FF-A updates for v6.7 The main addition is the initial support for the notifications and memory transaction descriptor changes added in FF-A v1.1 specification. The notification mechanism enables a requester/sender endpoint to notify a service provider/receiver endpoint about an event with non-blocking semantics. A notification is akin to the doorbell between two endpoints in a communication protocol that is based upon the doorbell/mailbox mechanism. The framework is responsible for the delivery of the notification from the ender to the receiver without blocking the sender. The receiver endpoint relies on the OS scheduler for allocation of CPU cycles to handle a notification. OS is referred as the receiver’s scheduler in the context of notifications. The framework is responsible for informing the receiver’s scheduler that the receiver must be run since it has a pending notification. The series also includes support for the new format of memory transaction descriptors introduced in v1.1 specification. Apart from the main additions, it includes minor fixes to re-enable FF-A drivers usage of 32bit mode of messaging and kernel warning due to the missing assignment of IDR allocation ID to the FFA device. It also adds emitting 'modalias' to the base attribute of FF-A devices. * tag 'ffa-updates-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: firmware: arm_ffa: Upgrade the driver version to v1.1 firmware: arm_ffa: Update memory descriptor to support v1.1 format firmware: arm_ffa: Switch to using ffa_mem_desc_offset() accessor KVM: arm64: FFA: Remove access of endpoint memory access descriptor array firmware: arm_ffa: Simplify the computation of transmit and fragment length firmware: arm_ffa: Add notification handling mechanism firmware: arm_ffa: Add interface to send a notification to a given partition firmware: arm_ffa: Add interfaces to request notification callbacks firmware: arm_ffa: Add schedule receiver callback mechanism firmware: arm_ffa: Initial support for scheduler receiver interrupt firmware: arm_ffa: Implement the NOTIFICATION_INFO_GET interface firmware: arm_ffa: Implement the FFA_NOTIFICATION_GET interface firmware: arm_ffa: Implement the FFA_NOTIFICATION_SET interface firmware: arm_ffa: Implement the FFA_RUN interface firmware: arm_ffa: Implement the notification bind and unbind interface firmware: arm_ffa: Implement notification bitmap create and destroy interfaces firmware: arm_ffa: Update the FF-A command list with v1.1 additions firmware: arm_ffa: Emit modalias for FF-A devices firmware: arm_ffa: Allow the FF-A drivers to use 32bit mode of messaging firmware: arm_ffa: Assign the missing IDR allocation ID to the FFA device Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2023-10-16	Merge tag 'scmi-updates-6.7' of ↵	Arnd Bergmann	4	-14/+73
	git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into soc/drivers Arm SCMI updates for v6.7 Main additions this time include: 1. SCMI v3.2 clock configuration support: This helps to retrieve the enabled state of a clock as well as allow to set OEM specific clock configurations. 2. Support for generic performance scaling(DVFS): The current SCMI DVFS support is limited to the CPUs in the kernel. This extension enables it to used for all kind of devices and not only for the CPUs. It updates the SCMI cpufreq to utilize the power domain bindings. It also adds a more generic SCMI performance domain based on the genpd framework that as be used for all the non-CPU devices. 3. Extend the generic performance scaling(DVFS) support for firmware driver OPPs: Consumer drivers for devices that are attached to the SCMI performance domain can't make use of the current OPP library to scale performance as the OPPs are firmware driven and often obtained from the firmware rather than the device tree. These changes extend the generic OPP and genpd PM domain frameworks to identify and utilise these firmware driven OPPs. 4. SCMI v3.2 clock parent support: This enables the support for discovering and changing parent clocks and extending the SCMI clk driver to use the same. 5. Qualcom SMC/HVC transport support: The Qualcomm virtual platforms require capability id in the hypervisor call to identify which doorbell to assert when supporting multiple SMC/HVC based SCMI transport channels. Extra parameter is added to support the same and the same is obtained at the fixed address in the shared memory which is initialised by the firmware. 6. Move the existing SCMI power domain driver under drivers/pmdomain Apart from the above main changes, it also include couple of minor fixes and cosmetic reworks. * tag 'scmi-updates-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: (37 commits) firmware: arm_scmi: Add qcom smc/hvc transport support dt-bindings: arm: Add new compatible for smc/hvc transport for SCMI firmware: arm_scmi: Convert u32 to unsigned long to align with arm_smccc_1_1_invoke() clk: scmi: Add support for clock {set,get}_parent firmware: arm_scmi: Add support for clock parents clk: scmi: Free scmi_clk allocated when the clocks with invalid info are skipped firmware: arm_scpi: Use device_get_match_data() firmware: arm_scmi: Add generic OPP support to the SCMI performance domain firmware: arm_scmi: Specify the performance level when adding an OPP firmware: arm_scmi: Simplify error path in scmi_dvfs_device_opps_add() OPP: Extend support for the opp-level beyond required-opps OPP: Switch to use dev_pm_domain_set_performance_state() OPP: Extend dev_pm_opp_data with a level OPP: Add dev_pm_opp_add_dynamic() to allow more flexibility PM: domains: Implement the ->set_performance_state() callback for genpd PM: domains: Introduce dev_pm_domain_set_performance_state() firmware: arm_scmi: Rename scmi_{msg_,}clock_config_{get,set}_{2,21} firmware: arm_scmi: Do not use !! on boolean when setting msg->flags firmware: arm_scmi: Move power-domain driver to the pmdomain dir pmdomain: arm: Add the SCMI performance domain ... Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>