aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2024-07-03mm: drop page_index and simplify folio_indexKairui Song2-17/+4
There are two helpers for retrieving the index within address space for mixed usage of swap cache and page cache: - page_index - folio_index This commit drops page_index, as we have eliminated all users, and converts folio_index's helper __page_file_index to use folio to avoid the page conversion. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Barry Song <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Marc Dionne <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Minchan Kim <[email protected]> Cc: NeilBrown <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiubo Li <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: remove page_file_offset and folio_file_posKairui Song1-17/+0
These two helpers were useful for mixed usage of swap cache and page cache, which help retrieve the corresponding file or swap device offset of a page or folio. They were introduced in commit f981c5950fa8 ("mm: methods for teaching filesystems about PG_swapcache pages") and used in commit d56b4ddf7781 ("nfs: teach the NFS client how to treat PG_swapcache pages"), suppose to be used with direct_IO for swap over fs. But after commit e1209d3a7a67 ("mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space"), swap with direct_IO is no more, and swap cache mapping is never exposed to fs. Now we have dropped all users of page_file_offset and folio_file_pos, so they can be deleted. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Barry Song <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Marc Dionne <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Minchan Kim <[email protected]> Cc: NeilBrown <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiubo Li <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm/huge_memory: mark racy access onhuge_anon_orders_alwaysRan Xiaokai1-2/+2
huge_anon_orders_always is accessed lockless, it is better to use the READ_ONCE() wrapper. This is not fixing any visible bug, hopefully this can cease some KCSAN complains in the future. Also do that for huge_anon_orders_madvise. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ran Xiaokai <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Lu Zhongjun <[email protected]> Reviewed-by: xu xin <[email protected]> Cc: Yang Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: add folio_alloc_mpol()Kefeng Wang1-0/+8
Patch series "mm: convert to folio_alloc_mpol()". This patch (of 4): This adds a new folio_alloc_mpol() like folio_alloc() but allocate folio according to NUMA mempolicy. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04firewire: ohci: add tracepoints event for data of Self-ID DMATakashi Sakamoto1-0/+54
In 1394 OHCI, the SelfIDComplete event occurs when the hardware has finished transmitting all of the self ID packets received during the bus initialization process to the host memory by DMA. This commit adds a tracepoints event for this event to trace the timing and packet data of Self-ID DMA. It is the part of following tracepoints events helpful to debug some events at bus reset; e.g. the issue addressed at a commit d0b06dc48fb1 ("firewire: core: use long bus reset on gap count error")[1]: * firewire_ohci:irqs * firewire_ohci:self_id_complete * firewire:bus_reset_handle * firewire:self_id_sequence They would be also helpful in the problem about invocation timing of hardIRQ and process (workqueue) contexts. We can often see this kind of problem with -rt kernel[2]. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0b06dc48fb1 [2] https://lore.kernel.org/linux-rt-users/YAwPoaUZ1gTD5y+k@hmbx/ Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Sakamoto <[email protected]>
2024-07-03drm/xe/uapi: Rename xe perf layer as xe observation layerAshutosh Dixit1-50/+52
In Xe, the perf layer allows capture of HW counter streams. These HW counters are generally performance related but don't have to be necessarily so. Also, the name "perf" is a carryover from i915 and is not preferred. Here we propose the name "observation" for this common layer which allows capture of different types of these counter streams. v2: Rename observability layer to observation layer (Lucas/Rodrigo) v3: Rename sysctl file to "observation_paranoid" (Jose) Fixes: 52c2e956dceb ("drm/xe/perf/uapi: "Perf" layer to support multiple perf counter stream types") Fixes: fe8929bdf835 ("drm/xe/perf/uapi: Add perf_stream_paranoid sysctl") Acked-by: Lucas De Marchi <[email protected]> Acked-by: Rodrigo Vivi <[email protected]> Signed-off-by: Ashutosh Dixit <[email protected]> Reviewed-by: Umesh Nerlige Ramappa <[email protected]> Acked-by: José Roberto de Souza <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-07-03Merge tag 'trace-v6.10-rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: "Fix ioctl conflict with memmapped ring buffer ioctl It was reported that the ioctl() number used to update the ring buffer memory mapping conflicted with the TCGETS ioctl causing strace to report: $ strace -e ioctl stty ioctl(0, TCGETS or TRACE_MMAP_IOCTL_GET_READER, {c_iflag=ICRNL|IXON, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0 Since this ioctl hasn't been in a full release yet, change it from "T", 0x1 to "R" 0x20, and also reserve 0x20-0x2F for future ioctl commands, as some more are being worked on for the future" * tag 'trace-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Have memmapped ring buffer use ioctl of "R" range 0x20-2F
2024-07-03ASoC: dapm: Use unsigned for number of widgets in snd_soc_dapm_new_controls()Krzysztof Kozlowski1-1/+1
Number of widgets in array passed to snd_soc_dapm_new_controls() cannot be negative, so make it explicit by using 'unsigned int', just like snd_soc_add_component_controls() is doing. Reviewed-by: Dmitry Baryshkov <[email protected]> Signed-off-by: Krzysztof Kozlowski <[email protected]> Link: https://patch.msgid.link/20240701-b4-qcom-audio-lpass-codec-cleanups-v3-4-6d98d4dd1ef5@linaro.org Signed-off-by: Mark Brown <[email protected]>
2024-07-03tracing: Have memmapped ring buffer use ioctl of "R" range 0x20-2FSteven Rostedt (Google)1-1/+1
To prevent conflicts with other ioctl numbers to allow strace to have an idea of what is happening, add the range of ioctls for the trace buffer mapping from _IO("T", 0x1) to the range of "R" 0x20 - 0x2F. Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/ Cc: Jonathan Corbet <[email protected]> Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Link: https://lore.kernel.org/[email protected] Reported-by: "Dmitry V. Levin" <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-07-03vdso: Add comment about reason for vdso struct orderingAnna-Maria Behnsen1-0/+4
struct vdso_data is optimized for fast access to the often required struct members. The optimization is not documented in the struct description but it should be kept in mind, when working with the vdso_data struct. Add a comment to the struct description. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03cgroup/misc: Introduce misc.peakXiu Jianfeng1-0/+2
Introduce misc.peak to record the historical maximum usage of the resource, as in some scenarios the value of misc.max could be adjusted based on the peak usage of the resource. Signed-off-by: Xiu Jianfeng <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-03block: don't free the integrity payload in bio_integrity_unmap_free_userChristoph Hellwig1-2/+2
Now that the integrity payload is always freed in bio_uninit, don't bother freeing it a little earlier in bio_integrity_unmap_free_user. With that the separate bio_integrity_unmap_free_user can go away by just passing the bio to bio_integrity_unmap_user. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Kanchan Joshi <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-07-03block: don't free submitter owned integrity payload on I/O completionChristoph Hellwig1-2/+1
Currently __bio_integrity_endio frees the integrity payload unless it is explicitly marked as user-mapped. This means in-kernel callers that allocate their own integrity payload never get to see it on I/O completion. The current two users don't need it as they just pre-mapped PI tuples received over the network, but this limits uses of integrity data lot. Change bio_integrity_endio to call __bio_integrity_endio for block layer generated integrity data only, and leave freeing of submitter allocated integrity data to bio_uninit which also gets called from the final bio_put. This requires that unmapping user mapped or copied integrity data is now always done by the caller, and the special BIP_INTEGRITY_USER flag can go away. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Kanchan Joshi <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-07-03block: also return bio_integrity_payload * from stubsChristoph Hellwig1-3/+3
struct bio_integrity_payload is defined unconditionally. No need to return void * from bio_integrity() and bio_integrity_alloc(). Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Kanchan Joshi <[email protected]> Reviewed-by: Anuj Gupta <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-07-03block: split integrity support out of bio.hChristoph Hellwig3-156/+154
Split struct bio_integrity_payload and the related prototypes out of bio.h into a separate bio-integrity.h header so that it is only pulled in by the few places that need it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Anuj Gupta <[email protected]> Reviewed-by: Kanchan Joshi <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-07-03Merge tag 'v6.10-rc6' into for-6.11/block-postJens Axboe50-148/+367
Pull in v6.10-rc6 to resolve a conflict for the integrity cleanups. * tag 'v6.10-rc6': (778 commits) Linux 6.10-rc6 ata: ahci: Clean up sysfs file on error ata: libata-core: Fix double free on error ata,scsi: libata-core: Do not leak memory for ata_port struct members ata: libata-core: Fix null pointer dereference on error x86-32: fix cmpxchg8b_emu build error with clang x86: stop playing stack games in profile_pc() i2c: testunit: discard write requests while old command is running i2c: testunit: don't erase registers after STOP tty: mxser: Remove __counted_by from mxser_board.ports[] randomize_kstack: Remove non-functional per-arch entropy filtering string: kunit: add missing MODULE_DESCRIPTION() macros ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models MAINTAINERS: Update IOMMU tree location tools/power turbostat: Add local build_bug.h header for snapshot target tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l' tools/power turbostat: option '-n' is ambiguous drm/drm_file: Fix pid refcounting race kallsyms: rework symbol lookup return codes gpiolib: cdev: Ignore reconfiguration without direction ... Signed-off-by: Jens Axboe <[email protected]>
2024-07-03drm/display/dsc: Add a helper to dump the DSC configurationImre Deak1-0/+3
Add a helper to dump the Display Stream Compression configuration, taken into use in the i915 driver by a later patch. v2: - Rebase on the s/DRM_X16/FXP_Q4 change. - s/DSC configration/DSC configuration in the function documentation. Acked-by: Jani Nikula <[email protected]> Signed-off-by: Imre Deak <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-07-03drm: Add helpers for q4 fixed point valuesImre Deak1-0/+23
Add helpers to convert between q4 fixed point and integer/fraction values. Also add the format/argument macros required to printk q4 fixed point variables. The q4 notation is based on the short variant described by https://en.wikipedia.org/wiki/Q_(number_format) where only the number of fraction bits in the fixed point value are defined, while the full size is deducted from the container type, that is the size of int for these helpers. Using the fxp_ prefix, which makes moving these helpers outside of drm to a more generic place easier, if they prove to be useful. These are needed by later patches dumping the Display Stream Compression configuration in DRM core and in the i915 driver to replace the corresponding bpp_x16 helpers defined locally in the driver. v2: Use the more generic/descriptive fxp_q4 prefix instead of drm_x16. (Jani) Cc: Jani Nikula <[email protected]> Acked-by: Jani Nikula <[email protected]> Signed-off-by: Imre Deak <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-07-03iommu/arm-smmu-v3: Add support for dirty tracking in domain allocJoao Martins1-0/+3
This provides all the infrastructure to enable dirty tracking if the hardware has the capability and domain alloc request for it. Also, add a device_iommu_capable() check in iommufd core for IOMMU_CAP_DIRTY_TRACKING before we request a user domain with dirty tracking support. Please note, we still report no support for IOMMU_CAP_DIRTY_TRACKING as it will finally be enabled in a subsequent patch. Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Shameer Kolothum <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-07-03parport: Remove parport_driver.devmodelDr. David Alan Gilbert1-4/+0
'devmodel' hasn't actually been used since: 'commit 3275158fa52a ("parport: remove use of devmodel")' and everyone now has it set to true and has been fixed up; remove the flag. (There are still comments all over about it) Signed-off-by: Dr. David Alan Gilbert <[email protected]> Acked-by: Sudip Mukherjee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-07-03parport: Remove attach function pointerDr. David Alan Gilbert1-1/+0
The attach function pointers haven't actually been called since: 'commit 3275158fa52a ("parport: remove use of devmodel")' topped adding entries to the drivers list. If you're converting a driver, look at the 'match_port' function pointer instead. (There are lots of comment references to 'attach' all over, but they probably need some deeper understanding to check the semantics to see if they can be replaced by match_port). Signed-off-by: Dr. David Alan Gilbert <[email protected]> Acked-by: Sudip Mukherjee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-07-03parport: Remove 'drivers' listDr. David Alan Gilbert1-1/+0
The list has been empty since: 'commit 3275158fa52a ("parport: remove use of devmodel")' This also means we can remove the 'list_head' from struct parport_driver. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Acked-by: Sudip Mukherjee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-07-03misc: fastrpc: Restrict untrusted app to attach to privileged PDEkansh Gupta1-0/+3
Untrusted application with access to only non-secure fastrpc device node can attach to root_pd or static PDs if it can make the respective init request. This can cause problems as the untrusted application can send bad requests to root_pd or static PDs. Add changes to reject attach to privileged PDs if the request is being made using non-secure fastrpc device node. Fixes: 0871561055e6 ("misc: fastrpc: Add support for audiopd") Cc: stable <[email protected]> Signed-off-by: Ekansh Gupta <[email protected]> Reviewed-by: Dmitry Baryshkov <[email protected]> Signed-off-by: Srinivas Kandagatla <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-07-03usb: typec: tcpci: add support to set connector orientationMarco Felsch1-0/+8
This add the support to set the optional connector orientation bit which is part of the optional CONFIG_STANDARD_OUTPUT register 0x18 [1]. This allows system designers to connect the tcpc orientation pin directly to the 2:1 ss-mux. [1] https://www.usb.org/sites/default/files/documents/usb-port_controller_specification_rev2.0_v1.0_0.pdf Signed-off-by: Marco Felsch <[email protected]> Reviewed-by: Heikki Krogerus <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-07-03Merge tag 'w1-drv-6.11' of ↵Greg Kroah-Hartman1-3/+4
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/krzk/linux-w1 into char-misc-next Krzysztof writes: 1-Wire bus drivers for v6.11 Just two cleanups for W1 core code. * tag 'w1-drv-6.11' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/krzk/linux-w1: w1: Drop allocation error message w1: Add missing newline and fix typos in w1_bus_master comment
2024-07-03bus: mhi: host: Allow controller drivers to specify name for the MHI controllerSlark Xiao1-0/+2
MHI devices usually have a product/device name to identify each device uniquely. So let's specify that name in 'struct mhi_controller' so that the client drivers can use this name to uniquely identify the devices and apply any device specific quirks. Signed-off-by: Slark Xiao <[email protected]> Reviewed-by: Manivannan Sadhasivam <[email protected]> Link: https://lore.kernel.org/r/[email protected] [mani: reworked subject and description] Signed-off-by: Manivannan Sadhasivam <[email protected]>
2024-07-03driver core: have match() callback in struct bus_type take a const *Greg Kroah-Hartman38-86/+46
In the match() callback, the struct device_driver * should not be changed, so change the function callback to be a const *. This is one step of many towards making the driver core safe to have struct device_driver in read-only memory. Because the match() callback is in all busses, all busses are modified to handle this properly. This does entail switching some container_of() calls to container_of_const() to properly handle the constant *. For some busses, like PCI and USB and HV, the const * is cast away in the match callback as those busses do want to modify those structures at this point in time (they have a local lock in the driver structure.) That will have to be changed in the future if they wish to have their struct device * in read-only-memory. Cc: Rafael J. Wysocki <[email protected]> Reviewed-by: Alex Elder <[email protected]> Acked-by: Sumit Garg <[email protected]> Link: https://lore.kernel.org/r/2024070136-wrongdoer-busily-01e8@gregkh Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-07-03perf: arm_pmuv3: Include asm/arm_pmuv3.h from linux/perf/arm_pmuv3.hRob Herring (Arm)1-0/+2
The arm64 asm/arm_pmuv3.h depends on defines from linux/perf/arm_pmuv3.h. Rather than depend on include order, follow the usual pattern of "linux" headers including "asm" headers of the same name. With this change, the include of linux/kvm_host.h is problematic due to circular includes: In file included from ../arch/arm64/include/asm/arm_pmuv3.h:9, from ../include/linux/perf/arm_pmuv3.h:312, from ../include/kvm/arm_pmu.h:11, from ../arch/arm64/include/asm/kvm_host.h:38, from ../arch/arm64/mm/init.c:41: ../include/linux/kvm_host.h:383:30: error: field 'arch' has incomplete type Switching to asm/kvm_host.h solves the issue. Signed-off-by: Rob Herring (Arm) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-07-03ASoC: cs35l56: Limit Speaker Volume to +12dB maximumRichard Fitzgerald1-1/+1
Change CS35L56_MAIN_RENDER_USER_VOLUME_MAX to 48, to limit the maximum value of the Speaker Volume control to +12dB. The minimum value is unchanged so that the default 0dB has the same integer control value. The original maximum of 400 (+100dB) was the largest value that can be mathematically handled by the DSP. The actual maximum amplification is +12dB. Signed-off-by: Richard Fitzgerald <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-07-03um: add shared memory optimisation for time-travel=extJohannes Berg1-14/+165
With external time travel, a LOT of message can end up being exchanged on the socket, taking a significant amount of time just to do that. Add a new shared memory optimisation to that, where a number of changes are made: - the controller sends a client ID and a shared memory FD (and a logging FD we don't use) in the ACK message to the initial START - the shared memory holds the current time and the free_until value, so that there's no need to exchange messages for that - if the client that's running has shared memory support, any client (the running one included) can request the next time it wants to run inside the shared memory, rather than sending a message, by also updating the free_until value - when shared memory is enabled, RUN/WAIT messages no longer have an ACK, further cutting down on messages Together, this can reduce the number of messages very significantly, and reduce overall test/simulation run time. Co-developed-by: Mordechay Goodstein <[email protected]> Signed-off-by: Mordechay Goodstein <[email protected]> Link: https://patch.msgid.link/20240702192118.6ad0a083f574.Ie41206c8ce4507fe26b991937f47e86c24ca7a31@changeid Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: time-travel: support time-travel protocol broadcast messagesMordechay Goodstein1-0/+11
Add a message type to the time-travel protocol to broadcast a small (64-bit) value to all participants in a simulation. The main use case is to have an identical message come to all participants in a simulation, e.g. to separate out logs for different tests running in a single simulation. Down in the guts of time_travel_handle_message() we can't use printk() and not even printk_deferred(), so just store the message and print it at the start of the userspace() function. Unfortunately this means that other prints in the kernel can actually bypass the message, but in most cases where this is used, for example to separate test logs, userspace will be involved. Also, even if we could use printk_deferred(), we'd still need to flush it out in the userspace() function since otherwise userspace messages might cross it. As a result, this is a reasonable compromise, there's no need to have any core changes and it solves the main use case we have for it. Signed-off-by: Mordechay Goodstein <[email protected]> Link: https://patch.msgid.link/20240702192118.c4093bc5b15e.I2ca8d006b67feeb866ac2017af7b741c9e06445a@changeid Signed-off-by: Johannes Berg <[email protected]>
2024-07-03mm/slab: Introduce kmem_buckets_create() and familyKees Cook1-0/+12
Dedicated caches are available for fixed size allocations via kmem_cache_alloc(), but for dynamically sized allocations there is only the global kmalloc API's set of buckets available. This means it isn't possible to separate specific sets of dynamically sized allocations into a separate collection of caches. This leads to a use-after-free exploitation weakness in the Linux kernel since many heap memory spraying/grooming attacks depend on using userspace-controllable dynamically sized allocations to collide with fixed size allocations that end up in same cache. While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense against these kinds of "type confusion" attacks, including for fixed same-size heap objects, we can create a complementary deterministic defense for dynamically sized allocations that are directly user controlled. Addressing these cases is limited in scope, so isolating these kinds of interfaces will not become an unbounded game of whack-a-mole. For example, many pass through memdup_user(), making isolation there very effective. In order to isolate user-controllable dynamically-sized allocations from the common system kmalloc allocations, introduce kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for where caller tracking is needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback is needed. Note that these caches are specifically flagged with SLAB_NO_MERGE, since merging would defeat the entire purpose of the mitigation. This can also be used in the future to extend allocation profiling's use of code tagging to implement per-caller allocation cache isolation[1] even for dynamic allocations. Memory allocation pinning[2] is still needed to plug the Use-After-Free cross-allocator weakness (where attackers can arrange to free an entire slab page and have it reallocated to a different cache), but that is an existing and separate issue which is complementary to this improvement. Development continues for that feature via the SLAB_VIRTUAL[3] series (which could also provide guard pages -- another complementary improvement). Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1] Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2] Link: https://lore.kernel.org/lkml/[email protected]/ [3] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-07-03mm/slab: Introduce kvmalloc_buckets_node() that can take kmem_buckets argumentKees Cook1-1/+3
Plumb kmem_buckets arguments through kvmalloc_node_noprof() so it is possible to provide an API to perform kvmalloc-style allocations with a particular set of buckets. Introduce kvmalloc_buckets_node() that takes a kmem_buckets argument. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-07-03mm/slab: Plumb kmem_buckets into __do_kmalloc_node()Kees Cook1-5/+22
Introduce CONFIG_SLAB_BUCKETS which provides the infrastructure to support separated kmalloc buckets (in the following kmem_buckets_create() patches and future codetag-based separation). Since this will provide a mitigation for a very common case of exploits, it is recommended to enable this feature for general purpose distros. By default, the new Kconfig will be enabled if CONFIG_SLAB_FREELIST_HARDENED is enabled (and it is added to the hardening.config Kconfig fragment). To be able to choose which buckets to allocate from, make the buckets available to the internal kmalloc interfaces by adding them as the second argument, rather than depending on the buckets being chosen from the fixed set of global buckets. Where the bucket is not available, pass NULL, which means "use the default system kmalloc bucket set" (the prior existing behavior), as implemented in kmalloc_slab(). To avoid adding the extra argument when !CONFIG_SLAB_BUCKETS, only the top-level macros and static inlines use the buckets argument (where they are stripped out and compiled out respectively). The actual extern functions can then be built without the argument, and the internals fall back to the global kmalloc buckets unconditionally. Co-developed-by: Vlastimil Babka <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-07-03mm/slab: Introduce kmem_buckets typedefKees Cook1-2/+3
Encapsulate the concept of a single set of kmem_caches that are used for the kmalloc size buckets. Redefine kmalloc_caches as an array of these buckets (for the different global cache buckets). Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-07-03slab, rust: extend kmalloc() alignment guarantees to remove Rust paddingVlastimil Babka1-1/+2
Slab allocators have been guaranteeing natural alignment for power-of-two sizes since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)"), while any other sizes are guaranteed to be aligned only to ARCH_KMALLOC_MINALIGN bytes (although in practice are aligned more than that in non-debug scenarios). Rust's allocator API specifies size and alignment per allocation, which have to satisfy the following rules, per Alice Ryhl [1]: 1. The alignment is a power of two. 2. The size is non-zero. 3. When you round up the size to the next multiple of the alignment, then it must not overflow the signed type isize / ssize_t. In order to map this to kmalloc()'s guarantees, some requested allocation sizes have to be padded to the next power-of-two size [2]. For example, an allocation of size 96 and alignment of 32 will be padded to an allocation of size 128, because the existing kmalloc-96 bucket doesn't guarantee alignent above ARCH_KMALLOC_MINALIGN. Without slab debugging active, the layout of the kmalloc-96 slabs however naturally align the objects to 32 bytes, so extending the size to 128 bytes is wasteful. To improve the situation we can extend the kmalloc() alignment guarantees in a way that 1) doesn't change the current slab layout (and thus does not increase internal fragmentation) when slab debugging is not active 2) reduces waste in the Rust allocator use case 3) is a superset of the current guarantee for power-of-two sizes. The extended guarantee is that alignment is at least the largest power-of-two divisor of the requested size. For power-of-two sizes the largest divisor is the size itself, but let's keep this case documented separately for clarity. For current kmalloc size buckets, it means kmalloc-96 will guarantee alignment of 32 bytes and kmalloc-196 will guarantee 64 bytes. This covers the rules 1 and 2 above of Rust's API as long as the size is a multiple of the alignment. The Rust layer should now only need to round up the size to the next multiple if it isn't, while enforcing the rule 3. Implementation-wise, this changes the alignment calculation in create_boot_cache(). While at it also do the calulation only for caches with the SLAB_KMALLOC flag, because the function is also used to create the initial kmem_cache and kmem_cache_node caches, where no alignment guarantee is necessary. In the Rust allocator's krealloc_aligned(), remove the code that padded sizes to the next power of two (suggested by Alice Ryhl) as it's no longer necessary with the new guarantees. Reported-by: Alice Ryhl <[email protected]> Reported-by: Boqun Feng <[email protected]> Link: https://lore.kernel.org/all/CAH5fLggjrbdUuT-H-5vbQfMazjRDpp2%2Bk3%[email protected]/ [1] Link: https://lore.kernel.org/all/CAH5fLghsZRemYUwVvhk77o6y1foqnCeDzW4WZv6ScEWna2+_jw@mail.gmail.com/ [2] Reviewed-by: Boqun Feng <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: Alice Ryhl <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-07-03cachefiles: fix slab-use-after-free in fscache_withdraw_volume()Baokun Li1-0/+4
We got the following issue in our fault injection stress test: ================================================================== BUG: KASAN: slab-use-after-free in fscache_withdraw_volume+0x2e1/0x370 Read of size 4 at addr ffff88810680be08 by task ondemand-04-dae/5798 CPU: 0 PID: 5798 Comm: ondemand-04-dae Not tainted 6.8.0-dirty #565 Call Trace: kasan_check_range+0xf6/0x1b0 fscache_withdraw_volume+0x2e1/0x370 cachefiles_withdraw_volume+0x31/0x50 cachefiles_withdraw_cache+0x3ad/0x900 cachefiles_put_unbind_pincount+0x1f6/0x250 cachefiles_daemon_release+0x13b/0x290 __fput+0x204/0xa00 task_work_run+0x139/0x230 Allocated by task 5820: __kmalloc+0x1df/0x4b0 fscache_alloc_volume+0x70/0x600 __fscache_acquire_volume+0x1c/0x610 erofs_fscache_register_volume+0x96/0x1a0 erofs_fscache_register_fs+0x49a/0x690 erofs_fc_fill_super+0x6c0/0xcc0 vfs_get_super+0xa9/0x140 vfs_get_tree+0x8e/0x300 do_new_mount+0x28c/0x580 [...] Freed by task 5820: kfree+0xf1/0x2c0 fscache_put_volume.part.0+0x5cb/0x9e0 erofs_fscache_unregister_fs+0x157/0x1b0 erofs_kill_sb+0xd9/0x1c0 deactivate_locked_super+0xa3/0x100 vfs_get_super+0x105/0x140 vfs_get_tree+0x8e/0x300 do_new_mount+0x28c/0x580 [...] ================================================================== Following is the process that triggers the issue: mount failed | daemon exit ------------------------------------------------------------ deactivate_locked_super cachefiles_daemon_release erofs_kill_sb erofs_fscache_unregister_fs fscache_relinquish_volume __fscache_relinquish_volume fscache_put_volume(fscache_volume, fscache_volume_put_relinquish) zero = __refcount_dec_and_test(&fscache_volume->ref, &ref); cachefiles_put_unbind_pincount cachefiles_daemon_unbind cachefiles_withdraw_cache cachefiles_withdraw_volumes list_del_init(&volume->cache_link) fscache_free_volume(fscache_volume) cache->ops->free_volume cachefiles_free_volume list_del_init(&cachefiles_volume->cache_link); kfree(fscache_volume) cachefiles_withdraw_volume fscache_withdraw_volume fscache_volume->n_accesses // fscache_volume UAF !!! The fscache_volume in cache->volumes must not have been freed yet, but its reference count may be 0. So use the new fscache_try_get_volume() helper function try to get its reference count. If the reference count of fscache_volume is 0, fscache_put_volume() is freeing it, so wait for it to be removed from cache->volumes. If its reference count is not 0, call cachefiles_withdraw_volume() with reference count protection to avoid the above issue. Fixes: fe2140e2f57f ("cachefiles: Implement volume support") Signed-off-by: Baokun Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-07-03netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()Baokun Li1-0/+6
Export fscache_put_volume() and add fscache_try_get_volume() helper function to allow cachefiles to get/put fscache_volume via linux/fscache-cache.h. Signed-off-by: Baokun Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-07-03fs: fix dentry sizeChristian Brauner1-1/+1
On CONFIG_SMP=y and on 32bit we need to decrease DNAME_INLINE_LEN to 36 btyes to end up with 128 bytes in total. Reported-by: Linus Torvalds <[email protected]> Links: https://lore.kernel.org/r/CAHk-=whtoqTSCcAvV-X-KPqoDWxS4vxmWpuKLB+Vv8=FtUd5vA@mail.gmail.com Signed-off-by: Christian Brauner <[email protected]>
2024-07-03vfs: move d_lockref out of the area used by RCU lookupMateusz Guzik1-1/+6
Stock kernel scales worse than FreeBSD when doing a 20-way stat(2) on the same tmpfs-backed file. According to perf top: 38.09% lockref_put_return 26.08% lockref_get_not_dead 25.60% __d_lookup_rcu 0.89% clear_bhb_loop __d_lookup_rcu is participating in cacheline ping pong due to the embedded name sharing a cacheline with lockref. Moving it out resolves the problem: 41.50% lockref_put_return 41.03% lockref_get_not_dead 1.54% clear_bhb_loop benchmark (will-it-scale, Sapphire Rapids, tmpfs, ops/s): FreeBSD:7219334 before: 5038006 after: 7842883 (+55%) One minor remark: the 'after' result is unstable, fluctuating in the range ~7.8 mln to ~9 mln during different runs. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-07-03drm/panthor: Fix sync-only jobsBoris Brezillon1-0/+5
A sync-only job is meant to provide a synchronization point on a queue, so we can't return a NULL fence there, we have to add a signal operation to the command stream which executes after all other previously submitted jobs are done. v2: - Fixed a UAF bug - Added R-bs Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") Signed-off-by: Boris Brezillon <[email protected]> Reviewed-by: Liviu Dudau <[email protected]> Reviewed-by: Steven Price <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-07-02page_pool: convert to use netmemMina Almasry5-40/+114
Abstract the memory type from the page_pool so we can later add support for new memory types. Convert the page_pool to use the new netmem type abstraction, rather than use struct page directly. As of this patch the netmem type is a no-op abstraction: it's always a struct page underneath. All the page pool internals are converted to use struct netmem instead of struct page, and the page pool now exports 2 APIs: 1. The existing struct page API. 2. The new struct netmem API. Keeping the existing API is transitional; we do not want to refactor all the current drivers using the page pool at once. The netmem abstraction is currently a no-op. The page_pool uses page_to_netmem() to convert allocated pages to netmem, and uses netmem_to_page() to convert the netmem back to pages to pass to mm APIs, Follow up patches to this series add non-paged netmem support to the page_pool. This change is factored out on its own to limit the code churn to this 1 patch, for ease of code review. Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-07-02cxl/events: Use a common struct for DRAM and General Media eventsFabio M. De Francesco1-26/+19
cxl_event_common was an unfortunate naming choice and caused confusion with the existing Common Event Record. Furthermore, its fields didn't map all the common information between DRAM and General Media Events. Remove cxl_event_common and introduce cxl_event_media_hdr to record common information between DRAM and General Media events. cxl_event_media_hdr, which is embedded in both cxl_event_gen_media and cxl_event_dram, leverages the commonalities between the two events to simplify their respective handling. Suggested-by: Dan Williams <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Signed-off-by: Fabio M. De Francesco <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dave Jiang <[email protected]>
2024-07-02x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor filesTony Luck1-0/+3
When SNC is enabled, monitoring data is collected at the SNC node granularity, but must be reported at L3-cache granularity for backwards compatibility in addition to reporting at the node level. Add a "ci" field to the rdt_mon_domain structure to save the cache information about the enclosing L3 cache for the domain. This provides: 1) The cache id which is needed to compose the name of the legacy monitoring directory, and to determine which domains should be summed to provide L3-scoped data. 2) The shared_cpu_map which is needed to determine which CPUs can be used to read the RMID counters with the MSR interface. This is the first step to an eventual goal of monitor reporting files like this (for a system with two SNC nodes per L3): $ cd /sys/fs/resctrl/mon_data $ tree mon_L3_00 mon_L3_00 <- 00 here is L3 cache id ├── llc_occupancy \ These files provide legacy support ├── mbm_local_bytes > for non-SNC aware monitor apps ├── mbm_total_bytes / that expect data at L3 cache level ├── mon_sub_L3_00 <- 00 here is SNC node id │   ├── llc_occupancy \ These files are finer grained │   ├── mbm_local_bytes > data from each SNC node │   └── mbm_total_bytes / └── mon_sub_L3_01 ├── llc_occupancy \ ├── mbm_local_bytes > As above, but for node 1. └── mbm_total_bytes / [ bp: Massage commit message. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Add node-scope to the options for feature scopeTony Luck1-0/+1
Currently supported resctrl features are all domain scoped the same as the scope of the L2 or L3 caches. Add RESCTRL_L3_NODE as a new option for features that are scoped at the same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA Cluster (SNC) feature where monitoring features are divided between nodes that share an L3 cache. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Split the rdt_domain and rdt_hw_domain structuresTony Luck1-20/+28
The same rdt_domain structure is used for both control and monitor functions. But this results in wasted memory as some of the fields are only used by control functions, while most are only used for monitor functions. Split into separate rdt_ctrl_domain and rdt_mon_domain structures with just the fields required for control and monitoring respectively. Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain and rdt_hw_mon_domain. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare for different scope for control/monitor operationsTony Luck1-6/+19
Resctrl assumes that control and monitor operations on a resource are performed at the same scope. Prepare for systems that use different scope (specifically Intel needs to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control and NODE scope for cache occupancy and memory bandwidth monitoring). Create separate domain lists for control and monitor operations. Note that errors during initialization of either control or monitor functions on a domain would previously result in that domain being excluded from both control and monitor operations. Now the domains are allocated independently it is no longer required to disable both control and monitor operations if either fail. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare to split rdt_domain structureTony Luck1-4/+12
The rdt_domain structure is used for both control and monitor features. It is about to be split into separate structures for these two usages because the scope for control and monitoring features for a resource will be different for future resources. To allow for common code that scans a list of domains looking for a specific domain id, move all the common fields ("list", "id", "cpu_mask") into their own structure within the rdt_domain structure. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare for new domain scopeTony Luck1-2/+7
Resctrl resources operate on subsets of CPUs in the system with the defining attribute of each subset being an instance of a particular level of cache. E.g. all CPUs sharing an L3 cache would be part of the same domain. In preparation for features that are scoped at the NUMA node level, change the code from explicit references to "cache_level" to a more generic scope. At this point the only options for this scope are groups of CPUs that share an L2 cache or L3 cache. Clean up the error handling when looking up domains. Report invalid ids before calling rdt_find_domain() in preparation for better messages when scope can be other than cache scope. This means that rdt_find_domain() will never return an error. So remove checks for error from the call sites. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02cgroup_misc: add kernel-doc comments for enum misc_res_typeRandy Dunlap1-3/+4
Fully document enum misc_res_type with kernel-doc comments to prevent kernel-doc warnings: misc_cgroup.h:12: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Types of misc cgroup entries supported by the host. misc_cgroup.h:12: warning: missing initial short description on line: * Types of misc cgroup entries supported by the host. Fixes: a72232eabdfc ("cgroup: Add misc cgroup controller") Signed-off-by: Randy Dunlap <[email protected]> Cc: [email protected] Signed-off-by: Tejun Heo <[email protected]>