Age | Commit message (Collapse) | Author | Files | Lines |
|
Generic implementations of fb_pgprotect() and fb_is_primary_device()
have been in the source code for a long time. Prepare the header file
to make use of them.
Improve the code by using an inline function for fb_pgprotect()
and by removing include statements. The default mode set by
fb_pgprotect() is now writecombine, which is what most platforms
want.
Symbols are protected by preprocessor guards. Architectures that
provide a symbol need to define a preprocessor token of the same
name and value. Otherwise the header file will provide a generic
implementation. This pattern has been taken from <asm/io.h>.
v3:
* include the correct header files
v2:
* use writecombine mappings by default (Arnd)
Signed-off-by: Thomas Zimmermann <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Helge Deller <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Create a uapi header include/uapi/linux/ext4.h, move the ioctls and
associated data structures to the uapi header, and include it from
fs/ext4/ext4.h.
Signed-off-by: Josh Triplett <[email protected]>
Link: https://lore.kernel.org/r/680175260970d977d16b5cc7e7606483ec99eb63.1680402881.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
This reverts commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4.
Paul reports that it causes a regression with IB on CX4
and FW 12.18.1000. In addition I think that the concept
of "management PF" is not fully accepted and requires
a discussion.
Fixes: fe998a3c77b9 ("net/mlx5: Enable management PF initialization")
Reported-by: Paul Moore <[email protected]>
Link: https://lore.kernel.org/all/CAHC9VhQ7A4+msL38WpbOMYjAqLp0EtOjeLh4Dc6SQtD6OUvCQg@mail.gmail.com/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
To enable kernel consumers of TLS to request a TLS handshake, add
support to net/handshake/ to request a handshake upcall.
This patch also acts as a template for adding handshake upcall
support for other kernel transport layer security providers.
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
When a kernel consumer needs a transport layer security session, it
first needs a handshake to negotiate and establish a session. This
negotiation can be done in user space via one of the several
existing library implementations, or it can be done in the kernel.
No in-kernel handshake implementations yet exist. In their absence,
we add a netlink service that can:
a. Notify a user space daemon that a handshake is needed.
b. Once notified, the daemon calls the kernel back via this
netlink service to get the handshake parameters, including an
open socket on which to establish the session.
c. Once the handshake is complete, the daemon reports the
session status and other information via a second netlink
operation. This operation marks that it is safe for the
kernel to use the open socket and the security session
established there.
The notification service uses a multicast group. Each handshake
mechanism (eg, tlshd) adopts its own group number so that the
handshake services are completely independent of one another. The
kernel can then tell via netlink_has_listeners() whether a handshake
service is active and prepared to handle a handshake request.
A new netlink operation, ACCEPT, acts like accept(2) in that it
instantiates a file descriptor in the user space daemon's fd table.
If this operation is successful, the reply carries the fd number,
which can be treated as an open and ready file descriptor.
While user space is performing the handshake, the kernel keeps its
muddy paws off the open socket. A second new netlink operation,
DONE, indicates that the user space daemon is finished with the
socket and it is safe for the kernel to use again. The operation
also indicates whether a session was established successfully.
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"22 hotfixes.
19 are cc:stable and the remainder address issues which were
introduced during this merge cycle, or aren't considered suitable for
-stable backporting.
19 are for MM and the remainder are for other subsystems"
* tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
nilfs2: initialize unused bytes in segment summary blocks
mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
mm/mmap: regression fix for unmapped_area{_topdown}
maple_tree: fix mas_empty_area() search
maple_tree: make maple state reusable after mas_empty_area_rev()
mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
tools/Makefile: do missed s/vm/mm/
mm: fix memory leak on mm_init error handling
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Revert "userfaultfd: don't fail on unrecognized features"
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug
tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used
mailmap: update jtoppins' entry to reference correct email
mm/mempolicy: fix use-after-free of VMA iterator
mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
mm/mprotect: fix do_mprotect_pkey() return on error
mm/khugepaged: check again on anon uffd-wp during isolation
...
|
|
Currently call_bind_status places a hard limit of 3 to the number of
retries on EACCES error. This limit was done to prevent NLM unlock
requests from being hang forever when the server keeps returning garbage.
However this change causes problem for cases when NLM service takes
longer than 9 seconds to register with the port mapper after a restart.
This patch removes this hard coded limit and let the RPC handles
the retry based on the standard hard/soft task semantics.
Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
Reported-by: Helen Chao <[email protected]>
Tested-by: Helen Chao <[email protected]>
Signed-off-by: Dai Ngo <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Locking range start and locking range length
attributes may be require to satisfy restrictions
exposed by OPAL2 geometry feature reporting.
Geometry reporting feature is described in TCG OPAL SSC,
section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity
and LowestAlignedLBA).
4.3.5.2.1.1 RangeStart Behavior:
[ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeStart is non-zero; and
c) StartAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER.
4.3.5.2.1.2 RangeLength Behavior:
If RangeStart is zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ]
If RangeStart is non-zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeLength is non-zero; and
c) LengthAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER
In userspace we stuck to logical block size reported by general
block device (via sysfs or ioctl), but we can not read
'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and
we need to get those values from sed-opal interface otherwise
we will not be able to report or avoid locking range setup
INVALID_PARAMETER errors above.
Signed-off-by: Ondrej Kozina <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Tested-by: Milan Broz <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Raw NAND core changes:
* Convert to platform remove callback returning void
* Fix spelling mistake waifunc() -> waitfunc()
Raw NAND controller driver changes:
* imx: Remove unused is_imx51_nfc and imx53_nfc functions
* omap2: Drop obsolete dependency on COMPILE_TEST
* orion: Use devm_platform_ioremap_resource()
* qcom:
- Use of_property_present() for testing DT property presence
- Use devm_platform_get_and_ioremap_resource()
* stm32_fmc2: Depends on ARCH_STM32 instead of MACH_STM32MP157
* tmio: Remove reference to config MTD_NAND_TMIO in the parsers
Raw NAND manufacturer driver changes:
* hynix: Fix up bit 0 of sdr_timing_mode
SPI-NAND changes:
* Add support for ESMT F50x1G41LB
Signed-off-by: Miquel Raynal <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into driver-core-next
Sudeep writes:
cacheinfo and arch_topology updates for v6.4
The cache information can be extracted from either a Device Tree(DT),
the PPTT ACPI table, or arch registers (clidr_el1 for arm64).
When the DT is used but no cache properties are advertised, the current
code doesn't correctly fallback to using arch information. The changes
fixes the same and also assuse the that L1 data/instruction caches
are private and L2/higher caches are shared when the cache information
is missing in DT/ACPI and is derived form clidr_el1/arch registers.
Currently the cacheinfo is built from the primary CPU prior to secondary
CPUs boot, if the DT/ACPI description contains cache information.
However, if not present, it still reverts to the old behavior, which
allocates the cacheinfo memory on each secondary CPUs which causes
RT kernels to triggers a "BUG: sleeping function called from invalid
context".
The changes here attempts to enable automatic detection for RT kernels
when no DT/ACPI cache information is available, by pre-allocating
cacheinfo memory on the primary CPU.
* tag 'cacheinfo-updates-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
cacheinfo: Add use_arch[|_cache]_info field/function
arch_topology: Remove early cacheinfo error message if -ENOENT
cacheinfo: Check cache properties are present in DT
cacheinfo: Check sib_leaf in cache_leaves_are_shared()
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
cacheinfo: Add arm64 early level initializer implementation
cacheinfo: Add arch specific early level initializer
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc into char-misc-next
Georgi writes:
interconnect changes for 6.4
This pull request contains the interconnect changes for the 6.4-rc1 merge
window, which this time are mostly cleanups.
Core changes:
interconnect: Skip call into provider if initial bw is zero
interconnect: Use of_property_present() for testing DT property presence
interconnect: drop racy registration API
interconnect: drop unused icc_link_destroy() interface
Driver changes:
interconnect: qcom: Drop obsolete dependency on COMPILE_TEST
interconnect: qcom: drop obsolete OSM_L3/EPSS defines
interconnect: qcom: osm-l3: drop unuserd header inclusion
interconnect: qcom: rpm: drop bogus pm domain attach
interconnect: qcom: rpm: make QoS INVALID default
interconnect: qcom: rpm: Add support for specifying channel num
interconnect: qcom: Sort kerneldoc entries
dt-bindings: interconnect: OSM L3: Add SM6375 CPUCP compatible
dt-bindings: interconnect: qcom,msm8998-bwmon: Resolve MSM8998 support
Signed-off-by: Georgi Djakov <[email protected]>
* tag 'icc-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc:
dt-bindings: interconnect: qcom,msm8998-bwmon: Resolve MSM8998 support
dt-bindings: interconnect: OSM L3: Add SM6375 CPUCP compatible
interconnect: qcom: Sort kerneldoc entries
interconnect: qcom: rpm: Add support for specifying channel num
interconnect: qcom: rpm: make QoS INVALID default
interconnect: qcom: rpm: drop bogus pm domain attach
interconnect: drop unused icc_link_destroy() interface
interconnect: drop racy registration API
interconnect: Use of_property_present() for testing DT property presence
interconnect: qcom: osm-l3: drop unuserd header inclusion
interconnect: qcom: drop obsolete OSM_L3/EPSS defines
interconnect: Skip call into provider if initial bw is zero
interconnect: qcom: Drop obsolete dependency on COMPILE_TEST
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mani/mhi into char-misc-next
Manivannan writes:
MHI Host
========
Core
----
- Removed the "mhi_poll()" API as there are no in-kernel users available at the
moment.
- Added range check for the CHDBOFF and ERDBOFF registers in case the device
reports bad values.
- Fixed the errno for the rest of the range checks to use -ERANGE.
- Modified the event ring handlers to ring the doorbell only if there are
any pending elements in the ring to process for the device.
- Removed the check for EE (Execution Environment) while processing the SYS_ERR
transition as it creates device recovery issues when SBL (Secondary
Bootloader) crashes early.
- Used mhi_tryset_pm_state() API to set the error state instead of open coding
if the firmware loading fails. This avoids the race with other pm_state
updates.
pci_generic
-----------
- Dropped the dedundant pci_{enable/disable}_pcie_error_reporting() calls from
driver probe's error path as the PCI core itself takes care of that now.
- Revered the commit 2d5253a096c6 ("bus: mhi: host: pci_generic: Add a secondary
AT port to Telit FN990") as it turned out to be erroneous. This happened due
to the patch adding secondary AT port for FN990 getting applied through NET
and MHI trees and this caused two commits for the same functionality but one
of them ended up wrong.
- Added support for Foxconn T99W510 modem based on SDX24 chipset from Qualcomm.
MHI Endpoint
============
- Demoted the channel not supported error log to debug as not all devices will
support all channels defined in MHI spec and this may spam users.
* tag 'mhi-for-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mani/mhi:
bus: mhi: host: Use mhi_tryset_pm_state() for setting fw error state
bus: mhi: host: Remove duplicate ee check for syserr
bus: mhi: host: Avoid ringing EV DB if there are no elements to process
bus: mhi: pci_generic: Add Foxconn T99W510
bus: mhi: host: Use ERANGE for BHIOFF/BHIEOFF range check
bus: mhi: host: Range check CHDBOFF and ERDBOFF
bus: mhi: host: pci_generic: Revert "Add a secondary AT port to Telit FN990"
bus: mhi: host: pci_generic: Drop redundant pci_enable_pcie_error_reporting()
bus: mhi: ep: Demote unsupported channel error log to debug
bus: mhi: host: Remove mhi_poll() API
|
|
Accesses to nf_trace and ipvs_property are already wrapped
by ifdefs where necessary. Don't allocate the bits for those
fields at all if possible.
Acked-by: Florian Westphal <[email protected]>
Acked-by: Simon Horman <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
nf_trace is a debug feature, AFAIU, and yet it sits oddly
high in the sk_buff bitfield. Move it down, pushing up
dst_pending_confirm and inner_protocol_type.
Next change will make nf_trace optional (under Kconfig)
and all optional fields should be placed after 2b fields
to avoid 2b fields straddling bytes.
dst_pending_confirm is L3, so it makes sense next to ignore_df.
inner_protocol_type goes up just to keep the balance.
Acked-by: Florian Westphal <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
alloc_cpu is currently between 4 byte fields, so it's almost
guaranteed to create a 2B hole. It has a knock on effect of
creating a 4B hole after @end (and @end and @tail being in
different cachelines).
None of this matters hugely, but for kernel configs which
don't enable all the features there may well be a 2B hole
after the bitfield. Move alloc_cpu there.
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
SCTP is not universally deployed, allow hiding its bit
from the skb.
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Datacenter kernel builds will very likely not include WIRELESS,
so let them shave 2 bits off the skb by hiding the wifi fields.
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Johannes Berg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Linux LEDs can be requested to perform hardware accelerated
blinking. Pass this to the PHY driver, if it implements the op.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Linux LEDs can be software controlled via the brightness file in /sys.
LED drivers need to implement a brightness_set function which the core
will call. Implement an intermediary in phy_device, which will call
into the phy driver if it implements the necessary function.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Define common binding parsing for all PHY drivers with LEDs using
phylib. Parse the DT as part of the phy_probe and add LEDs to the
linux LED class infrastructure. For the moment, provide a dummy
brightness function, which will later be replaced with a call into the
PHY driver. This allows testing since the LED core might otherwise
reject an LED whose brightness cannot be set.
Add a dependency on LED_CLASS. It either needs to be built in, or not
enabled, since a modular build can result in linker errors.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Provide stubs for devm_led_classdev_register_ext() and
led_init_default_state_get() so that LED drivers embedded within other
drivers such as PHYs and Ethernet switches still build when LEDS_CLASS
or NEW_LEDS are disabled. This also helps with Kconfig dependencies,
which are somewhat hairy for phylib and mdio and only get worse when
adding a dependency on LED_CLASS.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
(struct nf_conn)->timeout is an interval before the conntrack
confirmed. After confirmed, it becomes a timestamp.
It is observed that timeout of an unconfirmed conntrack:
- Set by calling ctnetlink_change_timeout(). As a result,
`nfct_time_stamp` was wrongly added to `ct->timeout` twice.
- Get by calling ctnetlink_dump_timeout(). As a result,
`nfct_time_stamp` was wrongly subtracted.
Call Trace:
<TASK>
dump_stack_lvl
ctnetlink_dump_timeout
__ctnetlink_glue_build
ctnetlink_glue_build
__nfqnl_enqueue_packet
nf_queue
nf_hook_slow
ip_mc_output
? __pfx_ip_finish_output
ip_send_skb
? __pfx_dst_output
udp_send_skb
udp_sendmsg
? __pfx_ip_generic_getfrag
sock_sendmsg
Separate the 2 cases in:
- Setting `ct->timeout` in __nf_ct_set_timeout().
- Getting `ct->timeout` in ctnetlink_dump_timeout().
Pablo appends:
Update ctnetlink to set up the timeout _after_ the IPS_CONFIRMED flag is
set on, otherwise conntrack creation via ctnetlink breaks.
Note that the problem described in this patch occurs since the
introduction of the nfnetlink_queue conntrack support, select a
sufficiently old Fixes: tag for -stable kernel to pick up this fix.
Fixes: a4b4766c3ceb ("netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info")
Signed-off-by: Tzung-Bi Shih <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Currently, bonding only obtain the timestamp (ts) information of
the active slave, which is available only for modes 1, 5, and 6.
For other modes, bonding only has software rx timestamping support.
However, some users who use modes such as LACP also want tx timestamp
support. To address this issue, let's check the ts information of each
slave. If all slaves support tx timestamping, we can enable tx
timestamping support for the bond.
Add a note that the get_ts_info may be called with RCU, or rtnl or
reference on the device in ethtool.h>
Suggested-by: Miroslav Lichvar <[email protected]>
Signed-off-by: Hangbin Liu <[email protected]>
Acked-by: Jay Vosburgh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Unbreak br_netfilter physdev match support, from Florian Westphal.
2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian.
3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal.
4) Fix validation of catch-all set element.
5) Tighten requirements for catch-all set elements.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
netfilter: nf_tables: validate catch-all set elements
netfilter: nf_tables: fix ifdef to also consider nf_tables=m
netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT
netfilter: br_netfilter: fix recent physdev match breakage
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
All ublk commands(control, IO) should have taken ioctl command encoding
from the beginning, because ioctl command encoding defines each code
uniquely, so driver can figure out wrong command sent from userspace
easily; 2) it might help security subsystem for audit uring cmd[1].
Unfortunately we didn't do that way, and it could be one lesson for
ublk driver.
So switch to ioctl command encoding now, we still support commands encoded
in old way, but they become legacy definition. Any new command should take
ioctl encoding.
See ublksrv code for switching to ioctl command encoding in [2].
[1] https://lore.kernel.org/io-uring/CAHC9VhSVzujW9LOj5Km80AjU0EfAuukoLrxO6BEfnXeK_s6bAg@mail.gmail.com/
[2] https://github.com/ming1/ubdsrv/commits/ioctl_cmd_encoding
Cc: Christoph Hellwig <[email protected]>
Cc: Ken Kurematsu <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
A multishot timeout submission will repeatedly generate completions with
the IORING_CQE_F_MORE cflag set. Depending on the value of the `off'
field in the submission, these timeouts can either repeat indefinitely
until cancelled (`off' = 0) or for a fixed number of times (`off' > 0).
Only noseq timeouts (i.e. not dependent on the number of I/O
completions) are supported.
An indefinite timer will be cancelled if the CQ ever overflows.
Signed-off-by: David Wei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
hwpoison_user_mappings() is updated to support ksm pages, and add
collect_procs_ksm() to collect processes when the error hit an ksm page.
The difference from collect_procs_anon() is that it also needs to traverse
the rmap-item list on the stable node of the ksm page. At the same time,
add_to_kill_ksm() is added to handle ksm pages. And
task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
to_kill list. This is because when scanning the list, if the pages that
make up the ksm page all come from the same process, they may be added
repeatedly.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Longlong Xia <[email protected]>
Tested-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Nanyong Sun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
typeof is (still) a GNU extension, which means that it cannot be used when
building ISO C (e.g. -std=c99). It should therefore be avoided in uapi
headers in favour of the ISO-friendly __typeof__.
Unfortunately this issue could not be detected by
CONFIG_UAPI_HEADER_TEST=y as the __ALIGN_KERNEL() macro is not expanded in
any uapi header.
This matters from a userspace perspective, not a kernel one. uapi
headers and their contents are expected to be usable in a variety of
situations, and in particular when building ISO C applications (with
-std=c99 or similar).
This particular problem can be reproduced by trying to use the
__ALIGN_KERNEL macro directly in application code, say:
#include <linux/const.h>
int align(int x, int a)
{
return __KERNEL_ALIGN(x, a);
}
and trying to build that with -std=c99.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: a79ff731a1b2 ("netfilter: xtables: make XT_ALIGN() usable in exported headers by exporting __ALIGN_KERNEL()")
Signed-off-by: Kevin Brodsky <[email protected]>
Reported-by: Ruben Ayrapetyan <[email protected]>
Tested-by: Ruben Ayrapetyan <[email protected]>
Reviewed-by: Petr Vorel <[email protected]>
Tested-by: Petr Vorel <[email protected]>
Reviewed-by: Masahiro Yamada <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Delay accounting does not track the delay of IRQ/SOFTIRQ. While
IRQ/SOFTIRQ could have obvious impact on some workloads productivity, such
as when workloads are running on system which is busy handling network
IRQ/SOFTIRQ.
Get the delay of IRQ/SOFTIRQ could help users to reduce such delay. Such
as setting interrupt affinity or task affinity, using kernel thread for
NAPI etc. This is inspired by "sched/psi: Add PSI_IRQ to track
IRQ/SOFTIRQ pressure"[1]. Also fix some code indent problems of older
code.
And update tools/accounting/getdelays.c:
/ # ./getdelays -p 156 -di
print delayacct stats ON
printing IO accounting
PID 156
CPU count real total virtual total delay total delay average
15 15836008 16218149 275700790 18.380ms
IO count delay total delay average
0 0 0.000ms
SWAP count delay total delay average
0 0 0.000ms
RECLAIM count delay total delay average
0 0 0.000ms
THRASHING count delay total delay average
0 0 0.000ms
COMPACT count delay total delay average
0 0 0.000ms
WPCOPY count delay total delay average
36 7586118 0.211ms
IRQ count delay total delay average
42 929161 0.022ms
[1] commit 52b1364ba0b1("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Yang <[email protected]>
Cc: Jiang Xuexin <[email protected]>
Cc: wangyong <[email protected]>
Cc: junhua huang <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This has a slight benefit for x86 and has no effect on other targets.
The benefit to x86 is it change the codegen for setting a node to block
from `mov %r0, %r1; or $RB_BLACK, %r1` to `lea RB_BLACK(%r0), %r1` which
saves an instructions.
In all other cases it just replace ALU with ALU (or -> and) which
perform the same on all machines I am aware of.
Total instructions in rbtree.o:
Before - 802
After - 782
so it saves about 20 `mov` instructions.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Noah Goldstein <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The interface for fcntl expects the argument passed for the command
F_ADD_SEALS to be of type int. The current code wrongly treats it as a
long. In order to avoid access to undefined bits, we should explicitly
cast the argument to int.
This commit changes the signature of all the related and helper functions
so that they treat the argument as int instead of long.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luca Vizzarro <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Chuck Lever <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: David Laight <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Android 14 and later default to MGLRU [1] and field telemetry showed
occasional long tail latency (>100ms) in the reclaim path.
Tracing revealed priority inversion in the reclaim path. In
try_to_inc_max_seq(), when high priority tasks were blocked on
wait_event_killable(), the preemption of the low priority task to call
wake_up_all() caused those high priority tasks to wait longer than
necessary. In general, this problem is not different from others of its
kind, e.g., one caused by mutex_lock(). However, it is specific to MGLRU
because it introduced the new wait queue lruvec->mm_state.wait.
The purpose of this new wait queue is to avoid the thundering herd
problem. If many direct reclaimers rush into try_to_inc_max_seq(), only
one can succeed, i.e., the one to wake up the rest, and the rest who
failed might cause premature OOM kills if they do not wait. So far there
is no evidence supporting this scenario, based on how often the wait has
been hit. And this begs the question how useful the wait queue is in
practice.
Based on Minchan's recommendation, which is in line with his commit
6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path") and the
rest of the MGLRU code which also uses trylock when possible, remove the
wait queue.
[1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Kalesh Singh <[email protected]>
Suggested-by: Minchan Kim <[email protected]>
Reported-by: Wei Wang <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Oleksandr Natalenko <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
During reclaim, we keep track of pages reclaimed from other means than
LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
which we stash a pointer to in current task_struct.
However, we keep track of more than just reclaimed slab pages through
this. We also use it for clean file pages dropped through pruned inodes,
and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a
helper function that wraps updating it through current, so that future
changes to this logic are contained within include/linux/swap.h.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Non-void KMSAN hooks may return error codes that indicate that KMSAN
failed to reflect the changed memory state in the metadata (e.g. it could
not create the necessary memory mappings). In such cases the callers
should handle the errors to prevent the tool from using the inconsistent
metadata in the future.
We mark non-void hooks with __must_check so that error handling is not
skipped.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dipanjan Das <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
copy-on-write of hugetlb user pages with uncorrectable errors will result
in a kernel crash. This is because the copy is performed in kernel mode
and in general we can not handle accessing memory with such errors while
in kernel mode. Commit a873dfe1032a ("mm, hwpoison: try to recover from
copy-on write faults") introduced the routine copy_user_highpage_mc() to
gracefully handle copying of user pages with uncorrectable errors.
However, the separate hugetlb copy-on-write code paths were not modified
as part of commit a873dfe1032a.
Modify hugetlb copy-on-write code paths to use copy_mc_user_highpage() so
that they can also gracefully handle uncorrectable errors in user pages.
This involves changing the hugetlb specific routine
copy_user_large_folio() from type void to int so that it can return an
error. Modify the hugetlb userfaultfd code in the same way so that it can
return -EHWPOISON if it encounters an uncorrectable error.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liu Shixin <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Now we use ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP config option to
indicate devdax and hugetlb vmemmap optimization support. Hence rename
that to a generic ARCH_WANT_OPTIMIZE_VMEMMAP
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Tarun Sahu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for
compound devmaps") added support for using optimized vmmemap for devdax
devices. But how vmemmap mappings are created are architecture specific.
For example, powerpc with hash translation doesn't have vmemmap mappings
in init_mm page table instead they are bolted table entries in the
hardware page table
vmemmap_populate_compound_pages() used by vmemmap optimization code is not
aware of these architecture-specific mapping. Hence allow architecture to
opt for this feature. I selected architectures supporting
HUGETLB_PAGE_OPTIMIZE_VMEMMAP option as also supporting this feature.
This patch fixes the below crash on ppc64.
BUG: Unable to handle kernel data access on write at 0xc00c000100400038
Faulting instruction address: 0xc000000001269d90
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a
Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries
NIP: c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000
REGS: c000000003632c30 TRAP: 0300 Not tainted (6.3.0-rc5-150500.34-default+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24842228 XER: 00000000
CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0
....
NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c
LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0
Call Trace:
[c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable)
[c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250
[c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0
[c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0
[c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0
[c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140
[c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520
[c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
[c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120
[c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0
[c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130
[c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250
[c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110
[c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10
[c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530
[c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270
[c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70
[c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0
[c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520
[c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
[c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120
[c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240
[c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130
[c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50
[c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300
[c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0
[c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100
[c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48
[c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320
[c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400
[c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0
[c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reported-by: Tarun Sahu <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Convert mfill_atomic_pte_copy(), shmem_mfill_atomic_pte() and
mfill_atomic_pte() to take in a folio pointer.
Convert mfill_atomic() to use a folio. Convert page_kaddr to kaddr in
mfill_atomic().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Replace copy_user_huge_page() with copy_user_large_folio().
copy_user_large_folio() does the same as copy_user_huge_page(), but takes
in folios instead of pages. Remove pages_per_huge_page from
copy_user_large_folio(), because we can get that from folio_nr_pages(dst).
Convert copy_user_gigantic_page() to take in folios.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Convert hugetlb_mfill_atomic_pte() to take in a folio pointer instead of
a page pointer.
Convert mfill_atomic_hugetlb() to use a folio.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Replace copy_huge_page_from_user() with copy_folio_from_user().
copy_folio_from_user() does the same as copy_huge_page_from_user(), but
takes in a folio instead of a page.
Convert page_kaddr to kaddr in copy_folio_from_user() to do indenting
cleanup.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When !CONFIG_SHMEM smaps_shmem_walk_ops is defined but not used,
triggering a compiler warning. To avoid the warning remove the #ifdef
around the usage. This has no effect because shmem_mapping() is a stub
returning false when !CONFIG_SHMEM so the code will be compiled out,
however we now need to also provide a stub for shmem_swap_usage().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7b86ac3371b7 ("pagewalk: separate function pointers from iterator data")
Signed-off-by: Steven Price <[email protected]>
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
If a library wants to get information from auxv (for instance,
AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable
or ideal:
- Be main or the pre-main startup code, and grub through the stack above
main. Doesn't work for a library.
- Call libc getauxval. Not ideal for libraries that are trying to be
libc-independent and/or don't otherwise require anything from other
libraries.
- Open and read /proc/self/auxv. Doesn't work for libraries that may run
in arbitrarily constrained environments that may not have /proc
mounted (e.g. libraries that might be used by an init program or a
container setup tool).
- Assume you're on the main thread and still on the original stack, and
try to walk the stack upwards, hoping to find auxv. Extremely bad
idea.
- Ask the caller to pass auxv in for you. Not ideal for a user-friendly
library, and then your caller may have the same problem.
Add a prctl that copies current->mm->saved_auxv to a userspace buffer.
Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org
Signed-off-by: Josh Triplett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Make sure that collapse_file respects any userfaultfds registered with
MODE_MISSING. If userspace has any such userfaultfds registered, then for
any page which it knows to be missing, it may expect a
UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
collapsing a shmem range would result in replacing an empty page with a
THP, to avoid breaking userfaultfd.
Synchronization when checking for userfaultfds in collapse_file is tricky
because the mmap locks can't be used to prevent races with the
registration of new userfaultfds. Instead, we provide synchronization by
ensuring that userspace cannot observe the fact that pages are missing
before we check for userfaultfds. Although this allows registration of a
userfaultfd to race with collapse_file, it ensures that userspace cannot
observe any pages transition from missing to present after such a race
occurs. This makes such a race indistinguishable to the collapse
occurring immediately before the userfaultfd registration.
The first step to provide this synchronization is to stop filling gaps
during the loop iterating over the target range, since the page cache lock
can be dropped during that loop. The second step is to fill the gaps with
XA_RETRY_ENTRY after the page cache lock is acquired the final time, to
avoid races with accesses to the page cache that only take the RCU read
lock.
The fact that we don't fill holes during the initial iteration means that
collapse_file now has to handle faults occurring during the collapse.
This is done by re-validating the number of missing pages after acquiring
the page cache lock for the final time.
This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
return EBUSY if there are any missing pages (instead of succeeding on
shmem and returning EINVAL on anonymous memory). There is also now a
window during MADV_COLLAPSE where a fault on a missing page will cause the
syscall to fail with EAGAIN.
The fact that intermediate page cache state can no longer be observed
before the rollback of a failed collapse is also technically a
userspace-visible change (via at least SEEK_DATA and SEEK_END), but it is
exceedingly unlikely that anything relies on being able to observe that
transient state.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Stevens <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Similar to how copy_mc_user_highpage is implemented for copy_user_highpage
on #MC supported architecture, introduce the #MC handled version of
copy_highpage.
This helper has immediate usage when khugepaged wants to copy file-backed
memory pages and tolerate #MC.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: David Stevens <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tong Tiangen <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Problem
=======
Memory DIMMs are subject to multi-bit flips, i.e. memory errors. As
memory size and density increase, the chances of and number of memory
errors increase. The increasing size and density of server RAM in the
data center and cloud have shown increased uncorrectable memory errors.
There are already mechanisms in the kernel to recover from uncorrectable
memory errors. This series of patches provides the recovery mechanism for
the particular kernel agent khugepaged when it collapses memory pages.
Impact
======
The main reason we chose to make khugepaged collapsing tolerant of memory
failures was its high possibility of accessing poisoned memory while
performing functionally optional compaction actions. Standard
applications typically don't have strict requirements on the size of its
pages. So they are given 4K pages by the kernel. The kernel is able to
improve application performance by either
1) giving applications 2M pages to begin with, or
2) collapsing 4K pages into 2M pages when possible.
This collapsing operation is done by khugepaged, a kernel agent that is
constantly scanning memory. When collapsing 4K pages into a 2M page, it
must copy the data from the 4K pages into a physically contiguous 2M page.
Therefore, as long as there exists one poisoned cache line in collapsible
4K pages, khugepaged will eventually access it. The current impact to
users is a machine check exception triggered kernel panic. However,
khugepaged’s compaction operations are not functionally required kernel
actions. Therefore making khugepaged tolerant to poisoned memory will
greatly improve user experience.
This patch series is for cases where khugepaged is the first guy that
detects the memory errors on the poisoned pages. IOW, the pages are not
known to have memory errors when khugepaged collapsing gets to them. In
our observation, this happens frequently when the huge page ratio of the
system is relatively low, which is fairly common in virtual machines
running on cloud.
Solution
========
As stated before, it is less desirable to crash the system only because
khugepaged accesses poisoned pages while it is collapsing 4K pages. The
high level idea of this patch series is to skip the group of pages
(usually 512 4K-size pages) once khugepaged finds one of them is poisoned,
as these pages have become ineligible to be collapsed.
We are also careful to unwind operations khuagepaged has performed before
it detects memory failures. For example, before copying and collapsing a
group of anonymous pages into a huge page, the source pages will be
isolated and their page table is unlinked from their PMD. These
operations need to be undone in order to ensure these pages are not
changed/lost from the perspective of other threads (both user and kernel
space). As for file backed memory pages, there already exists a rollback
case. This patch just extends it so that khugepaged also correctly rolls
back when it fails to copy poisoned 4K pages.
This patch (of 3):
Make __collapse_huge_page_copy return whether copying anonymous pages
succeeded, and make collapse_huge_page handle the return status.
Break existing PTE scan loop into two for-loops. The first loop copies
source pages into target huge page, and can fail gracefully when running
into memory errors in source pages. If copying all pages succeeds, the
second loop releases and clears up these normal pages. Otherwise, the
second loop rolls back the page table and page states by:
- re-establishing the original PTEs-to-PMD connection.
- releasing source pages back to their LRU list.
Tested manually:
0. Enable khugepaged on system under test.
1. Start a two-thread application. Each thread allocates a chunk of
non-huge anonymous memory buffer.
2. Pick 4 random buffer locations (2 in each thread) and inject
uncorrectable memory errors at corresponding physical addresses.
3. Signal both threads to make their memory buffer collapsible, i.e.
calling madvise(MADV_HUGEPAGE).
4. Wait and check kernel log: khugepaged is able to recover from poisoned
pages and skips collapsing them.
5. Signal both threads to inspect their buffer contents and make sure no
data corruption.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Cc: David Stevens <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tong Tiangen <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In workingset_refault(), we call
mem_cgroup_flush_stats_atomic_ratelimited() to read accurate stats within
an RCU read section and with sleeping disallowed. Move the call above the
RCU read section to make it non-atomic.
Flushing is an expensive operation that scales with the number of cpus and
the number of cgroups in the system, so avoid doing it atomically where
possible.
Since workingset_refault() is the only caller of
mem_cgroup_flush_stats_atomic_ratelimited(), just make it non-atomic, and
rename it to mem_cgroup_flush_stats_ratelimited().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Currently, all contexts that flush memcg stats do so with sleeping not
allowed. Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.
Flushing is an expensive operation that scales with the number of cpus and
the number of cgroups in the system, so avoid doing it atomically where
possible.
Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
sleepable), and provide a separate atomic version. The atomic version is
used in reclaim, refault, writeback, and in mem_cgroup_usage(). All other
code paths are left to use the non-atomic version. This includes
callbacks for userspace reads and the periodic flusher.
Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and
refault code paths are modified to do non-atomic flushing in separate
later patches -- so it will eventually be changed back to
mem_cgroup_flush_stats_ratelimited().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work, but
this is actually sometimes flushing directly from the callsite.
What it's doing is ratelimited calls. A better name would be
mem_cgroup_flush_stats_ratelimited().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "memcg: avoid flushing stats atomically where possible", v3.
rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system. The purpose of this series
is to minimize the contexts where we flush stats atomically.
Patches 1 and 2 are cleanups requested during reviews of prior versions of
this series.
Patch 3 makes sure we never try to flush from within an irq context.
Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for
atomic and non-atomic flushing, and make sure we only flush the stats
atomically when necessary.
Patch 8 is a slightly tangential optimization that limits the work done by
rstat flushing in some scenarios.
This patch (of 8):
cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|