Age | Commit message (Collapse) | Author | Files | Lines |
|
This adds IOMMU_HWPT_DATA_VTD_S1 for stage-1 hw_pagetable of Intel
VT-d and the corressponding data structure for userspace specified parameter
for the domain allocation.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Wrap up the data type/pointer/len sanity and a copy_struct_from_user call
for iommu drivers to copy driver specific data via struct iommu_user_data.
And expect it to be used in the domain_alloc_user op for example.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Nicolin Chen <[email protected]>
Co-developed-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce.
But it can only allocate a hw_pagetable that associates to a given IOAS,
i.e. only a kernel-managed hw_pagetable of IOMMUFD_OBJ_HWPT_PAGING type.
IOMMU drivers can now support user-managed hw_pagetables, for two-stage
translation use cases that require user data input from the user space.
Add a new IOMMUFD_OBJ_HWPT_NESTED type with its abort/destroy(). Pair it
with a new iommufd_hwpt_nested structure and its to_hwpt_nested() helper.
Update the to_hwpt_paging() helper, so a NESTED-type hw_pagetable can be
handled in the callers, for example iommufd_hw_pagetable_enforce_rr().
Screen the inputs including the parent PAGING-type hw_pagetable that has
a need of a new nest_parent flag in the iommufd_hwpt_paging structure.
Extend the IOMMU_HWPT_ALLOC ioctl to accept an IOMMU driver specific data
input which is tagged by the enum iommu_hwpt_data_type. Also, update the
@pt_id to accept hwpt_id too besides an ioas_id. Then, use them to allocate
a hw_pagetable of IOMMUFD_OBJ_HWPT_NESTED type using the
iommufd_hw_pagetable_alloc_nested() allocator.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Nicolin Chen <[email protected]>
Co-developed-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
domain_alloc_user op already accepts user flags for domain allocation, add
a parent domain pointer and a driver specific user data support as well.
The user data would be tagged with a type for iommu drivers to add their
own driver specific user data per hw_pagetable.
Add a struct iommu_user_data as a bundle of data_ptr/data_len/type from an
iommufd core uAPI structure. Make the user data opaque to the core, since
a userspace driver must match the kernel driver. In the future, if drivers
share some common parameter, there would be a generic parameter as well.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lu Baolu <[email protected]>
Co-developed-by: Nicolin Chen <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Introduce a new domain type for a user I/O page table, which is nested on
top of another user space address represented by a PAGING domain. This
new domain can be allocated by the domain_alloc_user op, and attached to
a device through the existing iommu_attach_device/group() interfaces.
The mappings of a nested domain are managed by user space software, so it
is not necessary to have map/unmap callbacks.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Merge updates related to system sleep handling, one power capping update
and one PM utility update for 6.7-rc1:
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon).
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav).
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon).
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig).
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä).
- Change the minimum python version for the intel_pstate_tracer utility
from 2.7 to 3.6 (Doug Smythies).
* pm-sleep:
PM: hibernate: fix the kerneldoc comment for swsusp_check() and swsusp_close()
PM: hibernate: Clean up sync_read handling in snapshot_write_next()
PM: sleep: Fix symbol export for _SIMPLE_ variants of _PM_OPS()
PM: hibernate: Use __get_safe_page() rather than touching the list
* powercap:
powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug()
* pm-tools:
tools/power/x86/intel_pstate_tracer: python minimum version
|
|
Merge cpufreq updates for 6.7-rc1:
- Add support for several Qualcomm SoC versions and other similar
changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
Gerhold and Varadarajan Narayanan).
- Clean up the tegra cpufreq driver (Sumit Gupta).
- Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
Herring).
- Add support for TI's am62p5 Soc (Bryan Brattlof).
- Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
Fainelli).
- Update Kconfig to mention i.MX7 as well (Alexander Stein).
- Revise global turbo disable check in intel_pstate (Srinivas
Pandruvada).
- Carry out initialization of sg_cpu in the schedutil cpufreq governor
in one loop (Liao Chang).
- Simplify the condition for storing 'down_threshold' in the
conservative cpufreq governor (Liao Chang).
- Use fine-grained mutex in the userspace cpufreq governor (Liao
Chang).
- Move is_managed indicator in the userspace cpufreq governor into a
per-policy structure (Liao Chang).
- Rebuild sched-domains when removing cpufreq driver (Pierre Gondois).
- Fix buffer overflow detection in trans_stats() (Christian Marangi).
* pm-cpufreq: (32 commits)
dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware
cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT
cpufreq: qcom-nvmem: add support for IPQ8064
cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu
cpufreq: qcom-nvmem: drop pvs_ver for format a fuses
dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu
cpufreq: qcom-nvmem: add support for IPQ6018
dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018
cpufreq: qcom-nvmem: Add MSM8909
cpufreq: qcom-nvmem: Simplify driver data allocation
cpufreq: stats: Fix buffer overflow detection in trans_stats()
dt-bindings: cpufreq: cpufreq-qcom-hw: Add SDX75 compatible
cpufreq: ARM_BRCMSTB_AVS_CPUFREQ cannot be used with ARM_SCMI_CPUFREQ
cpufreq: ti-cpufreq: Add opp support for am62p5 SoCs
cpufreq: dt-platdev: add am62p5 to blocklist
cpufreq: tegra194: remove redundant AND with cpu_online_mask
cpufreq: tegra194: use refclk delta based loop instead of udelay
cpufreq: tegra194: save CPU data to avoid repeated SMP calls
cpufreq: Rebuild sched-domains when removing cpufreq driver
cpufreq: userspace: Move is_managed indicator into per-policy structure
...
|
|
Merge devfreq updates for 6.7-rc1:
- Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to support
specific devices like UFS which handle multiple clocks through OPP
(Operationg Performance Point) framework (Manivannan Sadhasivam).
- Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
event driver:
* Generalize rockchip-dfi.c to support new RK3568/RK3588 using
different DDR type (Sascha Hauer).
* Convert devicetree bidning document format to yaml (Sascha Hauer).
* Add perf support for DFI (a unit suitable for measuring DDR
utilization) to rockchip-dfi.c to extend DFI usage (Sascha Hauer).
- Add locking to the OPP handling code in the Mediatek CCI devfreq
driver, because the voltage of shared OPP might be changed by
multiple drivers (Mark Tseng, Dan Carpenter).
- Use device_get_match_data() in the Samsung Exynos PPMU devfreq-event
driver (Rob Herring).
* pm-devfreq: (26 commits)
dt-bindings: devfreq: event: rockchip,dfi: Add rk3588 support
dt-bindings: devfreq: event: rockchip,dfi: Add rk3568 support
dt-bindings: devfreq: event: convert Rockchip DFI binding to yaml
PM / devfreq: rockchip-dfi: add support for RK3588
PM / devfreq: rockchip-dfi: account for multiple DDRMON_CTRL registers
PM / devfreq: rockchip-dfi: make register stride SoC specific
PM / devfreq: rockchip-dfi: Add perf support
PM / devfreq: rockchip-dfi: give variable a better name
PM / devfreq: rockchip-dfi: Prepare for multiple users
PM / devfreq: rockchip-dfi: Pass private data struct to internal functions
PM / devfreq: rockchip-dfi: Handle LPDDR4X
PM / devfreq: rockchip-dfi: Handle LPDDR2 correctly
PM / devfreq: rockchip-dfi: Add RK3568 support
PM / devfreq: rockchip-dfi: Clean up DDR type register defines
PM / devfreq: rk3399_dmc,dfi: generalize DDRTYPE defines
PM / devfreq: rockchip-dfi: introduce channel mask
PM / devfreq: rockchip-dfi: Use free running counter
PM / devfreq: mediatek: unlock on error in mtk_ccifreq_target()
PM / devfreq: exynos-ppmu: Use device_get_match_data()
PM / devfreq: rockchip-dfi: dfi store raw values in counter struct
...
|
|
Merge updates of the ACPI AC and ACPI PAD drivers and PNP updates for
6.7-rc1:
- Switch over the ACPI AC and ACPI PAD drivers to using the platform
driver interface which, is more logically consistent than binding a
driver directly to an ACPI device object, and clean them up (Michal
Wilczynski).
- Replace strncpy() in the PNP code with either memcpy() or strscpy()
as appropriate (Justin Stitt).
- Clean up coding style in pnp.h (GuoHua Cheng).
* acpi-ac:
ACPI: AC: Rename ACPI device from device to adev
ACPI: AC: Replace acpi_driver with platform_driver
ACPI: AC: Use string_choices API instead of ternary operator
ACPI: AC: Remove redundant checks
* acpi-pad:
ACPI: acpi_pad: Rename ACPI device from device to adev
ACPI: acpi_pad: Use dev groups for sysfs
ACPI: acpi_pad: Replace acpi_driver with platform_driver
* pnp:
PNP: replace deprecated strncpy() with memcpy()
PNP: ACPI: replace deprecated strncpy() with strscpy()
PNP: Clean up coding style in pnp.h
|
|
Merge ACPI bus type driver updates for 6.7-rc1:
- Add context argument to acpi_dev_install_notify_handler() (Rafael
Wysocki).
- Clarify ACPI bus concepts in the ACPI device enumeration
documentation (Rafael Wysocki).
* acpi-bus:
ACPI: bus: Add context argument to acpi_dev_install_notify_handler()
ACPI: docs: enumeration: Clarify ACPI bus concepts
|
|
Merge ACPI EC driver updates, ACPI sysfs interface updates, misc updates
related to ACPI and changes related to ACPI _UID handling for 6.7-rc1:
- Add EC GPE detection quirk for HP 250 G7 Notebook PC (Jonathan
Denose).
- Fix and clean up create_pnp_modalias() and create_of_modalias()
(Christophe JAILLET).
- Modify 2 pieces of code to use acpi_evaluate_dsm_typed() (Andy
Shevchenko).
- Define acpi_dev_uid_match() for matching _UID and use it in several
places (Raag Jadav).
- Use acpi_device_uid() for fetching _UID in 2 places (Raag Jadav).
* acpi-ec:
ACPI: EC: Add quirk for HP 250 G7 Notebook PC
* acpi-sysfs:
ACPI: sysfs: Clean up create_pnp_modalias() and create_of_modalias()
ACPI: sysfs: Fix create_pnp_modalias() and create_of_modalias()
* acpi-misc:
ACPI: x86: s2idle: Switch to use acpi_evaluate_dsm_typed()
ACPI: PCI: Switch to use acpi_evaluate_dsm_typed()
* acpi-uid:
perf: arm_cspmu: use acpi_dev_hid_uid_match() for matching _HID and _UID
ACPI: x86: use acpi_dev_uid_match() for matching _UID
ACPI: utils: use acpi_dev_uid_match() for matching _UID
pinctrl: intel: use acpi_dev_uid_match() for matching _UID
ACPI: utils: Introduce acpi_dev_uid_match() for matching _UID
perf: qcom: use acpi_device_uid() for fetching _UID
ACPI: sysfs: use acpi_device_uid() for fetching _UID
|
|
Merge ACPI backlight driver updates, ACPI APEI updates, ACPI PRM updates
and changes related to ACPI PCC for 6.7-rc1:
- Add acpi_backlight=vendor quirk for Toshiba Portégé R100 (Ondrej
Zary).
- Add "vendor" backlight quirks for 3 Lenovo x86 Android tablets (Hans
de Goede).
- Move Xiaomi Mi Pad 2 backlight quirk to its own section (Hans de
Goede).
- Annotate struct prm_module_info with __counted_by (Kees Cook).
- Fix AER info corruption in aer_recover_queue() when error status data
has multiple sections (Shiju Jose).
- Make APEI use ERST max execution time value for slow devices (Jeshua
Smith).
- Add support for platform notification handling to the PCC mailbox
driver and modify it to support shared interrupts for multiple
subspaces (Huisong Li).
- Define common macros to use when referring to various bitfields in the
PCC generic communications channel command and status fields and use
them in some drivers (Sudeep Holla).
* acpi-video:
ACPI: video: Add acpi_backlight=vendor quirk for Toshiba Portégé R100
ACPI: video: Add "vendor" quirks for 3 Lenovo x86 Android tablets
ACPI: video: Move Xiaomi Mi Pad 2 quirk to its own section
* acpi-prm:
ACPI: PRM: Annotate struct prm_module_info with __counted_by
* acpi-apei:
ACPI: APEI: Use ERST timeout for slow devices
ACPI: APEI: Fix AER info corruption when error status data has multiple sections
* acpi-pcc:
soc: kunpeng_hccs: Migrate to use generic PCC shmem related macros
hwmon: (xgene) Migrate to use generic PCC shmem related macros
i2c: xgene-slimpro: Migrate to use generic PCC shmem related macros
ACPI: PCC: Add PCC shared memory region command and status bitfields
mailbox: pcc: Support shared interrupt for multiple subspaces
mailbox: pcc: Add support for platform notification handling
|
|
Merge ACPI utilities updates, ACPI resource management updates, ACPI
device properties management updates and ACPI LPSS (Intel SoC) driver
update for 6.7-rc1:
- Rework acpi_handle_list handling so as to manage it dynamically,
including size computation (Rafael Wysocki).
- Clean up ACPI utilities code so as to make it follow the kernel
coding style (Jonathan Bergh).
- Consolidate IRQ trigger-type override DMI tables and drop .ident
values from dmi_system_id tables used for ACPI resources management
quirks (Hans de Goede).
- Add ACPI IRQ override for TongFang GMxXGxx (Werner Sembach).
- Allow _DSD buffer data only for byte accessors and document the _DSD
data buffer GUID (Andy Shevchenko).
- Drop BayTrail and Lynxpoint pinctrl device IDs from the ACPI LPSS
driver, because it does not need them (Raag Jadav).
* acpi-utils:
ACPI: utils: Remove redundant braces around individual statement
ACPI: utils: Fix up white space in a few places
ACPI: utils: Dynamically determine acpi_handle_list size
ACPI: thermal: Merge trip initialization functions
ACPI: thermal: Collapse trip devices update function wrappers
ACPI: thermal: Collapse trip devices update functions
ACPI: thermal: Add device list to struct acpi_thermal_trip
ACPI: thermal: Fix a small leak in acpi_thermal_add()
ACPI: thermal: Drop valid flag from struct acpi_thermal_trip
ACPI: thermal: Drop redundant trip point flags
ACPI: thermal: Untangle initialization and updates of active trips
ACPI: thermal: Untangle initialization and updates of the passive trip
ACPI: thermal: Simplify critical and hot trips representation
ACPI: thermal: Create and populate trip points table earlier
ACPI: thermal: Determine the number of trip points earlier
ACPI: thermal: Fold acpi_thermal_get_info() into its caller
ACPI: thermal: Simplify initialization of critical and hot trips
* acpi-resource:
ACPI: resource: Do IRQ override on TongFang GMxXGxx
ACPI: resource: Drop .ident values from dmi_system_id tables
ACPI: resource: Consolidate IRQ trigger-type override DMI tables
* acpi-property:
ACPI: property: Document the _DSD data buffer GUID
ACPI: property: Allow _DSD buffer data only for byte accessors
* acpi-soc:
ACPI: LPSS: drop BayTrail and Lynxpoint pinctrl HIDs
|
|
commit ef8dd01538ea ("genirq/msi: Make interrupt allocation less
convoluted"), reworked the code so that the x86 specific quirk for affinity
setting of non-maskable PCI/MSI interrupts is not longer activated if
necessary.
This could be solved by restoring the original logic in the core MSI code,
but after a deeper analysis it turned out that the quirk flag is not
required at all.
The quirk is only required when the PCI/MSI device cannot mask the MSI
interrupts, which in turn also prevents reservation mode from being enabled
for the affected interrupt.
This allows ot remove the NOMASK quirk bit completely as msi_set_affinity()
can instead check whether reservation mode is enabled for the interrupt,
which gives exactly the same answer.
Even in the momentary non-existing case that the reservation mode would be
not set for a maskable MSI interrupt this would not cause any harm as it
just would cause msi_set_affinity() to go needlessly through the
functionaly equivalent slow path, which works perfectly fine with maskable
interrupts as well.
Rework msi_set_affinity() to query the reservation mode and remove all
NOMASK quirk logic from the core code.
[ tglx: Massaged changelog ]
Fixes: ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted")
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Koichiro Den <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. Mostly
nf_tables updates with two patches for connlabel and br_netfilter.
1) Rename function name to perform on-demand GC for rbtree elements,
and replace async GC in rbtree by sync GC. Patches from Florian Westphal.
2) Use commit_mutex for NFT_MSG_GETRULE_RESET to ensure that two
concurrent threads invoking this command do not underrun stateful
objects. Patches from Phil Sutter.
3) Use single hook to deal with IP and ARP packets in br_netfilter.
Patch from Florian Westphal.
4) Use atomic_t in netns->connlabel use counter instead of using a
spinlock, also patch from Florian.
5) Cleanups for stateful objects infrastructure in nf_tables.
Patches from Phil Sutter.
6) Flush path uses opaque set element offered by the iterator, instead of
calling pipapo_deactivate() which looks up for it again.
7) Set backend .flush interface always succeeds, make it return void
instead.
8) Add struct nft_elem_priv placeholder structure and use it by replacing
void * to pass opaque set element representation from backend to frontend
which defeats compiler type checks.
9) Shrink memory consumption of set element transactions, by reducing
struct nft_trans_elem object size and reducing stack memory usage.
10) Use struct nft_elem_priv also for set backend .insert operation too.
11) Carry reset flag in nft_set_dump_ctx structure, instead of passing it
as a function argument, from Phil Sutter.
netfilter pull request 23-10-25
* tag 'nf-next-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx
netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST
netfilter: nf_tables: shrink memory consumption of set elements
netfilter: nf_tables: expose opaque set element as struct nft_elem_priv
netfilter: nf_tables: set backend .flush always succeeds
netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush
netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx
netfilter: nf_tables: nft_obj_filter fits into cb->ctx
netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx
netfilter: nf_tables: A better name for nft_obj_filter
netfilter: nf_tables: Unconditionally allocate nft_obj_filter
netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj
netfilter: conntrack: switch connlabels to atomic_t
br_netfilter: use single forward hook for ip and arp
netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests
netfilter: nf_tables: Introduce nf_tables_getrule_single()
netfilter: nf_tables: Open-code audit log call in nf_tables_getrule()
netfilter: nft_set_rbtree: prefer sync gc to async worker
netfilter: nft_set_rbtree: rename gc deactivate+erase function
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Merge an ACPICA change for 6.7-rc1 which adds symbol definitions related
to CDAT (Dave Jiang).
* acpica:
ACPICA: Add defines for CDAT SSLBIS
|
|
Replace the old __attribute__((packed)) with the new __packed.
Only cleanup, no functional changes.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
The header file contains lots of outdated comments and definitions.
Drop those as cleanup.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
Replace the old __attribute__((packed)) with the new __packed.
Only cleanup, no functional changes.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
Replace the old __attribute__((packed)) with the new __packed.
Only cleanup, no functional changes.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
RTAX_FEATURE_ALLFRAG was added before the first git commit:
https://www.mail-archive.com/[email protected]/msg03399.html
The feature would send packets to the fragmentation path if a box
receives a PMTU value with less than 1280 byte. However, since commit
9d289715eb5c ("ipv6: stop sending PTB packets for MTU < 1280"), such
message would be simply discarded. The feature flag is neither supported
in iproute2 utility. In theory one can still manipulate it with direct
netlink message, but it is not ideal because it was based on obsoleted
guidance of RFC-2460 (replaced by RFC-8200).
The feature would always test false at the moment, so remove related
code or mark them as unused.
Signed-off-by: Yan Zhai <[email protected]>
Reviewed-by: Florian Westphal <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/d78e44dcd9968a252143ffe78460446476a472a1.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
mbind(2) holds down_write of current task's mmap_lock throughout
(exclusive because it needs to set the new mempolicy on the vmas);
migrate_pages(2) holds down_read of pid's mmap_lock throughout.
They both hold mmap_lock across the internal migrate_pages(), under which
all new page allocations (huge or small) are made. I'm nervous about it;
and migrate_pages() certainly does not need mmap_lock itself. It's done
this way for mbind(2), because its page allocator is vma_alloc_folio() or
alloc_hugetlb_folio_vma(), both of which depend on vma and address.
Now that we have alloc_pages_mpol(), depending on (refcounted) memory
policy and interleave index, mbind(2) can be modified to use that or
alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the
internal migrate_pages() at all: add alloc_migration_target_by_mpol() to
replace mbind's new_page().
(After that change, alloc_hugetlb_folio_vma() is used by nothing but a
userfaultfd function: move it out of hugetlb.h and into the #ifdef.)
migrate_pages(2) has chosen its target node before migrating, so can
continue to use the standard alloc_migration_target(); but let it take and
drop mmap_lock just around migrate_to_node()'s queue_pages_range():
neither the node-to-node calculations nor the page migrations need it.
It seems unlikely, but it is conceivable that some userspace depends on
the kernel's mmap_lock exclusion here, instead of doing its own locking:
more likely in a testsuite than in real life. It is also possible, of
course, that some pages on the list will be munmapped by another thread
before they are migrated, or a newer memory policy applied to the range by
that time: but such races could happen before, as soon as mmap_lock was
dropped, so it does not appear to be a concern.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd
("tmpfs: distribute interleave better across nodes"), which added ino bias
to the interleave, hidden from mm/mempolicy.c until this commit.
Hence "ilx" throughout, the "interleave index". Originally I thought it
could be done just with nid, but that's wrong: the nodemask may come from
the shared policy layer below a shmem vma, or it may come from the task
layer above a shmem vma; and without the final nodemask then nodeid cannot
be decided. And how ilx is applied depends also on page order.
The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
passed down from vma-less alloc_pages() is also used as hint not to use
THP-style hugepage allocation - to avoid the overhead of a hugepage arg
(though I don't understand why we never just added a GFP bit for THP - if
it actually needs a different allocation strategy from other pages of the
same order). vma_alloc_folio() still carries its hugepage arg here, but
it is not used, and should be removed when agreed.
get_vma_policy() no longer allows a NULL vma: over time I believe we've
eradicated all the places which used to need it e.g. swapoff and madvise
used to pass NULL vma to read_swap_cache_async(), but now know the vma.
[[email protected]: handle NULL mpol being passed to __read_swap_cache_async()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
v3.8 commit b24f53a0bea3 ("mm: mempolicy: Add MPOL_MF_LAZY") introduced
MPOL_MF_LAZY, and included it in the MPOL_MF_VALID flags; but a720094ded8
("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now")
immediately removed it from MPOL_MF_VALID flags, pending further review.
"This will need to be revisited", but it has not been reinstated.
The present state is confusing: there is dead code in mm/mempolicy.c to
handle MPOL_MF_LAZY cases which can never occur. Remove that: it can be
resurrected later if necessary. But keep the definition of MPOL_MF_LAZY,
which must remain in the UAPI, even though it always fails with EINVAL.
https://lore.kernel.org/linux-mm/[email protected]/
links to a previous request to remove MPOL_MF_LAZY.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Prefer the more explicit "pgoff_t" to "unsigned long" when dealing with a
shared mempolicy tree. Delete confusing comment about pseudo mm vmas.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Before getting down to work, do a little cleanup, mainly of inconsistent
variable naming. I gave up trying to rationalize mpol versus pol versus
policy, and node versus nid, but let's avoid p and nd. Remove a few
superfluous blank lines, but add one; and here prefer vma->vm_policy to
vma_policy(vma) - the latter being appropriate in other sources, which
have to allow for !CONFIG_NUMA. That intriguing line about KERNEL_DS?
should have gone in v2.6.15, when numa_policy_init() stopped using
set_mempolicy(2)'s system call handler.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mempolicy: cleanups leading to NUMA mpol without vma", v2.
Mostly cleanups in mm/mempolicy.c, but finally removing the pseudo-vma
from shmem folio allocation, and removing the mmap_lock around folio
migration for mbind and migrate_pages syscalls.
This patch (of 12):
hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA
mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA
mempolicy for this file? hugetlb_vm_ops has never offered a set_policy
method, and hugetlbfs_parse_param() has never supported any mpol options
for a mount-wide default policy.
It's just an illusion: clean it away so as not to confuse others, giving
us more freedom to adjust shmem's set_policy/get_policy implementation.
But hugetlbfs_inode_info is still required, just to accommodate seals.
Yes, shared NUMA mempolicy support could be added to hugetlbfs, with a
set_policy method and/or mpol mount option (Andi's first posting did
include an admitted-unsatisfactory hugetlb_set_policy()); but it seems
that nobody has bothered to add that in the nineteen years since v2.6.7
made it possible, and there is at least one company that has invested
enough into hugetlbfs, that I guess they have learnt well enough how to
manage its NUMA, without needing shared mempolicy.
Remove linux/mempolicy.h from linux/hugetlb.h: include linux/pagemap.h in
its place, because hugetlb.h's recently added use of filemap_lock_folio()
requires that (although most .configs and .c's get it in some other way).
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "avoid divide-by-zero due to max_nr_accesses overflow".
The maximum nr_accesses of given DAMON context can be calculated by
dividing the aggregation interval by the sampling interval. Some logics
in DAMON uses the maximum nr_accesses as a divisor. Hence, the value
shouldn't be zero. Such case is avoided since DAMON avoids setting the
agregation interval as samller than the sampling interval. However, since
nr_accesses is unsigned int while the intervals are unsigned long, the
maximum nr_accesses could be zero while casting.
Avoid the divide-by-zero by implementing a function that handles the
corner case (first patch), and replaces the vulnerable direct max
nr_accesses calculations (remaining patches).
Note that the patches for the replacements are divided for broken commits,
to make backporting on required tres easier. Especially, the last patch
is for a patch that not yet merged into the mainline but in mm tree.
This patch (of 4):
The maximum nr_accesses of given DAMON context can be calculated by
dividing the aggregation interval by the sampling interval. Some logics
in DAMON uses the maximum nr_accesses as a divisor. Hence, the value
shouldn't be zero. Such case is avoided since DAMON avoids setting the
agregation interval as samller than the sampling interval. However, since
nr_accesses is unsigned int while the intervals are unsigned long, the
maximum nr_accesses could be zero while casting. Implement a function
that handles the corner case.
Note that this commit is not fixing the real issue since this is only
introducing the safe function that will replaces the problematic
divisions. The replacements will be made by followup commits, to make
backporting on stable series easier.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 198f0f4c58b9 ("mm/damon/vaddr,paddr: support pageout prioritization")
Signed-off-by: SeongJae Park <[email protected]>
Reported-by: Jakub Acs <[email protected]>
Cc: <[email protected]> [5.16+]
Signed-off-by: Andrew Morton <[email protected]>
|
|
Also remove count_memcg_page_event now that its last caller no longer uses
it and reword hpage_collapse_alloc_page() to hpage_collapse_alloc_folio().
This removes 1 call to compound_head() and helps convert khugepaged to
use folios throughout.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add nr_split to trace_mm_migrate_pages for large folio (including THP)
split events.
[[email protected]: cleanup per Huang, Ying]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since kmemleak_alloc_phys() rather than kmemleak_alloc() was called from
memblock_alloc_range_nid(), kmemleak_free_part_phys() should be used to
delete kmemleak object in free_bootmem_page(). In debug mode, there are
following warning:
kmemleak: Partially freeing unknown object at 0xffff97345aff7000 (size 4096)
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 028725e73375 ("bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page")
Signed-off-by: Liu Shixin <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Patrick Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since all calls use folio_xchg_last_cpupid(), remove
page_cpupid_xchg_last().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Make finish_mkwrite_fault static since it is not used outside of
memory.c.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add folio_xchg_last_cpupid() wrapper, which is required to convert
page_cpupid_xchg_last() to folio vertion later in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since all calls use folio_xchg_access_time(), remove
xchg_page_access_time().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add folio_xchg_access_time() wrapper, which is required to convert
xchg_page_access_time() to folio vertion later in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since all calls use folio_last_cpupid(), remove page_cpupid_last().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add folio_last_cpupid() wrapper, which is required to convert
page_cpupid_last() to folio vertion later in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: convert page cpupid functions to folios", v3.
The cpupid(or access time) used by numa balancing is stored in flags or
_last_cpupid(if LAST_CPUPID_NOT_IN_PAGE_FLAGS) of page, this is to convert
page cpupid to folio cpupid, a new _last_cpupid is added into folio, which
make us to use folio->_last_cpupid directly, and the page cpupid functions
are converted to folio ones.
page_cpupid_last() -> folio_last_cpupid()
xchg_page_access_time() -> folio_xchg_access_time()
page_cpupid_xchg_last() -> folio_xchg_last_cpupid()
This patch (of 19):
If WANT_PAGE_VIRTUAL and LAST_CPUPID_NOT_IN_PAGE_FLAGS defined, the
'virtual' and '_last_cpupid' are in struct page, and since _last_cpupid is
used by numa balancing feature, it is better to move it before KMSAN
metadata from struct page, also add them into struct folio to make us to
access them from folio directly.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Reimplement get_obj_cgroup_from_current() using current_obj_cgroup().
get_obj_cgroup_from_current() and current_obj_cgroup() share 80% of the
code, so the new implementation is almost trivial.
get_obj_cgroup_from_current() is a convenient function used by the
bpf subsystem, so there is no reason to get rid of it completely.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naresh Kamboju <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Switch to a scope-based protection of the objcg pointer on slab/kmem
allocation paths. Instead of using the get_() semantics in the
pre-allocation hook and put the reference afterwards, let's rely on the
fact that objcg is pinned by the scope.
It's possible because:
1) if the objcg is received from the current task struct, the task is
keeping a reference to the objcg.
2) if the objcg is received from an active memcg (remote charging),
the memcg is pinned by the scope and has a reference to the
corresponding objcg.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Keep a reference to the original objcg object for the entire life of a
memcg structure.
This allows to simplify the synchronization on the kernel memory
allocation paths: pinning a (live) memcg will also pin the corresponding
objcg.
The memory overhead of this change is minimal because object cgroups
usually outlive their corresponding memory cgroups even without this
change, so it's only an additional pointer per memcg.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
To charge a freshly allocated kernel object to a memory cgroup, the kernel
needs to obtain an objcg pointer. Currently it does it indirectly by
obtaining the memcg pointer first and then calling to
__get_obj_cgroup_from_memcg().
Usually tasks spend their entire life belonging to the same object cgroup.
So it makes sense to save the objcg pointer on task_struct directly, so
it can be obtained faster. It requires some work on fork, exit and cgroup
migrate paths, but these paths are way colder.
To avoid any costly synchronization the following rules are applied:
1) A task sets it's objcg pointer itself.
2) If a task is being migrated to another cgroup, the least
significant bit of the objcg pointer is set atomically.
3) On the allocation path the objcg pointer is obtained locklessly
using the READ_ONCE() macro and the least significant bit is
checked. If it's set, the following procedure is used to update
it locklessly:
- task->objcg is zeroed using cmpxcg
- new objcg pointer is obtained
- task->objcg is updated using try_cmpxchg
- operation is repeated if try_cmpxcg fails
It guarantees that no updates will be lost if task migration
is racing against objcg pointer update. It also allows to keep
both read and write paths fully lockless.
Because the task is keeping a reference to the objcg, it can't go away
while the task is alive.
This commit doesn't change the way the remote memcg charging works.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become the
maximal value even if the allocating/freeing depth is small, for example,
in the sender of network workloads. If a CPU was used as sender
originally, then it is used as receiver after context switching, we need
to fill the whole PCP with maximal high before triggering PCP draining for
consecutive high order freeing. This will hurt the performance of some
network workloads.
To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining. So, we can
detect consecutive page freeing much earlier.
On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
With the patch, the network bandwidth improves 5.0%. This restores the
performance drop caused by PCP auto-tuning.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
One target of PCP is to minimize pages in PCP if the system free pages is
too few. To reach that target, when page reclaiming is active for the
zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating
path, decrease PCP high and free some pages in freeing path. But this may
be too late because the background page reclaiming may introduce latency
for some workloads. So, in this patch, during page allocation we will
detect whether the number of free pages of the zone is below high
watermark. If so, we will stop increasing PCP high in allocating path,
decrease PCP high and free some pages in freeing path. With this, we can
reduce the possibility of the premature background page reclaiming caused
by too large PCP.
The high watermark checking is done in allocating path to reduce the
overhead in hotter freeing path.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The target to tune PCP high automatically is as follows,
- Minimize allocation/freeing from/to shared zone
- Minimize idle pages in PCP
- Minimize pages in PCP if the system free pages is too few
To reach these target, a tuning algorithm as follows is designed,
- When we refill PCP via allocating from the zone, increase PCP high.
Because if we had larger PCP, we could avoid to allocate from the
zone.
- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
decrease PCP high to try to free possible idle PCP pages.
- When page reclaiming is active for the zone, stop increasing PCP
high in allocating path, decrease PCP high and free some pages in
freeing path.
So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.
One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become the
maximal value even if the allocating/freeing depth is small. But this
isn't a severe issue, because there are no idle pages in this case.
One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during PCP
refilling. This can avoid the issue above. But if the number of pages
allocated is much less than that of pages freed on a CPU, there will be
many idle pages in PCP and it is hard to free these idle pages.
1/8 (>> 3) of PCP high will be decreased periodically. The value 1/8 is
kind of arbitrary. Just to make sure that the idle PCP pages will be
freed eventually.
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
build time decreases 3.5%. The cycles% of the spinlock contention (mostly
for zone lock) decreases from 11.0% to 0.5%. The number of PCP draining
for high order pages freeing (free_high) decreases 65.6%. The number of
pages allocated from zone (instead of from PCP) decreases 83.9%.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The page allocation performance requirements of different workloads are
usually different. So, we need to tune PCP (per-CPU pageset) high to
optimize the workload page allocation performance. Now, we have a system
wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand.
But, it's hard to find out the best value by hand. And one global
configuration may not work best for the different workloads that run on
the same system. One solution to these issues is to tune PCP high of each
CPU automatically.
This patch adds the framework for PCP high auto-tuning. With it,
pcp->high of each CPU will be changed automatically by tuning algorithm at
runtime. The minimal high (pcp->high_min) is the original PCP high value
calculated based on the low watermark pages. While the maximal high
(pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction
sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the
maximal pcp->high that can be set via sysctl knob by hand.
It's possible that PCP high auto-tuning doesn't work well for some
workloads. So, when PCP high is tuned by hand via the sysctl knob, the
auto-tuning will be disabled. The PCP high set by hand will be used
instead.
This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always. We will add actual auto-tuning
algorithm in the following patches in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When a task is allocating a large number of order-0 pages, it may acquire
the zone->lock multiple times allocating pages in batches. This may
unnecessarily contend on the zone lock when allocating very large number
of pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent allocations.
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service. With the patch, the
cycles% of the spinlock contention (mostly for zone lock) decreases from
12.6% to 11.0% (with PCP size == 367).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages
on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
PCP is mostly used for high-order pages freeing to improve the cache-hot
pages reusing between page allocating and freeing CPUs.
On system with small per-CPU data cache slice, pages shouldn't be cached
before draining to guarantee cache-hot. But on a system with large
per-CPU data cache slice, some pages can be cached before draining to
reduce zone lock contention.
So, in this patch, instead of draining without any caching, "pcp->batch"
pages will be cached in PCP before draining if the size of the per-CPU
data cache slice is more than "3 * batch".
In theory, if the size of per-CPU data cache slice is more than "2 *
batch", we can reuse cache-hot pages between CPUs. But considering the
other usage of cache (code, other data accessing, etc.), "3 * batch" is
used.
Note: "3 * batch" is chosen to make sure the optimization works on recent
x86_64 server CPUs. If you want to increase it, please check whether it
breaks the optimization.
On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite
with 16-pair processes increase 70.5%. The cycles% of the spinlock
contention (mostly for zone lock) decreases from 46.1% to 21.3%. The
number of PCP draining for high order pages freeing (free_high) decreases
89.9%. The cache miss rate keeps 0.2%.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This can be used to estimate the size of the data cache slice that can be
used by one CPU under ideal circumstances. Both DATA caches and UNIFIED
caches are used in calculation. So, the users need to consider the impact
of the code cache usage.
Because the cache inclusive/non-inclusive information isn't available now,
we just use the size of the per-CPU slice of LLC to make the result more
predictable across architectures. This may be improved when more cache
information is available in the future.
A brute-force algorithm to iterate all online CPUs is used to avoid to
allocate an extra cpumask, especially in offline callback.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|