Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
"This is just a few documentation fixes for percpu refcount and bitmap
helpers that went in v5.6, and moving my emails to all be at korg"
* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: update copyright emails to dennis@kernel.org
include/bitmap.h: add new functions to documentation
include/bitmap.h: add missing parameter in docs
percpu_ref: Fix comment regarding percpu_ref_init flags
|
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- GICv4.1 support
- 32bit host removal
PPC:
- secure (encrypted) using under the Protected Execution Framework
ultravisor
s390:
- allow disabling GISA (hardware interrupt injection) and protected
VMs/ultravisor support.
x86:
- New dirty bitmap flag that sets all bits in the bitmap when dirty
page logging is enabled; this is faster because it doesn't require
bulk modification of the page tables.
- Initial work on making nested SVM event injection more similar to
VMX, and less buggy.
- Various cleanups to MMU code (though the big ones and related
optimizations were delayed to 5.8). Instead of using cr3 in
function names which occasionally means eptp, KVM too has
standardized on "pgd".
- A large refactoring of CPUID features, which now use an array that
parallels the core x86_features.
- Some removal of pointer chasing from kvm_x86_ops, which will also
be switched to static calls as soon as they are available.
- New Tigerlake CPUID features.
- More bugfixes, optimizations and cleanups.
Generic:
- selftests: cleanups, new MMU notifier stress test, steal-time test
- CSV output for kvm_stat"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits)
x86/kvm: fix a missing-prototypes "vmread_error"
KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y
KVM: VMX: Add a trampoline to fix VMREAD error handling
KVM: SVM: Annotate svm_x86_ops as __initdata
KVM: VMX: Annotate vmx_x86_ops as __initdata
KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup()
KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection
KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes
KVM: VMX: Configure runtime hooks using vmx_x86_ops
KVM: VMX: Move hardware_setup() definition below vmx_x86_ops
KVM: x86: Move init-only kvm_x86_ops to separate struct
KVM: Pass kvm_init()'s opaque param to additional arch funcs
s390/gmap: return proper error code on ksm unsharing
KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move()
KVM: Fix out of range accesses to memslots
KVM: X86: Micro-optimize IPI fastpath delay
KVM: X86: Delay read msr data iff writes ICR MSR
KVM: PPC: Book3S HV: Add a capability for enabling secure guests
KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs
KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"Just a couple of updates for linux-5.7:
- A new Kconfig option to enable IMA architecture specific runtime
policy rules needed for secure and/or trusted boot, as requested.
- Some message cleanup (eg. pr_fmt, additional error messages)"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: add a new CONFIG for loading arch-specific policies
integrity: Remove duplicate pr_fmt definitions
IMA: Add log statements for failure conditions
IMA: Update KBUILD_MODNAME for IMA files to ima
|
|
Merge updates from Andrew Morton:
"A large amount of MM, plenty more to come.
Subsystems affected by this patch series:
- tools
- kthread
- kbuild
- scripts
- ocfs2
- vfs
- mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
hugetlbfs, hugetlb"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
selftests/vm: fix map_hugetlb length used for testing read and write
mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
mm/hugetlb.c: clean code by removing unnecessary initialization
hugetlb_cgroup: add hugetlb_cgroup reservation docs
hugetlb_cgroup: add hugetlb_cgroup reservation tests
hugetlb: support file_region coalescing again
hugetlb_cgroup: support noreserve mappings
hugetlb_cgroup: add accounting for shared mappings
hugetlb: disable region_add file_region coalescing
hugetlb_cgroup: add reservation accounting for private mappings
mm/hugetlb_cgroup: fix hugetlb_cgroup migration
hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
hugetlb_cgroup: add hugetlb_cgroup reservation counter
hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
mm/memblock.c: remove redundant assignment to variable max_addr
mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pathwalk sanitizing from Al Viro:
"Massive pathwalk rewrite and cleanups.
Several iterations have been posted; hopefully this thing is getting
readable and understandable now. Pretty much all parts of pathname
resolutions are affected...
The branch is identical to what has sat in -next, except for commit
message in "lift all calls of step_into() out of follow_dotdot/
follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
commit message changed there."
* 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
lookup_open(): don't bother with fallbacks to lookup+create
atomic_open(): no need to pass struct open_flags anymore
open_last_lookups(): move complete_walk() into do_open()
open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
open_last_lookups(): consolidate fsnotify_create() calls
take post-lookup part of do_last() out of loop
link_path_walk(): sample parent's i_uid and i_mode for the last component
__nd_alloc_stack(): make it return bool
reserve_stack(): switch to __nd_alloc_stack()
pick_link(): take reserving space on stack into a new helper
pick_link(): more straightforward handling of allocation failures
fold path_to_nameidata() into its only remaining caller
pick_link(): pass it struct path already with normal refcounting rules
fs/namei.c: kill follow_mount()
non-RCU analogue of the previous commit
helper for mount rootwards traversal
follow_dotdot(): be lazy about changing nd->path
follow_dotdot_rcu(): be lazy about changing nd->path
follow_dotdot{,_rcu}(): massage loops
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exec/proc updates from Eric Biederman:
"This contains two significant pieces of work: the work to sort out
proc_flush_task, and the work to solve a deadlock between strace and
exec.
Fixing proc_flush_task so that it no longer requires a persistent
mount makes improvements to proc possible. The removal of the
persistent mount solves an old regression that that caused the hidepid
mount option to only work on remount not on mount. The regression was
found and reported by the Android folks. This further allows Alexey
Gladkov's work making proc mount options specific to an individual
mount of proc to move forward.
The work on exec starts solving a long standing issue with exec that
it takes mutexes of blocking userspace applications, which makes exec
extremely deadlock prone. For the moment this adds a second mutex with
a narrower scope that handles all of the easy cases. Which makes the
tricky cases easy to spot. With a little luck the code to solve those
deadlocks will be ready by next merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
signal: Extend exec_id to 64bits
pidfd: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
kernel: doc: remove outdated comment cred.c
mm: docs: Fix a comment in process_vm_rw_core
selftests/ptrace: add test cases for dead-locks
exec: Fix a deadlock in strace
exec: Add exec_update_mutex to replace cred_guard_mutex
exec: Move exec_mmap right after de_thread in flush_old_exec
exec: Move cleanup of posix timers on exec out of de_thread
exec: Factor unshare_sighand out of de_thread and call it separately
exec: Only compute current once in flush_old_exec
pid: Improve the comment about waiting in zap_pid_ns_processes
proc: Remove the now unnecessary internal mount of proc
uml: Create a private mount of proc for mconsole
uml: Don't consult current to find the proc_mnt in mconsole_proc
proc: Use a list of inodes to flush from proc
...
|
|
It's even more important to check that we don't have a tail page when
calling hpage_nr_pages() when THP are disabled.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When CONFIG_HUGETLB_PAGE is set but not CONFIG_HUGETLBFS, the following
build failure is encoutered:
In file included from arch/powerpc/mm/fault.c:33:0:
include/linux/hugetlb.h: In function 'hstate_inode':
include/linux/hugetlb.h:477:9: error: implicit declaration of function 'HUGETLBFS_SB' [-Werror=implicit-function-declaration]
return HUGETLBFS_SB(i->i_sb)->hstate;
^
include/linux/hugetlb.h:477:30: error: invalid type argument of '->' (have 'int')
return HUGETLBFS_SB(i->i_sb)->hstate;
^
Gate hstate_inode() with CONFIG_HUGETLBFS instead of CONFIG_HUGETLB_PAGE.
Fixes: a137e1cc6d6e ("hugetlbfs: per mount huge page sizes")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Link: http://lkml.kernel.org/r/7e8c3a3c9a587b9cd8a2f146df32a421b961f3a2.1584432148.git.christophe.leroy@c-s.fr
Link: https://patchwork.ozlabs.org/patch/1255548/#2386036
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.
After a call to region_chg, we charge the approprate hugetlb_cgroup, and
if successful, we pass on the hugetlb_cgroup info to a follow up
region_add call. When a file_region entry is added to the resv_map via
region_add, we put the pointer to that cgroup in
file_region->reservation_counter. If charging doesn't succeed, we report
the error to the caller, so that the kernel fails the reservation.
On region_del, which is when the hugetlb memory is unreserved, we also
uncharge the file_region->reservation_counter.
[akpm@linux-foundation.org: forward declare struct file_region]
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Normally the pointer to the cgroup to uncharge hangs off the struct page,
and gets queried when it's time to free the page. With hugetlb_cgroup
reservations, this is not possible. Because it's possible for a page to
be reserved by one task and actually faulted in by another task.
The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.
For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.
On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.
[akpm@linux-foundation.org: forward declare struct resv_map]
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
or hugetlb reservation counter.
Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.
Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
These counters will track hugetlb reservations rather than hugetlb memory
faulted in. This patch only adds the counter, following patches add the
charging and uncharging of the counter.
This is patch 1 of an 9 patch series.
Problem:
Currently tasks attempting to reserve more hugetlb memory than is
available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
Reservations [1]. However, if a task attempts to reserve more hugetlb
memory than its hugetlb_cgroup limit allows, the kernel will allow the
mmap/shmget call, but will SIGBUS the task when it attempts to fault in
the excess memory.
We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode. We'd like to improve this behavior such
that users violating the hugetlb_cgroup limits get an error on mmap/shmget
time, rather than getting SIGBUS'd when they try to fault the excess
memory in. This gives the user an opportunity to fallback more gracefully
to non-hugetlbfs memory for example.
The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
enforcing the hugetlb_cgroup limit only happens at fault time, and the
offending task gets SIGBUS'd.
Proposed Solution:
A new page counter named
'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
- While usage_in_bytes tracks all *faulted* hugetlb memory,
rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than rsvd.limit_in_bytes, the kernel will fail this
reservation.
This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.
Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to the
existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters
under hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that
modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
accounting at reservation time rather than fault time. Adding a new
page_counter seems better as userspace could, if it wants, choose to
enforce different cgroups differently: one via limit_in_bytes, and
another via rsvd.limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
gives them the option to do so.
Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
vma_migratable() is called to check if pages in vma can be migrated before
go ahead to further actions. Currently it is used in below code path:
- task_numa_work
- mbind
- move_pages
For hugetlb mapping, whether vma is migratable or not is determined by:
- CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
- arch_hugetlb_migration_supported
Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
alone, and no code should use it directly. (note that current code in
vma_migratable don't cause failure or bug because
unmap_and_move_huge_page() will catch unsupported hugepage and handle it
properly)
This patch checks the two factors by hugepage_migration_supported for
impoving code logic and robustness. It will enable early bail out of
hugepage migration procedure, but because currently all architecture
supporting hugepage migration is able to support all page size, we would
not see performance gain with this patch applied.
vma_migratable() is moved to mm/mempolicy.c, because of the circular
reference of mempolicy.h and hugetlb.h cause defining it as inline not
feasible.
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "fix the missing underflow in memory operation function", v4.
The patchset helps to produce a KASAN report when size is negative in
memory operation functions. It is helpful for programmer to solve an
undefined behavior issue. Patch 1 based on Dmitry's review and
suggestion, patch 2 is a test in order to verify the patch 1.
[1]https://bugzilla.kernel.org/show_bug.cgi?id=199341
[2]https://lore.kernel.org/linux-arm-kernel/20190927034338.15813-1-walter-zh.wu@mediatek.com/
This patch (of 2):
KASAN missed detecting size is a negative number in memset(), memcpy(),
and memmove(), it will cause out-of-bounds bug. So needs to be detected
by KASAN.
If size is a negative number, then it has a reason to be defined as
out-of-bounds bug type. Casting negative numbers to size_t would indeed
turn up as a large size_t and its value will be larger than ULONG_MAX/2,
so that this can qualify as out-of-bounds.
KASAN report is shown below:
BUG: KASAN: out-of-bounds in kmalloc_memmove_invalid_size+0x70/0xa0
Read of size 18446744073709551608 at addr ffffff8069660904 by task cat/72
CPU: 2 PID: 72 Comm: cat Not tainted 5.4.0-rc1-next-20191004ajb-00001-gdb8af2f372b2-dirty #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x288
show_stack+0x14/0x20
dump_stack+0x10c/0x164
print_address_description.isra.9+0x68/0x378
__kasan_report+0x164/0x1a0
kasan_report+0xc/0x18
check_memory_region+0x174/0x1d0
memmove+0x34/0x88
kmalloc_memmove_invalid_size+0x70/0xa0
[1] https://bugzilla.kernel.org/show_bug.cgi?id=199341
[cai@lca.pw: fix -Wdeclaration-after-statement warn]
Link: http://lkml.kernel.org/r/1583509030-27939-1-git-send-email-cai@lca.pw
[peterz@infradead.org: fix objtool warning]
Link: http://lkml.kernel.org/r/20200305095436.GV2596@hirez.programming.kicks-ass.net
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Link: http://lkml.kernel.org/r/20191112065302.7015-1-walter-zh.wu@mediatek.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After introducing mem sub section concept, pfn_present() loses its literal
meaning, and will not be necessary a truth on partial populated mem
section.
Since all of the callers use it to judge an absent section, it is better
to rename pfn_present() as pfn_in_present_section().
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Leonardo Bras <leonardo@linux.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Link: http://lkml.kernel.org/r/1581919110-29575-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: mmap: add mmap trace point", v3.
Create mmap trace file and add trace point of vm_unmapped_area().
This patch (of 2):
In preparation for next patch remove inline of vm_unmapped_area and move
code to mmap.c. There is no logical change.
Also remove unmapped_area[_topdown] out of mm.h, there is no code
calling to them.
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20200320055823.27089-2-jaewon31.kim@samsung.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
handle_userfaultfd() is currently the only one place in the kernel page
fault procedures that can respond to non-fatal userspace signals. It was
trying to detect such an allowance by checking against USER & KILLABLE
flags, which was "un-official".
In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to show
that the fault handler allows the fault procedure to respond even to
non-fatal signals. Meanwhile, add this new flag to the default fault
flags so that all the page fault handlers can benefit from the new flag.
With that, replacing the userfault check to this one.
Since the line is getting even longer, clean up the fault flags a bit too
to ease TTY users.
Although we've got a new flag and applied it, we shouldn't have any
functional change with this patch so far.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220195348.16302-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Although there're tons of arch-specific page fault handlers, most of them
are still sharing the same initial value of the page fault flags. Say,
merely all of the page fault handlers would allow the fault to be retried,
and they also allow the fault to respond to SIGKILL.
Let's define a default value for the fault flags to replace those initial
page fault flags that were copied over. With this, it'll be far easier to
introduce new fault flag that can be used by all the architectures instead
of touching all the archs.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The idea comes from the upstream discussion between Linus and Andrea:
https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
A summary to the issue: there was a special path in handle_userfault() in
the past that we'll return a VM_FAULT_NOPAGE when we detected non-fatal
signals when waiting for userfault handling. We did that by reacquiring
the mmap_sem before returning. However that brings a risk in that the
vmas might have changed when we retake the mmap_sem and even we could be
holding an invalid vma structure.
This patch is a preparation of removing that special path by allowing the
page fault to return even faster if we were interrupted by a non-fatal
signal during a user-mode page fault handling routine.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160230.9598-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For most architectures, we've got a quick path to detect fatal signal
after a handle_mm_fault(). Introduce a helper for that quick path.
It cleans the current codes a bit so we don't need to duplicate the same
check across archs. More importantly, this will be an unified place that
we handle the signal immediately right after an interrupted page fault, so
it'll be much easier for us if we want to change the behavior of handling
signals later on for all the archs.
Note that currently only part of the archs are using this new helper,
because some archs have their own way to handle signals. In the follow up
patches, we'll try to apply this helper to all the rest of archs.
Another note is that the "regs" parameter in the new helper is not used
yet. It'll be used very soon. Now we kept it in this patch only to avoid
touching all the archs again in the follow up patches.
[peterx@redhat.com: fix sparse warnings]
Link: http://lkml.kernel.org/r/20200311145921.GD479302@xz-x1
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220155353.8676-4-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The documentation currently does not include the deathless prose written
to describe functions in pagemap.h because it's not included in any rst
file. Fix up the mismatches between parameter names and the documentation
and add the file to mm-api.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: http://lkml.kernel.org/r/20200221220045.24989-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently the declaration and definition for is_vma_temporary_stack() are
scattered. Lets make is_vma_temporary_stack() helper available for
general use and also drop the declaration from (include/linux/huge_mm.h)
which is no longer required. While at this, rename this as
vma_is_temporary_stack() in line with existing helpers. This should not
cause any functional change.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Idea of a foreign VMA with respect to the present context is very generic.
But currently there are two identical definitions for this in powerpc and
x86 platforms. Lets consolidate those redundant definitions while making
vma_is_foreign() available for general use later. This should not cause
any functional change.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Link: http://lkml.kernel.org/r/1582782965-3274-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm/vma: some more minor changes", v2.
The motivation here is to consolidate VMA flags and helpers in generic
memory header and reduce code duplication when ever applicable. If there
are other possible similar instances which might be missing here, please
do let me me know. I will be happy to incorporate them.
This patch (of 3):
Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h). This just
makes sure that no VMA flag is scattered in individual function files any
longer. While at this, fix an old comment which is no longer valid. This
should not cause any functional change.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1582782965-3274-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions. It's
shorter and more obvious.
These are the most basic functions which are just (un)charging the given
cgroup with the given amount of pages.
Also fix up the corresponding comments.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-7-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
These functions are charging the given number of kernel pages to the given
memory cgroup. The number doesn't have to be a power of two. Let's make
them to take the unsigned int nr_pages as an argument instead of the page
order.
It makes them look consistent with the corresponding uncharge functions
and functions like: mem_cgroup_charge_skmem(memcg, nr_pages).
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-5-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:
1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
the current memcg
2) set or clear the PageKmemcg flag
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Drop the unused page argument and put the memcg pointer at the first
place. This make the function consistent with its peers:
__memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: memcg: kmem API cleanup", v2.
This patchset aims to clean up the kernel memory charging API. It doesn't
bring any functional changes, just removes unused arguments, renames some
functions and fixes some comments.
Currently it's not obvious which functions are most basic
(memcg_kmem_(un)charge_memcg()) and which are based on them
(memcg_kmem_(un)charge()). The patchset renames these functions and
removes unused arguments:
TL;DR:
was:
memcg_kmem_charge_memcg(page, gfp, order, memcg)
memcg_kmem_uncharge_memcg(memcg, nr_pages)
memcg_kmem_charge(page, gfp, order)
memcg_kmem_uncharge(page, order)
now:
memcg_kmem_charge(memcg, gfp, nr_pages)
memcg_kmem_uncharge(memcg, nr_pages)
memcg_kmem_charge_page(page, gfp, order)
memcg_kmem_uncharge_page(page, order)
This patch (of 6):
The first argument of memcg_kmem_charge_memcg() and
__memcg_kmem_charge_memcg() is the page pointer and it's not used. Let's
drop it.
Memcg pointer is passed as the last argument. Move it to the first place
for consistency with other memcg functions, e.g.
__memcg_kmem_uncharge_memcg() or try_charge().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It sounds
not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it sounds
better to make it inline. However, it calls page_mapping() which is not
inlined either, but the disassemble shows it doesn't do push and pop
operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that test
bench and get back normal on my VM. Since the test bench configuration is
not that usual and I also saw around 6% up on the latest upstream, so it
sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.
Shakeel Butt did the below test:
On a real machine with limiting the 'dd' on a single node and reading 100
GiB sparse file (less than a single node). Just ran a single instance to
not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
measured the time it took.
Without patch: 56.64143 +- 0.672 sec
With patches: 56.10 +- 0.21 sec
[akpm@linux-foundation.org: move page_evictable() to internal.h]
Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.
While cpu accesses will trigger a fault that can be resolved, I/O accesses
will just fail. We need to add a callback into architecture code for
places that will do I/O, namely when writeback is started or when a page
reference is taken.
This is not only to enable paging, file backing etc, it is also necessary
to protect the host against a malicious user space. For example a bad
QEMU could simply start direct I/O on such protected memory. We do not
want userspace to be able to trigger I/O errors and thus the logic is
"whenever somebody accesses that page (gup) or does I/O, make sure that
this page can be accessed". When the guest tries to access that page we
will wait in the page fault handler for writeback to have finished and for
the page_ref to be the expected value.
On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling we can
tackle this when we know the details.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.
Add two new fields to /proc/vmstat:
nr_foll_pin_acquired
nr_foll_pin_released
These are documented in Documentation/core-api/pin_user_pages.rst. They
represent the number of pages (since boot time) that have been pinned
("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
pin_user_pages*() and unpin_user_pages*().
In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.
Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.
Also: update Documentation/core-api/pin_user_pages.rst to rename the new
/proc/vmstat entries, to the names listed here.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[jhubbard@nvidia.com: add kerneldoc]
Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
An upcoming patch requires subtracting a large chunk of refcounts from a
page, and checking what the resulting refcount is. This is a little
different than the usual "check for zero refcount" that many of the page
ref functions already do. However, it is similar to a few other routines
that (like this one) are generally useful for things such as 1-based
refcounting.
Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically,
and returns an atomic snapshot of the result.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-4-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This isn't just a random struct page, it's known to be a head page, and
calling it head makes the function better self-documenting. The pgoff_t
is less confusing if it's named index instead of offset. Also add a
couple of comments to explain why we're doing various things.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-3-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
fallback node"
This reverts commit ad2c8144418c6a81cefe65379fd47bbe8344cef2.
The function node_to_mem_node() was introduced by that commit for use in SLUB
on systems with memoryless nodes, but it turned out to be unreliable on some
architectures/configurations and a simpler solution exists than fixing it up.
Thus commit 0715e6c516f1 ("mm, slub: prevent kmalloc_node crashes and
memory leaks") removed the only user of node_to_mem_node() and we can
revert the commit that introduced the function.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20200320115533.9604-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The timer used by delayed kthread works are IRQ safe because the used
kthread_delayed_work_timer_fn() is IRQ safe.
It is properly marked when initialized by KTHREAD_DELAYED_WORK_INIT().
But TIMER_IRQSAFE flag is missing when initialized by
kthread_init_delayed_work().
The missing flag might trigger invalid warning from del_timer_sync() when
kthread_mod_delayed_work() is called with interrupts disabled.
This patch is result of a discussion about using the API, see
https://lkml.kernel.org/r/cfa886ad-e3b7-c0d2-3ff8-58d94170eab5@ti.com
Reported-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200217120709.1974-1-pmladek@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull rdma updates from Jason Gunthorpe:
"The majority of the patches are cleanups, refactorings and clarity
improvements.
This cycle saw some more activity from Syzkaller, I think we are now
clean on all but one of those bugs, including the long standing and
obnoxious rdma_cm locking design defect. Continue to see many drivers
getting cleanups, with a few new user visible features.
Summary:
- Various driver updates for siw, bnxt_re, rxe, efa, mlx5, hfi1
- Lots of cleanup patches for hns
- Convert more places to use refcount
- Aggressively lock the RDMA CM code that syzkaller says isn't
working
- Work to clarify ib_cm
- Use the new ib_device lifecycle model in bnxt_re
- Fix mlx5's MR cache which seems to be failing more often with the
new ODP code
- mlx5 'dynamic uar' and 'tx steering' user interfaces"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (144 commits)
RDMA/bnxt_re: make bnxt_re_ib_init static
IB/qib: Delete struct qib_ivdev.qp_rnd
RDMA/hns: Fix uninitialized variable bug
RDMA/hns: Modify the mask of QP number for CQE of hip08
RDMA/hns: Reduce the maximum number of extend SGE per WQE
RDMA/hns: Reduce PFC frames in congestion scenarios
RDMA/mlx5: Add support for RDMA TX flow table
net/mlx5: Add support for RDMA TX steering
IB/hfi1: Call kobject_put() when kobject_init_and_add() fails
IB/hfi1: Fix memory leaks in sysfs registration and unregistration
IB/mlx5: Move to fully dynamic UAR mode once user space supports it
IB/mlx5: Limit the scope of struct mlx5_bfreg_info to mlx5_ib
IB/mlx5: Extend QP creation to get uar page index from user space
IB/mlx5: Extend CQ creation to get uar page index from user space
IB/mlx5: Expose UAR object and its alloc/destroy commands
IB/hfi1: Get rid of a warning
RDMA/hns: Remove redundant judgment of qp_type
RDMA/hns: Remove redundant assignment of wc->smac when polling cq
RDMA/hns: Remove redundant qpc setup operations
RDMA/hns: Remove meaningless prints
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This series focuses on corner case bug fixes and general clarity
improvements to hmm_range_fault(). It arose from a review of
hmm_range_fault() by Christoph, Ralph and myself.
hmm_range_fault() is being used by these 'SVM' style drivers to
non-destructively read the page tables. It is very similar to
get_user_pages() except that the output is an array of PFNs and
per-pfn flags, and it has various modes of reading.
This is necessary before RDMA ODP can be converted, as we don't want
to have weird corner case regressions, which is still a looking
forward item. Ralph has a nice tester for this routine, but it is
waiting for feedback from the selftests maintainers.
Summary:
- 9 bug fixes
- Allow pgmap to track the 'owner' of a DEVICE_PRIVATE - in this case
the owner tells the driver if it can understand the DEVICE_PRIVATE
page or not. Use this to resolve a bug in nouveau where it could
touch DEVICE_PRIVATE pages from other drivers.
- Remove a bunch of dead, redundant or unused code and flags
- Clarity improvements to hmm_range_fault()"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (25 commits)
mm/hmm: return error for non-vma snapshots
mm/hmm: do not set pfns when returning an error code
mm/hmm: do not unconditionally set pfns when returning EBUSY
mm/hmm: use device_private_entry_to_pfn()
mm/hmm: remove HMM_FAULT_SNAPSHOT
mm/hmm: remove unused code and tidy comments
mm/hmm: return the fault type from hmm_pte_need_fault()
mm/hmm: remove pgmap checking for devmap pages
mm/hmm: check the device private page owner in hmm_range_fault()
mm: simplify device private page handling in hmm_range_fault
mm: handle multiple owners of device private pages in migrate_vma
memremap: add an owner field to struct dev_pagemap
mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_
mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_()
mm/hmm: simplify hmm_vma_walk_hugetlb_entry()
mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flag
mm/hmm: don't provide a stub for hmm_range_fault()
mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()
mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling
mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages
...
|
|
Pull XArray updates from Matthew Wilcox:
- Fix two bugs which affected multi-index entries larger than 2^26
indices
- Fix some documentation
- Remove unused IDA macros
- Add a small optimisation for tiny configurations
- Fix a bug which could cause an RCU walker to terminate a marked walk
early
* tag 'xarray-5.7' of git://git.infradead.org/users/willy/linux-dax:
xarray: Fix early termination of xas_for_each_marked
radix tree test suite: Support kmem_cache alignment
XArray: Optimise xas_sibling() if !CONFIG_XARRAY_MULTI
ida: remove abandoned macros
XArray: Fix incorrect comment in header file
XArray: Fix xas_pause for large multi-index entries
XArray: Fix xa_find_next for large multi-index entries
|
|
Pull drm updates from Dave Airlie:
"This is the main drm pull request for 5.7-rc1.
Highlights:
- i915 enables Tigerlake by default
- i915 and amdgpu have initial OLED backlight support
[ Jani Nikula pipes up and points out that we've had a bunch of
"initial support" code for a long time already, but only now
Lyude made it actually work on real world machines ]
- vmwgfx add support to enable OpenGL 4 userspace
- zero length arrays are mostly removed.
Detailed summary:
new driver:
- tidss: TI Keystone platform display subsystem
core:
- new drm device warn macros
- mode config valid for memory constrained devices
- bridge bus format negotation
- consolidated fake vblank event handling
- dma_alloc related cleanups
- drop get_crtc callback
- dp: DP1.4 EDID corruption test
- EDID CEA detailed timings improvements
- relicense some code to dual GPL2/MIT
- convert core vblank support to per-crtc support
- rework drm_global_mutex
- bridge rework to allow omap_dss custom driver removeal
- remove drm_fb_helper connector interrfaces
- zero-length array removal
scheduler:
- support for modifying the sched list
- revert job distribution optimization
- helper to pick least loaded scheduler
- race condition fix
mst:
- various fixes
- remove register_connector callback
i915:
- uapi to allows userspace specific CS ring buffer sizes
- Tigerlake enablement patches + Tigerlake enabled by default
- new sysfs entries for engine properties
- display/logging refactors
- eDP/DP fixes for DPCD
- Gen7 back to aliasing-ppgtt
- Gen8+ irq refactor
- Avoid globals
- GEM locking fixes and simplifications
- Ice Lake and Elkhart Lake fixes and workarounds
- Baytrail/Haswell instability fix
- GVT - VFIO edid better support
amdgpu:
- Rework VM update handling in preparation for HMM support
- drm load/unload removal fixups
- USB-C PD firmware updates
- HDCP srm support
- Navi/renoir PM watermark fixes
- OLED panel support
- Optimize debugging vram access
- Use BACO for runtime pm
- DC clock programming optimizations and fixes
- PSP fw loading sequence updates
- Drop DRIVER_USE_AGP
- Remove legacy drm load and unload callbacks
- ACP Kconfig fix
- Lots of fixes across the driver
amdkfd:
- runtime pm support
- more gfx config details in amdgpu
radeon:
- drop DRIVER_USE_AGP
vmwgfx:
- Disable DMA when SEV encryption in use
- Shader Model 5 support - needed for GL4 support
msm:
- DPU resource manager refactor
- dpu using atomic global state
mediatek:
- MT8183 DPI support
etnaviv:
- out-of-bounds read fix
- expose feature flags for GC400 STM32MP1 SoC
- runtime suspend entry fix
- dma32 zone fix
hisilicon:
- mode selection fixes
meson:
- YUV420 support
lima:
- add support for heap buffers
tinydrm:
- removal of owner field
- explicit DT dependency removal
- YAML schema conversion
tegra:
- misc cleanups
tidss:
- new driver
virtio:
- better batching of notifications to host
- memory handling reworked
- shmem + gpu context fixes
hibmc:
- add gamma_set support
- improve DPMS support
pl111:
- Integrator IM-PD1 support
sun4i:
- LVDS support for A20 + A33
- DSI panel handling improvements"
* tag 'drm-next-2020-04-01' of git://anongit.freedesktop.org/drm/drm: (1537 commits)
drm/i915/display: Fix mode private_flags comparison at atomic_check
drm/i915/gt: Stage the transfer of the virtual breadcrumb
drm/i915/gt: Select the deepest available parking mode for rc6
drm/i915: Avoid live-lock with i915_vma_parked()
drm/i915/gt: Treat idling as a RPS downclock event
drm/i915/gt: Cancel a hung context if already closed
drm/i915: Use explicit flag to mark unreachable intel_context
drm/amdgpu: don't try to reserve training bo for sriov (v2)
drm/amdgpu/smu11: add support for SMU AC/DC interrupts
drm/amdgpu/swSMU: handle manual AC/DC notifications
drm/amdgpu/swSMU: handle DC controlled by GPIO for navi1x
drm/amdgpu/swSMU: set AC/DC mode based on the current system state (v2)
drm/amdgpu/swSMU: correct the bootup power source for Navi1X (v2)
drm/amdgpu/swSMU: use the smu11 power source helper for navi1x
drm/amdgpu/smu11: add a helper to set the power source
drm/amd/swSMU: add callback to set AC/DC power source (v2)
drm/scheduler: fix rare NULL ptr race
drm/amdgpu: fix the coverage issue to clear ArcVPGRs
drm/amd/display: Fix pageflip event race condition for DCN.
drm/[radeon|amdgpu]: Remove HAINAN board from max_sclk override check
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
"My attempt to revitalize trivial queue I've been neglecting for years
(what a disaster that was for this world, right? :) ) with patches
collected from backlog that were still relevant and not applied
elsewhere in the meantime"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
err.h: remove deprecated PTR_RET for good
blk-mq: Fix typo in comment
x86/boot: Fix comment spelling
sh: mach-highlander: Fix comment spelling
s390/dasd: Fix comment spelling
mfd: wm8994: Fix comment spelling
docs: Add reference in binfmt-misc.rst
genirq: fix kerneldoc comment for irq_desc
drm/amdgpu: fix two documentation mismatch issues
HID: fix Kconfig word ordering
list/hashtable: minor documentation corrections.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Fix out-of-sync IVs in self-test for IPsec AEAD algorithms
Algorithms:
- Use formally verified implementation of x86/curve25519
Drivers:
- Enhance hwrng support in caam
- Use crypto_engine for skcipher/aead/rsa/hash in caam
- Add Xilinx AES driver
- Add uacce driver
- Register zip engine to uacce in hisilicon
- Add support for OCTEON TX CPT engine in marvell"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits)
crypto: af_alg - bool type cosmetics
crypto: arm[64]/poly1305 - add artifact to .gitignore files
crypto: caam - limit single JD RNG output to maximum of 16 bytes
crypto: caam - enable prediction resistance in HRWNG
bus: fsl-mc: add api to retrieve mc version
crypto: caam - invalidate entropy register during RNG initialization
crypto: caam - check if RNG job failed
crypto: caam - simplify RNG implementation
crypto: caam - drop global context pointer and init_done
crypto: caam - use struct hwrng's .init for initialization
crypto: caam - allocate RNG instantiation descriptor with GFP_DMA
crypto: ccree - remove duplicated include from cc_aead.c
crypto: chelsio - remove set but not used variable 'adap'
crypto: marvell - enable OcteonTX cpt options for build
crypto: marvell - add the Virtual Function driver for CPT
crypto: marvell - add support for OCTEON TX CPT engine
crypto: marvell - create common Kconfig and Makefile for Marvell
crypto: arm/neon - memzero_explicit aes-cbc key
crypto: bcm - Use scnprintf() for avoiding potential buffer overflow
crypto: atmel-i2c - Fix wakeup fail
...
|
|
Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter. With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent. This
bypasses the signal sending checks if the parent changes their
credentials during exec.
The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days. Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.
Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions. Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value. So with very lucky timing after this change this still
remains expoiltable.
I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.
Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
Fixes: 2.3.23pre2
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
Pull networking updates from David Miller:
"Highlights:
1) Fix the iwlwifi regression, from Johannes Berg.
2) Support BSS coloring and 802.11 encapsulation offloading in
hardware, from John Crispin.
3) Fix some potential Spectre issues in qtnfmac, from Sergey
Matyukevich.
4) Add TTL decrement action to openvswitch, from Matteo Croce.
5) Allow paralleization through flow_action setup by not taking the
RTNL mutex, from Vlad Buslov.
6) A lot of zero-length array to flexible-array conversions, from
Gustavo A. R. Silva.
7) Align XDP statistics names across several drivers for consistency,
from Lorenzo Bianconi.
8) Add various pieces of infrastructure for offloading conntrack, and
make use of it in mlx5 driver, from Paul Blakey.
9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.
10) Lots of parallelization improvements during configuration changes
in mlxsw driver, from Ido Schimmel.
11) Add support to devlink for generic packet traps, which report
packets dropped during ACL processing. And use them in mlxsw
driver. From Jiri Pirko.
12) Support bcmgenet on ACPI, from Jeremy Linton.
13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
Starovoitov, and your's truly.
14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.
15) Fix sysfs permissions when network devices change namespaces, from
Christian Brauner.
16) Add a flags element to ethtool_ops so that drivers can more simply
indicate which coalescing parameters they actually support, and
therefore the generic layer can validate the user's ethtool
request. Use this in all drivers, from Jakub Kicinski.
17) Offload FIFO qdisc in mlxsw, from Petr Machata.
18) Support UDP sockets in sockmap, from Lorenz Bauer.
19) Fix stretch ACK bugs in several TCP congestion control modules,
from Pengcheng Yang.
20) Support virtual functiosn in octeontx2 driver, from Tomasz
Duszynski.
21) Add region operations for devlink and use it in ice driver to dump
NVM contents, from Jacob Keller.
22) Add support for hw offload of MACSEC, from Antoine Tenart.
23) Add support for BPF programs that can be attached to LSM hooks,
from KP Singh.
24) Support for multiple paths, path managers, and counters in MPTCP.
From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
and others.
25) More progress on adding the netlink interface to ethtool, from
Michal Kubecek"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
cxgb4/chcr: nic-tls stats in ethtool
net: dsa: fix oops while probing Marvell DSA switches
net/bpfilter: remove superfluous testing message
net: macb: Fix handling of fixed-link node
net: dsa: ksz: Select KSZ protocol tag
netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
net: stmmac: create dwmac-intel.c to contain all Intel platform
net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
net: dsa: bcm_sf2: Add support for matching VLAN TCI
net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
net: dsa: bcm_sf2: Disable learning for ASP port
net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
net: dsa: b53: Restore VLAN entries upon (re)configuration
net: dsa: bcm_sf2: Fix overflow checks
hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH:
"Here is the big set of TTY / Serial patches for 5.7-rc1
Lots of console fixups and reworking in here, serial core tweaks
(doesn't that ever get old, why are we still creating new serial
devices?), serial driver updates, line-protocol driver updates, and
some vt cleanups and fixes included in here as well.
All have been in linux-next with no reported issues"
* tag 'tty-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (161 commits)
serial: 8250: Optimize irq enable after console write
serial: 8250: Fix rs485 delay after console write
vt: vt_ioctl: fix use-after-free in vt_in_use()
vt: vt_ioctl: fix VT_DISALLOCATE freeing in-use virtual console
tty: serial: make SERIAL_SPRD depend on COMMON_CLK
tty: serial: fsl_lpuart: fix return value checking
tty: serial: fsl_lpuart: move dma_request_chan()
ARM: dts: tango4: Make /serial compatible with ns16550a
ARM: dts: mmp*: Make the serial ports compatible with xscale-uart
ARM: dts: mmp*: Fix serial port names
ARM: dts: mmp2-brownstone: Don't redeclare phandle references
ARM: dts: pxa*: Make the serial ports compatible with xscale-uart
ARM: dts: pxa*: Fix serial port names
ARM: dts: pxa*: Don't redeclare phandle references
serial: omap: drop unused dt-bindings header
serial: 8250: 8250_omap: Add DMA support for UARTs on K3 SoCs
serial: 8250: 8250_omap: Work around errata causing spurious IRQs with DMA
serial: 8250: 8250_omap: Extend driver data to pass FIFO trigger info
serial: 8250: 8250_omap: Move locking out from __dma_rx_do_complete()
serial: 8250: 8250_omap: Account for data in flight during DMA teardown
...
|