Age | Commit message (Collapse) | Author | Files | Lines |
|
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
or hugetlb reservation counter.
Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.
Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
Signed-off-by: Mina Almasry <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Sandipan Das <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These counters will track hugetlb reservations rather than hugetlb memory
faulted in. This patch only adds the counter, following patches add the
charging and uncharging of the counter.
This is patch 1 of an 9 patch series.
Problem:
Currently tasks attempting to reserve more hugetlb memory than is
available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
Reservations [1]. However, if a task attempts to reserve more hugetlb
memory than its hugetlb_cgroup limit allows, the kernel will allow the
mmap/shmget call, but will SIGBUS the task when it attempts to fault in
the excess memory.
We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode. We'd like to improve this behavior such
that users violating the hugetlb_cgroup limits get an error on mmap/shmget
time, rather than getting SIGBUS'd when they try to fault the excess
memory in. This gives the user an opportunity to fallback more gracefully
to non-hugetlbfs memory for example.
The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
enforcing the hugetlb_cgroup limit only happens at fault time, and the
offending task gets SIGBUS'd.
Proposed Solution:
A new page counter named
'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
- While usage_in_bytes tracks all *faulted* hugetlb memory,
rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than rsvd.limit_in_bytes, the kernel will fail this
reservation.
This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.
Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to the
existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters
under hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that
modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
accounting at reservation time rather than fault time. Adding a new
page_counter seems better as userspace could, if it wants, choose to
enforce different cgroups differently: one via limit_in_bytes, and
another via rsvd.limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
gives them the option to do so.
Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Sandipan Das <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]
[4] https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/lkml/[email protected]/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Aneesh Kumar K . V" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Prakash Sangappa <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
vma_migratable() is called to check if pages in vma can be migrated before
go ahead to further actions. Currently it is used in below code path:
- task_numa_work
- mbind
- move_pages
For hugetlb mapping, whether vma is migratable or not is determined by:
- CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
- arch_hugetlb_migration_supported
Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
alone, and no code should use it directly. (note that current code in
vma_migratable don't cause failure or bug because
unmap_and_move_huge_page() will catch unsupported hugepage and handle it
properly)
This patch checks the two factors by hugepage_migration_supported for
impoving code logic and robustness. It will enable early bail out of
hugepage migration procedure, but because currently all architecture
supporting hugepage migration is able to support all page size, we would
not see performance gain with this patch applied.
vma_migratable() is moved to mm/mempolicy.c, because of the circular
reference of mempolicy.h and hugetlb.h cause defining it as inline not
feasible.
Signed-off-by: Li Xinhai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "fix the missing underflow in memory operation function", v4.
The patchset helps to produce a KASAN report when size is negative in
memory operation functions. It is helpful for programmer to solve an
undefined behavior issue. Patch 1 based on Dmitry's review and
suggestion, patch 2 is a test in order to verify the patch 1.
[1]https://bugzilla.kernel.org/show_bug.cgi?id=199341
[2]https://lore.kernel.org/linux-arm-kernel/[email protected]/
This patch (of 2):
KASAN missed detecting size is a negative number in memset(), memcpy(),
and memmove(), it will cause out-of-bounds bug. So needs to be detected
by KASAN.
If size is a negative number, then it has a reason to be defined as
out-of-bounds bug type. Casting negative numbers to size_t would indeed
turn up as a large size_t and its value will be larger than ULONG_MAX/2,
so that this can qualify as out-of-bounds.
KASAN report is shown below:
BUG: KASAN: out-of-bounds in kmalloc_memmove_invalid_size+0x70/0xa0
Read of size 18446744073709551608 at addr ffffff8069660904 by task cat/72
CPU: 2 PID: 72 Comm: cat Not tainted 5.4.0-rc1-next-20191004ajb-00001-gdb8af2f372b2-dirty #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x288
show_stack+0x14/0x20
dump_stack+0x10c/0x164
print_address_description.isra.9+0x68/0x378
__kasan_report+0x164/0x1a0
kasan_report+0xc/0x18
check_memory_region+0x174/0x1d0
memmove+0x34/0x88
kmalloc_memmove_invalid_size+0x70/0xa0
[1] https://bugzilla.kernel.org/show_bug.cgi?id=199341
[[email protected]: fix -Wdeclaration-after-statement warn]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix objtool warning]
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: kernel test robot <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Suggested-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Walter Wu <[email protected]>
Signed-off-by: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
After introducing mem sub section concept, pfn_present() loses its literal
meaning, and will not be necessary a truth on partial populated mem
section.
Since all of the callers use it to judge an absent section, it is better
to rename pfn_present() as pfn_in_present_section().
Signed-off-by: Pingfan Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michael Ellerman <[email protected]> [powerpc]
Cc: Dan Williams <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Leonardo Bras <[email protected]>
Cc: Nathan Fontenot <[email protected]>
Cc: Nathan Lynch <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: mmap: add mmap trace point", v3.
Create mmap trace file and add trace point of vm_unmapped_area().
This patch (of 2):
In preparation for next patch remove inline of vm_unmapped_area and move
code to mmap.c. There is no logical change.
Also remove unmapped_area[_topdown] out of mm.h, there is no code
calling to them.
Signed-off-by: Jaewon Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Borislav Petkov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
handle_userfaultfd() is currently the only one place in the kernel page
fault procedures that can respond to non-fatal userspace signals. It was
trying to detect such an allowance by checking against USER & KILLABLE
flags, which was "un-official".
In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to show
that the fault handler allows the fault procedure to respond even to
non-fatal signals. Meanwhile, add this new flag to the default fault
flags so that all the page fault handlers can benefit from the new flag.
With that, replacing the userfault check to this one.
Since the line is getting even longer, clean up the fault flags a bit too
to ease TTY users.
Although we've got a new flag and applied it, we shouldn't have any
functional change with this patch so far.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Although there're tons of arch-specific page fault handlers, most of them
are still sharing the same initial value of the page fault flags. Say,
merely all of the page fault handlers would allow the fault to be retried,
and they also allow the fault to respond to SIGKILL.
Let's define a default value for the fault flags to replace those initial
page fault flags that were copied over. With this, it'll be far easier to
introduce new fault flag that can be used by all the architectures instead
of touching all the archs.
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The idea comes from the upstream discussion between Linus and Andrea:
https://lore.kernel.org/lkml/[email protected]/
A summary to the issue: there was a special path in handle_userfault() in
the past that we'll return a VM_FAULT_NOPAGE when we detected non-fatal
signals when waiting for userfault handling. We did that by reacquiring
the mmap_sem before returning. However that brings a risk in that the
vmas might have changed when we retake the mmap_sem and even we could be
holding an invalid vma structure.
This patch is a preparation of removing that special path by allowing the
page fault to return even faster if we were interrupted by a non-fatal
signal during a user-mode page fault handling routine.
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For most architectures, we've got a quick path to detect fatal signal
after a handle_mm_fault(). Introduce a helper for that quick path.
It cleans the current codes a bit so we don't need to duplicate the same
check across archs. More importantly, this will be an unified place that
we handle the signal immediately right after an interrupted page fault, so
it'll be much easier for us if we want to change the behavior of handling
signals later on for all the archs.
Note that currently only part of the archs are using this new helper,
because some archs have their own way to handle signals. In the follow up
patches, we'll try to apply this helper to all the rest of archs.
Another note is that the "regs" parameter in the new helper is not used
yet. It'll be used very soon. Now we kept it in this patch only to avoid
touching all the archs again in the follow up patches.
[[email protected]: fix sparse warnings]
Link: http://lkml.kernel.org/r/20200311145921.GD479302@xz-x1
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The documentation currently does not include the deathless prose written
to describe functions in pagemap.h because it's not included in any rst
file. Fix up the mismatches between parameter names and the documentation
and add the file to mm-api.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently the declaration and definition for is_vma_temporary_stack() are
scattered. Lets make is_vma_temporary_stack() helper available for
general use and also drop the declaration from (include/linux/huge_mm.h)
which is no longer required. While at this, rename this as
vma_is_temporary_stack() in line with existing helpers. This should not
cause any functional change.
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Idea of a foreign VMA with respect to the present context is very generic.
But currently there are two identical definitions for this in powerpc and
x86 platforms. Lets consolidate those redundant definitions while making
vma_is_foreign() available for general use later. This should not cause
any functional change.
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/vma: some more minor changes", v2.
The motivation here is to consolidate VMA flags and helpers in generic
memory header and reduce code duplication when ever applicable. If there
are other possible similar instances which might be missing here, please
do let me me know. I will be happy to incorporate them.
This patch (of 3):
Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h). This just
makes sure that no VMA flag is scattered in individual function files any
longer. While at this, fix an old comment which is no longer valid. This
should not cause any functional change.
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Chris Down <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions. It's
shorter and more obvious.
These are the most basic functions which are just (un)charging the given
cgroup with the given amount of pages.
Also fix up the corresponding comments.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These functions are charging the given number of kernel pages to the given
memory cgroup. The number doesn't have to be a power of two. Let's make
them to take the unsigned int nr_pages as an argument instead of the page
order.
It makes them look consistent with the corresponding uncharge functions
and functions like: mem_cgroup_charge_skmem(memcg, nr_pages).
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:
1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
the current memcg
2) set or clear the PageKmemcg flag
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Drop the unused page argument and put the memcg pointer at the first
place. This make the function consistent with its peers:
__memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: memcg: kmem API cleanup", v2.
This patchset aims to clean up the kernel memory charging API. It doesn't
bring any functional changes, just removes unused arguments, renames some
functions and fixes some comments.
Currently it's not obvious which functions are most basic
(memcg_kmem_(un)charge_memcg()) and which are based on them
(memcg_kmem_(un)charge()). The patchset renames these functions and
removes unused arguments:
TL;DR:
was:
memcg_kmem_charge_memcg(page, gfp, order, memcg)
memcg_kmem_uncharge_memcg(memcg, nr_pages)
memcg_kmem_charge(page, gfp, order)
memcg_kmem_uncharge(page, order)
now:
memcg_kmem_charge(memcg, gfp, nr_pages)
memcg_kmem_uncharge(memcg, nr_pages)
memcg_kmem_charge_page(page, gfp, order)
memcg_kmem_uncharge_page(page, order)
This patch (of 6):
The first argument of memcg_kmem_charge_memcg() and
__memcg_kmem_charge_memcg() is the page pointer and it's not used. Let's
drop it.
Memcg pointer is passed as the last argument. Move it to the first place
for consistency with other memcg functions, e.g.
__memcg_kmem_uncharge_memcg() or try_charge().
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It sounds
not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it sounds
better to make it inline. However, it calls page_mapping() which is not
inlined either, but the disassemble shows it doesn't do push and pop
operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that test
bench and get back normal on my VM. Since the test bench configuration is
not that usual and I also saw around 6% up on the latest upstream, so it
sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.
Shakeel Butt did the below test:
On a real machine with limiting the 'dd' on a single node and reading 100
GiB sparse file (less than a single node). Just ran a single instance to
not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
measured the time it took.
Without patch: 56.64143 +- 0.672 sec
With patches: 56.10 +- 0.21 sec
[[email protected]: move page_evictable() to internal.h]
Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Shakeel Butt <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.
While cpu accesses will trigger a fault that can be resolved, I/O accesses
will just fail. We need to add a callback into architecture code for
places that will do I/O, namely when writeback is started or when a page
reference is taken.
This is not only to enable paging, file backing etc, it is also necessary
to protect the host against a malicious user space. For example a bad
QEMU could simply start direct I/O on such protected memory. We do not
want userspace to be able to trigger I/O errors and thus the logic is
"whenever somebody accesses that page (gup) or does I/O, make sure that
this page can be accessed". When the guest tries to access that page we
will wait in the page fault handler for writeback to have finished and for
the page_ref to be the expected value.
On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling we can
tackle this when we know the details.
Signed-off-by: Claudio Imbrenda <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.
Add two new fields to /proc/vmstat:
nr_foll_pin_acquired
nr_foll_pin_released
These are documented in Documentation/core-api/pin_user_pages.rst. They
represent the number of pages (since boot time) that have been pinned
("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
pin_user_pages*() and unpin_user_pages*().
In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.
Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.
Also: update Documentation/core-api/pin_user_pages.rst to rename the new
/proc/vmstat entries, to the names listed here.
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[[email protected]: add kerneldoc]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <[email protected]>
Suggested-by: Jérôme Glisse <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Claudio Imbrenda <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
An upcoming patch requires subtracting a large chunk of refcounts from a
page, and checking what the resulting refcount is. This is a little
different than the usual "check for zero refcount" that many of the page
ref functions already do. However, it is similar to a few other routines
that (like this one) are generally useful for things such as 1-based
refcounting.
Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically,
and returns an atomic snapshot of the result.
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This isn't just a random struct page, it's known to be a head page, and
calling it head makes the function better self-documenting. The pgoff_t
is less confusing if it's named index instead of offset. Also add a
couple of comments to explain why we're doing various things.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Pankaj Gupta <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
fallback node"
This reverts commit ad2c8144418c6a81cefe65379fd47bbe8344cef2.
The function node_to_mem_node() was introduced by that commit for use in SLUB
on systems with memoryless nodes, but it turned out to be unreliable on some
architectures/configurations and a simpler solution exists than fixing it up.
Thus commit 0715e6c516f1 ("mm, slub: prevent kmalloc_node crashes and
memory leaks") removed the only user of node_to_mem_node() and we can
revert the commit that introduced the function.
Signed-off-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nathan Lynch <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: PUVICHAKRAVARTHY RAMACHANDRAN <[email protected]>
Cc: Sachin Sant <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The timer used by delayed kthread works are IRQ safe because the used
kthread_delayed_work_timer_fn() is IRQ safe.
It is properly marked when initialized by KTHREAD_DELAYED_WORK_INIT().
But TIMER_IRQSAFE flag is missing when initialized by
kthread_init_delayed_work().
The missing flag might trigger invalid warning from del_timer_sync() when
kthread_mod_delayed_work() is called with interrupts disabled.
This patch is result of a discussion about using the API, see
https://lkml.kernel.org/r/[email protected]
Reported-by: Grygorii Strashko <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Grygorii Strashko <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pull rdma updates from Jason Gunthorpe:
"The majority of the patches are cleanups, refactorings and clarity
improvements.
This cycle saw some more activity from Syzkaller, I think we are now
clean on all but one of those bugs, including the long standing and
obnoxious rdma_cm locking design defect. Continue to see many drivers
getting cleanups, with a few new user visible features.
Summary:
- Various driver updates for siw, bnxt_re, rxe, efa, mlx5, hfi1
- Lots of cleanup patches for hns
- Convert more places to use refcount
- Aggressively lock the RDMA CM code that syzkaller says isn't
working
- Work to clarify ib_cm
- Use the new ib_device lifecycle model in bnxt_re
- Fix mlx5's MR cache which seems to be failing more often with the
new ODP code
- mlx5 'dynamic uar' and 'tx steering' user interfaces"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (144 commits)
RDMA/bnxt_re: make bnxt_re_ib_init static
IB/qib: Delete struct qib_ivdev.qp_rnd
RDMA/hns: Fix uninitialized variable bug
RDMA/hns: Modify the mask of QP number for CQE of hip08
RDMA/hns: Reduce the maximum number of extend SGE per WQE
RDMA/hns: Reduce PFC frames in congestion scenarios
RDMA/mlx5: Add support for RDMA TX flow table
net/mlx5: Add support for RDMA TX steering
IB/hfi1: Call kobject_put() when kobject_init_and_add() fails
IB/hfi1: Fix memory leaks in sysfs registration and unregistration
IB/mlx5: Move to fully dynamic UAR mode once user space supports it
IB/mlx5: Limit the scope of struct mlx5_bfreg_info to mlx5_ib
IB/mlx5: Extend QP creation to get uar page index from user space
IB/mlx5: Extend CQ creation to get uar page index from user space
IB/mlx5: Expose UAR object and its alloc/destroy commands
IB/hfi1: Get rid of a warning
RDMA/hns: Remove redundant judgment of qp_type
RDMA/hns: Remove redundant assignment of wc->smac when polling cq
RDMA/hns: Remove redundant qpc setup operations
RDMA/hns: Remove meaningless prints
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This series focuses on corner case bug fixes and general clarity
improvements to hmm_range_fault(). It arose from a review of
hmm_range_fault() by Christoph, Ralph and myself.
hmm_range_fault() is being used by these 'SVM' style drivers to
non-destructively read the page tables. It is very similar to
get_user_pages() except that the output is an array of PFNs and
per-pfn flags, and it has various modes of reading.
This is necessary before RDMA ODP can be converted, as we don't want
to have weird corner case regressions, which is still a looking
forward item. Ralph has a nice tester for this routine, but it is
waiting for feedback from the selftests maintainers.
Summary:
- 9 bug fixes
- Allow pgmap to track the 'owner' of a DEVICE_PRIVATE - in this case
the owner tells the driver if it can understand the DEVICE_PRIVATE
page or not. Use this to resolve a bug in nouveau where it could
touch DEVICE_PRIVATE pages from other drivers.
- Remove a bunch of dead, redundant or unused code and flags
- Clarity improvements to hmm_range_fault()"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (25 commits)
mm/hmm: return error for non-vma snapshots
mm/hmm: do not set pfns when returning an error code
mm/hmm: do not unconditionally set pfns when returning EBUSY
mm/hmm: use device_private_entry_to_pfn()
mm/hmm: remove HMM_FAULT_SNAPSHOT
mm/hmm: remove unused code and tidy comments
mm/hmm: return the fault type from hmm_pte_need_fault()
mm/hmm: remove pgmap checking for devmap pages
mm/hmm: check the device private page owner in hmm_range_fault()
mm: simplify device private page handling in hmm_range_fault
mm: handle multiple owners of device private pages in migrate_vma
memremap: add an owner field to struct dev_pagemap
mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_
mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_()
mm/hmm: simplify hmm_vma_walk_hugetlb_entry()
mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flag
mm/hmm: don't provide a stub for hmm_range_fault()
mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()
mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling
mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages
...
|
|
Pull XArray updates from Matthew Wilcox:
- Fix two bugs which affected multi-index entries larger than 2^26
indices
- Fix some documentation
- Remove unused IDA macros
- Add a small optimisation for tiny configurations
- Fix a bug which could cause an RCU walker to terminate a marked walk
early
* tag 'xarray-5.7' of git://git.infradead.org/users/willy/linux-dax:
xarray: Fix early termination of xas_for_each_marked
radix tree test suite: Support kmem_cache alignment
XArray: Optimise xas_sibling() if !CONFIG_XARRAY_MULTI
ida: remove abandoned macros
XArray: Fix incorrect comment in header file
XArray: Fix xas_pause for large multi-index entries
XArray: Fix xa_find_next for large multi-index entries
|
|
Pull drm updates from Dave Airlie:
"This is the main drm pull request for 5.7-rc1.
Highlights:
- i915 enables Tigerlake by default
- i915 and amdgpu have initial OLED backlight support
[ Jani Nikula pipes up and points out that we've had a bunch of
"initial support" code for a long time already, but only now
Lyude made it actually work on real world machines ]
- vmwgfx add support to enable OpenGL 4 userspace
- zero length arrays are mostly removed.
Detailed summary:
new driver:
- tidss: TI Keystone platform display subsystem
core:
- new drm device warn macros
- mode config valid for memory constrained devices
- bridge bus format negotation
- consolidated fake vblank event handling
- dma_alloc related cleanups
- drop get_crtc callback
- dp: DP1.4 EDID corruption test
- EDID CEA detailed timings improvements
- relicense some code to dual GPL2/MIT
- convert core vblank support to per-crtc support
- rework drm_global_mutex
- bridge rework to allow omap_dss custom driver removeal
- remove drm_fb_helper connector interrfaces
- zero-length array removal
scheduler:
- support for modifying the sched list
- revert job distribution optimization
- helper to pick least loaded scheduler
- race condition fix
mst:
- various fixes
- remove register_connector callback
i915:
- uapi to allows userspace specific CS ring buffer sizes
- Tigerlake enablement patches + Tigerlake enabled by default
- new sysfs entries for engine properties
- display/logging refactors
- eDP/DP fixes for DPCD
- Gen7 back to aliasing-ppgtt
- Gen8+ irq refactor
- Avoid globals
- GEM locking fixes and simplifications
- Ice Lake and Elkhart Lake fixes and workarounds
- Baytrail/Haswell instability fix
- GVT - VFIO edid better support
amdgpu:
- Rework VM update handling in preparation for HMM support
- drm load/unload removal fixups
- USB-C PD firmware updates
- HDCP srm support
- Navi/renoir PM watermark fixes
- OLED panel support
- Optimize debugging vram access
- Use BACO for runtime pm
- DC clock programming optimizations and fixes
- PSP fw loading sequence updates
- Drop DRIVER_USE_AGP
- Remove legacy drm load and unload callbacks
- ACP Kconfig fix
- Lots of fixes across the driver
amdkfd:
- runtime pm support
- more gfx config details in amdgpu
radeon:
- drop DRIVER_USE_AGP
vmwgfx:
- Disable DMA when SEV encryption in use
- Shader Model 5 support - needed for GL4 support
msm:
- DPU resource manager refactor
- dpu using atomic global state
mediatek:
- MT8183 DPI support
etnaviv:
- out-of-bounds read fix
- expose feature flags for GC400 STM32MP1 SoC
- runtime suspend entry fix
- dma32 zone fix
hisilicon:
- mode selection fixes
meson:
- YUV420 support
lima:
- add support for heap buffers
tinydrm:
- removal of owner field
- explicit DT dependency removal
- YAML schema conversion
tegra:
- misc cleanups
tidss:
- new driver
virtio:
- better batching of notifications to host
- memory handling reworked
- shmem + gpu context fixes
hibmc:
- add gamma_set support
- improve DPMS support
pl111:
- Integrator IM-PD1 support
sun4i:
- LVDS support for A20 + A33
- DSI panel handling improvements"
* tag 'drm-next-2020-04-01' of git://anongit.freedesktop.org/drm/drm: (1537 commits)
drm/i915/display: Fix mode private_flags comparison at atomic_check
drm/i915/gt: Stage the transfer of the virtual breadcrumb
drm/i915/gt: Select the deepest available parking mode for rc6
drm/i915: Avoid live-lock with i915_vma_parked()
drm/i915/gt: Treat idling as a RPS downclock event
drm/i915/gt: Cancel a hung context if already closed
drm/i915: Use explicit flag to mark unreachable intel_context
drm/amdgpu: don't try to reserve training bo for sriov (v2)
drm/amdgpu/smu11: add support for SMU AC/DC interrupts
drm/amdgpu/swSMU: handle manual AC/DC notifications
drm/amdgpu/swSMU: handle DC controlled by GPIO for navi1x
drm/amdgpu/swSMU: set AC/DC mode based on the current system state (v2)
drm/amdgpu/swSMU: correct the bootup power source for Navi1X (v2)
drm/amdgpu/swSMU: use the smu11 power source helper for navi1x
drm/amdgpu/smu11: add a helper to set the power source
drm/amd/swSMU: add callback to set AC/DC power source (v2)
drm/scheduler: fix rare NULL ptr race
drm/amdgpu: fix the coverage issue to clear ArcVPGRs
drm/amd/display: Fix pageflip event race condition for DCN.
drm/[radeon|amdgpu]: Remove HAINAN board from max_sclk override check
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
"My attempt to revitalize trivial queue I've been neglecting for years
(what a disaster that was for this world, right? :) ) with patches
collected from backlog that were still relevant and not applied
elsewhere in the meantime"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
err.h: remove deprecated PTR_RET for good
blk-mq: Fix typo in comment
x86/boot: Fix comment spelling
sh: mach-highlander: Fix comment spelling
s390/dasd: Fix comment spelling
mfd: wm8994: Fix comment spelling
docs: Add reference in binfmt-misc.rst
genirq: fix kerneldoc comment for irq_desc
drm/amdgpu: fix two documentation mismatch issues
HID: fix Kconfig word ordering
list/hashtable: minor documentation corrections.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Fix out-of-sync IVs in self-test for IPsec AEAD algorithms
Algorithms:
- Use formally verified implementation of x86/curve25519
Drivers:
- Enhance hwrng support in caam
- Use crypto_engine for skcipher/aead/rsa/hash in caam
- Add Xilinx AES driver
- Add uacce driver
- Register zip engine to uacce in hisilicon
- Add support for OCTEON TX CPT engine in marvell"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits)
crypto: af_alg - bool type cosmetics
crypto: arm[64]/poly1305 - add artifact to .gitignore files
crypto: caam - limit single JD RNG output to maximum of 16 bytes
crypto: caam - enable prediction resistance in HRWNG
bus: fsl-mc: add api to retrieve mc version
crypto: caam - invalidate entropy register during RNG initialization
crypto: caam - check if RNG job failed
crypto: caam - simplify RNG implementation
crypto: caam - drop global context pointer and init_done
crypto: caam - use struct hwrng's .init for initialization
crypto: caam - allocate RNG instantiation descriptor with GFP_DMA
crypto: ccree - remove duplicated include from cc_aead.c
crypto: chelsio - remove set but not used variable 'adap'
crypto: marvell - enable OcteonTX cpt options for build
crypto: marvell - add the Virtual Function driver for CPT
crypto: marvell - add support for OCTEON TX CPT engine
crypto: marvell - create common Kconfig and Makefile for Marvell
crypto: arm/neon - memzero_explicit aes-cbc key
crypto: bcm - Use scnprintf() for avoiding potential buffer overflow
crypto: atmel-i2c - Fix wakeup fail
...
|
|
If we have to retransmit requests, try to join their page groups
first.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Refactor nfs_lock_and_join_requests() in order to separate out the
subrequest merging into its own function nfs_lock_and_join_group()
that can be used by O_DIRECT.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Clean up nfs_lock_and_join_requests() to simplify the calculation
of the range covered by the page group, taking into account the
presence of mirrors.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
When a subrequest is being detached from the subgroup, we want to
ensure that it is not holding the group lock, or in the process
of waiting for the group lock.
Fixes: 5b2b5187fa85 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases")
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter. With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent. This
bypasses the signal sending checks if the parent changes their
credentials during exec.
The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days. Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.
Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions. Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value. So with very lucky timing after this change this still
remains expoiltable.
I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.
Link: https://lore.kernel.org/kernel-hardening/[email protected]
Fixes: 2.3.23pre2
Cc: [email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
A hrtimer can be released in its callback, but lockdep_hrtimer_exit()
dereferences the pointer after the callback returns, i.e. a potential use
after free.
Retrieve the context in which the hrtimer expires before the callback is
invoked and use it in lockdep_hrtimer_exit().
Fixes: 40db173965c0 ("lockdep: Add hrtimer context tracing bits")
Reported-by: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
This reverts commit 7c8978c0837d40c302f5e90d24c298d9ca9fc097, a new
version will come in the next release cycle.
Cc: <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ludovic Barre <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Ulf Hansson <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
This reverts commit 5caf6102e32ead7ed5d21b5309c1a4a7d70e6a9f. It still
needs some more work and that will happen for the next release cycle,
not this one.
Cc: <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ludovic Barre <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Ulf Hansson <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Pull networking updates from David Miller:
"Highlights:
1) Fix the iwlwifi regression, from Johannes Berg.
2) Support BSS coloring and 802.11 encapsulation offloading in
hardware, from John Crispin.
3) Fix some potential Spectre issues in qtnfmac, from Sergey
Matyukevich.
4) Add TTL decrement action to openvswitch, from Matteo Croce.
5) Allow paralleization through flow_action setup by not taking the
RTNL mutex, from Vlad Buslov.
6) A lot of zero-length array to flexible-array conversions, from
Gustavo A. R. Silva.
7) Align XDP statistics names across several drivers for consistency,
from Lorenzo Bianconi.
8) Add various pieces of infrastructure for offloading conntrack, and
make use of it in mlx5 driver, from Paul Blakey.
9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.
10) Lots of parallelization improvements during configuration changes
in mlxsw driver, from Ido Schimmel.
11) Add support to devlink for generic packet traps, which report
packets dropped during ACL processing. And use them in mlxsw
driver. From Jiri Pirko.
12) Support bcmgenet on ACPI, from Jeremy Linton.
13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
Starovoitov, and your's truly.
14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.
15) Fix sysfs permissions when network devices change namespaces, from
Christian Brauner.
16) Add a flags element to ethtool_ops so that drivers can more simply
indicate which coalescing parameters they actually support, and
therefore the generic layer can validate the user's ethtool
request. Use this in all drivers, from Jakub Kicinski.
17) Offload FIFO qdisc in mlxsw, from Petr Machata.
18) Support UDP sockets in sockmap, from Lorenz Bauer.
19) Fix stretch ACK bugs in several TCP congestion control modules,
from Pengcheng Yang.
20) Support virtual functiosn in octeontx2 driver, from Tomasz
Duszynski.
21) Add region operations for devlink and use it in ice driver to dump
NVM contents, from Jacob Keller.
22) Add support for hw offload of MACSEC, from Antoine Tenart.
23) Add support for BPF programs that can be attached to LSM hooks,
from KP Singh.
24) Support for multiple paths, path managers, and counters in MPTCP.
From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
and others.
25) More progress on adding the netlink interface to ethtool, from
Michal Kubecek"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
cxgb4/chcr: nic-tls stats in ethtool
net: dsa: fix oops while probing Marvell DSA switches
net/bpfilter: remove superfluous testing message
net: macb: Fix handling of fixed-link node
net: dsa: ksz: Select KSZ protocol tag
netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
net: stmmac: create dwmac-intel.c to contain all Intel platform
net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
net: dsa: bcm_sf2: Add support for matching VLAN TCI
net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
net: dsa: bcm_sf2: Disable learning for ASP port
net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
net: dsa: b53: Restore VLAN entries upon (re)configuration
net: dsa: bcm_sf2: Fix overflow checks
hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH:
"Here is the big set of TTY / Serial patches for 5.7-rc1
Lots of console fixups and reworking in here, serial core tweaks
(doesn't that ever get old, why are we still creating new serial
devices?), serial driver updates, line-protocol driver updates, and
some vt cleanups and fixes included in here as well.
All have been in linux-next with no reported issues"
* tag 'tty-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (161 commits)
serial: 8250: Optimize irq enable after console write
serial: 8250: Fix rs485 delay after console write
vt: vt_ioctl: fix use-after-free in vt_in_use()
vt: vt_ioctl: fix VT_DISALLOCATE freeing in-use virtual console
tty: serial: make SERIAL_SPRD depend on COMMON_CLK
tty: serial: fsl_lpuart: fix return value checking
tty: serial: fsl_lpuart: move dma_request_chan()
ARM: dts: tango4: Make /serial compatible with ns16550a
ARM: dts: mmp*: Make the serial ports compatible with xscale-uart
ARM: dts: mmp*: Fix serial port names
ARM: dts: mmp2-brownstone: Don't redeclare phandle references
ARM: dts: pxa*: Make the serial ports compatible with xscale-uart
ARM: dts: pxa*: Fix serial port names
ARM: dts: pxa*: Don't redeclare phandle references
serial: omap: drop unused dt-bindings header
serial: 8250: 8250_omap: Add DMA support for UARTs on K3 SoCs
serial: 8250: 8250_omap: Work around errata causing spurious IRQs with DMA
serial: 8250: 8250_omap: Extend driver data to pass FIFO trigger info
serial: 8250: 8250_omap: Move locking out from __dma_rx_do_complete()
serial: 8250: 8250_omap: Account for data in flight during DMA teardown
...
|
|
Pull MMC updates from Ulf Hansson:
"MMC core:
- Add support for host software queue for (e)MMC/SD
- Throttle polling rate for CMD6
- Update CMD13 busy condition check for CMD6 commands
- Improve busy detect polling for erase/trim/discard/HPI
- Fixup support for HW busy detection for HPI commands
- Re-work and improve support for eMMC sanitize commands
MMC host:
- mmci:
* Add support for sdmmc variant revision 2.0
- mmci_sdmmc:
* Improve support for busyend detection
* Fixup support for signal voltage switch
* Add support for tuning with delay block
- mtk-sd:
* Fix another SDIO irq issue
- sdhci:
* Disable native card detect when GPIO based type exist
- sdhci:
* Add option to defer request completion
- sdhci_am654:
* Add support to set a tap value per speed mode
- sdhci-esdhc-imx:
* Add support for i.MX8MM based variant
* Fixup support for standard tuning on i.MX8 usdhc
* Optimize for strobe/clock dll settings
* Fixup support for system and runtime suspend/resume
- sdhci-iproc:
* Update regulator/bus-voltage management for bcm2711
- sdhci-msm:
* Prevent clock gating with PWRSAVE_DLL on broken variants
* Fix management of CQE during SDHCI reset
- sdhci-of-arasan:
* Add support for auto tuning on ZynqMP based platforms
- sdhci-omap:
* Add support for system suspend/resume
- sdhci-sprd:
* Add support for HW busy detection
* Enable support host software queue
- sdhci-tegra:
* Add support for HW busy detection
- tmio/renesas_sdhi:
* Enforce retune after runtime suspend
- renesas_sdhi:
* Use manual tap correction for HS400 on some variants
* Add support for manual correction of tap values for tunings"
* tag 'mmc-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (86 commits)
mmc: cavium-octeon: remove nonsense variable coercion
mmc: mediatek: fix SDIO irq issue
mmc: mmci_sdmmc: Fix clear busyd0end irq flag
dt-bindings: mmc: Fix node name in an example
mmc: core: Re-work the code for eMMC sanitize
mmc: sdhci: use FIELD_GET for preset value bit masks
mmc: sdhci-of-at91: Display clock changes for debug purpose only
mmc: sdhci: iproc: Add custom set_power() callback for bcm2711
mmc: sdhci: am654: Use sdhci_set_power_and_voltage()
mmc: sdhci: at91: Use sdhci_set_power_and_voltage()
mmc: sdhci: milbeaut: Use sdhci_set_power_and_voltage()
mmc: sdhci: arasan: Use sdhci_set_power_and_voltage()
mmc: sdhci: Introduce sdhci_set_power_and_bus_voltage()
mmc: vub300: Use scnprintf() for avoiding potential buffer overflow
dt-bindings: mmc: synopsys-dw-mshc: fix clock-freq-min-max in example
sdhci: tegra: Enable MMC_CAP_WAIT_WHILE_BUSY host capability
sdhci: tegra: Implement Tegra specific set_timeout callback
mmc: sdhci-omap: Add Support for Suspend/Resume
mmc: renesas_sdhi: simplify execute_tuning
mmc: renesas_sdhi: Use BITS_PER_LONG helper
...
|
|
git://git.kernel.org:/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
"Two minor updates for the core security subsystem:
- kernel-doc warning fixes from Randy Dunlap
- header cleanup from YueHaibing"
* 'next-general' of git://git.kernel.org:/pub/scm/linux/kernel/git/jmorris/linux-security:
security: remove duplicated include from security.h
security: <linux/lsm_hooks.h>: fix all kernel-doc warnings
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux updates from Paul Moore:
"We've got twenty SELinux patches for the v5.7 merge window, the
highlights are below:
- Deprecate setting /sys/fs/selinux/checkreqprot to 1.
This flag was originally created to deal with legacy userspace and
the READ_IMPLIES_EXEC personality flag. We changed the default from
1 to 0 back in Linux v4.4 and now we are taking the next step of
deprecating it, at some point in the future we will take the final
step of rejecting 1.
- Allow kernfs symlinks to inherit the SELinux label of the parent
directory. In order to preserve backwards compatibility this is
protected by the genfs_seclabel_symlinks SELinux policy capability.
- Optimize how we store filename transitions in the kernel, resulting
in some significant improvements to policy load times.
- Do a better job calculating our internal hash table sizes which
resulted in additional policy load improvements and likely general
SELinux performance improvements as well.
- Remove the unused initial SIDs (labels) and improve how we handle
initial SIDs.
- Enable per-file labeling for the bpf filesystem.
- Ensure that we properly label NFS v4.2 filesystems to avoid a
temporary unlabeled condition.
- Add some missing XFS quota command types to the SELinux quota
access controls.
- Fix a problem where we were not updating the seq_file position
index correctly in selinuxfs.
- We consolidate some duplicated code into helper functions.
- A number of list to array conversions.
- Update Stephen Smalley's email address in MAINTAINERS"
* tag 'selinux-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: clean up indentation issue with assignment statement
NFS: Ensure security label is set for root inode
MAINTAINERS: Update my email address
selinux: avtab_init() and cond_policydb_init() return void
selinux: clean up error path in policydb_init()
selinux: remove unused initial SIDs and improve handling
selinux: reduce the use of hard-coded hash sizes
selinux: Add xfs quota command types
selinux: optimize storage of filename transitions
selinux: factor out loop body from filename_trans_read()
security: selinux: allow per-file labeling for bpffs
selinux: generalize evaluate_cond_node()
selinux: convert cond_expr to array
selinux: convert cond_av_list to array
selinux: convert cond_list to array
selinux: sel_avc_get_stat_idx should increase position index
selinux: allow kernfs symlinks to inherit parent directory context
selinux: simplify evaluate_cond_node()
Documentation,selinux: deprecate setting checkreqprot to 1
selinux: move status variables out of selinux_ss
|