Age | Commit message (Collapse) | Author | Files | Lines |
|
Generalizing pkey_alloc__scnprintf_access_rights(), so that we can use
it with other flags-like arguments, such as mount's mountflags argument.
Cc: Adrian Hunter <[email protected]>
Cc: Benjamin Peterson <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Wang Nan <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The intention is to have this as a library, since it is not perf
specific at all.
I did the switch for the files where I'm the only contributor, with the
exception of a few lines changed by Jiri Olsa.
Acked-by: Jiri Olsa <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Wang Nan <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
It'll use tools/include copy of linux/fs.h to generate a table to be
used by tools, initially by the 'mount' and 'umount' beautifiers in
'perf trace', but that could also be used to translate from a string
constant to the integer value to be used in a eBPF or tracefs tracepoint
filter.
When used without any args it produces:
$ tools/perf/trace/beauty/mount_flags.sh
static const char *mount_flags[] = {
[1 ? (ilog2(1) + 1) : 0] = "RDONLY",
[2 ? (ilog2(2) + 1) : 0] = "NOSUID",
[4 ? (ilog2(4) + 1) : 0] = "NODEV",
[8 ? (ilog2(8) + 1) : 0] = "NOEXEC",
[16 ? (ilog2(16) + 1) : 0] = "SYNCHRONOUS",
[32 ? (ilog2(32) + 1) : 0] = "REMOUNT",
[64 ? (ilog2(64) + 1) : 0] = "MANDLOCK",
[128 ? (ilog2(128) + 1) : 0] = "DIRSYNC",
[1024 ? (ilog2(1024) + 1) : 0] = "NOATIME",
[2048 ? (ilog2(2048) + 1) : 0] = "NODIRATIME",
[4096 ? (ilog2(4096) + 1) : 0] = "BIND",
[8192 ? (ilog2(8192) + 1) : 0] = "MOVE",
[16384 ? (ilog2(16384) + 1) : 0] = "REC",
[32768 ? (ilog2(32768) + 1) : 0] = "SILENT",
[16 + 1] = "POSIXACL",
[17 + 1] = "UNBINDABLE",
[18 + 1] = "PRIVATE",
[19 + 1] = "SLAVE",
[20 + 1] = "SHARED",
[21 + 1] = "RELATIME",
[22 + 1] = "KERNMOUNT",
[23 + 1] = "I_VERSION",
[24 + 1] = "STRICTATIME",
[25 + 1] = "LAZYTIME",
[26 + 1] = "SUBMOUNT",
[27 + 1] = "NOREMOTELOCK",
[28 + 1] = "NOSEC",
[29 + 1] = "BORN",
[30 + 1] = "ACTIVE",
[31 + 1] = "NOUSER",
};
$
Cc: Adrian Hunter <[email protected]>
Cc: Benjamin Peterson <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Wang Nan <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
We'll use it to create tables for the 'flags' argument to the 'mount'
and 'umount' syscalls.
Add it to check_headers.sh so that when a new protocol gets added we get
a notification during the build process.
Cc: Adrian Hunter <[email protected]>
Cc: Benjamin Peterson <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Wang Nan <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Replace a bunch of spaces with tab, cleans up indentation
Signed-off-by: Colin Ian King <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Commit 9ee3e06610fd ("HID: i2c-hid: override HID descriptors for certain
devices") added a new dmi_system_id quirk table to override certain HID
report descriptors for some systems that lack them.
But the table wasn't properly terminated, causing the dmi matching to
walk off into la-la-land, and starting to treat random data as dmi
descriptor pointers, causing boot-time oopses if you were at all
unlucky.
Terminate the array.
We really should have some way to just statically check that arrays that
should be terminated by an empty entry actually are so. But the HID
people really should have caught this themselves, rather than have me
deal with an oops during the merge window. Tssk, tssk.
Cc: Julian Sax <[email protected]>
Cc: Benjamin Tissoires <[email protected]>
Cc: Jiri Kosina <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <[email protected]>: (132 commits)
hugetlbfs: dirty pages as they are added to pagecache
mm: export add_swap_extent()
mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
mm/gup: cache dev_pagemap while pinning pages
Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
mm: return zero_resv_unavail optimization
mm: zero remaining unavailable struct pages
tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
mm/gup_benchmark.c: add additional pinning methods
mm/gup_benchmark.c: time put_page()
mm: don't raise MEMCG_OOM event due to failed high-order allocation
mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
...
|
|
Pull networking fixes from David Miller:
"What better way to start off a weekend than with some networking bug
fixes:
1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed
by David Ahern and Bjørn Mork.
2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE
properly in UDP, from Sean Tranchetti.
3) Remove TCA_OPTIONS from policy validation, it turns out we don't
consistently use nested attributes for this across all packet
schedulers. From David Ahern.
4) Fix SKB corruption in cadence driver, from Tristram Ha.
5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit.
6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
net/neigh: fix NULL deref in pneigh_dump_table()
net: allow traceroute with a specified interface in a vrf
bridge: do not add port to router list when receives query with source 0.0.0.0
net/smc: fix smc_buf_unuse to use the lgr pointer
ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
lan743x: Remove SPI dependency from Microchip group.
drivers: net: remove <net/busy_poll.h> inclusion when not needed
net: phy: genphy_10g_driver: Avoid NULL pointer dereference
r8169: fix broken Wake-on-LAN from S5 (poweroff)
octeontx2-af: Use GFP_ATOMIC under spin lock
net: ethernet: cadence: fix socket buffer corruption problem
net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
net: sched: Remove TCA_OPTIONS from policy
ice: Poll for link status change
ice: Allocate VF interrupts and set queue map
ice: Introduce ice_dev_onetime_setup
net: hns3: Fix for warning uninitialized symbol hw_err_lst3
octeontx2-af: Copy the right amount of memory
net: udp: fix handling of CHECKSUM_COMPLETE packets
...
|
|
Pull sparc fixes from David Miller:
"Some more sparc fixups, mostly aimed at getting the allmodconfig build
up and clean again"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Rework xchg() definition to avoid warnings.
sparc64: Export __node_distance.
sparc64: Make corrupted user stacks more debuggable.
|
|
Some test systems were experiencing negative huge page reserve counts and
incorrect file block counts. This was traced to /proc/sys/vm/drop_caches
removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs
explicit code removes the pages, the appropriate accounting is not
performed.
This can be recreated as follows:
fallocate -l 2M /dev/hugepages/foo
echo 1 > /proc/sys/vm/drop_caches
fallocate -l 2M /dev/hugepages/foo
grep -i huge /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
HugePages_Total: 2048
HugePages_Free: 2047
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 4194304 kB
ls -lsh /dev/hugepages/foo
4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
To address this issue, dirty pages as they are added to pagecache. This
can easily be reproduced with fallocate as shown above. Read faulted
pages will eventually end up being marked dirty. But there is a window
where they are clean and could be impacted by code such as drop_caches.
So, just dirty them all as they are added to the pagecache.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
Signed-off-by: Mike Kravetz <[email protected]>
Acked-by: Mihcla Hocko <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Aneesh Kumar K . V" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices. See 35054394c4b3
("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").
However, the swap code has a mechanism for the filesystem to manually add
swap extents using add_swap_extent() from the ->swap_activate() aop.
iomap has done this since 67482129cdab ("iomap: add a swapfile activation
function"). Btrfs will do the same in a later patch, so export
add_swap_extent().
Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
through the filesystem, and to make swapoff() call ->swap_deactivate().
For Btrfs, we want the latter but not the former, so split this flag into
two. This makes us always call ->swap_deactivate() if ->swap_activate()
succeeded, not just if it didn't add any swap extents itself.
This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.
Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Sterba <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
MAP_FIXED_NOREPLACE
Add a test for MAP_FIXED_NOREPLACE, based on some code originally by Jann
Horn. This would have caught the overlap bug reported by Daniel Micay.
I originally suggested to Michal that we create MAP_FIXED_NOREPLACE, but
instead of writing a selftest I spent my time bike-shedding whether it
should be called MAP_FIXED_SAFE/NOCLOBBER/WEAK/NEW .. mea culpa.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Ellerman <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Abdul Haleem <[email protected]>
Cc: Joel Stanley <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: David Goldblatt <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There should be no cache left by the time we overwrite the old transhuge
pmd with the new one. It's already too late to flush through the virtual
address because we already copied the page data to the new physical
address.
So flush the cache before the data copy.
Also delete the "end" variable to shutoff a "unused variable" warning on
x86 where flush_cache_range() is a noop.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
right away. do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page(). By the time do_huge_pmd_numa_page()
runs some CPU could still access the page through the TLB.
change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start(). So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the time
migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
invalidated in the secondary MMUs. It has to or if a secondary MMU can
still write to the page, the migrate_page_copy() would lose data.
However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing the
primary MMU pagetables and only implementing ->invalidate_range:
CPU0 CPU1 GPU sharing linux pagetables using
only ->invalidate_range
----------- ------------ ---------
GPU secondary MMU writes to the page
mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
do_huge_pmd_numa_page()
CPU TLB flush globally (1)
CPU cannot write to page
migrate_misplaced_transhuge_page()
GPU writes to the page...
migrate_page_copy()
...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
GPU secondary MMU is invalidated
and cannot write to the page anymore
(too late)
Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
too late, we also need a mmu_notifier_invalidate_range() before calling
migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
(3) also arrives too late.
This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing the
TLB and without first calling mmu_notifier_invalidate_range().
Even converting the removed mmu_notifier_invalidate_range_only_end() into
a mmu_notifier_invalidate_range_end() would not have been enough to fix
this, because it run after migrate_page_copy().
After the hugepage data copy is done migrate_misplaced_transhuge_page()
can proceed and call set_pmd_at without having to flush the TLB nor any
secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
flush, has to happen before the migrate_page_copy() is called or it would
be a bug in the first place (and it was for drivers using
->invalidate_range()).
KVM is unaffected because it doesn't implement ->invalidate_range().
The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
and all mmu notifiers there before copying the page.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Aaron Tomlin <[email protected]>
Cc: Jerome Glisse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "migrate_misplaced_transhuge_page race conditions".
Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.
While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.
These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.
I did only minor testing, and I don't expect to be able to reproduce this
(especially the lack of ->invalidate_range before migrate_page_copy,
requires the latest iommu hardware or infiniband to reproduce). The last
patch is noop for x86 and it needs further review from maintainers of
archs that implement flush_cache_range() (not in CC yet).
To avoid confusion, it's not the first patch that introduces the bug fixed
in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.
This patch (of 3):
This is a corollary of ced108037c2aa ("thp: fix MADV_DONTNEED vs. numa
balancing race"), 58ceeb6bec8 ("thp: fix MADV_DONTNEED vs. MADV_FREE
race") and 5b7abeae3af8c ("thp: fix MADV_DONTNEED vs clear soft dirty
race).
When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/[email protected]
but apparently this was missed.
The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
in a54a407fbf7 ("mm: Close races between THP migration and PMD numa
clearing").
The important part of such commit is only the part where the page lock is
not released until the first do_huge_pmd_numa_page() finished disarming
the pagenuma/protnone.
The addition of pmdp_clear_flush() wasn't beneficial to such commit and
there's no commentary about such an addition either.
I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition found
by Aaron.
At that point in time nobody thought of such kind of MADV_DONTNEED race
conditions yet (they were fixed later) so the code may have looked more
robust by adding the pmdp_clear_flush().
This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be zeroed
out.
This also optimizes the code and removes a superfluous TLB flush.
[[email protected]: reflow comment to 80 cols, fix grammar and typo (beacuse)]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Reported-by: Aaron Tomlin <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Jerome Glisse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The static lock quarantine_lock is used in quarantine.c to protect the
quarantine queue datastructures. It is taken inside quarantine queue
manipulation routines (quarantine_put(), quarantine_reduce() and
quarantine_remove_cache()), with IRQs disabled. This is not a problem on
a stock kernel but is problematic on an RT kernel where spin locks are
sleeping spinlocks, which can sleep and can not be acquired with disabled
interrupts.
Convert the quarantine_lock to a raw spinlock_t. The usage of
quarantine_lock is confined to quarantine.c and the work performed while
the lock is held is used for debug purpose.
[[email protected]: slightly altered the commit message]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Clark Williams <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Getting pages from ZONE_DEVICE memory needs to check the backing device's
live-ness, which is tracked in the device's dev_pagemap metadata. This
metadata is stored in a radix tree and looking it up adds measurable
software overhead.
This patch avoids repeating this relatively costly operation when
dev_pagemap is used by caching the last dev_pagemap while getting user
pages. The gup_benchmark kernel self test reports this reduces time to
get user pages to as low as 1/3 of the previous time.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") breaks movable_node kernel option because it changed
the memory gap range to reserved memblock. So, the node is marked as
Normal zone even if the SRAT has Hot pluggable affinity.
=====================================================================
kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
...
kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
...
kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
...
kernel: Movable zone start for each node
kernel: Node 3: 0x00001c0000000000
kernel: Early memory node ranges
...
=====================================================================
The original issue is fixed by the former patches, so let's revert commit
124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved").
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masayoshi Mizuma <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When checking for valid pfns in zero_resv_unavail(), it is not necessary
to verify that pfns within pageblock_nr_pages ranges are valid, only the
first one needs to be checked. This is because memory for pages are
allocated in contiguous chunks that contain pageblock_nr_pages struct
pages.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Masayoshi Mizuma <[email protected]>
Reviewed-by: Masayoshi Mizuma <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: Fix for movable_node boot option", v3.
This patch series contains a fix for the movable_node boot option issue
which was introduced by commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
regions into memblock.reserved").
The commit breaks the option because it changed the memory gap range to
reserved memblock. So, the node is marked as Normal zone even if the SRAT
has Hot pluggable affinity.
First and second patch fix the original issue which the commit tried to
fix, then revert the commit.
This patch (of 3):
There is a kernel panic that is triggered when reading /proc/kpageflags on
the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
BUG: unable to handle kernel paging request at fffffffffffffffe
PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
RIP: 0010:stable_page_flags+0x27/0x3c0
Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
Call Trace:
kpageflags_read+0xc7/0x120
proc_reg_read+0x3c/0x60
__vfs_read+0x36/0x170
vfs_read+0x89/0x130
ksys_pread64+0x71/0x90
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7efc42e75e23
Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.
Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
default (no memmap= given) memblock layout is like below:
MEMBLOCK configuration:
memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x4
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...
If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:
MEMBLOCK configuration:
memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x3
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...
This causes shrinking node 0's pfn range because it is calculated by the
address range of memblock.memory. So some of struct pages in the gap
range are left uninitialized.
We have a function zero_resv_unavail() which does zeroing the struct pages
outside memblock.memory, but currently it covers only the reserved
unavailable range (i.e. memblock.memory && !memblock.reserved). This
patch extends it to cover all unavailable range, which fixes the reported
issue.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by-by: Masayoshi Mizuma <[email protected]>
Tested-by: Oscar Salvador <[email protected]>
Tested-by: Masayoshi Mizuma <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new option '-H' to the gup benchmark to help understand how hugetlb
mapping pages compare with the default.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new benchmark option, -S, to request MAP_SHARED. This can be used
to compare with MAP_PRIVATE, or for files that require this option, like
dax.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Allow a user to specify a file to map by adding a new option, '-f',
providing a means to test various file backings.
If not specified, the benchmark will use a private mapping of /dev/zero,
which produces an anonymous mapping as before.
[[email protected]: avoid using comma operator]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If the '-w' parameter was provided, the benchmark would exit due to a
mssing 'break'.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Keith Busch <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Provide new gup benchmark ioctl commands to run different user page
pinning methods, get_user_pages_longterm() and get_user_pages(), in
addition to the existing get_user_pages_fast().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Keith Busch <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We'd like to measure time to unpin user pages, so this adds a second
benchmark timer on put_page, separate from get_page.
Adding the field breaks this ioctl ABI, but should be okay since this an
in-tree kernel selftest.
[[email protected]: add expansion to struct gup_benchmark for future use]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Keith Busch <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It was reported that on some of our machines containers were restarted
with OOM symptoms without an obvious reason. Despite there were almost no
memory pressure and plenty of page cache, MEMCG_OOM event was raised
occasionally, causing the container management software to think, that OOM
has happened. However, no tasks have been killed.
The following investigation showed that the problem is caused by a failing
attempt to charge a high-order page. In such case, the OOM killer is
never invoked. As shown below, it can happen under conditions, which are
very far from a real OOM: e.g. there is plenty of clean page cache and no
memory pressure.
There is no sense in raising an OOM event in this case, as it might
confuse a user and lead to wrong and excessive actions (e.g. restart the
workload, as in my case).
Let's look at the charging path in try_charge(). If the memory usage is
about memory.max, which is absolutely natural for most memory cgroups, we
try to reclaim some pages. Even if we were able to reclaim enough memory
for the allocation, the following check can fail due to a race with
another concurrent allocation:
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
goto retry;
For regular pages the following condition will save us from triggering
the OOM:
if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
goto retry;
But for high-order allocation this condition will intentionally fail. The
reason behind is that we'll likely fall to regular pages anyway, so it's
ok and even preferred to return ENOMEM.
In this case the idea of raising MEMCG_OOM looks dubious.
Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
order check, so that the event won't be raised for high order allocations.
This change doesn't affect regular pages allocation and charging.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process application
doing concurrent writes and fsyncs to a small range of a file.
range_cyclic
writeback Process 1 Process 2
xfs_vm_writepages
write_cache_pages
writeback_index = 2
cycled = 0
....
find page 2 dirty
lock Page 2
->writepage
page 2 writeback
page 2 clean
page 2 added to bio
no more pages
write()
locks page 1
dirties page 1
locks page 2
dirties page 1
fsync()
....
xfs_vm_writepages
write_cache_pages
start index 0
find page 1 towrite
lock Page 1
->writepage
page 1 writeback
page 1 clean
page 1 added to bio
find page 2 towrite
lock Page 2
page 2 is writeback
<blocks>
write()
locks page 1
dirties page 1
fsync()
....
xfs_vm_writepages
write_cache_pages
start index 0
!done && !cycled
sets index to 0, restarts lookup
find page 1 dirty
find page 1 towrite
lock Page 1
page 1 is writeback
<blocks>
lock Page 1
<blocks>
DEADLOCK because:
- process 1 needs page 2 writeback to complete to make
enough progress to issue IO pending for page 1
- writeback needs page 1 writeback to complete so process 2
can progress and unlock the page it is blocked on, then it
can issue the IO pending for page 2
- process 2 can't make progress until process 1 issues IO
for page 1
The underlying cause of the problem here is that range_cyclic writeback is
processing pages in descending index order as we hold higher index pages
in a structure controlled from above write_cache_pages(). The
write_cache_pages() caller needs to be able to submit these pages for IO
before write_cache_pages restarts writeback at mapping index 0 to avoid
wcp inverting the page lock/writeback wait order.
generic_writepages() is not susceptible to this bug as it has no private
context held across write_cache_pages() - filesystems using this
infrastructure always submit pages in ->writepage immediately and so there
is no problem with range_cyclic going back to mapping index 0.
However:
mpage_writepages() has a private bio context,
exofs_writepages() has page_collect
fuse_writepages() has fuse_fill_wb_data
nfs_writepages() has nfs_pageio_descriptor
xfs_vm_writepages() has xfs_writepage_ctx
All of these ->writepages implementations can hold pages under writeback
in their private structures until write_cache_pages() returns, and hence
they are all susceptible to this deadlock.
Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock. I looked
at the code long enough to understand that it has a similar retry loop for
range_cyclic writeback reaching the end of the file and then promptly ran
away before my eyes bled too much. I'll leave it for the ext4 developers
to determine if their code is actually has this deadlock and how to fix it
if it has.
There's a few ways I can see avoid this deadlock. There's probably more,
but these are the first I've though of:
1. get rid of range_cyclic altogether
2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()
2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation
3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer
3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.
I don't think #1 is an option - range_cyclic prevents frequently
dirtied lower file offset from starving background writeback of
rarely touched higher file offsets.
#2 is simple, and I don't think it will have any impact on
performance as going back to the start of the file implies an
immediate seek. We'll have exactly the same number of seeks if we
switch writeback to another inode, and then come back to this one
later and restart from index 0.
#2a is pretty much "status quo without the deadlock". Moving the
retry loop up into the wcp caller means we can issue IO on the
pending pages before calling wcp again, and so avoid locking or
waiting on pages in the wrong order. I'm not convinced we need to do
this given that we get the same thing from #2 on the next writeback
call from the writeback infrastructure.
#3 is really just a band-aid - it doesn't fix the access/wait
inversion problem, just prevents it from becoming a deadlock
situation. I'd prefer we fix the inversion, not sweep it under the
carpet like this.
#3a is really an optimisation that just so happens to include the
band-aid fix of #3.
So it seems that the simplest way to fix this issue is to implement
solution #2
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memmap_init_zone, is getting complex, because it is called from different
contexts: hotplug, and during boot, and also because it must handle some
architecture quirks. One of them is mirrored memory.
Move the code that decides whether to skip mirrored memory outside of
memmap_init_zone, into a separate function.
[[email protected]: uninline overlap_memmap_init()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Abdul Haleem <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Steven Sistare <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
update_defer_init() should be called only when struct page is about to be
initialized. Because it counts number of initialized struct pages, but
there we may skip struct pages if there is some mirrored memory.
So move, update_defer_init() after checking for mirrored memory.
Also, rename update_defer_init() to defer_init() and reverse the return
boolean to emphasize that this is a boolean function, that tells that the
reset of memmap initialization should be deferred.
Make this function self-contained: do not pass number of already
initialized pages in this zone by using static counters.
I found this bug by reading the code. The effect is that fewer than
expected struct pages are initialized early in boot, and it is possible
that in some corner cases we may fail to boot when mirrored pages are
used. The deferred on demand code should somewhat mitigate this. But
this still brings some inconsistencies compared to when booting without
mirrored pages, so it is better to fix.
[[email protected]: add comment about defer_init's lack of locking]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: make defer_init non-inline, __meminit]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Abdul Haleem <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Steven Sistare <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memmap_init is sometimes a macro sometimes a function based on
__HAVE_ARCH_MEMMAP_INIT. It is only a function on ia64. Make memmap_init
a weak function instead, and let ia64 redefine it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Steven Sistare <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Abdul Haleem <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This will allow to use generic refcount_t interfaces to check counters
overflow instead of currently existing VM_BUG_ON(). The only difference
after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires
with WARN(). But this seems not to be significant here, since such the
problems are usually caught by syzbot with panic-on-warn enabled.
Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrea Parri <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If move_freepages_block() returns 0 because !zone_spans_pfn(),
*num_movable can hold the value from the stack because it does not get
initialized in move_freepages().
Move the initialization to move_freepages_block() to guarantee the value
actually makes sense.
This currently doesn't affect its only caller where num_movable != NULL,
so no bug fix, but just more robust.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Greg Thelen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace "fallthru" with a proper "fall through" annotation.
This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now we recycle the uffd servicing threads earlier than the lock threads.
It might happen that when the lock thread is still blocked at a pthread
mutex lock while the servicing thread has already quitted for the cpu so
the lock thread will be blocked forever and hang the test program. To fix
the possible race, recycle the lock threads first.
This never happens with current missing-only tests, but when I start to
run the write-protection tests (the feature is not yet posted upstream) it
happens every time of the run possibly because in that new test we'll need
to service two page faults for each lock operation.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We do very similar things in read and poll modes, but we're copying the
codes around. Share the codes properly on reading the message and
handling the page fault to make the code cleaner. Meanwhile this solves
previous mismatch of behaviors between the two modes on that the old code:
- did not check EAGAIN case in read() mode
- ignored BOUNCE_VERIFY check in read() mode
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Firstly, the help in the comment region is obsolete, now we support
three parameters. Since at it, change it and move it into the help
message of the program.
Also, the help messages dumped here and there is obsolete too. Use a
single usage() helper.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Having two gigantic arrays that must manually be kept in sync, including
ifdefs, isn't exactly robust. To make it easier to catch such issues in
the future, add a BUILD_BUG_ON().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jann Horn <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Kemi Wang <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We clear the pte temporarily during read/modify/write update of the pte.
If we take a page fault while the pte is cleared, the application can get
SIGBUS. One such case is with remap_pfn_range without a backing
vm_ops->fault callback. do_fault will return SIGBUS in that case.
cpu 0 cpu1
mprotect()
ptep_modify_prot_start()/pte cleared.
.
. page fault.
.
.
prep_modify_prot_commit()
Fix this by taking page table lock and rechecking for pte_none.
[[email protected]: fix crash observed with syzkaller run]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Willem de Bruijn <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ido Schimmel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The comment for PFN_SPECIAL is missed in pfn_t.h. Add comment to get
consistent with other pfn flags.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
brk might be used to shrink memory mapping too other than munmap(). So,
it may hold write mmap_sem for long time when shrinking large mapping, as
what commit ("mm: mmap: zap pages with read mmap_sem in munmap")
described.
The brk() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case. But, it may set mm->brk after __do_munmap(),
which needs hold write mmap_sem.
However, a simple trick can workaround this by setting mm->brk before
__do_munmap(). Then restore the original value if __do_munmap() fails.
With this trick, it is safe to downgrade to read mmap_sem.
So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.
The period of holding exclusive mmap_sem for shrinking large mapping would
be reduced significantly with this optimization.
[[email protected]: tweak comment]
[[email protected]: fix unsigned compare against 0 issue]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Colin Ian King <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Other than munmap, mremap might be used to shrink memory mapping too.
So, it may hold write mmap_sem for long time when shrinking large
mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in
munmap") described.
The mremap() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case, so it is safe to downgrade to read mmap_sem.
So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.
The period of holding exclusive mmap_sem for shrinking large mapping
would be reduced significantly with this optimization.
MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this
optimization since they need manipulate vmas after do_munmap(),
downgrading mmap_sem may create race window.
Simple mapping shrink is the low hanging fruit, and it may cover the
most cases of unmap with munmap together.
[[email protected]: tweak comment]
[[email protected]: fix unsigned compare against 0 issue]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Colin Ian King <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These codes can be replaced with new inline vmf_error().
Link: http://lkml.kernel.org/r/20180927171411.GA23331@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use the same
version of huge_ptep_get, so move this generic implementation into
asm-generic/hugetlb.h.
[[email protected]: fix ARM 3level page tables]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Luiz Capitulino <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Helge Deller <[email protected]> [parisc]
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Paul Burton <[email protected]> [MIPS]
Acked-by: Ingo Molnar <[email protected]> [x86]
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
arm, ia64, sh, x86 architectures use the same version
of huge_ptep_set_access_flags, so move this generic implementation
into asm-generic/hugetlb.h.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Luiz Capitulino <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Helge Deller <[email protected]> [parisc]
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Paul Burton <[email protected]> [MIPS]
Acked-by: Ingo Molnar <[email protected]> [x86]
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
arm, ia64, mips, powerpc, sh, x86 architectures use the same version of
huge_ptep_set_wrprotect, so move this generic implementation into
asm-generic/hugetlb.h.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Luiz Capitulino <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Helge Deller <[email protected]> [parisc]
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Paul Burton <[email protected]> [MIPS]
Acked-by: Ingo Molnar <[email protected]> [x86]
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
arm, arm64, powerpc, sparc, x86 architectures use the same version of
prepare_hugepage_range, so move this generic implementation into
asm-generic/hugetlb.h.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Luiz Capitulino <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Helge Deller <[email protected]> [parisc]
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Paul Burton <[email protected]> [MIPS]
Acked-by: Ingo Molnar <[email protected]> [x86]
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
arm, arm64, ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use
the same version of huge_pte_wrprotect, so move this generic
implementation into asm-generic/hugetlb.h.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Luiz Capitulino <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Helge Deller <[email protected]> [parisc]
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Paul Burton <[email protected]> [MIPS]
Acked-by: Ingo Molnar <[email protected]> [x86]
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|