Age | Commit message (Collapse) | Author | Files | Lines |
|
Convert comments that reference mmap_sem to reference mmap_lock instead.
[[email protected]: fix up linux-next leftovers]
[[email protected]: s/lockaphore/lock/, per Vlastimil]
[[email protected]: more linux-next fixups, per Michel]
Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ying Han <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.
Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ying Han <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.
The change is generated using coccinelle with the following rule:
// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)
Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ying Han <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a typo in comment, fix it.
Signed-off-by: Ethon Paul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Swapin faults were the last event to charge pages after they had already
been put on the LRU list. Now that we charge directly on swapin, the
lrucare portion of the charge code is unused.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Joonsoo Kim <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Shakeel Butt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.
This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type. All we need is a freshly allocated page and a
memcg context to charge.
v2: prevent double charges on pre-allocated hugepages in khugepaged
[[email protected]: Fix crash - *hpage could be ERR_PTR instead of NULL]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Joonsoo Kim <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Qian Cai <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
divergence from the generic VM accounting means unnecessary code overhead,
and creates a dependency for memcg that page->mapping is set up at the
time of charging, so that page types can be told apart.
Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
The page is already locked in these places, so page->mem_cgroup is stable;
we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
it's set up in time.
Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
NR_SHMEM accounting sites.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Joonsoo Kim <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Balbir Singh <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Balbir Singh <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging. The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists. This makes for cryptic code and is a
breeding ground for subtle mistakes.
Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along. This is safe because charging completes
before the page is published and somebody may split it.
Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally. That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.
The following patches will introduce a new charging API, best not to carry
over unnecessary weight.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
Reviewed-by: Joonsoo Kim <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Balbir Singh <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: memcontrol: charge swapin pages on instantiation", v2.
This patch series reworks memcg to charge swapin pages directly at
swapin time, rather than at fault time, which may be much later, or
not happen at all.
Changes in version 2:
- prevent double charges on pre-allocated hugepages in khugepaged
- leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
- fix temporary accounting bug by switching rmap<->commit (Joonsoo)
- fix double swap charge bug in cgroup1/cgroup2 code gating
- simplify swapin error checking (Joonsoo)
- mm: memcontrol: document the new swap control behavior (Alex)
- review tags
The delayed swapin charging scheme we have right now causes problems:
- Alex's per-cgroup lru_lock patches rely on pages that have been
isolated from the LRU to have a stable page->mem_cgroup; otherwise
the lock may change underneath him. Swapcache pages are charged only
after they are added to the LRU, and charging doesn't follow the LRU
isolation protocol.
- Joonsoo's anon workingset patches need a suitable LRU at the time
the page enters the swap cache and displaces the non-resident
info. But the correct LRU is only available after charging.
- It's a containment hole / DoS vector. Users can trigger arbitrarily
large swap readahead using MADV_WILLNEED. The memory is never
charged unless somebody actually touches it.
- It complicates the page->mem_cgroup stabilization rules
In order to charge pages directly at swapin time, the memcg code base
needs to be prepared, and several overdue cleanups become a necessity:
To charge pages at swapin time, we need to always have cgroup
ownership tracking of swap records. We also cannot rely on
page->mapping to tell apart page types at charge time, because that's
only set up during a page fault.
To eliminate the page->mapping dependency, memcg needs to ditch its
private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
of the generic vmstat counters and accounting sites, such as
NR_FILE_PAGES, NR_ANON_MAPPED etc.
To switch to generic vmstat counters, the charge sequence must be
adjusted such that page->mem_cgroup is set up by the time these
counters are modified.
The series is structured as follows:
1. Bug fixes
2. Decoupling charging from rmap
3. Swap controller integration into memcg
4. Direct swapin charging
This patch (of 19):
When replacing one page with another one in the cache, we have to decrease
the file count of the old page's NUMA node and increase the one of the new
NUMA node, otherwise the old node leaks the count and the new node
eventually underflows its counter.
Fixes: 74d609585d8b ("page cache: Add and replace pages using the XArray")
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Joonsoo Kim <[email protected]>
Reviewed-by: Balbir Singh <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Roman Gushchin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Highlights:
- speedup dead root detection during orphan cleanup, eg. when there
are many deleted subvolumes waiting to be cleaned, the trees are
now looked up in radix tree instead of a O(N^2) search
- snapshot creation with inherited qgroup will mark the qgroup
inconsistent, requires a rescan
- send will emit file capabilities after chown, this produces a
stream that does not need postprocessing to set the capabilities
again
- direct io ported to iomap infrastructure, cleaned up and simplified
code, notably removing last use of struct buffer_head in btrfs code
Core changes:
- factor out backreference iteration, to be used by ordinary
backreferences and relocation code
- improved global block reserve utilization
* better logic to serialize requests
* increased maximum available for unlink
* improved handling on large pages (64K)
- direct io cleanups and fixes
* simplify layering, where cloned bios were unnecessarily created
for some cases
* error handling fixes (submit, endio)
* remove repair worker thread, used to avoid deadlocks during
repair
- refactored block group reading code, preparatory work for new type
of block group storage that should improve mount time on large
filesystems
Cleanups:
- cleaned up (and slightly sped up) set/get helpers for metadata data
structure members
- root bit REF_COWS got renamed to SHAREABLE to reflect the that the
blocks of the tree get shared either among subvolumes or with the
relocation trees
Fixes:
- when subvolume deletion fails due to ENOSPC, the filesystem is not
turned read-only
- device scan deals with devices from other filesystems that changed
ownership due to overwrite (mkfs)
- fix a race between scrub and block group removal/allocation
- fix long standing bug of a runaway balance operation, printing the
same line to the syslog, caused by a stale status bit on a reloc
tree that prevented progress
- fix corrupt log due to concurrent fsync of inodes with shared
extents
- fix space underflow for NODATACOW and buffered writes when it for
some reason needs to fallback to COW mode"
* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
btrfs: fix space_info bytes_may_use underflow during space cache writeout
btrfs: fix space_info bytes_may_use underflow after nocow buffered write
btrfs: fix wrong file range cleanup after an error filling dealloc range
btrfs: remove redundant local variable in read_block_for_search
btrfs: open code key_search
btrfs: split btrfs_direct_IO to read and write part
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
fs: remove dio_end_io()
btrfs: switch to iomap_dio_rw() for dio
iomap: remove lockdep_assert_held()
iomap: add a filesystem hook for direct I/O bio submission
fs: export generic_file_buffered_read()
btrfs: turn space cache writeout failure messages into debug messages
btrfs: include error on messages about failure to write space/inode caches
btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
btrfs: make checksum item extension more efficient
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
btrfs: unexport btrfs_compress_set_level()
btrfs: simplify iget helpers
...
|
|
We no longer return 0 here and the comment doesn't tell us anything that
we don't already know (SIGBUS is a pretty good indicator that things
didn't work out).
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Export generic_file_buffered_read() to be used to supplement incomplete
direct reads.
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Yang Shi writes:
Currently, when truncating a shmem file, if the range is partly in a THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed, unless the range covers the whole THP.
Even though all the subpages are truncated (randomly or sequentially), the
THP may still be kept in page cache.
This might be fine for some usecases which prefer preserving THP, but
balloon inflation is handled in base page size. So when using shmem THP
as memory backend, QEMU inflation actually doesn't work as expected since
it doesn't free memory. But the inflation usecase really needs to get the
memory freed. (Anonymous THP will also not get freed right away, but will
be freed eventually when all subpages are unmapped: whereas shmem THP
still stays in page cache.)
Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read of the punched area will return zeroes.
Hugh Dickins adds:
Our earlier "team of pages" huge tmpfs implementation worked in the way
that Yang Shi proposes; and we have been using this patch to continue to
split the huge page when hole-punched or truncated, since converting over
to the compound page implementation. Although huge tmpfs gives out huge
pages when available, if the user specifically asks to truncate or punch a
hole (perhaps to free memory, perhaps to reduce the memcg charge), then
the filesystem should do so as best it can, splitting the huge page.
That is not always possible: any additional reference to the huge page
prevents split_huge_page() from succeeding, so the result can be flaky.
But in practice it works successfully enough that we've not seen any
problem from that.
Add shmem_punch_compound() to encapsulate the decision of when a split is
needed, and doing the split if so. Using this simplifies the flow in
shmem_undo_range(); and the first (trylock) pass does not need to do any
page clearing on failure, because the second pass will either succeed or
do that clearing. Following the example of zero_user_segment() when
clearing a partial page, add flush_dcache_page() and set_page_dirty() when
clearing a hole - though I'm not certain that either is needed.
But: split_huge_page() would be sure to fail if shmem_undo_range()'s
pagevec holds further references to the huge page. The easiest way to fix
that is for find_get_entries() to return early, as soon as it has put one
compound head or tail into the pagevec. At first this felt like a hack;
but on examination, this convention better suits all its callers - or will
do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
speedup by checking for compound pages there.
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- These were never called PCG flags; they've been called FGP flags since
their introduction in 2014.
- The FGP_FOR_MMAP flag was misleadingly documented as if it was an
alternative to FGP_CREAT instead of an option to it.
- Rename the 'offset' parameter to 'index'.
- Capitalisation, formatting, rewording.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
No in-tree users (proc, madvise, memcg, mincore) can be built as a module.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The first argument of shrink_readahead_size_eio() is not used. Hence
remove it from the function definition and from all the callers.
Signed-off-by: Souptick Joarder <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Mount failure issue happens under the scenario: Application forked dozens
of threads to mount the same number of cramfs images separately in docker,
but several mounts failed with high probability. Mount failed due to the
checking result of the page(read from the superblock of loop dev) is not
uptodate after wait_on_page_locked(page) returned in function cramfs_read:
wait_on_page_locked(page);
if (!PageUptodate(page)) {
...
}
The reason of the checking result of the page not uptodate: systemd-udevd
read the loopX dev before mount, because the status of loopX is Lo_unbound
at this time, so loop_make_request directly trigger the calling of io_end
handler end_buffer_async_read, which called SetPageError(page). So It
caused the page can't be set to uptodate in function
end_buffer_async_read:
if(page_uptodate && !PageError(page)) {
SetPageUptodate(page);
}
Then mount operation is performed, it used the same page which is just
accessed by systemd-udevd above, Because this page is not uptodate, it
will launch a actual read via submit_bh, then wait on this page by calling
wait_on_page_locked(page). When the I/O of the page done, io_end handler
end_buffer_async_read is called, because no one cleared the page
error(during the whole read path of mount), which is caused by
systemd-udevd reading, so this page is still in "PageError" status, which
can't be set to uptodate in function end_buffer_async_read, then caused
mount failure.
But sometimes mount succeed even through systemd-udeved read loopX dev
just before, The reason is systemd-udevd launched other loopX read just
between step 3.1 and 3.2, the steps as below:
1, loopX dev default status is Lo_unbound;
2, systemd-udved read loopX dev (page is set to PageError);
3, mount operation
1) set loopX status to Lo_bound;
==>systemd-udevd read loopX dev<==
2) read loopX dev(page has no error)
3) mount succeed
As the loopX dev status is set to Lo_bound after step 3.1, so the other
loopX dev read by systemd-udevd will go through the whole I/O stack, part
of the call trace as below:
SYS_read
vfs_read
do_sync_read
blkdev_aio_read
generic_file_aio_read
do_generic_file_read:
ClearPageError(page);
mapping->a_ops->readpage(filp, page);
here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
some function name changed, the call trace as below:
blkdev_read_iter
generic_file_read_iter
generic_file_buffered_read:
/*
* A previous I/O error may have been due to temporary
* failures, eg. mutipath errors.
* Pg_error will be set again if readpage fails.
*/
ClearPageError(page);
/* Start the actual read. The read will unlock the page*/
error=mapping->a_ops->readpage(flip, page);
We can see ClearPageError(page) is called before the actual read,
then the read in step 3.2 succeed.
This patch is to add the calling of ClearPageError just before the actual
read of read path of cramfs mount. Without the patch, the call trace as
below when performing cramfs mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page
do_read_cache_page:
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, the call trace as below when performing mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page:
do_read_cache_page:
ClearPageError(page); <== new add
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, mount operation trigger the calling of
ClearPageError(page) before the actual read, the page has no error if no
additional page error happen when I/O done.
Signed-off-by: Xianting Tian <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When handling a page fault, we drop mmap_sem to start async readahead so
that we don't block on IO submission with mmap_sem held. However there's
no point to drop mmap_sem in case readahead is disabled. Handle that case
to avoid pointless dropping of mmap_sem and retrying the fault. This was
actually reported to block mlockall(MCL_CURRENT) indefinitely.
Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
Reported-by: Minchan Kim <[email protected]>
Reported-by: Robert Stupp <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
At some point filemap_write_and_wait() and
filemap_write_and_wait_range() got the exact same implementation with
the exception of the range being specified in *_range()
Similar to other functions in fs.h which call *_range(..., 0,
LLONG_MAX), change filemap_write_and_wait() to be a static inline which
calls filemap_write_and_wait_range()
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
One of our services is observing hanging ps/top/etc under heavy write
IO, and the task states show this is an mmap_sem priority inversion:
A write fault is holding the mmap_sem in read-mode and waiting for
(heavily cgroup-limited) IO in balance_dirty_pages():
balance_dirty_pages+0x724/0x905
balance_dirty_pages_ratelimited+0x254/0x390
fault_dirty_shared_page.isra.96+0x4a/0x90
do_wp_page+0x33e/0x400
__handle_mm_fault+0x6f0/0xfa0
handle_mm_fault+0xe4/0x200
__do_page_fault+0x22b/0x4a0
page_fault+0x45/0x50
Somebody tries to change the address space, contending for the mmap_sem in
write-mode:
call_rwsem_down_write_failed_killable+0x13/0x20
do_mprotect_pkey+0xa8/0x330
SyS_mprotect+0xf/0x20
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The waiting writer locks out all subsequent readers to avoid lock
starvation, and several threads can be seen hanging like this:
call_rwsem_down_read_failed+0x14/0x30
proc_pid_cmdline_read+0xa0/0x480
__vfs_read+0x23/0x140
vfs_read+0x87/0x130
SyS_read+0x42/0x90
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
To fix this, do what we do for cache read faults already: drop the
mmap_sem before calling into anything IO bound, in this case the
balance_dirty_pages() function, and return VM_FAULT_RETRY.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
generic_file_direct_write() tries to invalidate pagecache after O_DIRECT
write. Unlike to similar code in dio_complete() this silently ignores
error returned from invalidate_inode_pages2_range().
According to comment this code here because not all filesystems call
dio_complete() to do proper invalidation after O_DIRECT write. Noticeable
example is a blkdev_direct_IO().
This patch calls dio_warn_stale_pagecache() if invalidation fails.
Link: http://lkml.kernel.org/r/157270038294.4812.2238891109785106069.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.
See also commit 5a9d929d6e13 ("iomap: report collisions between directio
and buffered writes to userspace").
Direct I/O is supported by non-disk filesystems, for example NFS. Thus
generic code needs this even in kernel without CONFIG_BLOCK.
Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
generic_file_direct_write() invalidates cache at entry. Second time this
should be done when request completes. But this function calls second
invalidation at exit unconditionally even for async requests.
This patch skips second invalidation for async requests (-EIOCBQUEUED).
Link: http://lkml.kernel.org/r/157270037850.4812.15036239021726025572.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to
fix the following warning:
mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ben Dooks <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In previous patch, an application could put part of its text section in
THP via madvise(). These THPs will be protected from writes when the
application is still running (TXTBSY). However, after the application
exits, the file is available for writes.
This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write. A new counter nr_thps is added to struct
address_space. In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.
This patch enables an application to put part of its text sections to THP
via madvise, for example:
madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
We tried to reuse the logic for THP on tmpfs.
Currently, write is not supported for non-shmem THP. khugepaged will only
process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
(see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
execve(). This requirement limits non-shmem THP to text sections.
The next patch will handle writes, which would only happen when the all
the vmas with VM_DENYWRITE are unmapped.
An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.
[[email protected]: fix build without CONFIG_SHMEM]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix double unlock in collapse_file()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With THP, current check of offset:
VM_BUG_ON_PAGE(page->index != offset, page);
is no longer accurate. Update it to:
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Similar to previous patch, pagecache_get_page() avoids race condition with
truncate by checking page->mapping == mapping. This does not work for
compound pages. This patch let it check compound_head(page)->mapping
instead.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Enable THP for text section of non-shmem files", v10;
This patchset follows up discussion at LSF/MM 2019. The motivation is to
put text section of an application in THP, and thus reduces iTLB miss rate
and improves performance. Both Facebook and Oracle showed strong
interests to this feature.
To make reviews easier, this set aims a mininal valid product. Current
version of the work does not have any changes to file system specific
code. This comes with some limitations (discussed later).
This set enables an application to "hugify" its text section by simply
running something like:
madvise(0x600000, 0x80000, MADV_HUGEPAGE);
Before this call, the /proc/<pid>/maps looks like:
00400000-074d0000 r-xp 00000000 00:27 2006927 app
After this call, part of the text section is split out and mapped to
THP:
00400000-00425000 r-xp 00000000 00:27 2006927 app
00600000-00e00000 r-xp 00200000 00:27 2006927 app <<< on THP
00e00000-074d0000 r-xp 00a00000 00:27 2006927 app
Limitations:
1. This only works for text section (vma with VM_DENYWRITE).
2. Original limitation #2 is removed in v3.
We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS.
Once we get better support on the write path, we can remove the config and
enable it by default.
Tested cases:
1. Tested with btrfs and ext4.
2. Tested with real work application (memcache like caching service).
3. Tested with "THP aware uprobe":
https://patchwork.kernel.org/project/linux-mm/list/?series=131339
This patch (of 7):
Currently, filemap_fault() avoids race condition with truncate by checking
page->mapping == mapping. This does not work for compound pages. This
patch let it check compound_head(page)->mapping instead.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.
Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/[email protected]/
Kirill and Huang Ying contributed several fixes.
[[email protected]: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Jan Kara <[email protected]>
Reviewed-by: Kirill Shutemov <[email protected]>
Reviewed-by: Song Liu <[email protected]>
Tested-by: Song Liu <[email protected]>
Tested-by: William Kucharski <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Tested-by: Qian Cai <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Song Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This actually checks that writeback is needed or in progress.
Link: http://lkml.kernel.org/r/156378817069.1087.1302816672037672488.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Functions like filemap_write_and_wait_range() should do nothing if inode
has no dirty pages or pages currently under writeback. But they anyway
construct struct writeback_control and this does some atomic operations if
CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
updates state of writeback ownership, on slow path might be more work.
Current this path is safely avoided only when inode mapping has no pages.
For example generic_file_read_iter() calls filemap_write_and_wait_range()
at each O_DIRECT read - pretty hot path.
This patch skips starting new writeback if mapping has no dirty tags set.
If writeback is already in progress filemap_write_and_wait_range() will
wait for it.
Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
Commit 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking
operations") changed when mmap_sem is dropped during filemap page fault
and when returning VM_FAULT_RETRY.
Correct the comment to reflect the change.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Acked-by: Song Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We can just pass a NULL filler and do the right thing inside of
do_read_cache_page based on the NULL parameter.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "fix filler_t callback type mismatches", v2.
Casting mapping->a_ops->readpage to filler_t causes an indirect call
type mismatch with Control-Flow Integrity checking. This change fixes
the mismatch in read_cache_page_gfp and read_mapping_page by adding
using a NULL filler argument as an indication to call ->readpage
directly, and by passing the right parameter callbacks in nfs and jffs2.
This patch (of 4):
Code cleanup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Many bug fixes and cleanups, and an optimization for case-insensitive
lookups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix coverity warning on error path of filename setup
ext4: replace ktype default_attrs with default_groups
ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()
ext4: refactor initialize_dirent_tail()
ext4: rename "dirent_csum" functions to use "dirblock"
ext4: allow directory holes
jbd2: drop declaration of journal_sync_buffer()
ext4: use jbd2_inode dirty range scoping
jbd2: introduce jbd2_inode dirty range scoping
mm: add filemap_fdatawait_range_keep_errors()
ext4: remove redundant assignment to node
ext4: optimize case-insensitive lookups
ext4: make __ext4_get_inode_loc plug
ext4: clean up kerneldoc warnigns when building with W=1
ext4: only set project inherit bit for directory
ext4: enforce the immutable flag on open files
ext4: don't allow any modifications to an immutable file
jbd2: fix typo in comment of journal_submit_inode_data_buffers
jbd2: fix some print format mistakes
ext4: gracefully handle ext4_break_layouts() failure during truncate
|
|
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull copy_file_range updates from Darrick Wong:
"This fixes numerous parameter checking problems and inconsistent
behaviors in the new(ish) copy_file_range system call.
Now the system call will actually check its range parameters
correctly; refuse to copy into files for which the caller does not
have sufficient privileges; update mtime and strip setuid like file
writes are supposed to do; and allows copying up to the EOF of the
source file instead of failing the call like we used to.
Summary:
- Create a generic copy_file_range handler and make individual
filesystems responsible for calling it (i.e. no more assuming that
do_splice_direct will work or is appropriate)
- Refactor copy_file_range and remap_range parameter checking where
they are the same
- Install missing copy_file_range parameter checking(!)
- Remove suid/sgid and update mtime like any other file write
- Change the behavior so that a copy range crossing the source file's
eof will result in a short copy to the source file's eof instead of
EINVAL
- Permit filesystems to decide if they want to handle
cross-superblock copy_file_range in their local handlers"
* tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fuse: copy_file_range needs to strip setuid bits and update timestamps
vfs: allow copy_file_range to copy across devices
xfs: use file_modified() helper
vfs: introduce file_modified() helper
vfs: add missing checks to copy_file_range
vfs: remove redundant checks from generic_remap_checks()
vfs: introduce generic_file_rw_checks()
vfs: no fallback for ->copy_file_range
vfs: introduce generic_copy_file_range()
|
|
This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.
Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:
page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
anon
flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
page dumped because: VM_BUG_ON_PAGE(entry != page)
page->mem_cgroup:ffff978433ace000
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:170!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
RIP: 0010:__delete_from_swap_cache+0x20d/0x240
Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
Call Trace:
delete_from_swap_cache+0x46/0xa0
try_to_free_swap+0xbc/0x110
swap_writepage+0x13/0x70
pageout.isra.0+0x13c/0x350
shrink_page_list+0xc14/0xdf0
shrink_inactive_list+0x1e5/0x3c0
shrink_node_memcg+0x202/0x760
shrink_node+0xe0/0x470
balance_pgdat+0x2d1/0x510
kswapd+0x220/0x420
kthread+0xfb/0x130
ret_from_fork+0x22/0x40
and it's not immediately obvious why it happens. It's too late in the
rc cycle to do anything but revert for now.
Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.
Signed-off-by: Ross Zwisler <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: [email protected]
|
|
Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.
[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
The access limit checks on input file range in generic_remap_checks()
are redundant because the input file size is guaranteed to be within
limits and pos+len are already checked to be within input file size.
Beyond the fact that the check cannot fail, if it would have failed,
it could return -EFBIG for input file range error. There is no precedent
for that. -EFBIG is returned in syscalls that would change file length.
With that call removed, we can fold generic_access_check_limits() into
generic_write_check_limits().
Signed-off-by: Amir Goldstein <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.
Suggested-by: Darrick J. Wong <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
I removed the only user of this and hadn't noticed it was now unused.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Laurent Dufour <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|