<feed xmlns='http://www.w3.org/2005/Atom'>
<title>blaster4385/linux-IllusionX/mm, branch v6.12.1</title>
<subtitle>Linux kernel with personal config changes for arch linux</subtitle>
<id>https://git.tablaster.dev/blaster4385/linux-IllusionX/atom?h=v6.12.1</id>
<link rel='self' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/atom?h=v6.12.1'/>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/'/>
<updated>2024-11-22T14:30:26Z</updated>
<entry>
<title>mm/mmap: fix __mmap_region() error handling in rare merge failure case</title>
<updated>2024-11-22T14:30:26Z</updated>
<author>
<name>Liam R. Howlett</name>
<email>Liam.Howlett@Oracle.com</email>
</author>
<published>2024-11-19T17:59:45Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=3cb721578ad99677ac2dede8b084ae3e306f5a3f'/>
<id>urn:sha1:3cb721578ad99677ac2dede8b084ae3e306f5a3f</id>
<content type='text'>
The mmap_region() function tries to install a new vma, which requires a
pre-allocation for the maple tree write due to the complex locking
scenarios involved.

Recent efforts to simplify the error recovery required the relocation of
the preallocation of the maple tree nodes (via vma_iter_prealloc()
calling mas_preallocate()) higher in the function.

The relocation of the preallocation meant that, if there was a file
associated with the vma and the driver call (mmap_file()) modified the
vma flags, then a new merge of the new vma with existing vmas is
attempted.

During the attempt to merge the existing vma with the new vma, the vma
iterator is used - the same iterator that would be used for the next
write attempt to the tree.  In the event of needing a further allocation
and if the new allocations fails, the vma iterator (and contained maple
state) will cleaned up, including freeing all previous allocations and
will be reset internally.

Upon returning to the __mmap_region() function, the error is available
in the vma_merge_struct and can be used to detect the -ENOMEM status.

Hitting an -ENOMEM scenario after the driver callback leaves the system
in a state that undoing the mapping is worse than continuing by dipping
into the reserve.

A preallocation should be performed in the case of an -ENOMEM and the
allocations were lost during the failure scenario.  The __GFP_NOFAIL
flag is used in the allocation to ensure the allocation succeeds after
implicitly telling the driver that the mapping was happening.

The range is already set in the vma_iter_store() call below, so it is
not necessary and is dropped.

Reported-by: syzbot+bc6bfc25a68b7a020ee1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/x/log.txt?x=17b0ace8580000
Fixes: 5de195060b2e2 ("mm: resolve faulty mmap_region() error path behaviour")
Signed-off-by: Liam R. Howlett &lt;Liam.Howlett@Oracle.com&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Reviewed-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-11-17T00:00:38Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-11-17T00:00:38Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=4a5df37964673effcd9f84041f7423206a5ae5f2'/>
<id>urn:sha1:4a5df37964673effcd9f84041f7423206a5ae5f2</id>
<content type='text'>
Pull hotfixes from Andrew Morton:
 "10 hotfixes, 7 of which are cc:stable. All singletons, please see the
  changelogs for details"

* tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: revert "mm: shmem: fix data-race in shmem_getattr()"
  ocfs2: uncache inode which has failed entering the group
  mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
  mm, doc: update read_ahead_kb for MADV_HUGEPAGE
  fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()
  sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
  crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
  mm/mremap: fix address wraparound in move_page_tables()
  tools/mm: fix compile error
  mm, swap: fix allocation and scanning race with swapoff
</content>
</entry>
<entry>
<title>mm: revert "mm: shmem: fix data-race in shmem_getattr()"</title>
<updated>2024-11-16T23:30:32Z</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2024-11-16T00:57:24Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=d1aa0c04294e29883d65eac6c2f72fe95cc7c049'/>
<id>urn:sha1:d1aa0c04294e29883d65eac6c2f72fe95cc7c049</id>
<content type='text'>
Revert d949d1d14fa2 ("mm: shmem: fix data-race in shmem_getattr()") as
suggested by Chuck [1].  It is causing deadlocks when accessing tmpfs over
NFS.

As Hugh commented, "added just to silence a syzbot sanitizer splat: added
where there has never been any practical problem".

Link: https://lkml.kernel.org/r/ZzdxKF39VEmXSSyN@tissot.1015granger.net [1]
Fixes: d949d1d14fa2 ("mm: shmem: fix data-race in shmem_getattr()")
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Cc: Jeongjun Park &lt;aha310510@gmail.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix NULL pointer dereference in alloc_pages_bulk_noprof</title>
<updated>2024-11-15T06:43:48Z</updated>
<author>
<name>Jinjiang Tu</name>
<email>tujinjiang@huawei.com</email>
</author>
<published>2024-11-13T08:32:35Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=8ce41b0f9d77cca074df25afd39b86e2ee3aa68e'/>
<id>urn:sha1:8ce41b0f9d77cca074df25afd39b86e2ee3aa68e</id>
<content type='text'>
We triggered a NULL pointer dereference for ac.preferred_zoneref-&gt;zone in
alloc_pages_bulk_noprof() when the task is migrated between cpusets.

When cpuset is enabled, in prepare_alloc_pages(), ac-&gt;nodemask may be
&amp;current-&gt;mems_allowed.  when first_zones_zonelist() is called to find
preferred_zoneref, the ac-&gt;nodemask may be modified concurrently if the
task is migrated between different cpusets.  Assuming we have 2 NUMA Node,
when traversing Node1 in ac-&gt;zonelist, the nodemask is 2, and when
traversing Node2 in ac-&gt;zonelist, the nodemask is 1.  As a result, the
ac-&gt;preferred_zoneref points to NULL zone.

In alloc_pages_bulk_noprof(), for_each_zone_zonelist_nodemask() finds a
allowable zone and calls zonelist_node_idx(ac.preferred_zoneref), leading
to NULL pointer dereference.

__alloc_pages_noprof() fixes this issue by checking NULL pointer in commit
ea57485af8f4 ("mm, page_alloc: fix check for NULL preferred_zone") and
commit df76cee6bbeb ("mm, page_alloc: remove redundant checks from alloc
fastpath").

To fix it, check NULL pointer for preferred_zoneref-&gt;zone.

Link: https://lkml.kernel.org/r/20241113083235.166798-1-tujinjiang@huawei.com
Fixes: 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator")
Signed-off-by: Jinjiang Tu &lt;tujinjiang@huawei.com&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Alexander Lobakin &lt;alobakin@pm.me&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Nanyong Sun &lt;sunnanyong@huawei.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mremap: fix address wraparound in move_page_tables()</title>
<updated>2024-11-15T06:43:48Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2024-11-11T19:34:30Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=a4a282daf1a190f03790bf163458ea3c8d28d217'/>
<id>urn:sha1:a4a282daf1a190f03790bf163458ea3c8d28d217</id>
<content type='text'>
On 32-bit platforms, it is possible for the expression `len + old_addr &lt;
old_end` to be false-positive if `len + old_addr` wraps around. 
`old_addr` is the cursor in the old range up to which page table entries
have been moved; so if the operation succeeded, `old_addr` is the *end* of
the old region, and adding `len` to it can wrap.

The overflow causes mremap() to mistakenly believe that PTEs have been
copied; the consequence is that mremap() bails out, but doesn't move the
PTEs back before the new VMA is unmapped, causing anonymous pages in the
region to be lost.  So basically if userspace tries to mremap() a
private-anon region and hits this bug, mremap() will return an error and
the private-anon region's contents appear to have been zeroed.

The idea of this check is that `old_end - len` is the original start
address, and writing the check that way also makes it easier to read; so
fix the check by rearranging the comparison accordingly.

(An alternate fix would be to refactor this function by introducing an
"orig_old_start" variable or such.)


Tested in a VM with a 32-bit X86 kernel; without the patch:

```
user@horn:~/big_mremap$ cat test.c
#define _GNU_SOURCE
#include &lt;stdlib.h&gt;
#include &lt;stdio.h&gt;
#include &lt;err.h&gt;
#include &lt;sys/mman.h&gt;

#define ADDR1 ((void*)0x60000000)
#define ADDR2 ((void*)0x10000000)
#define SIZE          0x50000000uL

int main(void) {
  unsigned char *p1 = mmap(ADDR1, SIZE, PROT_READ|PROT_WRITE,
      MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
  if (p1 == MAP_FAILED)
    err(1, "mmap 1");
  unsigned char *p2 = mmap(ADDR2, SIZE, PROT_NONE,
      MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
  if (p2 == MAP_FAILED)
    err(1, "mmap 2");
  *p1 = 0x41;
  printf("first char is 0x%02hhx\n", *p1);
  unsigned char *p3 = mremap(p1, SIZE, SIZE,
      MREMAP_MAYMOVE|MREMAP_FIXED, p2);
  if (p3 == MAP_FAILED) {
    printf("mremap() failed; first char is 0x%02hhx\n", *p1);
  } else {
    printf("mremap() succeeded; first char is 0x%02hhx\n", *p3);
  }
}
user@horn:~/big_mremap$ gcc -static -o test test.c
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() failed; first char is 0x00
```

With the patch:

```
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() succeeded; first char is 0x41
```

Link: https://lkml.kernel.org/r/20241111-fix-mremap-32bit-wrap-v1-1-61d6be73b722@google.com
Fixes: af8ca1c14906 ("mm/mremap: optimize the start addresses in move_page_tables()")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Reviewed-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Acked-by: Qi Zheng &lt;zhengqi.arch@bytedance.com&gt;
Reviewed-by: Liam R. Howlett &lt;Liam.Howlett@Oracle.com&gt;
Cc: Joel Fernandes (Google) &lt;joel@joelfernandes.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: fix allocation and scanning race with swapoff</title>
<updated>2024-11-14T23:25:07Z</updated>
<author>
<name>Kairui Song</name>
<email>kasong@tencent.com</email>
</author>
<published>2024-11-12T08:34:14Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=0ec8bc9e880eb576dc4492e8e0c7153ed0a71031'/>
<id>urn:sha1:0ec8bc9e880eb576dc4492e8e0c7153ed0a71031</id>
<content type='text'>
There are two flags used to synchronize allocation and scanning with
swapoff: SWP_WRITEOK and SWP_SCANNING.

SWP_WRITEOK: Swapoff will first unset this flag, at this point any further
swap allocation or scanning on this device should just abort so no more
new entries will be referencing this device.  Swapoff will then unuse all
existing swap entries.

SWP_SCANNING: This flag is set when device is being scanned.  Swapoff will
wait for all scanner to stop before the final release of the swap device
structures to avoid UAF.  Note this flag is the highest used bit of
si-&gt;flags so it could be added up arithmetically, if there are multiple
scanner.

commit 5f843a9a3a1e ("mm: swap: separate SSD allocation from
scan_swap_map_slots()") ignored SWP_SCANNING and SWP_WRITEOK flags while
separating cluster allocation path from the old allocation path.  Add the
flags back to fix swapoff race.  The race is hard to trigger as si-&gt;lock
prevents most parallel operations, but si-&gt;lock could be dropped for
reclaim or discard.  This issue is found during code review.

This commit fixes this problem.  For SWP_SCANNING, Just like before, set
the flag before scan and remove it afterwards.

For SWP_WRITEOK, there are several places where si-&gt;lock could be dropped,
it will be error-prone and make the code hard to follow if we try to cover
these places one by one.  So just do one check before the real allocation,
which is also very similar like before.  With new cluster allocator it may
waste a bit of time iterating the clusters but won't take long, and
swapoff is not performance sensitive.

Link: https://lkml.kernel.org/r/20241112083414.78174-1-ryncsn@gmail.com
Fixes: 5f843a9a3a1e ("mm: swap: separate SSD allocation from scan_swap_map_slots()")
Reported-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Closes: https://lore.kernel.org/linux-mm/87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com/
Signed-off-by: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Barry Song &lt;v-songbaohua@oppo.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-11-13T16:58:11Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-11-13T16:58:11Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=4b49c0ba4eeb31b44462303cac4162476b72c831'/>
<id>urn:sha1:4b49c0ba4eeb31b44462303cac4162476b72c831</id>
<content type='text'>
Pull misc fixes from Andrew Morton:
 "10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
  singletons"

* tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: swapfile: fix cluster reclaim work crash on rotational devices
  selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
  mm/thp: fix deferred split queue not partially_mapped: fix
  mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
  nommu: pass NULL argument to vma_iter_prealloc()
  ocfs2: fix UBSAN warning in ocfs2_verify_volume()
  nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
  nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
  mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
  mm: count zeromap read and set for swapout and swapin
</content>
</entry>
<entry>
<title>mm: swapfile: fix cluster reclaim work crash on rotational devices</title>
<updated>2024-11-13T00:01:36Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2024-11-07T14:08:36Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=dcf32ea7ecede94796fb30231b3969d7c838374c'/>
<id>urn:sha1:dcf32ea7ecede94796fb30231b3969d7c838374c</id>
<content type='text'>
syzbot and Daan report a NULL pointer crash in the new full swap cluster
reclaim work:

&gt; Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI
&gt; KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
&gt; CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0
&gt; Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
&gt; Workqueue: events swap_reclaim_work
&gt; RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49
&gt; Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 &lt;80&gt; 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00
&gt; RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202
&gt; RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078
&gt; RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
&gt; RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
&gt; R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000
&gt; R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000
&gt; FS:  0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
&gt; CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
&gt; CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0
&gt; DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
&gt; DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
&gt; Call Trace:
&gt;  &lt;TASK&gt;
&gt;  __list_del_entry_valid include/linux/list.h:124 [inline]
&gt;  __list_del_entry include/linux/list.h:215 [inline]
&gt;  list_move_tail include/linux/list.h:310 [inline]
&gt;  swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748
&gt;  swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779

The syzbot console output indicates a virtual environment where swapfile
is on a rotational device.  In this case, clusters aren't actually used,
and si-&gt;full_clusters is not initialized.  Daan's report is from qemu, so
likely rotational too.

Make sure to only schedule the cluster reclaim work when clusters are
actually in use.

Link: https://lkml.kernel.org/r/20241107142335.GB1172372@cmpxchg.org
Link: https://lore.kernel.org/lkml/672ac50b.050a0220.2edce.1517.GAE@google.com/
Link: https://github.com/systemd/systemd/issues/35044
Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters")
Reported-by: syzbot+078be8bfa863cb9e0c6b@syzkaller.appspotmail.com
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: Daan De Meyer &lt;daan.j.demeyer@gmail.com&gt;
Cc: Kairui Song &lt;ryncsn@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/thp: fix deferred split queue not partially_mapped: fix</title>
<updated>2024-11-12T18:14:00Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2024-11-10T21:11:21Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=a3477c9e02cc9d62a7c8bfc4e7453f5af9a175aa'/>
<id>urn:sha1:a3477c9e02cc9d62a7c8bfc4e7453f5af9a175aa</id>
<content type='text'>
Though even more elusive than before, list_del corruption has still been
seen on THP's deferred split queue.

The idea in commit e66f3185fa04 was right, but its implementation wrong. 
The context omitted an important comment just before the critical test:
"split_folio() removes folio from list on success." In ignoring that
comment, when a THP split succeeded, the code went on to release the
preceding safe folio, preserving instead an irrelevant (formerly head)
folio: which gives no safety because it's not on the list.  Fix the logic.

Link: https://lkml.kernel.org/r/3c995a30-31ce-0998-1b9f-3a2cb9354c91@google.com
Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Usama Arif &lt;usamaarif642@gmail.com&gt;
Reviewed-by: Zi Yan &lt;ziy@nvidia.com&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Barry Song &lt;baohua@kernel.org&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases</title>
<updated>2024-11-12T18:14:00Z</updated>
<author>
<name>John Hubbard</name>
<email>jhubbard@nvidia.com</email>
</author>
<published>2024-11-05T03:29:44Z</published>
<link rel='alternate' type='text/html' href='https://git.tablaster.dev/blaster4385/linux-IllusionX/commit/?id=94efde1d15399f5c88e576923db9bcd422d217f2'/>
<id>urn:sha1:94efde1d15399f5c88e576923db9bcd422d217f2</id>
<content type='text'>
commit 53ba78de064b ("mm/gup: introduce
check_and_migrate_movable_folios()") created a new constraint on the
pin_user_pages*() API family: a potentially large internal allocation must
now occur, for FOLL_LONGTERM cases.

A user-visible consequence has now appeared: user space can no longer pin
more than 2GB of memory anymore on x86_64.  That's because, on a 4KB
PAGE_SIZE system, when user space tries to (indirectly, via a device
driver that calls pin_user_pages()) pin 2GB, this requires an allocation
of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
kmalloc().

In addition to the directly visible effect described above, there is also
the problem of adding an unnecessary allocation.  The **pages array
argument has already been allocated, and there is no need for a redundant
**folios array allocation in this case.

Fix this by avoiding the new allocation entirely.  This is done by
referring to either the original page[i] within **pages, or to the
associated folio.  Thanks to David Hildenbrand for suggesting this
approach and for providing the initial implementation (which I've tested
and adjusted slightly) as well.

[jhubbard@nvidia.com: whitespace tweak, per David]
  Link: https://lkml.kernel.org/r/131cf9c8-ebc0-4cbb-b722-22fa8527bf3c@nvidia.com
[jhubbard@nvidia.com: bypass pofs_get_folio(), per Oscar]
  Link: https://lkml.kernel.org/r/c1587c7f-9155-45be-bd62-1e36c0dd6923@nvidia.com
Link: https://lkml.kernel.org/r/20241105032944.141488-2-jhubbard@nvidia.com
Fixes: 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()")
Signed-off-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Suggested-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Vivek Kasireddy &lt;vivek.kasireddy@intel.com&gt;
Cc: Dave Airlie &lt;airlied@redhat.com&gt;
Cc: Gerd Hoffmann &lt;kraxel@redhat.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Cc: Dongwon Kim &lt;dongwon.kim@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Junxiao Chang &lt;junxiao.chang@intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
