Age | Commit message (Collapse) | Author | Files | Lines |
|
Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.
Problem
=======
huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.
huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).
The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.
There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.
One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):
======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>
#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */
#define HOLD_SEC (1)
int pipefd[2];
void *buf;
void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;
ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}
tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);
for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}
return tmpbuf;
}
void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}
void proc2(int fd)
{
unsigned char c;
buf = do_map(fd);
if (!buf)
return;
read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);
printf("Proc2 quitting\n");
}
void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}
void *proc1_thread2(void *data)
{
unsigned char c;
/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);
printf("Proc1-thread2 quitting\n");
return NULL;
}
void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;
buf = do_map(fd);
if (!buf)
return;
ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);
/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
do_unmap(buf);
}
int main(void)
{
int fd, ret;
fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}
ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}
ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}
if (fork()) {
proc1(fd);
} else {
proc2(fd);
}
close(pipefd[0]);
close(pipefd[1]);
close(fd);
return 0;
}
======8<=======
The kernel patch needed to present such a race so it'll trigger 100%:
======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
: #include <asm/page.h>
: #include <asm/pgalloc.h>
: #include <asm/tlb.h>
: +#include <asm/delay.h>
:
: #include <linux/io.h>
: #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: bool unshare = false;
: int absent;
: struct page *page;
: + unsigned long c = 0;
:
: /*
: * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: */
: pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
: huge_page_size(h));
: +
: + pr_info("%s: withhold 1 sec...\n", __func__);
: + for (c = 0; c < 100; c++) {
: + udelay(10000);
: + }
: + pr_info("%s: withhold 1 sec...done\n", __func__);
: +
: if (pte)
: ptl = huge_pte_lock(h, mm, pte);
: absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======
It'll trigger use-after-free of the pgtable spinlock:
======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======
Resolution
==========
This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.
Patch Layout
============
Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()
Tests
=====
The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.
This patch (of 9):
Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.
Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.
Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The check for whether a hugetlb vma lock exists partially depends on the
vma's flags. Currently, it checks for either VM_MAYSHARE or VM_SHARED.
The reason both flags are used is because VM_MAYSHARE was previously
cleared in hugetlb vmas as they are tore down. This is no longer the
case, and only the VM_MAYSHARE check is required.
Link: https://lkml.kernel.org/r/20221212235042.178355-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Tests to verify MFD_NOEXEC, MFD_EXEC and vm.memfd_noexec sysctl.
Link: https://lkml.kernel.org/r/20221215001205.51969-6-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@google.com>
Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: kernel test robot <lkp@intel.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to
an executable memfd, so W^X from start.
This implys application need to fill the content of the memfd first, after
F_SEAL_EXEC is applied, application can no longer modify the content of
the memfd.
Typically, application seals the memfd right after writing to it.
For example:
1. memfd_create(MFD_EXEC).
2. write() code to the memfd.
3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X.
4. call exec() on the memfd.
Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: kernel test robot <lkp@intel.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).
When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
executable (mode: 0777) after creation.
when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.
The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
The sysctl allows finer control of memfd_create for old-software that
doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable memfd
by default. Also, the value of memfd_noexec is passed to child namespace
at creation time. For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.
[akpm@linux-foundation.org: add stub functions to fix build]
[akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
[akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review]
[akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning]
Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@google.com>
Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Basic tests to ensure that user/group/other execute bits cannot be changed
after applying F_SEAL_EXEC to a memfd.
Link: https://lkml.kernel.org/r/20221215001205.51969-3-jeffxu@google.com
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Co-developed-by: Jeff Xu <jeffxu@google.com>
Signed-off-by: Jeff Xu <jeffxu@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: kernel test robot <lkp@intel.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8.
Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting it
differently.
However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass and
enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code and
root escalation. [2] lists more VRP in this kind.
On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].
To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
X bit.For example, if a container has vm.memfd_noexec=2, then
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
LSM, which rejects or allows executable memfd based on its security policy.
This patch (of 5):
The new F_SEAL_EXEC flag will prevent modification of the exec bits:
written as traditional octal mask, 0111, or as named flags, S_IXUSR |
S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify
any of these bits after the seal is applied will fail with errno EPERM.
This will preserve the execute bits as they are at the time of sealing, so
the memfd will become either permanently executable or permanently
un-executable.
Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com
Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Co-developed-by: Jeff Xu <jeffxu@google.com>
Signed-off-by: Jeff Xu <jeffxu@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.
The reasons I still think this patch is worthwhile, are:
(1) It is a cleanup already; diffstat tells.
(2) It just feels natural after I thought about this, if the pte is uffd
protected, let's remove the write bit no matter what it was.
(2) Since x86 is the only arch that supports uffd-wp, it also redefines
pte|pmd_mkuffd_wp() in that it should always contain removals of
write bits. It means any future arch that want to implement uffd-wp
should naturally follow this rule too. It's good to make it a
default, even if with vm_page_prot changes on VM_UFFD_WP.
(3) It covers more than vm_page_prot. So no chance of any potential
future "accident" (like pte_mkdirty() sparc64 or loongarch, even
though it just got its pte_mkdirty fixed <1 month ago). It'll be
fairly clear when reading the code too that we don't worry anything
before a pte_mkuffd_wp() on uncertainty of the write bit.
We may call pte_wrprotect() one more time in some paths (e.g. thp split),
but that should be fully local bitop instruction so the overhead should be
negligible.
Although this patch should logically also fix all the known issues on
uffd-wp too recently on page migration (not for numa hint recovery - that
may need another explcit pte_wrprotect), but this is not the plan for that
fix. So no fixes, and stable doesn't need this.
Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ives van Hoorne <ives@codesandbox.io>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
folio_set_compound_order() is moved to an mm-internal location so external
folio users cannot misuse this function. Change the name of the function
to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON. Also,
handle the case if a non-large folio is passed and add clarifying comments
to the function.
Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/
Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com
Fixes: 9fd330582b2f ("mm: add folio dtor and order setter functions")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Merge branch 'mm-hotfixes-stable' into mm-stable
|
|
This patch should harden commit 15520a3f0469 ("mm: use pte markers for
swap errors") on using pte markers for swapin errors on a few corner
cases.
1. Propagate swapin errors across fork()s: if there're swapin errors in
the parent mm, after fork()s the child should sigbus too when an error
page is accessed.
2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
marker can be quickly switched to a swapin error.
3. Explicitly ignore swapin error pte markers in change_protection().
I mostly don't worry on (2) or (3) at all, but we should still have them.
Case (1) is special because it can potentially cause silent data corrupt
on child when parent has swapin error triggered with swapoff, but since
swapin error is rare itself already it's probably not easy to trigger
either.
Currently there is a priority difference between the uffd-wp bit and the
swapin error entry, in which the swapin error always has higher priority
(e.g. we don't need to wr-protect a swapin error pte marker).
If there will be a 3rd bit introduced, we'll probably need to consider a
more involved approach so we may need to start operate on the bits. Let's
leave that for later.
This patch is tested with case (1) explicitly where we'll get corrupted
data before in the child if there's existing swapin error pte markers, and
after patch applied the child can be rightfully killed.
We don't need to copy stable for this one since 15520a3f0469 just landed
as part of v6.2-rc1, only "Fixes" applied.
Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com
Fixes: 15520a3f0469 ("mm: use pte markers for swap errors")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: Fixes on pte markers".
Patch 1 resolves the syzkiller report from Pengfei.
Patch 2 further harden pte markers when used with the recent swapin error
markers. The major case is we should persist a swapin error marker after
fork(), so child shouldn't read a corrupted page.
This patch (of 2):
When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
have it and has pte marker installed. The warning is improper along with
the comment. The right thing is to inherit the pte marker when needed, or
keep the dst pte empty.
A vague guess is this happened by an accident when there's the prior patch
to introduce src/dst vma into this helper during the uffd-wp feature got
developed and I probably messed up in the rebase, since if we replace
dst_vma with src_vma the warning & comment it all makes sense too.
Hugetlb did exactly the right here (copy_hugetlb_page_range()). Fix the
general path.
Reproducer:
https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c
Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808
Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com
Fixes: c56d1b62cce8 ("mm/shmem: handle uffd-wp during fork()")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org> # 5.19+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Merge branch 'master' into mm-hotfixes-stable
|
|
Merge branch 'master' into mm-stable
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure the poking PGD is pinned for Xen PV as it requires it this
way
- Fixes for two resctrl races when moving a task or creating a new
monitoring group
- Fix SEV-SNP guests running under HyperV where MTRRs are disabled to
not return a UC- type mapping type on memremap() and thus cause a
serious slowdown
- Fix insn mnemonics in bioscall.S now that binutils is starting to fix
confusing insn suffixes
* tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: fix poking_init() for Xen PV guests
x86/resctrl: Fix event counts regression in reused RMIDs
x86/resctrl: Fix task CLOSID/RMID update race
x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case
x86/boot: Avoid using Intel mnemonics in AT&T syntax asm
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fixes from Borislav Petkov:
- Fix the EDAC device's confusion in the polling setting units
- Fix a memory leak in highbank's probing function
* tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/highbank: Fix memory leak in highbank_mc_probe()
EDAC/device: Fix period calculation in edac_device_reset_delay_period()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a build failure with some versions of ld that have an odd version
string
- Fix incorrect use of mutex in the IMC PMU driver
Thanks to Kajol Jain, Michael Petlan, Ojaswin Mujoo, Peter Zijlstra, and
Yang Yingliang.
* tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/hash: Make stress_hpt_timer_fn() static
powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
powerpc/boot: Fix incorrect version calculation issue in ld_version
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Core: Fix an iommu-group refcount leak
- Fix overflow issue in IOVA alloc path
- ARM-SMMU fixes from Will:
- Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY
- Fix SMMU shutdown paths to avoid device unregistration race
- Error handling fix for Mediatek IOMMU driver
* tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()
iommu/iova: Fix alloc iova overflows issue
iommu: Fix refcount leak in iommu_device_claim_dma_owner
iommu/arm-smmu-v3: Don't unregister on shutdown
iommu/arm-smmu: Don't unregister on shutdown
iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY even betterer
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fix from Mike Rapoport:
"memblock: always release pages to the buddy allocator in
memblock_free_late()
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the
deferred init process.
memblock_free_pages() is called by memblock_free_late(), which is used
to free reserved ranges after memblock_free_all() has run. All pages
in reserved ranges have been initialized at that point, and
accordingly, those pages are not touched by the deferred init process.
This means that currently, if the pages that memblock_free_late()
intends to release are in the deferred range, they will never be
released to the buddy allocator. They will forever be reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call
__free_pages_core() directly instead.
One case where this issue can occur in the wild is EFI boot on x86_64.
The x86 EFI code reserves all EFI boot services memory ranges via
memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively).
If any of those ranges happens to fall within the deferred init range,
the pages will not be released and that memory will be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages"
* tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
mm: Always release pages to the buddy allocator in memblock_free_late().
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kernel hardening fixes from Kees Cook:
- Fix CFI hash randomization with KASAN (Sami Tolvanen)
- Check size of coreboot table entry and use flex-array
* tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kbuild: Fix CFI hash randomization with KASAN
firmware: coreboot: Check size of table entry and use flex-array
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module fix from Luis Chamberlain:
"Just one fix for modules by Nick"
* tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
kallsyms: Fix scheduling with interrupts disabled in self-test
|
|
Pull cifs fixes from Steve French:
- memory leak and double free fix
- two symlink fixes
- minor cleanup fix
- two smb1 fixes
* tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix uninitialized memory read for smb311 posix symlink create
cifs: fix potential memory leaks in session setup
cifs: do not query ifaces on smb1 mounts
cifs: fix double free on failed kerberos auth
cifs: remove redundant assignment to the variable match
cifs: fix file info setting in cifs_open_file()
cifs: fix file info setting in cifs_query_path_info()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two minor fixes in the hisi_sas driver which only impact enterprise
style multi-expander and shared disk situations and no core changes"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id
scsi: hisi_sas: Use abort task set to reset SAS disks when discovered
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA fix from Damien Le Moal:
"A single fix to prevent building the pata_cs5535 driver with user mode
linux as it uses msr operations that are not defined with UML"
* tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: pata_cs5535: Don't build on UML
|
|
Pull block fixes from Jens Axboe:
"Nothing major in here, just a collection of NVMe fixes and dropping a
wrong might_sleep() that static checkers tripped over but which isn't
valid"
* tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux:
MAINTAINERS: stop nvme matching for nvmem files
nvme: don't allow unprivileged passthrough on partitions
nvme: replace the "bool vec" arguments with flags in the ioctl path
nvme: remove __nvme_ioctl
nvme-pci: fix error handling in nvme_pci_enable()
nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers
nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression
block: Drop spurious might_sleep() from blk_put_queue()
|
|
Pull io_uring fixes from Jens Axboe:
"A fix for a regression that happened last week, rest is fixes that
will be headed to stable as well. In detail:
- Fix for a regression added with the leak fix from last week (me)
- In writing a test case for that leak, inadvertently discovered a
case where we a poll request can race. So fix that up and mark it
for stable, and also ensure that fdinfo covers both the poll tables
that we have. The latter was an oversight when the split poll table
were added (me)
- Fix for a lockdep reported issue with IOPOLL (Pavel)"
* tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux:
io_uring: lock overflowing for IOPOLL
io_uring/poll: attempt request issue after racy poll wakeup
io_uring/fdinfo: include locked hash table in fdinfo output
io_uring/poll: add hash if ready poll request can't complete inline
io_uring/io-wq: only free worker if it was allocated for creation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci fixes from Bjorn Helgaas:
- Work around apparent firmware issue that made Linux reject MMCONFIG
space, which broke PCI extended config space (Bjorn Helgaas)
- Fix CONFIG_PCIE_BT1 dependency due to mid-air collision between a
PCI_MSI_IRQ_DOMAIN -> PCI_MSI change and addition of PCIE_BT1 (Lukas
Bulwahn)
* tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space
x86/pci: Simplify is_mmconf_reserved() messages
PCI: dwc: Adjust to recent removal of PCI_MSI_IRQ_DOMAIN
|
|
Clang emits a asan.module_ctor constructor to each object file
when KASAN is enabled, and these functions are indirectly called
in do_ctors. With CONFIG_CFI_CLANG, the compiler also emits a CFI
type hash before each address-taken global function so they can
pass indirect call checks.
However, in commit 0c3e806ec0f9 ("x86/cfi: Add boot time hash
randomization"), x86 implemented boot time hash randomization,
which relies on the .cfi_sites section generated by objtool. As
objtool is run against vmlinux.o instead of individual object
files with X86_KERNEL_IBT (enabled by default), CFI types in
object files that are not part of vmlinux.o end up not being
included in .cfi_sites, and thus won't get randomized and trip
CFI when called.
Only .vmlinux.export.o and init/version-timestamp.o are linked
into vmlinux separately from vmlinux.o. As these files don't
contain any functions, disable KASAN for both of them to avoid
breaking hash randomization.
Link: https://github.com/ClangBuiltLinux/linux/issues/1742
Fixes: 0c3e806ec0f9 ("x86/cfi: Add boot time hash randomization")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230112224948.1479453-2-samitolvanen@google.com
|
|
The memcpy() of the data following a coreboot_table_entry couldn't
be evaluated by the compiler under CONFIG_FORTIFY_SOURCE. To make it
easier to reason about, add an explicit flexible array member to struct
coreboot_device so the entire entry can be copied at once. Additionally,
validate the sizes before copying. Avoids this run-time false positive
warning:
memcpy: detected field-spanning write (size 168) of single field "&device->entry" at drivers/firmware/google/coreboot_table.c:103 (size 8)
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/all/03ae2704-8c30-f9f0-215b-7cdf4ad35a9a@molgen.mpg.de/
Cc: Jack Rosenthal <jrosenth@chromium.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Julius Werner <jwerner@chromium.org>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Link: https://lore.kernel.org/r/20230107031406.gonna.761-kees@kernel.org
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Jack Rosenthal <jrosenth@chromium.org>
Link: https://lore.kernel.org/r/20230112230312.give.446-kees@kernel.org
|
|
kallsyms_on_each* may schedule so must not be called with interrupts
disabled. The iteration function could disable interrupts, but this
also changes lookup_symbol() to match the change to the other timing
code.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Link: https://lore.kernel.org/all/bug-216902-206035@https.bugzilla.kernel.org%2F/
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202212251728.8d0872ff-oliver.sang@intel.com
Fixes: 30f3bb09778d ("kallsyms: Add self-test facility")
Tested-by: "Erhard F." <erhard_f@mailbox.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
This driver uses MSR functions that aren't implemented under UML.
Avoid building it to prevent tripping up allyesconfig.
e.g.
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3a3): undefined reference to `__tracepoint_read_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3d2): undefined reference to `__tracepoint_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x457): undefined reference to `__tracepoint_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x481): undefined reference to `do_trace_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4d5): undefined reference to `do_trace_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4f5): undefined reference to `do_trace_read_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x51c): undefined reference to `do_trace_write_msr'
Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix the PMCR_EL0 reset value after the PMU rework
- Correctly handle S2 fault triggered by a S1 page table walk by not
always classifying it as a write, as this breaks on R/O memslots
- Document why we cannot exit with KVM_EXIT_MMIO when taking a write
fault from a S1 PTW on a R/O memslot
- Put the Apple M2 on the naughty list for not being able to
correctly implement the vgic SEIS feature, just like the M1 before
it
- Reviewer updates: Alex is stepping down, replaced by Zenghui
x86:
- Fix various rare locking issues in Xen emulation and teach lockdep
to detect them
- Documentation improvements
- Do not return host topology information from KVM_GET_SUPPORTED_CPUID"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock
KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule
KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()
KVM: x86/xen: Fix lockdep warning on "recursive" gpc locking
Documentation: kvm: fix SRCU locking order docs
KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID
KVM: nSVM: clarify recalc_intercepts() wrt CR8
MAINTAINERS: Remove myself as a KVM/arm64 reviewer
MAINTAINERS: Add Zenghui Yu as a KVM/arm64 reviewer
KVM: arm64: vgic: Add Apple M2 cpus to the list of broken SEIS implementations
KVM: arm64: Convert FSC_* over to ESR_ELx_FSC_*
KVM: arm64: Document the behaviour of S1PTW faults on RO memslots
KVM: arm64: Fix S1PTW handling on RO memslots
KVM: arm64: PMU: Fix PMCR_EL0 reset value
|
|
On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.
To illustrate the impact, below are benchmark results obtained by
running various will-it-scale tests on top of the 6.2-rc3 kernel and
Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.
All results in ops/s. Note there is some variance in re-runs, but the
code is consistently faster when contention is present.
open3 ("Same file open/close"):
proc stock no-pause
1 805603 814942 (+%1)
2 1054980 1054781 (-0%)
8 1544802 1822858 (+18%)
24 1191064 2199665 (+84%)
48 851582 1469860 (+72%)
96 609481 1427170 (+134%)
fstat2 ("Same file fstat"):
proc stock no-pause
1 3013872 3047636 (+1%)
2 4284687 4400421 (+2%)
8 3257721 5530156 (+69%)
24 2239819 5466127 (+144%)
48 1701072 5256609 (+209%)
96 1269157 6649326 (+423%)
Additionally, a kernel with a private patch to help access() scalability:
access2 ("Same file access"):
proc stock patched patched
+nopause
24 2378041 2005501 5370335 (-15% / +125%)
That is, fixing the problems in access itself *reduces* scalability
after the cacheline ping-pong only happens in lockref with the pause
instruction.
Note that fstat and access benchmarks are not currently integrated into
will-it-scale, but interested parties can find them in pull requests to
said project.
Code at hand has a rather tortured history. First modification showed
up in commit d472d9d98b46 ("lockref: Relax in cmpxchg loop"), written
with Itanium in mind. Later it got patched up to use an arch-dependent
macro to stop doing it on s390 where it caused a significant regression.
Said macro had undergone revisions and was ultimately eliminated later,
going back to cpu_relax.
While I intended to only remove cpu_relax for x86-64, I got the
following comment from Linus:
I would actually prefer just removing it entirely and see if
somebody else hollers. You have the numbers to prove it hurts on
real hardware, and I don't think we have any numbers to the
contrary.
So I think it's better to trust the numbers and remove it as a
failure, than say "let's just remove it on x86-64 and leave
everybody else with the potentially broken code"
Additionally, Will Deacon (maintainer of the arm64 port, one of the
architectures previously benchmarked):
So, from the arm64 side of the fence, I'm perfectly happy just
removing the cpu_relax() calls from lockref.
As such, come back full circle in history and whack it altogether.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/
Acked-by: Tony Luck <tony.luck@intel.com> # ia64
Acked-by: Nicholas Piggin <npiggin@gmail.com> # powerpc
Acked-by: Will Deacon <will@kernel.org> # arm64
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Normally we reject ECAM space unless it is reported as reserved in the E820
table or via a PNP0C02 _CRS method (PCI Firmware, r3.3, sec 4.1.2).
07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map"), removes
E820 entries that correspond to EfiMemoryMappedIO regions because some
other firmware uses EfiMemoryMappedIO for PCI host bridge windows, and the
E820 entries prevent Linux from allocating BAR space for hot-added devices.
Some firmware doesn't report ECAM space via PNP0C02 _CRS methods, but does
mention it as an EfiMemoryMappedIO region via EFI GetMemoryMap(), which is
normally converted to an E820 entry by a bootloader or EFI stub. After
07eab0901ede, that E820 entry is removed, so we reject this ECAM space,
which makes PCI extended config space (offsets 0x100-0xfff) inaccessible.
The lack of extended config space breaks anything that relies on it,
including perf, VSEC telemetry, EDAC, QAT, SR-IOV, etc.
Allow use of ECAM for extended config space when the region is covered by
an EfiMemoryMappedIO region, even if it's not included in E820 or PNP0C02
_CRS.
Link: https://lore.kernel.org/r/ac2693d8-8ba3-72e0-5b66-b3ae008d539d@linux.intel.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216891
Fixes: 07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map")
Link: https://lore.kernel.org/r/20230110180243.1590045-3-helgaas@kernel.org
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Reported-by: Tony Luck <tony.luck@intel.com>
Reported-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reported-by: Yunying Sun <yunying.sun@intel.com>
Reported-by: Baowen Zheng <baowen.zheng@corigine.com>
Reported-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reported-by: Yang Lixiao <lixiao.yang@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yunying Sun <yunying.sun@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- avoid a potential crash on the efi_subsys_init() error path
- use more appropriate error code for runtime services calls issued
after a crash in the firmware occurred
- avoid READ_ONCE() for accessing firmware tables that may appear
misaligned in memory
* tag 'efi-fixes-for-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: tpm: Avoid READ_ONCE() for accessing the event log
efi: rt-wrapper: Add missing include
efi: fix userspace infinite retry read efivars after EFI runtime services page fault
efi: fix NULL-deref in init error path
|
|
Pull documentation fixes from Jonathan Corbet:
"Three documentation fixes (or rather two and one warning):
- Sphinx 6.0 broke our configuration mechanism, so fix it
- I broke our configuration for non-Alabaster themes; Akira fixed it
- Deprecate Sphinx < 2.4 with an eye toward future removal"
* tag 'docs-6.2-fixes' of git://git.lwn.net/linux:
docs/conf.py: Use about.html only in sidebar of alabaster theme
docs: Deprecate use of Sphinx < 2.4.x
docs: Fix the docs build with Sphinx 6.0
|
|
Nathan reports that recent kernels built with LTO will crash when doing
EFI boot using Fedora's GRUB and SHIM. The culprit turns out to be a
misaligned load from the TPM event log, which is annotated with
READ_ONCE(), and under LTO, this gets translated into a LDAR instruction
which does not tolerate misaligned accesses.
Interestingly, this does not happen when booting the same kernel
straight from the UEFI shell, and so the fact that the event log may
appear misaligned in memory may be caused by a bug in GRUB or SHIM.
However, using READ_ONCE() to access firmware tables is slightly unusual
in any case, and here, we only need to ensure that 'event' is not
dereferenced again after it gets unmapped, but this is already taken
care of by the implicit barrier() semantics of the early_memunmap()
call.
Cc: <stable@vger.kernel.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1782
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became a slightly big update, but it's more or less expected, as
the first batch after holidays.
All changes (but for the last two last-minute fixes) have been stewed
in linux-next long enough, so it's fairly safe to take:
- PCM UAF fix in 32bit compat layer
- ASoC board-specific fixes for Intel, AMD, Medathek, Qualcomm
- SOF power management fixes
- ASoC Intel link failure fixes
- A series of fixes for USB-audio regressions
- CS35L41 HD-audio codec regression fixes
- HD-audio device-specific fixes / quirks
Note that one SPI patch has been taken in ASoC subtree mistakenly, and
the same fix is found in spi tree, but it should be OK to apply"
* tag 'sound-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (39 commits)
ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF
ALSA: usb-audio: Fix possible NULL pointer dereference in snd_usb_pcm_has_fixed_rate()
ALSA: hda/realtek: Enable mute/micmute LEDs on HP Spectre x360 13-aw0xxx
ASoC: fsl-asoc-card: Fix naming of AC'97 CODEC widgets
ASoC: fsl_ssi: Rename AC'97 streams to avoid collisions with AC'97 CODEC
ALSA: hda/hdmi: Add a HP device 0x8715 to force connect list
ALSA: control-led: use strscpy in set_led_id()
ALSA: usb-audio: Always initialize fixed_rate in snd_usb_find_implicit_fb_sync_format()
ASoC: dt-bindings: qcom,lpass-tx-macro: correct clocks on SC7280
ASoC: dt-bindings: qcom,lpass-wsa-macro: correct clocks on SM8250
ASoC: qcom: Fix building APQ8016 machine driver without SOUNDWIRE
ALSA: hda: cs35l41: Check runtime suspend capability at runtime_idle
ALSA: hda: cs35l41: Don't return -EINVAL from system suspend/resume
ASoC: fsl_micfil: Correct the number of steps on SX controls
ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform
Revert "ALSA: usb-audio: Drop superfluous interface setup at parsing"
ALSA: usb-audio: More refactoring of hw constraint rules
ALSA: usb-audio: Relax hw constraints for implicit fb sync
ALSA: usb-audio: Make sure to stop endpoints before closing EPs
ALSA: hda - Enable headset mic on another Dell laptop with ALC3254
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix assorted issues in the ARM cpufreq drivers and in the AMD
P-state driver.
Specifics:
- Fix cpufreq policy reference counting in amd-pstate to prevent it
from crashing on removal (Perry Yuan)
- Fix double initialization and set suspend-freq for Apple's cpufreq
driver (Arnd Bergmann, Hector Martin)
- Fix reading of "reg" property, update cpufreq-dt's blocklist and
update DT documentation for Qualcomm's cpufreq driver (Konrad
Dybcio, Krzysztof Kozlowski)
- Replace 0 with NULL in the Armada cpufreq driver (Miles Chen)
- Fix potential overflows in the CPPC cpufreq driver (Pierre Gondois)
- Update blocklist for the Tegra234 Soc cpufreq driver (Sumit Gupta)"
* tag 'pm-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: amd-pstate: fix kernel hang issue while amd-pstate unregistering
cpufreq: armada-37xx: stop using 0 as NULL pointer
cpufreq: apple-soc: Switch to the lowest frequency on suspend
dt-bindings: cpufreq: cpufreq-qcom-hw: document interrupts
cpufreq: Add SM6375 to cpufreq-dt-platdev blocklist
cpufreq: Add Tegra234 to cpufreq-dt-platdev blocklist
cpufreq: qcom-hw: Fix reading "reg" with address/size-cells != 2
cpufreq: CPPC: Add u64 casts to avoid overflowing
cpufreq: apple: remove duplicate intializer
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These add one more ACPI IRQ override quirk, improve ACPI companion
lookup for backlight devices and add missing kernel command line
option values for backlight detection.
Specifics:
- Improve ACPI companion lookup for backlight devices in the cases
when there is more than one candidate ACPI device object (Hans de
Goede)
- Add missing support for manual selection of NVidia-WMI-EC or Apple
GMUX backlight in the kernel command line to the ACPI backlight
driver (Hans de Goede)
- Skip ACPI IRQ override on Asus Expertbook B2402CBA (Tamim Khan)"
* tag 'acpi-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: Fix selecting wrong ACPI fwnode for the iGPU on some Dell laptops
ACPI: video: Allow selecting NVidia-WMI-EC or Apple GMUX backlight from the cmdline
ACPI: resource: Skip IRQ override on Asus Expertbook B2402CBA
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
"A set of assorted fixes and hardware-id additions"
* tag 'platform-drivers-x86-v6.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: thinkpad_acpi: Fix profile mode display in AMT mode
platform/x86: int3472/discrete: Ensure the clk/power enable pins are in output mode
platform/x86/amd: Fix refcount leak in amd_pmc_probe
platform/x86: intel/pmc/core: Add Meteor Lake mobile support
platform/x86: simatic-ipc: add another model
platform/x86: simatic-ipc: correct name of a model
platform/x86: dell-privacy: Only register SW_CAMERA_LENS_COVER if present
platform/x86: dell-privacy: Fix SW_CAMERA_LENS_COVER reporting
platform/x86: asus-wmi: Don't load fan curves without fan
platform/x86: asus-wmi: Ignore fan on E410MA
platform/x86: asus-wmi: Add quirk wmi_ignore_fan
platform/x86: asus-nb-wmi: Add alternate mapping for KEY_SCREENLOCK
platform/x86: asus-nb-wmi: Add alternate mapping for KEY_CAMERA
platform/surface: aggregator: Add missing call to ssam_request_sync_free()
platform/surface: aggregator: Ignore command messages not intended for us
platform/x86: touchscreen_dmi: Add info for the CSL Panther Tab HD
platform/x86: ideapad-laptop: Add Legion 5 15ARH05 DMI id to set_fn_lock_led_list[]
platform/x86: sony-laptop: Don't turn off 0x153 keyboard backlight during probe
|
|
Pull drm fixes from Dave Airlie:
"There is a bit of a post-holiday build up here I expect, small fixes
across the board, amdgpu and msm being the main leaders, with others
having a few. One code removal patch for nouveau:
buddy:
- benchmark regression fix for top-down buddy allocation
panel:
- add Lenovo panel orientation quirk
ttm:
- fix kernel oops regression
amdgpu:
- fix missing fence references
- fix missing pipeline sync fencing
- SMU13 fan speed fix
- SMU13 fix power cap handling
- SMU13 BACO fix
- Fix a possible segfault in bo validation error case
- Delay removal of firmware framebuffer
- Fix error when unloading
amdkfd:
- SVM fix when clearing vram
- GC11 fix for multi-GPU
i915:
- Reserve enough fence slot for i915_vma_unbind_vsync
- Fix potential use after free
- Reset engines twice in case of reset failure
- Use multi-cast registers for SVG Unit registers
msm:
- display:
- doc warning fixes
- dt attribs cleanups
- memory leak fix
- error handing in hdmi probe fix
- dp_aux_isr incorrect signalling fix
- shutdown path fix
- accel:
- a5xx: fix quirks to be a bitmask
- a6xx: fix gx halt to avoid 1s hang
- kexec shutdown fix
- fix potential double free
vmwgfx:
- drop rcu usage to make code more robust
virtio:
- fix use-after-free in gem handle code
nouveau:
- drop unused nouveau_fbcon.c"
* tag 'drm-fixes-2023-01-13' of git://anongit.freedesktop.org/drm/drm: (35 commits)
drm: Optimize drm buddy top-down allocation method
drm/ttm: Fix a regression causing kernel oops'es
drm/i915/gt: Cover rest of SVG unit MCR registers
drm/nouveau: Remove file nouveau_fbcon.c
drm/amdkfd: Fix NULL pointer error for GC 11.0.1 on mGPU
drm/amd/pm/smu13: BACO is supported when it's in BACO state
drm/amdkfd: Add sync after creating vram bo
drm/i915/gt: Reset twice
drm/amdgpu: fix pipeline sync v2
drm/vmwgfx: Remove rcu locks from user resources
drm/virtio: Fix GEM handle creation UAF
drm/amdgpu: Fixed bug on error when unloading amdgpu
drm/amd: Delay removal of the firmware framebuffer
drm/amdgpu: Fix potential NULL dereference
drm/i915: Fix potential context UAFs
drm/i915: Reserve enough fence slot for i915_vma_unbind_async
drm: Add orientation quirk for Lenovo ideapad D330-10IGL
drm/msm/a6xx: Avoid gx gbit halt during rpm suspend
drm/msm/adreno: Make adreno quirks not overwrite each other
drm/msm: another fix for the headless Adreno GPU
...
|
|
Takes rwsem lock inside snd_ctl_elem_read instead of snd_ctl_elem_read_user
like it was done for write in commit 1fa4445f9adf1 ("ALSA: control - introduce
snd_ctl_notify_one() helper"). Doing this way we are also fixing the following
locking issue happening in the compat path which can be easily triggered and
turned into an use-after-free.
64-bits:
snd_ctl_ioctl
snd_ctl_elem_read_user
[takes controls_rwsem]
snd_ctl_elem_read [lock properly held, all good]
[drops controls_rwsem]
32-bits:
snd_ctl_ioctl_compat
snd_ctl_elem_write_read_compat
ctl_elem_write_read
snd_ctl_elem_read [missing lock, not good]
CVE-2023-0266 was assigned for this issue.
Cc: stable@kernel.org # 5.13+
Signed-off-by: Clement Lecigne <clecigne@google.com>
Reviewed-by: Jaroslav Kysela <perex@perex.cz>
Link: https://lore.kernel.org/r/20230113120745.25464-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Here's a sizeable batch of Friday the 13th arm64 fixes for -rc4. What
could possibly go wrong?
The obvious reason we have so much here is because of the holiday
season right after the merge window, but we've also brought back an
erratum workaround that was previously dropped at the last minute and
there's an MTE coredumping fix that strays outside of the arch/arm64
directory.
Summary:
- Fix PAGE_TABLE_CHECK failures on hugepage splitting path
- Fix PSCI encoding of MEM_PROTECT_RANGE function in UAPI header
- Fix NULL deref when accessing debugfs node if PSCI is not present
- Fix MTE core dumping when VMA list is being updated concurrently
- Fix SME signal frame handling when SVE is not implemented by the
CPU
- Fix asm constraints for cmpxchg_double() to hazard both words
- Fix build failure with stack tracer and older versions of Clang
- Bring back workaround for Cortex-A715 erratum 2645198"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Fix build with CC=clang, CONFIG_FTRACE=y and CONFIG_STACK_TRACER=y
arm64/mm: Define dummy pud_user_exec() when using 2-level page-table
arm64: errata: Workaround possible Cortex-A715 [ESR|FAR]_ELx corruption
firmware/psci: Don't register with debugfs if PSCI isn't available
firmware/psci: Fix MEM_PROTECT_RANGE function numbers
arm64/signal: Always allocate SVE signal frames on SME only systems
arm64/signal: Always accept SVE signal frames on SME only systems
arm64/sme: Fix context switch for SME only systems
arm64: cmpxchg_double*: hazard against entire exchange variable
arm64/uprobes: change the uprobe_opcode_t typedef to fix the sparse warning
arm64: mte: Avoid the racy walk of the vma list during core dump
elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}
arm64: mte: Fix double-freeing of the temporary tag storage during coredump
arm64: ptrace: Use ARM64_SME to guard the SME register enumerations
arm64/mm: add pud_user_exec() check in pud_user_accessible_page()
arm64/mm: fix incorrect file_map_count for invalid pmd
|
|
A clk, prepared and enabled in mtk_iommu_v1_hw_init(), is not released in
the error handling path of mtk_iommu_v1_probe().
Add the corresponding clk_disable_unprepare(), as already done in the
remove function.
Fixes: b17336c55d89 ("iommu/mediatek: add support for mtk iommu generation one HW")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/593e7b7d97c6e064b29716b091a9d4fd122241fb.1671473163.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
In __alloc_and_insert_iova_range, there is an issue that retry_pfn
overflows. The value of iovad->anchor.pfn_hi is ~0UL, then when
iovad->cached_node is iovad->anchor, curr_iova->pfn_hi + 1 will
overflow. As a result, if the retry logic is executed, low_pfn is
updated to 0, and then new_pfn < low_pfn returns false to make the
allocation successful.
This issue occurs in the following two situations:
1. The first iova size exceeds the domain size. When initializing
iova domain, iovad->cached_node is assigned as iovad->anchor. For
example, the iova domain size is 10M, start_pfn is 0x1_F000_0000,
and the iova size allocated for the first time is 11M. The
following is the log information, new->pfn_lo is smaller than
iovad->cached_node.
Example log as follows:
[ 223.798112][T1705487] sh: [name:iova&]__alloc_and_insert_iova_range
start_pfn:0x1f0000,retry_pfn:0x0,size:0xb00,limit_pfn:0x1f0a00
[ 223.799590][T1705487] sh: [name:iova&]__alloc_and_insert_iova_range
success start_pfn:0x1f0000,new->pfn_lo:0x1efe00,new->pfn_hi:0x1f08ff
2. The node with the largest iova->pfn_lo value in the iova domain
is deleted, iovad->cached_node will be updated to iovad->anchor,
and then the alloc iova size exceeds the maximum iova size that can
be allocated in the domain.
After judging that retry_pfn is less than limit_pfn, call retry_pfn+1
to fix the overflow issue.
Signed-off-by: jianjiao zeng <jianjiao.zeng@mediatek.com>
Signed-off-by: Yunfei Wang <yf.wang@mediatek.com>
Cc: <stable@vger.kernel.org> # 5.15.*
Fixes: 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails")
Acked-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20230111063801.25107-1-yf.wang@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
iommu_group_get() returns the group with the reference incremented.
Move iommu_group_get() after owner check to fix the refcount leak.
Fixes: 89395ccedbc1 ("iommu: Add device-centric DMA ownership interfaces")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221230083100.1489569-1-linmq006@gmail.com
[ joro: Remove *group = NULL initialization ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Similar to SMMUv2, this driver calls iommu_device_unregister() from the
shutdown path, which removes the IOMMU groups with no coordination
whatsoever with their users - shutdown methods are optional in device
drivers. This can lead to NULL pointer dereferences in those drivers'
DMA API calls, or worse.
Instead of calling the full arm_smmu_device_remove() from
arm_smmu_device_shutdown(), let's pick only the relevant function call -
arm_smmu_device_disable() - more or less the reverse of
arm_smmu_device_reset() - and call just that from the shutdown path.
Fixes: 57365a04c921 ("iommu: Move bus setup to IOMMU device registration")
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20221215141251.3688780-2-vladimir.oltean@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
|