Age | Commit message (Collapse) | Author | Files | Lines |
|
Friedrich Weber reported a kernel crash problem and bisected to commit
81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
We don't initialize its queuelist just for this first request, although
the queuelist of all later popped requests will be initialized.
Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
rq->queuelist doesn't need to be initialized. It should be ok since rq
can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.
Please note the commit 81ada09cc25e ("blk-flush: reuse rq queuelist in
flush state machine") also has another requirement that no drivers would
touch rq->queuelist after blk_mq_end_request() since we will reuse it to
add rq to the post-flush pending list in POSTFLUSH. If this is not true,
we will have to revert that commit IMHO.
This updated version adds "list_del_init(&rq->queuelist)" in flush rq
callback since the dm layer may submit request of a weird invalid format
(REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH), which causes double list_add
if without this "list_del_init(&rq->queuelist)". The weird invalid format
problem should be fixed in dm layer.
Reported-by: Friedrich Weber <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Closes: https://lore.kernel.org/lkml/[email protected]/
Fixes: 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine")
Cc: Christoph Hellwig <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Friedrich Weber <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
For zoned block devices using zone write plugging, an rcu_barrier() call
is needed in disk_free_zone_resources() to synchronize freeing of zone
write plugs and the destrution of the mempool used to allocate the
plugs. The barrier call does slow down a little teardown of zoned block
devices but should not affect teardown of regular block devices or zoned
block devices that do not use zone write plugging (e.g. zoned DM devices
that do not require zone append emulation).
Modify disk_free_zone_resources() to return early if we do not have a
mempool to start with, that is, if the device does not use zone write
plugging. This avoids the costly rcu_barrier() and speeds up disk
teardown.
Reported-by: Mikulas Patocka <[email protected]>
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Tested-by: Mikulas Patocka <[email protected]>
Reviewed-by: Niklas Cassel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Clang static checker (scan-build) warning:
block/sed-opal.c:line 317, column 3
Value stored to 'ret' is never read.
Fix this problem by returning the error code when keyring_search() failed.
Otherwise, 'key' will have a wrong value when 'kerf' stores the error code.
Fixes: 3bfeb6125664 ("block: sed-opal: keyring support for SED keys")
Signed-off-by: Su Hui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull NVMe fixes from Keith:
"nvme fixes Linux 6.10
- Use reserved tags for special fabrics operations (Chunguang)
- Persistent Reservation status masking fix (Weiwen)"
* tag 'nvme-6.10-2024-06-05' of git://git.infradead.org/nvme:
nvme: fix nvme_pr_* status code parsing
nvme-fabrics: use reserved tag for reg read/write command
|
|
Block size should be between 512 and PAGE_SIZE and be a power of 2. The current
check does not validate this, so update the check.
Without this patch, null_blk would Oops due to a null pointer deref when
loaded with bs=1536 [1].
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Andreas Hindborg <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[axboe: remove unnecessary braces and != 0 check]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Fix the parsing if extra status bits (e.g. MORE) is present.
Fixes: 7fb42780d06c ("nvme: Convert NVMe errors to PR errors")
Signed-off-by: Weiwen Hu <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
In some scenarios, if too many commands are issued by nvme command in
the same time by user tasks, this may exhaust all tags of admin_q. If
a reset (nvme reset or IO timeout) occurs before these commands finish,
reconnect routine may fail to update nvme regs due to insufficient tags,
which will cause kernel hang forever. In order to workaround this issue,
maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
tags. This maybe safe for nvmf:
1. For the disable ctrl path, we will not issue connect command
2. For the enable ctrl / fw activate path, since connect and reg_xx()
are called serially.
So the reserved tags may still be enough while reg_xx() use reserved tags.
Signed-off-by: Chunguang Xu <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Commit bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
attempts to revert the code change introduced by commit cd5ab1b0fcb4
("blk-throttle: add .low interface"). However, it leaves behind the
bps_conf[] and iops_conf[] fields in the throtl_grp structure which
aren't set anywhere in the new blk-throttle.c code but are still being
used by tg_prfill_limit() to display the limits in io.max. Now io.max
always displays the following values if a block queue is used:
<m>:<n> rbps=0 wbps=0 riops=0 wiops=0
Fix this problem by removing bps_conf[] and iops_conf[] and use bps[]
and iops[] instead to complete the revert.
Fixes: bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Reported-by: Justin Forbes <[email protected]>
Closes: https://github.com/containers/podman/issues/22701#issuecomment-2120627789
Signed-off-by: Waiman Long <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
A zoned device may have a last sequential write required zone that is
smaller than other zones. However, all tests to check if a zone write
plug write offset exceeds the zone capacity use the same capacity
value stored in the gendisk zone_capacity field. This is incorrect for a
zoned device with a last runt (smaller) zone.
Add the new field last_zone_capacity to struct gendisk to store the
capacity of the last zone of the device. blk_revalidate_seq_zone() and
blk_revalidate_conv_zone() are both modified to get this value when
disk_zone_is_last() returns true. Similarly to zone_capacity, the value
is first stored using the last_zone_capacity field of struct
blk_revalidate_zone_args. Once zone revalidation of all zones is done,
this is used to set the gendisk last_zone_capacity field.
The checks to determine if a zone is full or if a sector offset in a
zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
and blk_zone_wplug_prepare_bio() are modified to use the new helper
functions disk_zone_is_full() and disk_zone_wplug_is_full().
disk_zone_is_full() uses the zone index to determine if the zone being
tested is the last one of the disk and uses the either the disk
zone_capacity or last_zone_capacity accordingly.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Niklas Cassel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit ecfe43b11b02 ("block: Remember zone capacity when revalidating
zones") introduced checks to ensure that the capacity of the zones of
a zoned device is constant for all zones. However, this check ignores
the possibility that a zoned device has a smaller last zone with a size
not equal to the capacity of other zones. Such device correspond in
practice to an SMR drive with a smaller last zone and all zones with a
capacity equal to the zone size, leading to the last zone capacity being
different than the capacity of other zones.
Correctly handle such device by fixing the check for the constant zone
capacity in blk_revalidate_seq_zone() using the new helper function
disk_zone_is_last(). This helper function is also used in
blk_revalidate_zone_cb() when checking the zone size.
Fixes: ecfe43b11b02 ("block: Remember zone capacity when revalidating zones")
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Niklas Cassel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
A zoned device with a smaller last zone together with a zone capacity
smaller than the zone size does make any sense as that does not
correspond to any possible setup for a real device:
1) For ZNS and zoned UFS devices, all zones are always the same size.
2) For SMR HDDs, all zones always have the same capacity.
In other words, if we have a smaller last runt zone, then this zone
capacity should always be equal to the zone size.
Add a check in null_init_zoned_dev() to prevent a configuration to have
both a smaller zone size and a zone capacity smaller than the zone size.
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Niklas Cassel <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull NVMe fixes from Keith:
"nvme fixes for Linux 6.10
- Removing unused fields (Kanchan)
- Large folio offsets support (Kundan)
- Multipath NUMA node initialiazation fix (Nilay)
- Multipath IO stats accounting fixes (Keith)
- Circular lockdep fix (Keith)
- Target race condition fix (Sagi)
- Target memory leak fix (Sagi)"
* tag 'nvme-6.10-2024-05-29' of git://git.infradead.org/nvme:
nvmet: fix a possible leak when destroy a ctrl during qp establishment
nvme: use srcu for iterating namespace list
nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset
nvme: remove sgs and sws
nvmet: fix ns enable/disable possible hang
nvme-multipath: fix io accounting on failover
nvme: fix multipath batched completion accounting
nvme-multipath: find NUMA path only for online numa-node
|
|
In nvmet_sq_destroy we capture sq->ctrl early and if it is non-NULL we
know that a ctrl was allocated (in the admin connect request handler)
and we need to release pending AERs, clear ctrl->sqs and sq->ctrl
(for nvme-loop primarily), and drop the final reference on the ctrl.
However, a small window is possible where nvmet_sq_destroy starts (as
a result of the client giving up and disconnecting) concurrently with
the nvme admin connect cmd (which may be in an early stage). But *before*
kill_and_confirm of sq->ref (i.e. the admin connect managed to get an sq
live reference). In this case, sq->ctrl was allocated however after it was
captured in a local variable in nvmet_sq_destroy.
This prevented the final reference drop on the ctrl.
Solve this by re-capturing the sq->ctrl after all inflight request has
completed, where for sure sq->ctrl reference is final, and move forward
based on that.
This issue was observed in an environment with many hosts connecting
multiple ctrls simoutanuosly, creating a delay in allocating a ctrl
leading up to this race window.
Reported-by: Alex Turin <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
The nvme pci driver synchronizes with all the namespace queues during a
reset to ensure that there's no pending timeout work.
Meanwhile the timeout work potentially iterates those same namespaces to
freeze their queues.
Each of those namespace iterations use the same read lock. If a write
lock should somehow get between the synchronize and freeze steps, then
forward progress is deadlocked.
We had been relying on the nvme controller state machine to ensure the
reset work wouldn't conflict with timeout work. That guarantee may be a
bit fragile to rely on, so iterate the namespace lists without taking
potentially circular locks, as reported by lockdep.
Link: https://lore.kernel.org/all/20220930001943.zdbvolc3gkekfmcv@shindev/
Reported-by: Shinichiro Kawasaki <[email protected]>
Tested-by: Shinichiro Kawasaki <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
In __bch_bucket_alloc_set() the lines after lable 'err:' indeed do
nothing useful after multiple cache devices are removed from bcache
code. This cleanup patch drops the useless code to save a bit CPU
cycles.
Signed-off-by: Coly Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
If there are extreme heavy write I/O continuously hit on relative small
cache device (512GB in my testing), it is possible to make counter
c->gc_stats.in_use continue to increase and exceed CUTOFF_CACHE_ADD.
If 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, all following write
requests will bypass the cache device because check_should_bypass()
returns 'true'. Because all writes bypass the cache device, counter
c->sectors_to_gc has no chance to be negative value, and garbage
collection thread won't be waken up even the whole cache becomes clean
after writeback accomplished. The aftermath is that all write I/Os go
directly into backing device even the cache device is clean.
To avoid the above situation, this patch uses a quite conservative way
to fix: if 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, only wakes
up garbage collection thread when the whole cache device is clean.
Before the fix, the writes-always-bypass situation happens after 10+
hours write I/O pressure on 512GB Intel optane memory which acts as
cache device. After this fix, such situation doesn't happen after 36+
hours testing.
Signed-off-by: Coly Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Currently, if the gc is running, when the allocator found free_inc
is empty, allocator has to wait the gc finish. Before that, the
IO is blocked.
But actually, there would be some buckets is reclaimable before gc,
and gc will never mark this kind of bucket to be unreclaimable.
So we can put these buckets into free_inc in gc running to avoid
IO being blocked.
Signed-off-by: Dongsheng Yang <[email protected]>
Signed-off-by: Mingzhe Zou <[email protected]>
Signed-off-by: Coly Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
The logical block size need to be smaller than the max_hw_sector
setting, otherwise we can't even transfer a single LBA.
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Garry <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The max_user_sectors is one of the three factors determining the actual
max_sectors limit for READ/WRITE requests. Because of that it needs to
be stacked at least for the device mapper multi-path case where requests
are directly inserted on the lower device. For SCSI disks this is
important because the sd driver actually sets it's own advisory limit
that is lower than max_hw_sectors based on the block limits VPD page.
While this is a bit odd an unusual, the same effect can happen if a
user or udev script tweaks the value manually.
Fixes: 4f563a64732d ("block: add a max_user_discard_sectors queue limit")
Reported-by: Mike Snitzer <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
sd can set a max_sectors value that is lower than the max_hw_sectors
limit based on the block limits VPD page. While this is rather unusual,
it used to work until the max_user_sectors field was split out to cleanly
deal with conflicting hardware and user limits when the hardware limit
changes. Also set max_user_sectors to ensure the limit can properly be
stacked.
Fixes: 4f563a64732d ("block: add a max_user_discard_sectors queue limit")
Reported-by: Mike Snitzer <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
When changing the maximum number of open zones, print that number
instead of the total number of zones.
Fixes: dc4d137ee3b7 ("null_blk: add support for max open/active zone limit for zoned devices")
Cc: [email protected]
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Niklas Cassel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
blk_stats_alloc_enable was used for block hybrid poll, the related
function definition was removed by patch:
commit 54bdd67d0f88 ("blk-mq: remove hybrid polling")
but the function declaration was not deleted.
Signed-off-by: hexue <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
When powering on a null_blk device that is not already on, the return
value ret that is initialized to be count is reused to check the return
value of null_add_dev(), leading to nullb_device_power_store() to return
null_add_dev() return value (0 on success) instead of "count".
So make sure to set ret to be equal to count when there are no errors.
Fixes: a2db328b0839 ("null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'")
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Reviewed-by: Kanchan Joshi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Don't stuff the values directly into the queue without any
synchronization, but instead delay applying the queue limits in
the caller and let dm_set_zones_restrictions work on the limit
structure.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Fold it into the only caller in preparation to changes in the
queue limits setup.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Keep it together with the rest of the zoned code.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
This is unused now that all the atomic queue limit conversions are
merged.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
|
|
percpu.h depends on smp.h, but doesn't include it directly because of
circular header dependency issues; percpu.h is needed in a bunch of low
level headers.
This fixes a randconfig build error on mips:
include/linux/alloc_tag.h: In function '__alloc_tag_ref_set':
include/asm-generic/percpu.h:31:40: error: implicit declaration of function 'raw_smp_processor_id' [-Werror=implicit-function-declaration]
Reported-by: kernel test robot <[email protected]>
Fixes: 24e44cc22aa3 ("mm: percpu: enable per-cpu allocation tagging")
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tool fix from Arnaldo Carvalho de Melo:
"Revert a patch causing a regression.
This made a simple 'perf record -e cycles:pp make -j199' stop working
on the Ampere ARM64 system Linus uses to test ARM64 kernels".
* tag 'perf-tools-fixes-for-v6.10-1-2024-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
Revert "perf parse-events: Prefer sysfs/JSON hardware events over legacy"
|
|
This reverts commit 617824a7f0f73e4de325cf8add58e55b28c12493.
This made a simple 'perf record -e cycles:pp make -j199' stop working on
the Ampere ARM64 system Linus uses to test ARM64 kernels, as discussed
at length in the threads in the Link tags below.
The fix provided by Ian wasn't acceptable and work to fix this will take
time we don't have at this point, so lets revert this and work on it on
the next devel cycle.
Reported-by: Linus Torvalds <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Bhaskar Chowdhury <[email protected]>
Cc: Ethan Adams <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Yang Jihong <[email protected]>
Link: https://lore.kernel.org/lkml/CAHk-=wi5Ri=yR2jBVk-4HzTzpoAWOgstr1LEvg_-OXtJvXXJOA@mail.gmail.com
Link: https://lore.kernel.org/lkml/CAHk-=wiWvtFyedDNpoV7a8Fq_FpbB+F5KmWK2xPY3QoYseOf_A@mail.gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Pull smb client fixes from Steve French:
- two important netfs integration fixes - including for a data
corruption and also fixes for multiple xfstests
- reenable swap support over SMB3
* tag '6.10-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix missing set of remote_i_size
cifs: Fix smb3_insert_range() to move the zero_point
cifs: update internal version number
smb3: reenable swapfiles over SMB3 mounts
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"16 hotfixes, 11 of which are cc:stable.
A few nilfs2 fixes, the remainder are for MM: a couple of selftests
fixes, various singletons fixing various issues in various parts"
* tag 'mm-hotfixes-stable-2024-05-25-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/ksm: fix possible UAF of stable_node
mm/memory-failure: fix handling of dissolved but not taken off from buddy pages
mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock again
nilfs2: fix potential hang in nilfs_detach_log_writer()
nilfs2: fix unexpected freezing of nilfs_segctor_sync()
nilfs2: fix use-after-free of timer for log writer thread
selftests/mm: fix build warnings on ppc64
arm64: patching: fix handling of execmem addresses
selftests/mm: compaction_test: fix bogus test success and reduce probability of OOM-killer invocation
selftests/mm: compaction_test: fix incorrect write of zero to nr_hugepages
selftests/mm: compaction_test: fix bogus test success on Aarch64
mailmap: update email address for Satya Priya
mm/huge_memory: don't unpoison huge_zero_folio
kasan, fortify: properly rename memintrinsics
lib: add version into /proc/allocinfo output
mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
- Fix x86 IRQ vector leak caused by a CPU offlining race
- Fix build failure in the riscv-imsic irqchip driver
caused by an API-change semantic conflict
- Fix use-after-free in irq_find_at_or_after()
* tag 'irq-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/irqdesc: Prevent use-after-free in irq_find_at_or_after()
genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline
irqchip/riscv-imsic: Fixup riscv_ipi_set_virq_range() conflict
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix regressions of the new x86 CPU VFM (vendor/family/model)
enumeration/matching code
- Fix crash kernel detection on buggy firmware with
non-compliant ACPI MADT tables
- Address Kconfig warning
* tag 'x86-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Fix x86_match_cpu() to match just X86_VENDOR_INTEL
crypto: x86/aes-xts - switch to new Intel CPU model defines
x86/topology: Handle bogus ACPI tables correctly
x86/kconfig: Select ARCH_WANT_FRAME_POINTERS again when UNWINDER_FRAME_POINTER=y
|
|
Pull ipmi updates from Corey Minyard:
"Mostly updates for deprecated interfaces, platform.remove and
converting from a tasklet to a BH workqueue.
Also use HAS_IOPORT for disabling inb()/outb()"
* tag 'for-linus-6.10-1' of https://github.com/cminyard/linux-ipmi:
ipmi: kcs_bmc_npcm7xx: Convert to platform remove callback returning void
ipmi: kcs_bmc_aspeed: Convert to platform remove callback returning void
ipmi: ipmi_ssif: Convert to platform remove callback returning void
ipmi: ipmi_si_platform: Convert to platform remove callback returning void
ipmi: ipmi_powernv: Convert to platform remove callback returning void
ipmi: bt-bmc: Convert to platform remove callback returning void
char: ipmi: handle HAS_IOPORT dependencies
ipmi: Convert from tasklet to BH workqueue
|
|
Pull ceph updates from Ilya Dryomov:
"A series from Xiubo that adds support for additional access checks
based on MDS auth caps which were recently made available to clients.
This is needed to prevent scenarios where the MDS quietly discards
updates that a UID-restricted client previously (wrongfully) acked to
the user.
Other than that, just a documentation fixup"
* tag 'ceph-for-6.10-rc1' of https://github.com/ceph/ceph-client:
doc: ceph: update userspace command to get CephFS metadata
ceph: add CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK feature bit
ceph: check the cephx mds auth access for async dirop
ceph: check the cephx mds auth access for open
ceph: check the cephx mds auth access for setattr
ceph: add ceph_mds_check_access() helper
ceph: save cap_auths in MDS client when session is opened
|
|
https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"Fixes:
- reusing of the file index (could cause the file to be trimmed)
- infinite dir enumeration
- taking DOS names into account during link counting
- le32_to_cpu conversion, 32 bit overflow, NULL check
- some code was refactored
Changes:
- removed max link count info display during driver init
Remove:
- atomic_open has been removed for lack of use"
* tag 'ntfs3_for_6.10' of https://github.com/Paragon-Software-Group/linux-ntfs3:
fs/ntfs3: Break dir enumeration if directory contents error
fs/ntfs3: Fix case when index is reused during tree transformation
fs/ntfs3: Mark volume as dirty if xattr is broken
fs/ntfs3: Always make file nonresident on fallocate call
fs/ntfs3: Redesign ntfs_create_inode to return error code instead of inode
fs/ntfs3: Use variable length array instead of fixed size
fs/ntfs3: Use 64 bit variable to avoid 32 bit overflow
fs/ntfs3: Check 'folio' pointer for NULL
fs/ntfs3: Missed le32_to_cpu conversion
fs/ntfs3: Remove max link count info display during driver init
fs/ntfs3: Taking DOS names into account during link counting
fs/ntfs3: remove atomic_open
fs/ntfs3: use kcalloc() instead of kzalloc()
|
|
Pull smb server fixes from Steve French:
"Two ksmbd server fixes, both for stable"
* tag '6.10-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: ignore trailing slashes in share paths
ksmbd: avoid to send duplicate oplock break notifications
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"There is one new driver and then most of the changes are the device
tree bindings conversions to yaml.
New driver:
- Epson RX8111
Drivers:
- Many Device Tree bindings conversions to dtschema
- pcf8563: wakeup-source support"
* tag 'rtc-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
pcf8563: add wakeup-source support
rtc: rx8111: handle VLOW flag
rtc: rx8111: demote warnings to debug level
rtc: rx6110: Constify struct regmap_config
dt-bindings: rtc: convert trivial devices into dtschema
dt-bindings: rtc: stmp3xxx-rtc: convert to dtschema
dt-bindings: rtc: pxa-rtc: convert to dtschema
rtc: Add driver for Epson RX8111
dt-bindings: rtc: Add Epson RX8111
rtc: mcp795: drop unneeded MODULE_ALIAS
rtc: nuvoton: Modify part number value
rtc: test: Split rtc unit test into slow and normal speed test
dt-bindings: rtc: nxp,lpc1788-rtc: convert to dtschema
dt-bindings: rtc: digicolor-rtc: move to trivial-rtc
dt-bindings: rtc: alphascale,asm9260-rtc: convert to dtschema
dt-bindings: rtc: armada-380-rtc: convert to dtschema
rtc: cros-ec: provide ID table for avoiding fallback match
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
Pull i3c updates from Alexandre Belloni:
"Runtime PM (power management) is improved and hot-join support has
been added to the dw controller driver.
Core:
- Allow device driver to trigger controller runtime PM
Drivers:
- dw: hot-join support
- svc: better IBI handling"
* tag 'i3c/for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
i3c: dw: Add hot-join support.
i3c: master: Enable runtime PM for master controller
i3c: master: svc: fix invalidate IBI type and miss call client IBI handler
i3c: master: svc: change ENXIO to EAGAIN when IBI occurs during start frame
i3c: Add comment for -EAGAIN in i3c_device_do_priv_xfers()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull jffs2 updates from Richard Weinberger:
- Fix illegal memory access in jffs2_free_inode()
- Kernel-doc fixes
- print symbolic error names
* tag 'jffs2-for-linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
jffs2: Fix potential illegal address access in jffs2_free_inode
jffs2: Simplify the allocation of slab caches
jffs2: nodemgmt: fix kernel-doc comments
jffs2: print symbolic error name instead of error code
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull UML updates from Richard Weinberger:
- Fixes for -Wmissing-prototypes warnings and further cleanup
- Remove callback returning void from rtc and virtio drivers
- Fix bash location
* tag 'uml-for-linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (26 commits)
um: virtio_uml: Convert to platform remove callback returning void
um: rtc: Convert to platform remove callback returning void
um: Remove unused do_get_thread_area function
um: Fix -Wmissing-prototypes warnings for __vdso_*
um: Add an internal header shared among the user code
um: Fix the declaration of kasan_map_memory
um: Fix the -Wmissing-prototypes warning for get_thread_reg
um: Fix the -Wmissing-prototypes warning for __switch_mm
um: Fix -Wmissing-prototypes warnings for (rt_)sigreturn
um: Stop tracking host PID in cpu_tasks
um: process: remove unused 'n' variable
um: vector: remove unused len variable/calculation
um: vector: fix bpfflash parameter evaluation
um: slirp: remove set but unused variable 'pid'
um: signal: move pid variable where needed
um: Makefile: use bash from the environment
um: Add winch to winch_handlers before registering winch IRQ
um: Fix -Wmissing-prototypes warnings for __warp_* and foo
um: Fix -Wmissing-prototypes warnings for text_poke*
um: Move declarations to proper headers
...
|
|
Pull drm fixes from Dave Airlie:
"Some fixes for the end of the merge window, mostly amdgpu and panthor,
with one nouveau uAPI change that fixes a bad decision we made a few
months back.
nouveau:
- fix bo metadata uAPI for vm bind
panthor:
- Fixes for panthor's heap logical block.
- Reset on unrecoverable fault
- Fix VM references.
- Reset fix.
xlnx:
- xlnx compile and doc fixes.
amdgpu:
- Handle vbios table integrated info v2.3
amdkfd:
- Handle duplicate BOs in reserve_bo_and_cond_vms
- Handle memory limitations on small APUs
dp/mst:
- MST null deref fix.
bridge:
- Don't let next bridge create connector in adv7511 to make probe
work"
* tag 'drm-next-2024-05-25' of https://gitlab.freedesktop.org/drm/kernel:
drm/amdgpu/atomfirmware: add intergrated info v2.3 table
drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2
drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
drm/amdkfd: handle duplicate BOs in reserve_bo_and_cond_vms
drm/bridge: adv7511: Attach next bridge without creating connector
drm/buddy: Fix the warn on's during force merge
drm/nouveau: use tile_mode and pte_kind for VM_BIND bo allocations
drm/panthor: Call panthor_sched_post_reset() even if the reset failed
drm/panthor: Reset the FW VM to NULL on unplug
drm/panthor: Keep a ref to the VM at the panthor_kernel_bo level
drm/panthor: Force an immediate reset on unrecoverable faults
drm/panthor: Document drm_panthor_tiler_heap_destroy::handle validity constraints
drm/panthor: Fix an off-by-one in the heap context retrieval logic
drm/panthor: Relax the constraints on the tiler chunk size
drm/panthor: Make sure the tiler initial/max chunks are consistent
drm/panthor: Fix tiler OOM handling to allow incremental rendering
drm: xlnx: zynqmp_dpsub: Fix compilation error
drm: xlnx: zynqmp_dpsub: Fix few function comments
|
|
Occasionally, the generic/001 xfstest will fail indicating corruption in
one of the copy chains when run on cifs against a server that supports
FSCTL_DUPLICATE_EXTENTS_TO_FILE (eg. Samba with a share on btrfs). The
problem is that the remote_i_size value isn't updated by cifs_setsize()
when called by smb2_duplicate_extents(), but i_size *is*.
This may cause cifs_remap_file_range() to then skip the bit after calling
->duplicate_extents() that sets sizes.
Fix this by calling netfs_resize_file() in smb2_duplicate_extents() before
calling cifs_setsize() to set i_size.
This means we don't then need to call netfs_resize_file() upon return from
->duplicate_extents(), but we also fix the test to compare against the pre-dup
inode size.
[Note that this goes back before the addition of remote_i_size with the
netfs_inode struct. It should probably have been setting cifsi->server_eof
previously.]
Fixes: cfc63fc8126a ("smb3: fix cached file size problems in duplicate extents (reflink)")
Signed-off-by: David Howells <[email protected]>
cc: Steve French <[email protected]>
cc: Paulo Alcantara <[email protected]>
cc: Shyam Prasad N <[email protected]>
cc: Rohith Surabattula <[email protected]>
cc: Jeff Layton <[email protected]>
cc: [email protected]
cc: [email protected]
Signed-off-by: Steve French <[email protected]>
|
|
Fix smb3_insert_range() to move the zero_point over to the new EOF.
Without this, generic/147 fails as reads of data beyond the old EOF point
return zeroes.
Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib")
Signed-off-by: David Howells <[email protected]>
cc: Shyam Prasad N <[email protected]>
cc: Rohith Surabattula <[email protected]>
cc: Jeff Layton <[email protected]>
cc: [email protected]
cc: [email protected]
Signed-off-by: Steve French <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more mm updates from Andrew Morton:
"Jeff Xu's implementation of the mseal() syscall"
* tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
selftest mm/mseal read-only elf memory segment
mseal: add documentation
selftest mm/mseal memory sealing
mseal: add mseal syscall
mseal: wire up mseal syscall
|
|
The commit 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page
deduplication limit") introduced a possible failure case in the
stable_tree_insert(), where we may free the new allocated stable_node_dup
if we fail to prepare the missing chain node.
Then that kfolio return and unlock with a freed stable_node set... And
any MM activities can come in to access kfolio->mapping, so UAF.
Fix it by moving folio_set_stable_node() to the end after stable_node
is inserted successfully.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit")
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When I did memory failure tests recently, below panic occurs:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:1009!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__del_page_from_free_list+0x151/0x180
RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
FS: 00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
Call Trace:
<TASK>
__rmqueue_pcplist+0x23b/0x520
get_page_from_freelist+0x26b/0xe40
__alloc_pages_noprof+0x113/0x1120
__folio_alloc_noprof+0x11/0xb0
alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
__alloc_fresh_hugetlb_folio+0xe7/0x140
alloc_pool_huge_folio+0x68/0x100
set_max_huge_pages+0x13d/0x340
hugetlb_sysctl_handler_common+0xe8/0x110
proc_sys_call_handler+0x194/0x280
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xc2/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff916114887
RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
</TASK>
Modules linked in: mce_inject hwpoison_inject
---[ end trace 0000000000000000 ]---
And before the panic, there had an warning about bad page state:
BUG: Bad page state in process page-types pfn:8cee00
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
page_type: 0xffffff7f(buddy)
raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
Call Trace:
<TASK>
dump_stack_lvl+0x83/0xa0
bad_page+0x63/0xf0
free_unref_page+0x36e/0x5c0
unpoison_memory+0x50b/0x630
simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
debugfs_attr_write+0x42/0x60
full_proxy_write+0x5b/0x80
vfs_write+0xcd/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xc2/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f189a514887
RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
</TASK>
The root cause should be the below race:
memory_failure
try_memory_failure_hugetlb
me_huge_page
__page_handle_poison
dissolve_free_hugetlb_folio
drain_all_pages -- Buddy page can be isolated e.g. for compaction.
take_page_off_buddy -- Failed as page is not in the buddy list.
-- Page can be putback into buddy after compaction.
page_ref_inc -- Leads to buddy page with refcnt = 1.
Then unpoison_memory() can unpoison the page and send the buddy page back
into buddy list again leading to the above bad page state warning. And
bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
allocate this page.
Fix this issue by only treating __page_handle_poison() as successful when
it returns 1.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ceaf8fbea79a ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage")
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
After switching smaps_rollup to use VMA iterator, searching for next entry
is part of the condition expression of the do-while loop. So the current
VMA needs to be addressed before the continue statement.
Otherwise, with some VMAs skipped, userspace observed memory
consumption from /proc/pid/smaps_rollup will be smaller than the sum of
the corresponding fields from /proc/pid/smaps.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end")
Signed-off-by: Yuanyuan Zhong <[email protected]>
Reviewed-by: Mohamed Khalfella <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|