Age | Commit message (Collapse) | Author | Files | Lines |
|
There are sleep in atomic context bugs when the request to secure
element of st-nci is timeout. The root cause is that nci_skb_alloc
with GFP_KERNEL parameter is called in st_nci_se_wt_timeout which is
a timer handler. The call paths that could trigger bugs are shown below:
(interrupt context 1)
st_nci_se_wt_timeout
nci_hci_send_event
nci_hci_send_data
nci_skb_alloc(..., GFP_KERNEL) //may sleep
(interrupt context 2)
st_nci_se_wt_timeout
nci_hci_send_event
nci_hci_send_data
nci_send_data
nci_queue_tx_data_frags
nci_skb_alloc(..., GFP_KERNEL) //may sleep
This patch changes allocation mode of nci_skb_alloc from GFP_KERNEL to
GFP_ATOMIC in order to prevent atomic context sleeping. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: ed06aeefdac3 ("nfc: st-nci: Rename st21nfcb to st-nci")
Signed-off-by: Duoming Zhou <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
test_bit() tests if one bit is set or not.
Here the logic seems to check of bit QL_RESET_PER_SCSI (i.e. 4) OR bit
QL_RESET_START (i.e. 3) is set.
In fact, it checks if bit 7 (4 | 3 = 7) is set, that is to say
QL_ADAPTER_UP.
This looks harmless, because this bit is likely be set, and when the
ql_reset_work() delayed work is scheduled in ql3xxx_isr() (the only place
that schedule this work), QL_RESET_START or QL_RESET_PER_SCSI is set.
This has been spotted by smatch.
Fixes: 5a4faa873782 ("[PATCH] qla3xxx NIC driver")
Signed-off-by: Christophe JAILLET <[email protected]>
Link: https://lore.kernel.org/r/80e73e33f390001d9c0140ffa9baddf6466a41a2.1652637337.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- Avoid putting Elo i2 PCIe Ports in D3cold because downstream devices
are inaccessible after going back to D0 (Rafael J. Wysocki)
- Qualcomm SM8250 has a ddrss_sf_tbu clock but SC8180X does not; make a
SC8180X-specific config without the clock so it probes correctly
(Bjorn Andersson)
- Revert aardvark chained IRQ handler rewrite because it broke
interrupt affinity (Pali Rohár)
* tag 'pci-v5.18-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
Revert "PCI: aardvark: Rewrite IRQ code to chained IRQ handler"
PCI: qcom: Remove ddrss_sf_tbu clock from SC8180X
PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"Fix up a recent change in the int340x thermal driver that
inadvertently broke thermal zone handling on some systems
(Srinivas Pandruvada)"
* tag 'thermal-5.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: int340x: Mode setting with new OS handshake
|
|
The code attempts to free the 'new' pointer using kmem_cache_free(),
which is wrong because this function isn't responsible of freeing it.
Instead, the function should free new->htable and clear the contents of
*new (to prevent double-free).
Cc: [email protected]
Fixes: c7c556f1e81b ("selinux: refactor changing booleans")
Reported-by: Wander Lairson Costa <[email protected]>
Signed-off-by: Ondrej Mosnacek <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
|
|
Introduce arch_try_cmpxchg64 for 64-bit and 32-bit targets to improve
code using cmpxchg64. On 64-bit targets, the generated assembly improves
from:
ab: 89 c8 mov %ecx,%eax
ad: 48 89 4c 24 60 mov %rcx,0x60(%rsp)
b2: 83 e0 fd and $0xfffffffd,%eax
b5: 89 54 24 64 mov %edx,0x64(%rsp)
b9: 88 44 24 60 mov %al,0x60(%rsp)
bd: 48 89 c8 mov %rcx,%rax
c0: c6 44 24 62 f2 movb $0xf2,0x62(%rsp)
c5: 48 8b 74 24 60 mov 0x60(%rsp),%rsi
ca: f0 49 0f b1 34 24 lock cmpxchg %rsi,(%r12)
d0: 48 39 c1 cmp %rax,%rcx
d3: 75 cf jne a4 <t+0xa4>
to:
b3: 89 c2 mov %eax,%edx
b5: 48 89 44 24 60 mov %rax,0x60(%rsp)
ba: 83 e2 fd and $0xfffffffd,%edx
bd: 89 4c 24 64 mov %ecx,0x64(%rsp)
c1: 88 54 24 60 mov %dl,0x60(%rsp)
c5: c6 44 24 62 f2 movb $0xf2,0x62(%rsp)
ca: 48 8b 54 24 60 mov 0x60(%rsp),%rdx
cf: f0 48 0f b1 13 lock cmpxchg %rdx,(%rbx)
d4: 75 d5 jne ab <t+0xab>
where a move and a compare after cmpxchg is saved. The improvements
for 32-bit targets are even more noticeable, because dual-word compare
after cmpxchg8b gets eliminated.
Signed-off-by: Uros Bizjak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Add generic support for try_cmpxchg64{,_acquire,_release,_relaxed}
and their falbacks involving cmpxchg64.
Signed-off-by: Uros Bizjak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
This fires on a Fam16h machine here:
unchecked MSR access error: WRMSR to 0xc000010f (tried to write 0x0000000000000018) \
at rIP: 0xffffffff81007db1 (amd_brs_reset+0x11/0x50)
Call Trace:
<TASK>
amd_pmu_cpu_starting
? x86_pmu_dead_cpu
x86_pmu_starting_cpu
cpuhp_invoke_callback
? x86_pmu_starting_cpu
? x86_pmu_dead_cpu
cpuhp_issue_call
? x86_pmu_starting_cpu
__cpuhp_setup_state_cpuslocked
? x86_pmu_dead_cpu
? x86_pmu_starting_cpu
__cpuhp_setup_state
? map_vsyscall
init_hw_perf_events
? map_vsyscall
do_one_initcall
? _raw_spin_unlock_irqrestore
? try_to_wake_up
kernel_init_freeable
? rest_init
kernel_init
ret_from_fork
because that CPU hotplug callback gets executed on any AMD CPU - not
only on the BRS-enabled ones. Check the BRS feature bit properly.
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-By: Stephane Eranian <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
There's two problems with the current amd_brs_adjust_period() code:
- it isn't in fact AMD specific and wil always adjust the period;
- it adjusts the period, while it should only adjust the event count,
resulting in repoting a short period.
Fix this by using x86_pmu.limit_period, this makes it specific to the
AMD BRS case and ensures only the event count is adjusted while the
reported period is unmodified.
Fixes: ba2fe7500845 ("perf/x86/amd: Add AMD branch sampling period adjustment")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
|
|
drm_dp_mst_get_edid call kmemdup to create mst_edid. So mst_edid need to be
freed after use.
Signed-off-by: Hangyu Hua <[email protected]>
Reviewed-by: Lyude Paul <[email protected]>
Signed-off-by: Lyude Paul <[email protected]>
Cc: [email protected]
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
This change fixes the following:
1) The flags variable is not initialized. Always use raw_spin_lock_irqsave
and raw_spin_unlock_irqrestore to serialize patching.
2) flush_kernel_vmap_range is primarily intended for DMA flushes.
The whole cache flush in flush_kernel_vmap_range is only possible
when interrupts are enabled on SMP machines. Since __patch_text_multiple
calls flush_kernel_vmap_range with interrupts disabled, it is better
to directly call flush_kernel_dcache_range_asm and
flush_kernel_icache_range_asm.
3) The final call to flush_icache_range is unnecessary.
Tested with `[PATCH, V3] parisc: Rewrite cache flush code for
PA8800/PA8900' change on rp3440, c8000 and c3750 (32 and 64-bit).
Note by Helge:
This patch had been temporarily reverted shortly before v5.18-rc6 in order
to fix boot issues. Now it can be re-applied.
Signed-off-by: John David Anglin <[email protected]>
Signed-off-by: Helge Deller <[email protected]>
|
|
Originally, I was convinced that we needed to use tmpalias flushes
everwhere, for both user and kernel flushes. However, when I modified
flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
would crash quite early when booting.
The PDC returns alias values of 0 for the icache and dcache. This
indicates that either the alias boundary is greater than 16MB or
equivalent aliasing doesn't work. I modified the tmpalias code to
make it easy to try alternate boundaries. I tried boundaries up to
128MB but still kernel tmpalias flushes didn't work on c8000.
This led me to conclude that tmpalias flushes don't work on PA8800
and PA8900 machines, and that we needed to flush directly using the
virtual address of user and kernel pages. This is likely the major
cause of instability on the c8000 and rp34xx machines.
Flushing user pages requires doing a temporary context switch as we
have to flush pages that don't belong to the current context. Further,
we have to deal with pages that aren't present. If a page isn't
present, the flush instructions fault on every line.
Other code has been rearranged and simplified based on testing. For
example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
and flush_cache_dup_mm differ in that flush_cache_mm calls
purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
In some implementations, pdc is more efficient than fdc. Based on
my testing, I don't believe there's any performance benefit on the
c8000.
Signed-off-by: John David Anglin <[email protected]>
Signed-off-by: Helge Deller <[email protected]>
|
|
Change the "BUG" to "WARNING" and disable the message because it triggers
occasionally in spite of the check in flush_cache_page_if_present.
The pte value extracted for the "from" page in copy_user_highpage is racy and
occasionally the pte is cleared before the flush is complete. I assume that
the page is simultaneously flushed by flush_cache_mm before the pte is cleared
as nullifying the fdc doesn't seem to cause problems.
I investigated various locking scenarios but I wasn't able to find a way to
sequence the flushes. This code is called for every COW break and locks impact
performance.
This patch is related to the bigger cache flush patch because we need the pte
on PA8800/PA8900 to flush using the vma context.
I have also seen this from copy_to_user_page and copy_from_user_page.
The messages appear infrequently when enabled.
Signed-off-by: John David Anglin <[email protected]>
Signed-off-by: Helge Deller <[email protected]>
|
|
clk_generated_best_diff() helps in finding the parent and the divisor to
compute a rate closest to the required one. However, it doesn't take into
account the request's range for the new rate. Make sure the new rate
is within the required range.
Fixes: 8a8f4bf0c480 ("clk: at91: clk-generated: create function to find best_diff")
Signed-off-by: Codrin Ciubotariu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Claudiu Beznea <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>
|
|
cpufreq_offline() calls offline() and exit() under the policy rwsem
But they are called outside the rwsem in cpufreq_online().
Make cpufreq_online() call offline() and exit() as well as online() and
init() under the policy rwsem to achieve a clear lock relationship.
All of the init() and online() implementations in the tree only
initialize the policy object without attempting to acquire the policy
rwsem and they won't call cpufreq APIs attempting to acquire it.
Signed-off-by: Schspa Shi <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
If policy initialization fails after the sysfs files are created,
there is a possibility to end up running show()/store() callbacks
for half-initialized policies, which may have unpredictable
outcomes.
Abort show()/store() in such a case by making sure the policy is active.
Also dectivate the policy on such failures.
Signed-off-by: Schspa Shi <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Not calling the function for dummy contexts will cause the context to
not be reset. During the next syscall, this will cause an error in
__audit_syscall_entry:
WARN_ON(context->context != AUDIT_CTX_UNUSED);
WARN_ON(context->name_count);
if (context->context != AUDIT_CTX_UNUSED || context->name_count) {
audit_panic("unrecoverable error in audit_syscall_entry()");
return;
}
These problematic dummy contexts are created via the following call
chain:
exit_to_user_mode_prepare
-> arch_do_signal_or_restart
-> get_signal
-> task_work_run
-> tctx_task_work
-> io_req_task_submit
-> io_issue_sqe
-> audit_uring_entry
Cc: [email protected]
Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Julian Orth <[email protected]>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <[email protected]>
|
|
We gate whether to IOPOLL for a request on whether the opcode is allowed
on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING
is the only one that allows a file yet isn't pollable, it's merely
supported to allow communication on an IOPOLL ring, not because we can
poll for completion of it.
Put the assigned file early and clear it, so we don't attempt to poll
for it.
Reported-by: [email protected]
Fixes: 3f1d52abf098 ("io_uring: defer msg-ring file validity check until command issue")
Signed-off-by: Jens Axboe <[email protected]>
|
|
Hulk Robot reported a BUG_ON:
==================================================================
EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0,
block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters
kernel BUG at fs/ext4/ext4_jbd2.c:53!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1
RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline]
RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116
[...]
Call Trace:
ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795
generic_perform_write+0x279/0x3c0 mm/filemap.c:3344
ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270
ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520
do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732
do_iter_write+0x107/0x430 fs/read_write.c:861
vfs_writev fs/read_write.c:934 [inline]
do_pwritev+0x1e5/0x380 fs/read_write.c:1031
[...]
==================================================================
Above issue may happen as follows:
cpu1 cpu2
__________________________|__________________________
do_pwritev
vfs_writev
do_iter_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_destroy_inline_data_nolock
clear EXT4_STATE_MAY_INLINE_DATA
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_regular_allocator
ext4_mb_good_group_nolock
ext4_mb_init_group
ext4_mb_init_cache
ext4_mb_generate_buddy --> error
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_restore_inline_data
set EXT4_STATE_MAY_INLINE_DATA
ext4_block_write_begin
ext4_da_write_end
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_write_inline_data_end
handle=NULL
ext4_journal_stop(handle)
__ext4_journal_stop
ext4_put_nojournal(handle)
ref_cnt = (unsigned long)handle
BUG_ON(ref_cnt == 0) ---> BUG_ON
The lock held by ext4_convert_inline_data is xattr_sem, but the lock
held by generic_perform_write is i_rwsem. Therefore, the two locks can
be concurrent.
To solve above issue, we add inode_lock() for ext4_convert_inline_data().
At the same time, move ext4_convert_inline_data() in front of
ext4_punch_hole(), remove similar handling from ext4_punch_hole().
Fixes: 0c8d414f163f ("ext4: let fallocate handle inline data correctly")
Cc: [email protected]
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Baokun Li <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Symlink's external data block is one kind of metadata block, and now
that almost all ext4 metadata block's page cache (e.g. directory blocks,
quota blocks...) belongs to bdev backing inode except the symlink. It
is essentially worked in data=journal mode like other regular file's
data block because probably in order to make it simple for generic VFS
code handling symlinks or some other historical reasons, but the logic
of creating external data block in ext4_symlink() is complicated. and it
also make things confused if user do not want to let the filesystem
worked in data=journal mode. This patch convert the final exceptional
case and make things clean, move the mapping of the symlink's external
data block to bdev like any other metadata block does.
Signed-off-by: Zhang Yi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Current ext4_getblk() might sleep if some resources are not valid or
could be race with a concurrent extents modifing procedure. So we
cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in
the atomic context in some fast path (e.g. the upcoming procedure of
getting symlink external block in the RCU context), even if the map
extents have already been check and cached.
Signed-off-by: Zhang Yi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In __ext4_super() we always overwrote the user specified journal_ioprio
value with a default value, expecting parse_apply_sb_mount_options() to
later correctly set ctx->journal_ioprio to the user specified value.
However, if parse_apply_sb_mount_options() returned early because of
empty sbi->es_s->s_mount_opts, the correct journal_ioprio value was
never set.
This patch fixes __ext4_super() to only use the default value if the
user has not specified any value for journal_ioprio.
Similarly, the remount behavior was to either use journal_ioprio
value specified during initial mount, or use the default value
irrespective of the journal_ioprio value specified during remount.
This patch modifies this to first check if a new value for ioprio
has been passed during remount and apply it. If no new value is
passed, use the value specified during initial mount.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
Tested-by: Ritesh Harjani <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
|
|
Otherwise nonaligned fstrim calls will works inconveniently for iterative
scanners, for example:
// trim [0,16MB] for group-1, but mark full group as trimmed
fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m
// handle [16MB,16MB] for group-1, do nothing because group already has the flag.
fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m
[ Update function documentation for ext4_trim_all_free -- TYT ]
Signed-off-by: Dmitry Monakhov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
|
|
We got issue as follows:
EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478
ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000
ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae
==================================================================
BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220
Read of size 4 at addr ffff88810beee6ae by task rep/1895
CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ #241
Call Trace:
dump_stack+0xbe/0xf9
print_address_description.constprop.0+0x1e/0x220
kasan_report.cold+0x37/0x7f
ext4_rename_dir_prepare+0x152/0x220
ext4_rename+0xf44/0x1ad0
ext4_rename2+0x11c/0x170
vfs_rename+0xa84/0x1440
do_renameat2+0x683/0x8f0
__x64_sys_renameat+0x53/0x60
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f45a6fc41c9
RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9
RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005
RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080
R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0
R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000
The buggy address belongs to the page:
page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee
flags: 0x200000000000000()
raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
ext4_rename_dir_prepare: [2] parent_de->inode=3537895424
ext4_rename_dir_prepare: [3] dir=0xffff888124170140
ext4_rename_dir_prepare: [4] ino=2
ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872
Reason is first directory entry which 'rec_len' is 34478, then will get illegal
parent entry. Now, we do not check directory entry after read directory block
in 'ext4_get_first_dir_block'.
To solve this issue, check directory entry in 'ext4_get_first_dir_block'.
[ Trigger an ext4_error() instead of just warning if the directory is
missing a '.' or '..' entry. Also make sure we return an error code
if the file system is corrupted. -TYT ]
Signed-off-by: Ye Bin <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
|
|
Zoned devices are expected to have zone sizes in the range of 1-2GB for
ZNS SSDs and SMR HDDs have zone sizes of 256MB, so there is no need to
allow arbitrarily small zone sizes on btrfs.
But for testing purposes with emulated devices it is sometimes desirable
to create devices with as small as 4MB zone size to uncover errors.
So use 4MB as the smallest possible zone size and reject mounts of devices
with a smaller zone size.
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Btrfs defaults to max_inline=2K to make small writes inlined into
metadata.
The default value is always a win, as even DUP/RAID1/RAID10 doubles the
metadata usage, it should still cause less physical space used compared
to a 4K regular extents.
But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
users may find inlined extents causing too much space wasted, and want
to convert those inlined extents back to regular extents.
Unfortunately defrag will unconditionally skip all inline extents, no
matter if the user is trying to converting them back to regular extents.
So this patch will add a small exception for defrag_collect_targets() to
allow defragging inline extents, if and only if the inlined extents are
larger than max_inline, allowing users to convert them to regular ones.
This also allows us to defrag extents like the following:
item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
generation 7 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 16384 ram 16384
extent compression 1 (zlib)
Previously we're unable to do any defrag, since the first extent is
inlined, and the second one has no extent to merge.
Now we can defrag it to just one single extent, saving 48 bytes metadata
space.
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13635584 nr 4096
extent data offset 0 nr 20480 ram 20480
extent compression 1 (zlib)
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The following error message lack the "0x" obviously:
cannot mount because of unsupported optional features (4000)
Add the prefix to make it less confusing. This can happen on older
kernels that try to mount a filesystem with newer features so it makes
sense to backport to older trees.
CC: [email protected] # 4.14+
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When reserving metadata units for creating an inode, we don't need to
reserve one extra unit for the inode ref item because when creating the
inode, at btrfs_create_new_inode(), we always insert the inode item and
the inode ref item in a single batch (a single btree insert operation,
and both ending up in the same leaf).
As we have accounted already one unit for the inode item, the extra unit
for the inode ref item is superfluous, it only makes us reserve more
metadata than necessary and often adding more reclaim pressure if we are
low on available metadata space.
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The block_group->alloc_offset is an offset from the start of the block
group. OTOH, the ->meta_write_pointer is an address in the logical
space. So, we should compare the alloc_offset shifted with the
block_group->start.
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
CC: [email protected] # 5.16+
Signed-off-by: Naohiro Aota <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Adaptive-rx and Adaptive-tx are interrupt moderation settings
that can be enabled/disabled using ethtool:
ethtool -C ethX adaptive-rx on/off adaptive-tx on/off
Unfortunately those settings are getting cleared after
changing number of queues, or in ethtool world 'channels':
ethtool -L ethX rx 1 tx 1
Clearing was happening due to introduction of bit fields
in ice_ring_container struct. This way only itr_setting
bits were rebuilt during ice_vsi_rebuild_set_coalesce().
Introduce an anonymous struct of bitfields and create a
union to refer to them as a single variable.
This way variable can be easily saved and restored.
Fixes: 61dc79ced7aa ("ice: Restore interrupt throttle settings after VSI rebuild")
Signed-off-by: Michal Wilczynski <[email protected]>
Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
|
|
The hardware statistics counters are not cleared during resets so the
drivers first access is to initialize the baseline and then subsequent
reads are for reporting the counters. The statistics counters are read
during the watchdog subtask when the interface is up. If the baseline
is not initialized before the interface is up, then there can be a brief
window in which some traffic can be transmitted/received before the
initial baseline reading takes place.
Directly initialize ethtool statistics in driver open so the baseline will
be initialized when the interface is up, and any dropped packets
incremented before the interface is up won't be reported.
Fixes: 28dc1b86f8ea9 ("ice: ignore dropped packets during init")
Signed-off-by: Paul Greenwalt <[email protected]>
Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
|
|
Do not allow to write timestamps on RX rings if PF is being configured.
When PF is being configured RX rings can be freed or rebuilt. If at the
same time timestamps are updated, the kernel will crash by dereferencing
null RX ring pointer.
PID: 1449 TASK: ff187d28ed658040 CPU: 34 COMMAND: "ice-ptp-0000:51"
#0 [ff1966a94a713bb0] machine_kexec at ffffffff9d05a0be
#1 [ff1966a94a713c08] __crash_kexec at ffffffff9d192e9d
#2 [ff1966a94a713cd0] crash_kexec at ffffffff9d1941bd
#3 [ff1966a94a713ce8] oops_end at ffffffff9d01bd54
#4 [ff1966a94a713d08] no_context at ffffffff9d06bda4
#5 [ff1966a94a713d60] __bad_area_nosemaphore at ffffffff9d06c10c
#6 [ff1966a94a713da8] do_page_fault at ffffffff9d06cae4
#7 [ff1966a94a713de0] page_fault at ffffffff9da0107e
[exception RIP: ice_ptp_update_cached_phctime+91]
RIP: ffffffffc076db8b RSP: ff1966a94a713e98 RFLAGS: 00010246
RAX: 16e3db9c6b7ccae4 RBX: ff187d269dd3c180 RCX: ff187d269cd4d018
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ff187d269cfcc644 R8: ff187d339b9641b0 R9: 0000000000000000
R10: 0000000000000002 R11: 0000000000000000 R12: ff187d269cfcc648
R13: ffffffff9f128784 R14: ffffffff9d101b70 R15: ff187d269cfcc640
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ff1966a94a713ea0] ice_ptp_periodic_work at ffffffffc076dbef [ice]
#9 [ff1966a94a713ee0] kthread_worker_fn at ffffffff9d101c1b
#10 [ff1966a94a713f10] kthread at ffffffff9d101b4d
#11 [ff1966a94a713f50] ret_from_fork at ffffffff9da0023f
Fixes: 77a781155a65 ("ice: enable receive hardware timestamping")
Signed-off-by: Arkadiusz Kubalewski <[email protected]>
Reviewed-by: Michal Schmidt <[email protected]>
Tested-by: Dave Cain <[email protected]>
Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.18, take #3
- Correctly expose GICv3 support even if no irqchip is created
so that userspace doesn't observe it changing pointlessly
(fixing a regression with QEMU)
- Don't issue a hypercall to set the id-mapped vectors when
protected mode is enabled (fix for pKVM in combination with
CPUs affected by Spectre-v3a)
|
|
Clock source is prepared and enabled by clk_prepare_enable()
in probe function, but not disabled or unprepared in remove
function.
Signed-off-by: Yang Yingliang <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/linux-mtd/[email protected]
|
|
When "-o device" mount option is not specified, scan the device table
and instantiate the devices if there's any in the device table. In this
case, the tag field of each device slot uniquely specifies a device.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Gao Xiang <[email protected]>
|
|
Use asynchronous io to read data from fscache may greatly improve IO
bandwidth for sequential buffered read scenario.
Change erofs_fscache_read_folios to erofs_fscache_read_folios_async,
and read data from fscache asynchronously.
Make .readpage()/.readahead() to use this new helper.
Signed-off-by: Xin Yin <[email protected]>
Reviewed-by: Jeffle Xu <[email protected]>
Signed-off-by: Jeffle Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
[ Gao Xiang: minor styling changes. ]
Signed-off-by: Gao Xiang <[email protected]>
|
|
Introduce 'fsid' mount option to enable on-demand read sementics, in
which case, erofs will be mounted from data blobs. Users could specify
the name of primary data blob by this mount option.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Tested-by: Zichen Tian <[email protected]>
Tested-by: Jia Zhu <[email protected]>
Tested-by: Yan Song <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Implement fscache-based data readahead. Also registers an individual
bdi for each erofs instance to enable readahead.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Implement the data plane of reading data from data blobs over fscache
for inline layout.
For the heading non-inline part, the data plane for non-inline layout is
reused, while only the tail packing part needs special handling.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Implement the data plane of reading data from data blobs over fscache
for non-inline layout.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Implement the data plane of reading metadata from primary data blob
over fscache.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Similar to the multi-device mode, erofs could be mounted from one
primary data blob (mandatory) and multiple extra data blobs (optional).
Register fscache context for each extra data blob.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Registers fscache context for primary data blob. Also move the
initialization of s_op and related fields forward, since anonymous
inode will be allocated under the super block when registering the
fscache context.
Something worth mentioning about the cleanup routine.
1. The fscache context will instantiate anonymous inodes under the super
block. Release these anonymous inodes when .put_super() is called, or
we'll get "VFS: Busy inodes after unmount." warning.
2. The fscache context is initialized prior to the root inode. If
.kill_sb() is called when mount failed, .put_super() won't be called
when root inode has not been initialized yet. Thus .kill_sb() shall
also contain the cleanup routine.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Add erofs_fscache_read_folios() helper reading from fscache. It supports
on-demand read semantics. That is, it will make the backend prepare for
the data when cache miss. Once data ready, it will read from the cache.
This helper can then be used to implement .readpage()/.readahead() of
on-demand read semantics.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Introduce one anonymous inode for data blobs so that erofs can cache
metadata directly within such anonymous inode.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Introduce a context structure for managing data blobs, and helper
functions for initializing and cleaning up this context structure.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
A new fscache based mode is going to be introduced for erofs, in which
case on-demand read semantics is implemented through fscache.
As the first step, register fscache volume for each erofs filesystem.
That means, data blobs can not be shared among erofs filesystems. In the
following iteration, we are going to introduce the domain semantics, in
which case several erofs filesystems can belong to one domain, and data
blobs can be shared among these erofs filesystems of one domain.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
Until then erofs is exactly blockdev based filesystem.
A new fscache-based mode is going to be introduced for erofs to support
scenarios where on-demand read semantics is needed, e.g. container
image distribution. In this case, erofs could be mounted from data blobs
through fscache.
Add a helper checking which mode erofs works in, and twist the code in
preparation for the upcoming fscache mode.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
... so that it can be used in the following introduced fscache mode.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|