aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-07-27md: also clone new io if io accounting is disabledYu Kuai3-43/+42
Currently, 'active_io' is grabbed before make_reqeust() is called, and it's dropped immediately make_reqeust() returns. Hence 'active_io' actually means io is dispatching, not io is inflight. For raid0 and raid456 that io accounting is enabled, 'active_io' will also be grabbed when bio is cloned for io accounting, and this 'active_io' is dropped until io is done. Always clone new bio so that 'active_io' will mean that io is inflight, raid1 and raid10 will switch to use this method in later patches. Now that bio will be cloned even if io accounting is disabled, also rename related structure from '*_acct_*' to '*_clone_*'. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: move initialization and destruction of 'io_acct_set' to md.cYu Kuai4-63/+23
'io_acct_set' is only used for raid0 and raid456, prepare to use it for raid1 and raid10, so that io accounting from different levels can be consistent. By the way, follow up patches will also use this io clone mechanism to make sure 'active_io' represents in flight io, not io that is dispatching, so that mddev_suspend will wait for io to be done as designed. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: deprecate bitmap file supportChristoph Hellwig2-1/+3
The support for bitmaps on files is a very bad idea abusing various kernel APIs, and fundamentally requires the file to not be on the actual array without a way to check that this is actually the case. Add a deprecation warning to see if we might be able to eventually drop it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: make bitmap file support optionalChristoph Hellwig3-0/+32
The support for write intent bitmaps in files on an external files in md is a hot mess that abuses ->bmap to map file offsets into physical device objects, and also abuses buffer_heads in a creative way. Make this code optional so that MD can be built into future kernels without buffer_head support, and so that we can eventually deprecate it. Note this does not affect the internal bitmap support, which has none of the problems. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: don't use ->index for pages backing the bitmap fileChristoph Hellwig2-27/+39
The md driver allocates pages for storing the bitmap file data, which are not page cache pages, and then stores the page granularity file offset in page->index, which is a field that isn't really valid except for page cache pages. Use a separate index for the superblock, and use the scheme used at read size to recalculate the index for the bitmap pages instead. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: account for mddev->bitmap_info.offset in read_sb_pageChristoph Hellwig1-9/+8
Diretly apply mddev->bitmap_info.offset to the sector number to read instead of doing that in both callers. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: cleanup read_sb_pageChristoph Hellwig1-12/+11
Convert read_sb_page to the normal kernel coding style, calculate the target sector only once, and add a local iosize variable to make the call to sync_page_io more readable. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: refactor md_bitmap_init_from_diskChristoph Hellwig1-71/+70
Split the confusing loop in md_bitmap_init_from_disk that iterates over all chunks but also needs to read and map the pages into three separate loops: one that iterates over the pages to read them, a second optional one to iterate over the pages to mark them invalid if the bitmaps are out of date, and a final one that actually iterates over the chunks. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: rename read_page to read_file_pageChristoph Hellwig1-6/+4
Make the difference to read_sb_page clear. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: split file writes into a separate helperChristoph Hellwig1-24/+24
Split the file write code out of write_page into a separate helper. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: use %pD to print the file name in md_bitmap_file_kickChristoph Hellwig1-10/+2
Don't bother allocating an extra buffer in the I/O failure handler and instead use the printk built-in format to print the last 4 path name components. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: initialize variables at declaration time in md_bitmap_file_unmapChristoph Hellwig1-8/+4
Just a small tidyup to prepare for bigger changes. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md-bitmap: set BITMAP_WRITE_ERROR in write_sb_pageChristoph Hellwig1-13/+8
Set BITMAP_WRITE_ERROR directly in write_sb_page instead of propagating the error to the caller and setting it there. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: enhance checking in md_check_recovery()Yu Kuai1-7/+15
For md_check_recovery(): 1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread. 2) if 'MD_RECOVERY_RUNING' is set: a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for md_do_sync() to be done. b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code expects that sync_thread is not NULL, otherwise new sync_thread will be registered, which will corrupt the array. Make sure md_check_recovery() won't register new sync_thread if 'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for the above corruption, Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: wake up 'resync_wait' at last in md_reap_sync_thread()Yu Kuai1-1/+1
md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...) from action_store(), just make sure action_store() will still wait for everything to be done in md_reap_sync_thread(). Signed-off-by: Yu Kuai <[email protected]> Reviewd-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: refactor idle/frozen_sync_thread() to fix deadlockYu Kuai2-4/+21
Our test found a following deadlock in raid10: 1) Issue a normal write, and such write failed: raid10_end_write_request set_bit(R10BIO_WriteError, &r10_bio->state) one_write_done reschedule_retry // later from md thread raid10d handle_write_completed list_add(&r10_bio->retry_list, &conf->bio_end_io_list) // later from md thread raid10d if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) list_move(conf->bio_end_io_list.prev, &tmp) r10_bio = list_first_entry(&tmp, struct r10bio, retry_list) raid_end_bio_io(r10_bio) Dependency chain 1: normal io is waiting for updating superblock 2) Trigger a recovery: raid10_sync_request raise_barrier Dependency chain 2: sync thread is waiting for normal io 3) echo idle/frozen to sync_action: action_store mddev_lock md_unregister_thread kthread_stop Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread 4) md thread can't update superblock: raid10d md_check_recovery if (mddev_trylock(mddev)) md_update_sb Dependency chain 4: update superblock is waiting for 'reconfig_mutex' Hence cyclic dependency exist, in order to fix the problem, we must break one of them. Dependency 1 and 2 can't be broken because they are foundation design. Dependency 4 may be possible if it can be guaranteed that no io can be inflight, however, this requires a new mechanism which seems complex. Dependency 3 is a good choice, because idle/frozen only requires sync thread to finish, which can be done asynchronously that is already implemented, and 'reconfig_mutex' is not needed anymore. This patch switch 'idle' and 'frozen' to wait sync thread to be done asynchronously, and this patch also add a sequence counter to record how many times sync thread is done, so that 'idle' won't keep waiting on new started sync thread. Noted that raid456 has similiar deadlock([1]), and it's verified[2] this deadlock can be fixed by this patch as well. [1] https://lore.kernel.org/linux-raid/[email protected]/T/#t [2] https://lore.kernel.org/linux-raid/[email protected]/ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: add a mutex to synchronize idle and frozen in action_store()Yu Kuai2-0/+8
Currently, for idle and frozen, action_store will hold 'reconfig_mutex' and call md_reap_sync_thread() to stop sync thread, however, this will cause deadlock (explained in the next patch). In order to fix the problem, following patch will release 'reconfig_mutex' and wait on 'resync_wait', like md_set_readonly() and do_md_stop() does. Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN' unconditionally, which might cause unexpected problems, for example, frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while 'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which might starve in progress frozen. A mutex is added to synchronize idle and frozen from action_store(). Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md: refactor action_store() for 'idle' and 'frozen'Yu Kuai1-16/+45
Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there are no functional changes except that MD_RECOVERY_RUNNING is checked again after 'reconfig_mutex' is held. Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27Revert "md: unlock mddev before reap sync_thread in action_store"Yu Kuai2-18/+2
This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1. Because it will introduce a defect that sync_thread can be running while MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems, for example: list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0). Call trace: __list_add_valid+0xfc/0x140 insert_work+0x78/0x1a0 __queue_work+0x500/0xcf4 queue_work_on+0xe8/0x12c md_check_recovery+0xa34/0xf30 raid10d+0xb8/0x900 [raid10] md_thread+0x16c/0x2cc kthread+0x1a4/0x1ec ret_from_fork+0x10/0x18 This is because work is requeued while it's still inside workqueue: t1: t2: action_store mddev_lock if (mddev->sync_thread) mddev_unlock md_unregister_thread // first sync_thread is done md_check_recovery mddev_try_lock /* * once MD_RECOVERY_DONE is set, new sync_thread * can start. */ set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) INIT_WORK(&mddev->del_work, md_start_sync) queue_work(md_misc_wq, &mddev->del_work) test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...) // set pending bit insert_work list_add_tail mddev_unlock mddev_lock_nointr md_reap_sync_thread // MD_RECOVERY_RUNNING is cleared mddev_unlock t3: // before queued work started from t2 md_check_recovery // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started INIT_WORK(&mddev->del_work, md_start_sync) work->data = 0 // work pending bit is cleared queue_work(md_misc_wq, &mddev->del_work) insert_work list_add_tail // list is corrupted The above commit is reverted to fix the problem, the deadlock this commit tries to fix will be fixed in following patches. Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-25block: cleanup bio_integrity_prepJinyoung Choi1-5/+1
If a problem occurs in the process of creating an integrity payload, the status of bio is always BLK_STS_RESOURCE. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230725051839epcms2p8e4d20ad6c51326ad032e8406f59d0aaa@epcms2p8 Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: Improve performance for BLK_MQ_F_BLOCKING driversBart Van Assche2-7/+12
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING has been set. This is suboptimal since running the queue asynchronously is slower than running the queue synchronously. This patch modifies blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set: - Run the queue synchronously if it is allowed to sleep. - Run the queue asynchronously if it is not allowed to sleep. Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller may be invoked from atomic context. The following caller chains have been reviewed: blk_mq_run_hw_queue(hctx, false) blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */ blk_execute_rq() /* may sleep */ blk_mq_run_hw_queues(q, async=false) blk_freeze_queue_start() /* may sleep */ blk_mq_requeue_work() /* may sleep */ scsi_kick_queue() scsi_requeue_run_queue() /* may sleep */ scsi_run_host_queues() scsi_ioctl_reset() /* may sleep */ blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false) blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list(plug, from_schedule=false) __blk_flush_plug(plug, from_schedule=false) blk_add_rq_to_plug() blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */ blk_mq_plug_issue_direct() blk_mq_flush_plug_list() /* see above */ blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list() /* see above */ blk_mq_try_issue_directly() blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */ blk_mq_try_issue_list_directly(hctx, list) blk_mq_insert_requests() /* see above */ Cc: Christoph Hellwig <[email protected]> Cc: Ming Lei <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24scsi: Remove a blk_mq_run_hw_queues() callBart Van Assche1-1/+1
blk_mq_kick_requeue_list() calls blk_mq_run_hw_queues() asynchronously. Leave out the direct blk_mq_run_hw_queues() call. This patch causes scsi_run_queue() to call blk_mq_run_hw_queues() asynchronously instead of synchronously. Since scsi_run_queue() is not called from the hot I/O submission path, this patch does not affect the hot path. This patch prepares for allowing blk_mq_run_hw_queue() to sleep if BLK_MQ_F_BLOCKING has been set. scsi_run_queue() may be called from atomic context and must not sleep. Hence the removal of the blk_mq_run_hw_queues(q, false) call. See also scsi_unblock_requests(). Cc: "Martin K. Petersen" <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24scsi: Inline scsi_kick_queue()Bart Van Assche1-7/+2
Inline scsi_kick_queue() to prepare for modifying the second argument passed to blk_mq_run_hw_queues(). Reviewed-by: Christoph Hellwig <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: don't pass a bio to bio_try_merge_hw_segChristoph Hellwig1-9/+7
There is no good reason to pass the bio to bio_try_merge_hw_seg. Just pass the current bvec and rename the function to bvec_try_merge_hw_page. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: move the bi_size update out of __bio_try_merge_pageChristoph Hellwig1-37/+20
The update of bi_size is the only thing in __bio_try_merge_page that needs a bio. Move it to the callers, and merge __bio_try_merge_page and page_is_mergeable into a single bvec_try_merge_page that only takes the current bvec instead of a full bio. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: downgrade a bio_full call in bio_add_pageChristoph Hellwig1-1/+1
bio_add_page already checks that there is space in bi_size a little earlier. So after we failed to add to an existing segment, just check that there is another one available instead of duplicating the bi_size check. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: move the bi_size overflow check in __bio_try_merge_pageChristoph Hellwig1-4/+5
Checking for availability in bi_size in a function that attempts to merge into an existing segment is a bit odd, as the limit also applies when adding a new segment. This code works fine as we always call __bio_try_merge_page, but contributes to sub-optimal calling conventions and doesn't lead to clear code. Move it to two of the callers instead, the third one already has a more strict check that includes max_hw_segments anyway. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: move the bi_vcnt check out of __bio_try_merge_pageChristoph Hellwig1-23/+22
Move the bi_vcnt out of __bio_try_merge_page and into the two callers that don't already have it in preparation for additional changes to __bio_try_merge_page. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: move the BIO_CLONED checks out of __bio_try_merge_pageChristoph Hellwig1-3/+6
__bio_try_merge_page is a way too low-level helper to assert that the bio is not cloned. Move the check into bio_add_page and bio_iov_iter_get_pages instead, which are the high level entry points that should enforce this variant. bio_add_hw_page already this check, coverig the third (indirect) caller of __bio_try_merge_page. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: use SECTOR_SHIFT bio_add_hw_pageChristoph Hellwig1-1/+1
Use the SECTOR_SHIFT magic constant instead of the magic number. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-24block: tidy up the bio full checks in bio_add_hw_pageChristoph Hellwig1-6/+4
bio_add_hw_page already checks if the number of bytes trying to be added even fit into max_hw_sectors limit of the queue. Remove the call to bio_full and just do a check for the smaller of the number of segments in the bio and the queue max segments limit, and do this cheap check before the more expensive gap to previous check. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jinyoung Choi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-20block: refactor to use helperNitesh Shetty1-3/+1
Reduce some code by making use of bio_integrity_bytes(). Signed-off-by: Nitesh Shetty <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17blk-flush: reuse rq queuelist in flush state machineChengming Zhou2-8/+10
Since we don't need to maintain inflight flush_data requests list anymore, we can reuse rq->queuelist for flush pending list. Note in mq_flush_data_end_io(), we need to re-initialize rq->queuelist before reusing it in the state machine when end, since the rq->rq_next also reuse it, may have corrupted rq->queuelist by the driver. This patch decrease the size of struct request by 16 bytes. Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17blk-flush: count inflight flush_data requestsChengming Zhou2-7/+7
The flush state machine use a double list to link all inflight flush_data requests, to avoid issuing separate post-flushes for these flush_data requests which shared PREFLUSH. So we can't reuse rq->queuelist, this is why we need rq->flush.list In preparation of the next patch that reuse rq->queuelist for flush state machine, we change the double linked list to unsigned long counter, which count all inflight flush_data requests. This is ok since we only need to know if there is any inflight flush_data request, so unsigned long counter is good. Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17blk-flush: fix rq->flush.seq for post-flush requestsChengming Zhou1-1/+1
If the policy == (REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH), it means that the data sequence and post-flush sequence need to be done for this request. The rq->flush.seq should record what sequences have been done (or don't need to be done). So in this case, pre-flush doesn't need to be done, we should init rq->flush.seq to REQ_FSEQ_PREFLUSH not REQ_FSEQ_POSTFLUSH. Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests") Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17blk-mq: use percpu csd to remote complete instead of per-rq csdChengming Zhou2-10/+7
If request need to be completed remotely, we insert it into percpu llist, and smp_call_function_single_async() if llist is empty previously. We don't need to use per-rq csd, percpu csd is enough. And the size of struct request is decreased by 24 bytes. This way is cleaner, and looks correct, given block softirq is guaranteed to be scheduled to consume the list if one new request is added to this percpu list, either smp_call_function_single_async() returns -EBUSY or 0. Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17block: don't allow enabling a cache on devices that don't support itChristoph Hellwig3-6/+13
Currently the write_cache attribute allows enabling the QUEUE_FLAG_WC flag on devices that never claimed the capability. Fix that by adding a QUEUE_FLAG_HW_WC flag that is set by blk_queue_write_cache and guards re-enabling the cache through sysfs. Note that any rescan that calls blk_queue_write_cache will still re-enable the write cache as in the current code. Fixes: 93e9d8e836cb ("block: add ability to flag write back caching on a device") Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17block: cleanup queue_wc_storeChristoph Hellwig1-11/+3
Get rid of the local queue_wc_store variable and handling setting and clearing the QUEUE_FLAG_WC flag diretly instead the if / else if. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17nbd: automatically load module on genl accessThomas Weißschuh1-0/+1
Add a module alias to nbd.ko that allows the generic netlink core to automatically load the module when netlink messages for nbd are received. This frees the user from manually having to load the module before using nbd functionality via netlink. If the system policy allows it this can even be used to load the nbd module from containers which would otherwise not have access to the necessary module files to do a normal "modprobe nbd". For example this avoids the following error when using nbd-client: $ nbd-client localhost 10809 /dev/nbd0 ... Error: Couldn't resolve the nbd netlink family, make sure the nbd module is loaded and your nbd driver supports the netlink interface. Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17blk-wbt: Replace strlcpy with strscpyAzeem Shaikh1-4/+4
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-17kyber: Replace strlcpy with strscpyAzeem Shaikh1-4/+4
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-16Linux 6.5-rc2Linus Torvalds1-1/+1
2023-07-16Merge tag 'xtensa-20230716' of https://github.com/jcmvbkbc/linux-xtensaLinus Torvalds3-22/+18
Pull xtensa fixes from Max Filippov: - fix interaction between unaligned exception handler and load/store exception handler - fix parsing ISS network interface specification string - add comment about etherdev freeing to ISS network driver * tag 'xtensa-20230716' of https://github.com/jcmvbkbc/linux-xtensa: xtensa: fix unaligned and load/store configuration interaction xtensa: ISS: fix call to split_if_spec xtensa: ISS: add comment about etherdev freeing
2023-07-16Merge tag 'perf_urgent_for_v6.5_rc2' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Fix a lockdep warning when the event given is the first one, no event group exists yet but the code still goes and iterates over event siblings * tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
2023-07-16Merge tag 'objtool_urgent_for_v6.5_rc2' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Borislav Petkov: - Mark copy_iovec_from_user() __noclone in order to prevent gcc from doing an inter-procedural optimization and confuse objtool - Initialize struct elf fully to avoid build failures * tag 'objtool_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: iov_iter: Mark copy_iovec_from_user() noclone objtool: initialize all of struct elf
2023-07-16Merge tag 'sched_urgent_for_v6.5_rc2' of ↵Linus Torvalds5-12/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Remove a cgroup from under a polling process properly - Fix the idle sibling selection * tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: use kernfs polling functions for PSI trigger polling sched/fair: Use recent_used_cpu to test p->cpus_ptr
2023-07-16Merge tag 'pinctrl-v6.5-2' of ↵Linus Torvalds4-55/+63
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: "I'm mostly on vacation but what would vacation be without a few critical fixes so people can use their gaming laptops when hiding away from the sun (or rain)? - Fix a really annoying interrupt storm in the AMD driver affecting Asus TUF gaming notebooks - Fix device tree parsing in the Renesas driver" * tag 'pinctrl-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: amd: Unify debounce handling into amd_pinconf_set() pinctrl: amd: Drop pull up select configuration pinctrl: amd: Use amd_pinconf_set() for all config options pinctrl: amd: Only use special debounce behavior for GPIO 0 pinctrl: renesas: rzg2l: Handle non-unique subnode names pinctrl: renesas: rzv2m: Handle non-unique subnode names
2023-07-16Merge tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds7-31/+43
Pull smb client fixes from Steve French: - Two reconnect fixes: important fix to address inFlight count to leak (which can leak credits), and fix for better handling a deleted share - DFS fix - SMB1 cleanup fix - deferred close fix * tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix mid leak during reconnection after timeout threshold cifs: is_network_name_deleted should return a bool smb: client: fix missed ses refcounting smb: client: Fix -Wstringop-overflow issues cifs: if deferred close is disabled then close files immediately
2023-07-16Merge tag 'powerpc-6.5-3' of ↵Linus Torvalds6-46/+51
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix Speculation_Store_Bypass reporting in /proc/self/status on Power10 - Fix HPT with 4K pages since recent changes by implementing pmd_same() - Fix 64-bit native_hpte_remove() to be irq-safe Thanks to Aneesh Kumar K.V, Nageswara R Sastry, and Russell Currey. * tag 'powerpc-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size powerpc/64e: Fix obtool warnings in exceptions-64e.S powerpc/security: Fix Speculation_Store_Bypass reporting on Power10 powerpc/64s: Fix native_hpte_remove() to be irq-safe
2023-07-16Merge tag 'hardening-v6.5-rc2' of ↵Linus Torvalds6-10/+11
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Remove LTO-only suffixes from promoted global function symbols (Yonghong Song) - Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu) - Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann) - Claim maintainership of string routines * tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: sparc: mark __arch_xchg() as __always_inline MAINTAINERS: Foolishly claim maintainership of string routines kallsyms: strip LTO-only suffixes from promoted global functions vmlinux.lds.h: Remove a reference to no longer used sections .text..refcount