aboutsummaryrefslogtreecommitdiff
path: root/drivers/md/raid5-cache.c
AgeCommit message (Collapse)AuthorFilesLines
2017-01-24md/raid5-cache: delete meaningless codeShaohua Li1-2/+0
sector_t is unsigned long, it's never < 0 Reported-by: Julia Lawall <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2017-01-05md/r5cache: fix spelling mistake on "recoverying"Colin Ian King1-1/+1
Trivial fix to spelling mistake "recoverying" to "recovering" in pr_dbg message. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2017-01-05md/r5cache: assign conf->log before r5l_load_log()Song Liu1-1/+3
r5l_load_log() calls functions that requires a proper conf->log, for example, r5c_is_writeback(). Therefore, we should set conf->log before calling r5l_load_log(). If r5l_load_log() fails, conf->log is set back to NULL. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2017-01-05md/r5cache: simplify handling of sh->log_start in recoverySong Liu1-15/+12
We only need to update sh->log_start at the end of recovery, which is r5c_recovery_rewrite_data_only_stripes(), so it is not necessary to set it before that. In this patch, log_start is removed from r5c_recovery_alloc_stripe(). After updating all sh->log_start, rewrite_data_only_stripes() also updates log->next_checkpoints to the last sh->log_start. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2017-01-05md/raid5-cache: removes unnecessary write-through mode judgmentsJackieLiu1-3/+0
The write-through mode has been returned in front of the function, do not need to do it again. Signed-off-by: JackieLiu <[email protected]> Reviewed-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-12-13Merge branch 'md-next' into md-linusShaohua Li1-214/+1619
2016-12-13Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+3
Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...
2016-12-08md: separate flags for superblock changesShaohua Li1-3/+3
The mddev->flags are used for different purposes. There are a lot of places we check/change the flags without masking unrelated flags, we could check/change unrelated flags. These usage are most for superblock write, so spearate superblock related flags. This should make the code clearer and also fix real bugs. Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-12-08md/r5cache: after recovery, increase journal seq by 10000Song Liu1-7/+7
Currently, we increase journal entry seq by 10 after recovery. However, this is not sufficient in the following case. After crash the journal looks like | seq+0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 | If +1 is not valid, we dropped all entries from +1 to +12; and write seq+10: | seq+0 | +10 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 | However, if we write a big journal entry with seq+11, it will connect with some stale journal entry: | seq+0 | +10 | +11 | +12 | To reduce the risk of this issue, we increase seq by 10000 instead. Shaohua: use 10000 instead of 1000. The risk should be very unlikely. The total stripe cache size is less than 2k typically, and several stripes can fit into one meta data block. So the total inflight meta data blocks would be quite small, which means the the total sequence number used should be quite small. The 10000 sequence number increase should be far more than safe. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-12-08md/raid5-cache: fix crc in rewrite_data_only_stripes()Song Liu1-4/+6
r5l_recovery_create_empty_meta_block() creates crc for the empty metablock. After the metablock is updated, we need clear the checksum before recalculate it. Shaohua: moved checksum calculation out of r5l_recovery_create_empty_meta_block. We should calculate it after all fields are updated. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-12-08md/raid5-cache: no recovery is required when create super-blockJackieLiu1-2/+8
When create the super-block information, We do not need to do this recovery stage, only need to initialize some variables. Signed-off-by: JackieLiu <[email protected]> Reviewed-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-12-05md/r5cache: do r5c_update_log_state after log recoveryZhengyuan Liu1-5/+3
We should update log state after we did a log recovery, current completion may get wrong log state since log->log_start wasn't initalized until we called r5l_recovery_log. At log recovery stage, no lock needed as there is no race conditon. next_checkpoint field will be initialized in r5l_recovery_log too. Signed-off-by: Zhengyuan Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-12-05md/raid5-cache: adjust the write position of the empty block if no data blocksJackieLiu1-4/+16
When recovery is complete, we write an empty block and record his position first, then make the data-only stripes rewritten done, the location of the empty block as the last checkpoint position to write into the super block. And we should update last_checkpoint to this empty block position. ------------------------------------------------------------------ | old log | empty block | data only stripes | invalid log | ------------------------------------------------------------------ ^ ^ ^ | |- log->last_checkpoint |- log->log_start | |- log->last_cp_seq |- log->next_checkpoint |- log->seq=n |- log->seq=10+n At the same time, if there is no data-only stripes, this scene may appear, | meta1 | meta2 | meta3 | meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should The solution is we create a new meta in meta2 with its seq == meta1's seq + 10 and let superblock points to meta2. Signed-off-by: JackieLiu <[email protected]> Reviewed-by: Zhengyuan Liu <[email protected]> Reviewed-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-12-02md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0Song Liu1-1/+13
With writeback cache, we define log space critical as free_space < 2 * reclaim_required_space So the deassert of R5C_LOG_CRITICAL could happen when 1. free_space increases 2. reclaim_required_space decreases Currently, run_no_space_stripes() is called when 1 happens, but not (always) when 2 happens. With this patch, run_no_space_stripes() is call when R5C_LOG_CRITICAL is cleared. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-29md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedlyJackieLiu1-2/+0
R5c_make_stripe_write_out has set this flag, do not need to set again. Signed-off-by: JackieLiu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-29md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_logJackieLiu1-2/+0
The next_cp_seq field is useless, remove it. Signed-off-by: JackieLiu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-29md/raid5-cache: release the stripe_head at the appropriate locationJackieLiu1-6/+7
If we released the 'stripe_head' in r5c_recovery_flush_log, ctx->cached_list will both release the data-parity stripes and data-only stripes, which will become empty. And we also need to use the data-only stripes in r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite data-only stripes is done before releasing them. Reviewed-by: Zhengyuan Liu <[email protected]> Reviewed-by: Song Liu <[email protected]> Signed-off-by: JackieLiu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-29md/raid5-cache: use ring add to prevent overflowJackieLiu1-1/+1
'write_pos' must be protected with 'r5l_ring_add', or it may overflow Signed-off-by: JackieLiu <[email protected]> Reviewed-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-29md/raid5-cache: remove unnecessary function parametersJackieLiu1-8/+4
The function parameter 'recovery_list' is not used in body, we can delete it Signed-off-by: JackieLiu <[email protected]> Reviewed-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-29raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into ↵Zhengyuan Liu1-3/+1
cache r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine would release the stripe later and add it into neither r5c_cached_full_stripes list or r5c_cached_partial_stripes list and set correct flag. Reviewed-by: JackieLiu <[email protected]> Signed-off-by: Zhengyuan Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-29raid5-cache: add another check conditon before replaying one stripeZhengyuan Liu1-2/+2
New stripe that was just allocated has no STRIPE_R5C_CACHING state too, add this check condition could avoid unnecessary replaying for empty stripe. r5l_recovery_replay_one_stripe would reset stripe for any case, delete it to make code more clean. Signed-off-by: Zhengyuan Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-27md/r5cache: enable IRQs on error pathDan Carpenter1-1/+1
We need to re-enable the IRQs here before returning. Fixes: a39f7afde358 ("md/r5cache: write-out phase and reclaim support") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-27md/r5cache: handle alloc_page failureSong Liu1-1/+26
RMW of r5c write back cache uses an extra page to store old data for prexor. handle_stripe_dirtying() allocates this page by calling alloc_page(). However, alloc_page() may fail. To handle alloc_page() failures, this patch adds an extra page to disk_info. When alloc_page fails, handle_stripe() trys to use these pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE), the stripe is added to delayed_list. Signed-off-by: Song Liu <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-23raid5-cache: suspend reclaim thread instead of shutdownShaohua Li1-13/+5
There is mechanism to suspend a kernel thread. Use it instead of playing create/destroy game. Signed-off-by: Shaohua Li <[email protected]> Reviewed-by: NeilBrown <[email protected]> Cc: Song Liu <[email protected]>
2016-11-22block: bio: pass bvec table to bio_init()Ming Lei1-1/+1
Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Fixed up the new O_DIRECT cases. Signed-off-by: Jens Axboe <[email protected]>
2016-11-18md/r5cache: handle FLUSH and FUASong Liu1-18/+144
With raid5 cache, we committing data from journal device. When there is flush request, we need to flush journal device's cache. This was not needed in raid5 journal, because we will flush the journal before committing data to raid disks. This is similar to FUA, except that we also need flush journal for FUA. Otherwise, corruptions in earlier meta data will stop recovery from reaching FUA data. slightly changed the code by Shaohua Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: r5cache recovery: part 2Song Liu1-186/+24
1. In previous patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In this patchpatch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: r5cache recovery: part 1Song Liu1-0/+602
Recovery of write-back cache has different logic to write-through only cache. Specifically, for write-back cache, the recovery need to scan through all active journal entries before flushing data out. Therefore, large portion of the recovery logic is rewritten here. To make the diffs cleaner, we split the rewrite as follows: 1. In this patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In next patch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions With cache feature, there are 2 different scenarios of recovery: 1. Data-Parity stripe: a stripe with complete parity in journal. 2. Data-Only stripe: a stripe with only data in journal (or partial parity). The code differentiate Data-Parity stripe from Data-Only stripe with flag STRIPE_R5C_CACHING. For Data-Parity stripes, we use the same procedure as raid5 journal, where all the data and parity are replayed to the RAID devices. For Data-Only strips, we need to finish complete calculate parity and finish the full reconstruct write or RMW write. For simplicity, in the recovery, we load the stripe to stripe cache. Once the array is started, the stripe cache state machine will handle these stripes through normal write path. r5c_recovery_flush_log contains the main procedure of recovery. The recovery code first scans through the journal and loads data to stripe cache. The code keeps tracks of all these stripes in a list (use sh->lru and ctx->cached_list), stripes in the list are organized in the order of its first appearance on the journal. During the scan, the recovery code assesses each stripe as Data-Parity or Data-Only. During scan, the array may run out of stripe cache. In these cases, the recovery code will also call raid5_set_cache_size to increase stripe cache size. If the array still runs out of stripe cache because there isn't enough memory, the array will not assemble. At the end of scan, the recovery code replays all Data-Parity stripes, and sets proper states for Data-Only stripes. The recovery code also increases seq number by 10 and rewrites all Data-Only stripes to journal. This is to avoid confusion after repeated crashes. More details is explained in raid5-cache.c before r5c_recovery_rewrite_data_only_stripes(). Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: refactoring journal recovery codeSong Liu1-9/+18
1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block(); 2. pull the code that initialize r5l_meta_block from r5l_log_write_empty_meta_block() to a separate function r5l_recovery_create_empty_meta_block(), so that we can reuse this piece of code. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: sysfs entry journal_modeSong Liu1-0/+65
With write cache, journal_mode is the knob to switch between write-back and write-through. Below is an example: root@virt-test:~/# cat /sys/block/md0/md/journal_mode [write-through] write-back root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode root@virt-test:~/# cat /sys/block/md0/md/journal_mode write-through [write-back] Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: write-out phase and reclaim supportSong Liu1-30/+381
There are two limited resources, stripe cache and journal disk space. For better performance, we priotize reclaim of full stripe writes. To free up more journal space, we free earliest data on the journal. In current implementation, reclaim happens when: 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim if there is no reclaim in the past 5 seconds. 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes, or cached stripes is enough for a full stripe (chunk size / 4k) (r5c_check_cached_full_stripe) 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage) 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data) r5c_do_reclaim() contains new logic of reclaim. For stripe cache: When stripe cache pressure is high (more than 3/4 stripes are cached, or there is empty inactive lists), flush all full stripe. If fewer than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes are flushed, flush some paritial stripes. When stripe cache pressure is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes. For log space: To avoid deadlock due to log space, we need to reserve enough space to flush cached data. The size of required log space depends on total number of cached stripes (stripe_in_journal_count). In current implementation, the writing-out phase automatically include pending data writes with parity writes (similar to write through case). Therefore, we need up to (conf->raid_disks + 1) pages for each cached stripe (1 page for meta data, raid_disks pages for all data and parity). r5c_log_required_to_flush_cache() calculates log space required to flush cache. In the following, we refer to the space calculated by r5c_log_required_to_flush_cache() as reclaim_required_space. Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log device is less than 2x of reclaim_required_space. r5c_cache keeps all data in cache (not fully committed to RAID) in a list (stripe_in_journal_list). These stripes are in the order of their first appearance on the journal. So the log tail (last_checkpoint) should point to the journal_start of the first item in the list. When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is set, the state machine only writes data that are already in the log device (in stripe_in_journal_list). This patch includes a fix to improve performance by Shaohua Li <[email protected]>. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: caching phase of r5cacheSong Liu1-9/+233
As described in previous patch, write back cache operates in two phases: caching and writing-out. The caching phase works as: 1. write data to journal (r5c_handle_stripe_dirtying, r5c_cache_data) 2. call bio_endio (r5c_handle_data_cached, r5c_return_dev_pending_writes). Then the writing-out phase is as: 1. Mark the stripe as write-out (r5c_make_stripe_write_out) 2. Calcualte parity (reconstruct or RMW) 3. Write parity (and maybe some other data) to journal device 4. Write data and parity to RAID disks This patch implements caching phase. The cache is integrated with stripe cache of raid456. It leverages code of r5l_log to write data to journal device. Writing-out phase of the cache is implemented in the next patch. With r5cache, write operation does not wait for parity calculation and write out, so the write latency is lower (1 write to journal device vs. read and then write to raid disks). Also, r5cache will reduce RAID overhead (multipile IO due to read-modify-write of parity) and provide more opportunities of full stripe writes. This patch adds 2 flags to stripe_head.state: - STRIPE_R5C_PARTIAL_STRIPE, - STRIPE_R5C_FULL_STRIPE, Instead of inactive_list, stripes with cached data are tracked in r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list. STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for stripes in these lists. Note: stripes in r5c_full/partial_stripe_list are not considered as "active". For RMW, the code allocates an extra page for each data block being updated. This is stored in r5dev->orig_page and the old data is read into it. Then the prexor calculation subtracts ->orig_page from the parity block, and the reconstruct calculation adds the ->page data back into the parity block. r5cache naturally excludes SkipCopy. When the array has write back cache, async_copy_data() will not skip copy. There are some known limitations of the cache implementation: 1. Write cache only covers full page writes (R5_OVERWRITE). Writes of smaller granularity are write through. 2. Only one log io (sh->log_io) for each stripe at anytime. Later writes for the same stripe have to wait. This can be improved by moving log_io to r5dev. 3. With writeback cache, read path must enter state machine, which is a significant bottleneck for some workloads. 4. There is no per stripe checkpoint (with r5l_payload_flush) in the log, so recovery code has to replay more than necessary data (sometimes all the log from last_checkpoint). This reduces availability of the array. This patch includes a fix proposed by ZhengYuan Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: State machine for raid5-cache write back modeSong Liu1-3/+140
This patch adds state machine for raid5-cache. With log device, the raid456 array could operate in two different modes (r5c_journal_mode): - write-back (R5C_MODE_WRITE_BACK) - write-through (R5C_MODE_WRITE_THROUGH) Existing code of raid5-cache only has write-through mode. For write-back cache, it is necessary to extend the state machine. With write-back cache, every stripe could operate in two different phases: - caching - writing-out In caching phase, the stripe handles writes as: - write to journal - return IO In writing-out phase, the stripe behaviors as a stripe in write through mode R5C_MODE_WRITE_THROUGH. STRIPE_R5C_CACHING is added to sh->state to differentiate caching and writing-out phase. Please note: this is a "no-op" patch for raid5-cache write-through mode. The following detailed explanation is copied from the raid5-cache.c: /* * raid5 cache state machine * * With rhe RAID cache, each stripe works in two phases: * - caching phase * - writing-out phase * * These two phases are controlled by bit STRIPE_R5C_CACHING: * if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase * if STRIPE_R5C_CACHING == 1, the stripe is in caching phase * * When there is no journal, or the journal is in write-through mode, * the stripe is always in writing-out phase. * * For write-back journal, the stripe is sent to caching phase on write * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off * the write-out phase by clearing STRIPE_R5C_CACHING. * * Stripes in caching phase do not write the raid disks. Instead, all * writes are committed from the log device. Therefore, a stripe in * caching phase handles writes as: * - write to log device * - return IO * * Stripes in writing-out phase handle writes as: * - calculate parity * - write pending data and parity to journal * - write data and parity to raid disks * - return IO for pending writes */ Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-18md/r5cache: Check array size in r5l_init_logSong Liu1-10/+16
Currently, r5l_write_stripe checks meta size for each stripe write, which is not necessary. With this patch, r5l_init_log checks maximal meta size of the array, which is (r5l_meta_block + raid_disks x r5l_payload_data_parity). If this is too big to fit in one page, r5l_init_log aborts. With current meta data, r5l_log support raid_disks up to 203. Signed-off-by: Song Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-17raid5-cache: fix lockdep warningShaohua Li1-2/+14
lockdep reports warning of the rcu_dereference usage. Using normal rdev access pattern to avoid the warning. Signed-off-by: Shaohua Li <[email protected]>
2016-11-07raid5-cache: restrict the use area of the log_offset variableJackieLiu1-11/+8
We can calculate this offset by using ctx->meta_total_blocks, without passing in from the function Signed-off-by: JackieLiu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-11-01block,fs: use REQ_* flags directlyChristoph Hellwig1-2/+2
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-10-28raid5-cache: correct condition for empty metadata writeShaohua Li1-1/+1
As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-10-24md/raid5: write an empty meta-block when creating log super-blockZhengyuan Liu1-0/+1
If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-10-24md/raid5: initialize next_checkpoint field before useZhengyuan Liu1-0/+3
No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused. Signed-off-by: Zhengyuan Liu <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-08-31raid5-cache: fix a deadlock in superblock writeShaohua Li1-31/+15
There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail. Updating superblock (either directly call md_update_sb() or depend on md thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all IO finish, hence waitting for reclaim thread, while reclaim thread is calling this function and waitting for reconfig mutex. So there is a deadlock. We workaround this issue with a trylock. The downside of the solution is we could miss discard if we can't take reconfig mutex. But this should happen rarely (mainly in raid array stop), so miss discard shouldn't be a big problem. Cc: NeilBrown <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe1-1/+1
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <[email protected]>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie1-1/+1
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-06-07md: use bio op accessorsMike Christie1-10/+16
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-06-07block/fs/drivers: remove rw argument from submit_bioMike Christie1-3/+4
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <[email protected]> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <[email protected]>
2016-05-19Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds1-2/+2
Pull MD updates from Shaohua Li: "Several patches from Guoqing fixing md-cluster bugs and several patches from Heinz fixing dm-raid bugs" * tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md-cluster: check the return value of process_recvd_msg md-cluster: gather resync infos and enable recv_thread after bitmap is ready md: set MD_CHANGE_PENDING in a atomic region md: raid5: add prerequisite to run underneath dm-raid md: raid10: add prerequisite to run underneath dm-raid md: md.c: fix oops in mddev_suspend for raid0 md-cluster: fix ifnullfree.cocci warnings md-cluster/bitmap: unplug bitmap to sync dirty pages to disk md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit md-cluster/bitmap: fix wrong calcuation of offset md-cluster: sync bitmap when node received RESYNCING msg md-cluster: always setup in-memory bitmap md-cluster: wakeup thread if activated a spare disk md-cluster: change array_sectors and update size are not supported md-cluster: fix locking when node joins cluster during message broadcast md-cluster: unregister thread if err happened md-cluster: wake up thread to continue recovery md-cluser: make resync_finish only called after pers->sync_request md-cluster: change resync lock from asynchronous to synchronous
2016-05-09md: set MD_CHANGE_PENDING in a atomic regionGuoqing Jiang1-2/+2
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Sasha Levin <[email protected]> Cc: <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Shaohua Li <[email protected]>
2016-04-13block: kill off q->flush_flagsJens Axboe1-1/+2
Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries. Signed-off-by: Jens Axboe <[email protected]>
2016-01-14raid5-cache: handle journal hotadd in quiesceShaohua Li1-0/+7
Handle journal hotadd in quiesce to avoid creating duplicated threads. Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: NeilBrown <[email protected]>
2016-01-14md: set MD_HAS_JOURNAL in correct placesShaohua Li1-0/+1
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd. Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: NeilBrown <[email protected]>