aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/volumes.c
AgeCommit message (Collapse)AuthorFilesLines
2023-02-15btrfs: fix spelling mistakes found using codespellColin Ian King1-1/+1
There quite a few spelling mistakes as found using codespell. Fix them. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-09btrfs: free device in btrfs_close_devices for a single device filesystemAnand Jain1-1/+15
We have this check to make sure we don't accidentally add older devices that may have disappeared and re-appeared with an older generation from being added to an fs_devices (such as a replace source device). This makes sense, we don't want stale disks in our file system. However for single disks this doesn't really make sense. I've seen this in testing, but I was provided a reproducer from a project that builds btrfs images on loopback devices. The loopback device gets cached with the new generation, and then if it is re-used to generate a new file system we'll fail to mount it because the new fs is "older" than what we have in cache. Fix this by freeing the cache when closing the device for a single device filesystem. This will ensure that the mount command passed device path is scanned successfully during the next mount. CC: stable@vger.kernel.org # 5.10+ Reported-by: Daan De Meyer <daandemeyer@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-25btrfs: limit device extents to the device sizeJosef Bacik1-1/+5
There was a recent regression in btrfs/177 that started happening with the size class patches ("btrfs: introduce size class to block group allocator"). This however isn't a regression introduced by those patches, but rather the bug was uncovered by a change in behavior in these patches. The patches triggered more chunk allocations in the ^free-space-tree case, which uncovered a race with device shrink. The problem is we will set the device total size to the new size, and use this to find a hole for a device extent. However during shrink we may have device extents allocated past this range, so we could potentially find a hole in a range past our new shrink size. We don't actually limit our found extent to the device size anywhere, we assume that we will not find a hole past our device size. This isn't true with shrink as we're relocating block groups and thus creating holes past the device size. Fix this by making sure we do not search past the new device size, and if we wander into any device extents that start after our device size simply break from the loop and use whatever hole we've already found. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-16btrfs: stop using write_one_page in btrfs_scratch_superblockChristoph Hellwig1-9/+8
write_one_page is an awkward interface that expects the page locked and ->writepage to be implemented. Replace that by zeroing the signature bytes and synchronize the block device page using the proper bdev helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-16btrfs: factor out scratching of one regular super blockChristoph Hellwig1-25/+26
btrfs_scratch_superblocks open codes scratching super block of a non-zoned super block. Split the code to read, zero and write the superblock for regular devices into a separate helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11btrfs: add extra error messages to cover non-ENOMEM errors from ↵Qu Wenruo1-1/+10
device_add_list() [BUG] When test case btrfs/219 (aka, mount a registered device but with a lower generation) failed, there is not any useful information for the end user to find out what's going wrong. The mount failure just looks like this: # mount -o loop /tmp/219.img2 /mnt/btrfs/ mount: /mnt/btrfs: mount(2) system call failed: File exists. dmesg(1) may have more information after failed mount system call. While the dmesg contains nothing but the loop device change: loop1: detected capacity change from 0 to 524288 [CAUSE] In device_list_add() we have a lot of extra checks to reject invalid cases. That function also contains the regular device scan result like the following prompt: BTRFS: device fsid 6222333e-f9f1-47e6-b306-55ddd4dcaef4 devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (3027) But unfortunately not all errors have their own error messages, thus if we hit something wrong in device_add_list(), there may be no error messages at all. [FIX] Add errors message for all non-ENOMEM errors. For ENOMEM, I'd say we're in a much worse situation, and there should be some OOM messages way before our call sites. CC: stable@vger.kernel.org # 6.0+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: fix extent map use-after-free when handling missing device in ↵void0red1-1/+2
read_one_chunk Store the error code before freeing the extent_map. Though it's reference counted structure, in that function it's the first and last allocation so this would lead to a potential use-after-free. The error can happen eg. when chunk is stored on a missing device and the degraded mount option is missing. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216721 Reported-by: eriri <1527030098@qq.com> Fixes: adfb69af7d8c ("btrfs: add_missing_dev() should return the actual error") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: void0red <void0red@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: split the bio submission path into a separate fileChristoph Hellwig1-291/+5
The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: allocate btrfs_io_context without GFP_NOFAILLi zeming1-1/+4
The __GFP_NOFAIL flag could loop indefinitely when allocation memory in alloc_btrfs_io_context. The callers starting from __btrfs_map_block already handle errors so it's safe to drop the flag. Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use btrfs_dev_name() helper to handle missing devices betterQu Wenruo1-8/+8
[BUG] If dev-replace failed to re-construct its data/metadata, the kernel message would be incorrect for the missing device: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault) Note the above "dev (efault)" of the second line. While the first line is properly reporting "<missing disk>". [CAUSE] Although dev-replace is using btrfs_dev_name(), the heavy lifting work is still done by scrub (scrub is reused by both dev-replace and regular scrub). Unfortunately scrub code never uses btrfs_dev_name() helper, as it's only declared locally inside dev-replace.c. [FIX] Fix the output by: - Move the btrfs_dev_name() helper to volumes.h - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls Only zoned code is not touched, as I'm not familiar with degraded zoned code. - Constify return value and parameter Now the output looks pretty sane: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move device->name RCU allocation and assign to btrfs_alloc_device()Anand Jain1-37/+31
There is a repeating code section in the parent function after calling btrfs_alloc_device(), as below: name = rcu_string_strdup(path, GFP_...); if (!name) { btrfs_free_device(device); return ERR_PTR(-ENOMEM); } rcu_assign_pointer(device->name, name); Except in add_missing_dev() for obvious reasons. This patch consolidates that repeating code into the btrfs_alloc_device() itself so that the parent function doesn't have to duplicate code. This consolidation also helps to review issues regarding RCU lock violation with device->name. Parent function device_list_add() and add_missing_dev() use GFP_NOFS for the allocation, whereas the rest of the parent functions use GFP_KERNEL, so bring the NOFS allocation context using memalloc_nofs_save() in the function device_list_add() and add_missing_dev() is already doing it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: drop private_data parameter from extent_io_tree_initDavid Sterba1-2/+1
All callers except one pass NULL, so the parameter can be dropped and the inode::io_tree initialization can be open coded. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: simplify percent calculation helpers, rename div_factorDavid Sterba1-7/+5
The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move super_block specific helpers into super.hJosef Bacik1-0/+1
This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move scrub prototypes into scrub.hJosef Bacik1-0/+1
Move these out of ctree.h into scrub.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move relocation prototypes into relocation.hJosef Bacik1-0/+1
Move these out of ctree.h into relocation.h to cut down on code in ctree.h Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move ioctl prototypes into ioctl.hJosef Bacik1-0/+1
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move uuid tree prototypes to uuid-tree.hJosef Bacik1-0/+1
Move these out of ctree.h into uuid-tree.h to cut down on the code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update function commentsDavid Sterba1-20/+23
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik1-0/+1
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move fs wide helpers out of ctree.hJosef Bacik1-0/+1
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: auto enable discard=async when possibleDavid Sterba1-0/+3
There's a request to automatically enable async discard for capable devices. We can do that, the async mode is designed to wait for larger freed extents and is not intrusive, with limits to iops, kbps or latency. The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . The automatic selection is done if there's at least one discard capable device in the filesystem (not capable devices are skipped). Mounting with any other discard option will honor that option, notably mounting with nodiscard will keep it disabled. Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-07btrfs: zoned: initialize device's zone info for seedingJohannes Thumshirn1-2/+9
When performing seeding on a zoned filesystem it is necessary to initialize each zoned device's btrfs_zoned_device_info structure, otherwise mounting the filesystem will cause a NULL pointer dereference. This was uncovered by fstests' testcase btrfs/163. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-07btrfs: zoned: clone zoned device info when cloning a deviceJohannes Thumshirn1-0/+12
When cloning a btrfs_device, we're not cloning the associated btrfs_zoned_device_info structure of the device in case of a zoned filesystem. Later on this leads to a NULL pointer dereference when accessing the device's zone_info for instance when setting a zone as active. This was uncovered by fstests' testcase btrfs/161. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-07btrfs: fix match incorrectly in dev_args_match_deviceLiu Shixin1-8/+8
syzkaller found a failed assertion: assertion failed: (args->devid != (u64)-1) || args->missing, in fs/btrfs/volumes.c:6921 This can be triggered when we set devid to (u64)-1 by ioctl. In this case, the match of devid will be skipped and the match of device may succeed incorrectly. Patch 562d7b1512f7 introduced this function which is used to match device. This function contains two matching scenarios, we can distinguish them by checking the value of args->missing rather than check whether args->devid and args->uuid is default value. Reported-by: syzbot+031687116258450f9853@syzkaller.appspotmail.com Fixes: 562d7b1512f7 ("btrfs: handle device lookup with btrfs_dev_lookup_args") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-25btrfs: don't use btrfs_chunk::sub_stripes from diskQu Wenruo1-1/+11
[BUG] There are two reports (the earliest one from LKP, a more recent one from kernel bugzilla) that we can have some chunks with 0 as sub_stripes. This will cause divide-by-zero errors at btrfs_rmap_block, which is introduced by a recent kernel patch ac0677348f3c ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block"): if (map->type & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10)) { stripe_nr = stripe_nr * map->num_stripes + i; stripe_nr = div_u64(stripe_nr, map->sub_stripes); <<< } [CAUSE] From the more recent report, it has been proven that we have some chunks with 0 as sub_stripes, mostly caused by older mkfs. It turns out that the mkfs.btrfs fix is only introduced in 6718ab4d33aa ("btrfs-progs: Initialize sub_stripes to 1 in btrfs_alloc_data_chunk") which is included in v5.4 btrfs-progs release. So there would be quite some old filesystems with such 0 sub_stripes. [FIX] Just don't trust the sub_stripes values from disk. We have a trusted btrfs_raid_array[] to fetch the correct sub_stripes numbers for each profile and that are fixed. By this, we can keep the compatibility with older filesystems while still avoid divide-by-zero bugs. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Viktor Kuzmin <kvaster@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216559 Fixes: ac0677348f3c ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block") CC: stable@vger.kernel.org # 6.0 Reviewed-by: Su Yue <glass@fydeos.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: check superblock to ensure the fs was not modified at thaw timeQu Wenruo1-1/+1
[BACKGROUND] There is an incident report that, one user hibernated the system, with one btrfs on removable device still mounted. Then by some incident, the btrfs got mounted and modified by another system/OS, then back to the hibernated system. After resuming from the hibernation, new write happened into the victim btrfs. Now the fs is completely broken, since the underlying btrfs is no longer the same one before the hibernation, and the user lost their data due to various transid mismatch. [REPRODUCER] We can emulate the situation using the following small script: truncate -s 1G $dev mkfs.btrfs -f $dev mount $dev $mnt fsstress -w -d $mnt -n 500 sync xfs_freeze -f $mnt cp $dev $dev.backup # There is no way to mount the same cloned fs on the same system, # as the conflicting fsid will be rejected by btrfs. # Thus here we have to wipe the fs using a different btrfs. mkfs.btrfs -f $dev.backup dd if=$dev.backup of=$dev bs=1M xfs_freeze -u $mnt fsstress -w -d $mnt -n 20 umount $mnt btrfs check $dev The final fsck will fail due to some tree blocks has incorrect fsid. This is enough to emulate the problem hit by the unfortunate user. [ENHANCEMENT] Although such case should not be that common, it can still happen from time to time. From the view of btrfs, we can detect any unexpected super block change, and if there is any unexpected change, we just mark the fs read-only, and thaw the fs. By this we can limit the damage to minimal, and I hope no one would lose their data by this anymore. Suggested-by: Goffredo Baroncelli <kreijack@libero.it> Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: stop allocation a btrfs_io_context for simple I/OChristoph Hellwig1-37/+57
The I/O context structure is only used to pass the btrfs_device to the end I/O handler for I/Os that go to a single device. Stop allocating the I/O context for these cases by passing the optional btrfs_io_stripe argument to __btrfs_map_block to query the mapping information and then using a fast path submission and I/O completion handler. As the old btrfs_io_context based I/O submission path is only used for mirrored writes, rename the functions to make that clear and stop setting the btrfs_bio device and mirror_num field that is only used for reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add fast path for single device io in __btrfs_map_blockChristoph Hellwig1-14/+46
There is no need for most of the btrfs_io_context when doing I/O to a single device. To support such I/O without the extra btrfs_io_context allocation, turn the mirror_num argument into a pointer so that it can be used to output the selected mirror number, and add an optional argument that points to a btrfs_io_stripe structure, which will be filled with a single extent if provided by the caller. In that case the btrfs_io_context allocation can be skipped as all information for the single device I/O is provided in the mirror_num argument and the on-stack btrfs_io_stripe. A caller that makes use of this new argument will be added in the next commit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: decide bio cloning inside submit_stripe_bioChristoph Hellwig1-9/+5
Remove the orig_bio argument as it can be derived from the bioc, and the clone argument as it can be calculated from bioc and dev_nr. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: factor out low-level bio setup from submit_stripe_bioChristoph Hellwig1-23/+28
Split out a low-level btrfs_submit_dev_bio helper that just submits the bio without any cloning decisions or setting up the end I/O handler for future reuse by a different caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: give struct btrfs_bio a real end_io handlerChristoph Hellwig1-16/+14
Currently btrfs_bio end I/O handling is a bit of a mess. The bi_end_io handler and bi_private pointer of the embedded struct bio are both used to handle the completion of the high-level btrfs_bio and for the I/O completion for the low-level device that the embedded bio ends up being sent to. To support this bi_end_io and bi_private are saved into the btrfs_io_context structure and then restored after the bio sent to the underlying device has completed the actual I/O. Untangle this by adding an end I/O handler and private data to struct btrfs_bio for the high-level btrfs_bio based completions, and leave the actual bio bi_end_io handler and bi_private pointer entirely to the low-level device I/O. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: properly abstract the parity raid bio handlingChristoph Hellwig1-1/+17
The parity raid write/recover functionality is currently not very well abstracted from the bio submission and completion handling in volumes.c: - the raid56 code directly completes the original btrfs_bio fed into btrfs_submit_bio instead of dispatching back to volumes.c - the raid56 code consumes the bioc and bio_counter references taken by volumes.c, which also leads to special casing of the calls from the scrub code into the raid56 code To fix this up supply a bi_end_io handler that calls back into the volumes.c machinery, which then puts the bioc, decrements the bio_counter and completes the original bio, and updates the scrub code to also take ownership of the bioc and bio_counter in all cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: use chained bios when cloningChristoph Hellwig1-44/+50
The stripes_pending in the btrfs_io_context counts number of inflight low-level bios for an upper btrfs_bio. For reads this is generally one as reads are never cloned, while for writes we can trivially use the bio remaining mechanisms that is used for chained bios. To be able to make use of that mechanism, split out a separate trivial end_io handler for the cloned bios that does a minimal amount of error tracking and which then calls bio_endio on the original bio to transfer control to that, with the remaining counter making sure it is completed last. This then allows to merge btrfs_end_bioc into the original bio bi_end_io handler. To make this all work all error handling needs to happen through the bi_end_io handler, which requires a small amount of reshuffling in submit_stripe_bio so that the bio is cloned already by the time the suitability of the device is checked. This reduces the size of the btrfs_io_context and prepares splitting the btrfs_bio at the stripe boundary. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: don't take a bio_counter reference for cloned biosChristoph Hellwig1-4/+2
Stop grabbing an extra bio_counter reference for each clone bio in a mirrored write and instead just release the one original reference in btrfs_end_bioc once all the bios for a single btrfs_bio have completed instead of at the end of btrfs_submit_bio once all bios have been submitted. This means the reference is now carried by the "upper" btrfs_bio only instead of each lower bio. Also remove the now unused btrfs_bio_counter_inc_noblocked helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: pass the operation to btrfs_bio_allocChristoph Hellwig1-2/+2
Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset set does. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move btrfs_bio allocation to volumes.cChristoph Hellwig1-0/+58
volumes.c is the place that implements the storage layer using the btrfs_bio structure, so move the bio_set and allocation helpers there as well. To make up for the new initialization boilerplate, merge the two init/exit helpers in extent_io.c into a single one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove lock protection for BLOCK_GROUP_FLAG_RELOCATING_REPAIRJosef Bacik1-3/+0
Before when this was modifying the bit field we had to protect it with the bg->lock, however now we're using bit helpers so we can stop using the bg->lock. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPYJosef Bacik1-2/+0
We use this during device replace for zoned devices, we were simply taking the lock because it was in a bit field and we needed the lock to be safe with other modifications in the bitfield. With the bit helpers we no longer require that locking. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: convert block group bit field to use bit helpersJosef Bacik1-5/+4
We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-06btrfs: fix the max chunk size and stripe length calculationQu Wenruo1-0/+3
[BEHAVIOR CHANGE] Since commit f6fca3917b4d ("btrfs: store chunk size in space-info struct"), btrfs no longer can create larger data chunks than 1G: mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4 mount $dev1 $mnt btrfs balance start --full $mnt btrfs balance start --full $mnt umount $mnt btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2 Before that offending commit, what we got is a 4G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Now what we got is only 1G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176 length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 This will increase the number of data chunks by the number of devices, not only increase system chunk usage, but also greatly increase mount time. Without a proper reason, we should not change the max chunk size. [CAUSE] Previously, we set max data chunk size to 10G, while max data stripe length to 1G. Commit f6fca3917b4d ("btrfs: store chunk size in space-info struct") completely ignored the 10G limit, but use 1G max stripe limit instead, causing above shrink in max data chunk size. [FIX] Fix the max data chunk size to 10G, and in decide_stripe_size_regular() we limit stripe_size to 1G manually. This should only affect data chunks, as for metadata chunks we always set the max stripe size the same as max chunk size (256M or 1G depending on fs size). Now the same script result the same old result: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Reported-by: Wang Yugui <wangyugui@e16-tech.com> Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()Zixuan Fu1-1/+4
In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if the path is invalid. In this case, btrfs_get_dev_args_from_path() returns directly without freeing args->uuid and args->fsid allocated before, which causes memory leak. To fix these possible leaks, when btrfs_get_bdev_and_sb() fails, btrfs_put_dev_args_from_path() is called to clean up the memory. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Fixes: faa775c41d655 ("btrfs: add a btrfs_get_dev_args_from_path helper") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Zixuan Fu <r33s3n6@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: merge btrfs_dev_stat_print_on_error with its only callerChristoph Hellwig1-5/+0
Fold it into the only caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: transfer the bio counter reference to the raid submission helpersChristoph Hellwig1-8/+7
Transfer the bio counter reference acquired by btrfs_submit_bio to raid56_parity_write and raid56_parity_recovery together with the bio that the reference was acquired for instead of acquiring another reference in those helpers and dropping the original one in btrfs_submit_bio. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: do not return errors from raid56_parity_recoverChristoph Hellwig1-1/+1
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Also use the proper bool type for the generic_io argument. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: do not return errors from raid56_parity_writeChristoph Hellwig1-1/+1
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: do not return errors from btrfs_map_bioChristoph Hellwig1-5/+7
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. As this requires touching all the callers, rename the function to btrfs_submit_bio, which describes the functionality much better. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block()Qu Wenruo1-1/+4
For profiles other than RAID56, __btrfs_map_block() returns @map_length as min(stripe_end, logical + *length), which is also the same result from btrfs_get_io_geometry(). But for RAID56, __btrfs_map_block() returns @map_length as stripe_len. This strange behavior is going to hurt incoming bio split at btrfs_map_bio() time, as we will use @map_length as bio split size. Fix this behavior by returning @map_length by the same calculation as for other profiles. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: use fixed stripe length everywhereChristoph Hellwig1-8/+5
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there are functions passing it as arguments, this is not necessary. The fixed value has been used for a long time and though the stripe length should be configurable by super block member stripesize, this hasn't been implemented and would require more changes so we don't need to keep this code around until then. Partially based on a patch from Qu Wenruo. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: clean up chained assignmentsDavid Sterba1-1/+2
The chained assignments may be convenient to write, but make readability a bit worse as it's too easy to overlook that there are several values set on the same line while this is rather an exception. Making it consistent everywhere avoids surprises. The pattern where inode times are initialized reuses the first value and the order is mtime, ctime. In other blocks the assignments are expanded so the order of variables is similar to the neighboring code. Signed-off-by: David Sterba <dsterba@suse.com>