aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-08-20dm ima: prefix ima event name related to device mapper with dm_Tushar Sugandhi1-9/+10
The event names for the DM events recorded in the ima log do not contain any information to indicate the events are part of the DM devices/targets. Prefix the event names for DM events with "dm_" to indicate that they are part of device-mapper. Signed-off-by: Tushar Sugandhi <[email protected]> Suggested-by: Thore Sommer <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-20dm ima: add version info to dm related events in ima logTushar Sugandhi2-12/+57
The DM events present in the ima log contain various attributes in the key=value format. The attributes' names/values may change in future, and new attributes may also get added. The attestation server needs some versioning to determine which attributes are supported and are expected in the ima log. Add version information to the DM events present in the ima log to help attestation servers to correctly process the attributes across different versions. Signed-off-by: Tushar Sugandhi <[email protected]> Suggested-by: Mimi Zohar <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-20dm ima: prefix dm table hashes in ima log with hash algorithmTushar Sugandhi2-3/+12
The active/inactive table hashes measured in the ima log do not contain the information about hash algorithm. This information is useful for the attestation servers to recreate the hashes and compare them with the ones present in the ima log to verify the table contents. Prefix the table hashes in various DM events in ima log with the hash algorithm used to compute those hashes. Signed-off-by: Tushar Sugandhi <[email protected]> Suggested-by: Mimi Zohar <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-18dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()Arne Welzel1-1/+6
On systems with many cores using dm-crypt, heavy spinlock contention in percpu_counter_compare() can be observed when the page allocation limit for a given device is reached or close to be reached. This is due to percpu_counter_compare() taking a spinlock to compute an exact result on potentially many CPUs at the same time. Switch to non-exact comparison of allocated and allowed pages by using the value returned by percpu_counter_read_positive() to avoid taking the percpu_counter spinlock. This may over/under estimate the actual number of allocated pages by at most (batch-1) * num_online_cpus(). Currently, batch is bounded by 32. The system on which this issue was first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt allocations, this seems an acceptable error. Certainly preferred over running into the spinlock contention. This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs and 192GB RAM as follows, but can be provoked on systems with less CPUs as well. * Disable swap * Tune vm settings to promote regular writeback $ echo 50 > /proc/sys/vm/dirty_expire_centisecs $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes * Create 8 dmcrypt devices based on files on a tmpfs * Create and mount an ext4 filesystem on each crypt devices * Run stress-ng --hdd 8 within one of above filesystems Total %system usage collected from sysstat goes to ~35%. Write throughput on the underlying loop device is ~2GB/s. perf profiling an individual kworker kcryptd thread shows the following profile, indicating spinlock contention in percpu_counter_compare(): 99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork | --ret_from_fork kthread worker_thread | --99.92%--process_one_work | |--80.52%--kcryptd_crypt | | | |--62.58%--mempool_alloc | | | | | --62.24%--crypt_page_alloc | | | | | --61.51%--__percpu_counter_compare | | | | | --61.34%--__percpu_counter_sum | | | | | |--58.68%--_raw_spin_lock_irqsave | | | | | | | --58.30%--native_queued_spin_lock_slowpath | | | | | --0.69%--cpumask_next | | | | | --0.51%--_find_next_bit | | | |--10.61%--crypt_convert | | | | | |--6.05%--xts_crypt ... After applying this patch and running the same test, %system usage is lowered to ~7% and write throughput on the loop device increases to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62% in the profile and not hitting the percpu_counter() spinlock anymore. |--8.15%--mempool_alloc | | | |--3.93%--crypt_page_alloc | | | | | --3.75%--__alloc_pages | | | | | --3.62%--get_page_from_freelist | | | | | --3.22%--rmqueue_bulk | | | | | --2.59%--_raw_spin_lock | | | | | --2.57%--native_queued_spin_lock_slowpath | | | --3.05%--_raw_spin_lock_irqsave | | | --2.49%--native_queued_spin_lock_slowpath Suggested-by: DJ Gregor <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Signed-off-by: Arne Welzel <[email protected]> Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages") Cc: [email protected] Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm: add documentation for IMA measurement supportTushar Sugandhi2-0/+307
To interpret various DM target measurement data in IMA logs, a separate documentation page is needed under Documentation/admin-guide/device-mapper. Add documentation to help system administrators and attestation client/server component owners to interpret the measurement data generated by various DM targets, on various device/table state changes. Signed-off-by: Tushar Sugandhi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm: update target status functions to support IMA measurementTushar Sugandhi32-2/+328
For device mapper targets to take advantage of IMA's measurement capabilities, the status functions for the individual targets need to be updated to handle the status_type_t case for value STATUSTYPE_IMA. Update status functions for the following target types, to log their respective attributes to be measured using IMA. 01. cache 02. crypt 03. integrity 04. linear 05. mirror 06. multipath 07. raid 08. snapshot 09. striped 10. verity For rest of the targets, handle the STATUSTYPE_IMA case by setting the measurement buffer to NULL. For IMA to measure the data on a given system, the IMA policy on the system needs to be updated to have the following line, and the system needs to be restarted for the measurements to take effect. /etc/ima/ima-policy measure func=CRITICAL_DATA label=device-mapper template=ima-buf The measurements will be reflected in the IMA logs, which are located at: /sys/kernel/security/integrity/ima/ascii_runtime_measurements /sys/kernel/security/integrity/ima/binary_runtime_measurements These IMA logs can later be consumed by various attestation clients running on the system, and send them to external services for attesting the system. The DM target data measured by IMA subsystem can alternatively be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with DM_TABLE_STATUS_CMD. Signed-off-by: Tushar Sugandhi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm ima: measure data on device renameTushar Sugandhi3-0/+53
A given block device is identified by it's name and UUID. However, both these parameters can be renamed. For an external attestation service to correctly attest a given device, it needs to keep track of these rename events. Update the device data with the new values for IMA measurements. Measure both old and new device name/UUID parameters in the same IMA measurement event, so that the old and the new values can be connected later. Signed-off-by: Tushar Sugandhi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm ima: measure data on table clearTushar Sugandhi3-0/+97
For a given block device, an inactive table slot contains the parameters to configure the device with. The inactive table can be cleared multiple times, accidentally or maliciously, which may impact the functionality of the device, and compromise the system. Therefore it is important to measure and log the event when a table is cleared. Measure device parameters, and table hashes when the inactive table slot is cleared. Signed-off-by: Tushar Sugandhi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm ima: measure data on device removeTushar Sugandhi3-0/+124
Presence of an active block-device, configured with expected parameters, is important for an external attestation service to determine if a system meets the attestation requirements. Therefore it is important for DM to measure the device remove events. Measure device parameters and table hashes when the device is removed, using either remove or remove_all. Signed-off-by: Tushar Sugandhi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm ima: measure data on device resumeTushar Sugandhi3-2/+125
A given block device can load a table multiple times, with different input parameters, before eventually resuming it. Further, a device may be suspended and then resumed. The device may never resume after a table-load. Because of the above valid scenarios for a given device, it is important to measure and log the device resume event using IMA. Also, if the table is large, measuring it in clear-text each time the device changes state, will unnecessarily increase the size of IMA log. Since the table clear-text is already measured during table-load event, measuring the hash during resume should be sufficient to validate the table contents. Measure the device parameters, and hash of the active table, when the device is resumed. Signed-off-by: Tushar Sugandhi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm ima: measure data on table loadTushar Sugandhi9-2/+415
DM configures a block device with various target specific attributes passed to it as a table. DM loads the table, and calls each target’s respective constructors with the attributes as input parameters. Some of these attributes are critical to ensure the device meets certain security bar. Thus, IMA should measure these attributes, to ensure they are not tampered with, during the lifetime of the device. So that the external services can have high confidence in the configuration of the block-devices on a given system. Some devices may have large tables. And a given device may change its state (table-load, suspend, resume, rename, remove, table-clear etc.) many times. Measuring these attributes each time when the device changes its state will significantly increase the size of the IMA logs. Further, once configured, these attributes are not expected to change unless a new table is loaded, or a device is removed and recreated. Therefore the clear-text of the attributes should only be measured during table load, and the hash of the active/inactive table should be measured for the remaining device state changes. Export IMA function ima_measure_critical_data() to allow measurement of DM device parameters, as well as target specific attributes, during table load. Compute the hash of the inactive table and store it for measurements during future state change. If a load is called multiple times, update the inactive table hash with the hash of the latest populated table. So that the correct inactive table hash is measured when the device transitions to different states like resume, remove, rename, etc. Signed-off-by: Tushar Sugandhi <[email protected]> Signed-off-by: Colin Ian King <[email protected]> # leak fix Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm writecache: add event countersMikulas Patocka2-5/+67
Add 10 counters for various events (hit, miss, etc) and export them in the status line (accessed from userspace with "dmsetup status"). Also add a message "clear_stats" that resets these counters. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm writecache: report invalid return from writecache_map helpersMikulas Patocka1-1/+4
If some "writecache_map_*" function returns invalid state, it is a bug. So, we should report it and not fail silently. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm writecache: further writecache_map() cleanupMike Snitzer1-32/+43
Factor out writecache_map_flush() and writecache_map_discard() from writecache_map(). Also eliminate the various goto labels in writecache_map(). Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm writecache: factor out writecache_map_remap_origin()Mike Snitzer1-15/+15
Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10dm writecache: split up writecache_map() to improve code readabilityMike Snitzer1-151/+187
writecache_map() has grown too large and can be confusing to read given all the goto statements. Signed-off-by: Mike Snitzer <[email protected]>
2021-08-10writeback: make the laptop_mode prototypes available unconditionallyChristoph Hellwig1-5/+0
Fix the !CONFIG_BLOCK build after the recent cleanup. Fixes: 5ed964f8e54e ("mm: hide laptop_mode_wb_timer entirely behind the BDI API") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: return ELEVATOR_DISCARD_MERGE if possibleMing Lei5-16/+24
When merging one bio to request, if they are discard IO and the queue supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE because both block core and related drivers(nvme, virtio-blk) doesn't handle mixed discard io merge(traditional IO merge together with discard merge) well. Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation, so both blk-mq and drivers just need to handle multi-range discard. Reported-by: Oleksandr Natalenko <[email protected]> Signed-off-by: Ming Lei <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Fixes: 2705dfb20947 ("block: fix discard request merge") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: remove the bd_bdi in struct block_deviceChristoph Hellwig7-20/+9
Just retrieve the bdi from the disk. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: move the bdi from the request_queue to the gendiskChristoph Hellwig14-63/+58
The backing device information only makes sense for file system I/O, and thus belongs into the gendisk and not the lower level request_queue structure. Move it there. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: add a queue_has_disk helperChristoph Hellwig1-0/+1
Add a helper to check if a gendisk is associated with a request_queue. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: pass a gendisk to blk_queue_update_readaheadChristoph Hellwig6-8/+10
.. and rename the function to disk_update_readahead. This is in preparation for moving the BDI from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09mm: hide laptop_mode_wb_timer entirely behind the BDI APIChristoph Hellwig3-7/+3
Don't leak the detaіls of the timer into the block layer, instead initialize the timer in bdi_alloc and delete it in bdi_unregister. Note that this means the timer is initialized (but not armed) for non-block queues as well now. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: remove support for delayed queue registrationsChristoph Hellwig3-29/+7
Now that device mapper has been changed to register the disk once it is fully ready all this code is unused. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09dm: delay registering the gendiskChristoph Hellwig2-13/+11
device mapper is currently the only outlier that tries to call register_disk after add_disk, leading to fairly inconsistent state of these block layer data structures. Instead change device-mapper to just register the gendisk later now that the holder mechanism can cope with that. Note that this introduces a user visible change: the dm kobject is now only visible after the initial table has been loaded. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09dm: move setting md->type into dm_setup_md_queueChristoph Hellwig2-6/+3
Move setting md->type from both callers into dm_setup_md_queue. This ensures that md->type is only set to a valid value after the queue has been fully setup, something we'll rely on future changes. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09dm: cleanup cleanup_mapped_deviceChristoph Hellwig1-5/+1
md->queue is now always set when md->disk is set, so simplify the conditionals a bit. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: support delayed holder registrationChristoph Hellwig3-17/+66
device mapper needs to register holders before it is ready to do I/O. Currently it does so by registering the disk early, which can leave the disk and queue in a weird half state where the queue is registered with the disk, except for sysfs and the elevator. And this state has been a bit promlematic before, and will get more so when sorting out the responsibilities between the queue and the disk. Support registering holders on an initialized but not registered disk instead by delaying the sysfs registration until the disk is registered. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: look up holders by bdevChristoph Hellwig5-17/+15
Invert they way the holder relations are tracked. This very slightly reduces the memory overhead for partitioned devices. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: remove the extra kobject reference in bd_link_disk_holderChristoph Hellwig1-6/+0
Since commit 0d02129e76ed ("block: merge struct block_device and struct hd_struct") there is no way for the bdev to go away as long as there is a holder, so remove the extra references. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09block: make the block holder code optionalChristoph Hellwig8-146/+151
Move the block holder code into a separate file as it is not in any way related to the other block_dev.c code, and add a new selectable config option for it so that we don't have to build it without any remapped drivers selected. The Kconfig symbol contains a _DEPRECATED suffix to match the comments added in commit 49731baa41df ("block: restore multiple bd_link_disk_holder() support"). Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-05loop: Select I/O scheduler 'none' from inside add_disk()Bart Van Assche1-1/+2
We noticed that the user interface of Android devices becomes very slow under memory pressure. This is because Android uses the zram driver on top of the loop driver for swapping, because under memory pressure the swap code alternates reads and writes quickly, because mq-deadline is the default scheduler for loop devices and because mq-deadline delays writes by five seconds for such a workload with default settings. Fix this by making the kernel select I/O scheduler 'none' from inside add_disk() for loop devices. This default can be overridden at any time from user space, e.g. via a udev rule. This approach has an advantage compared to changing the I/O scheduler from userspace from 'mq-deadline' into 'none', namely that synchronize_rcu() does not get called. This patch changes the default I/O scheduler for loop devices from 'mq-deadline' into 'none'. Additionally, this patch reduces the Android boot time on my test setup with 0.5 seconds compared to configuring the loop I/O scheduler from user space. Cc: Christoph Hellwig <[email protected]> Cc: Ming Lei <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Martijn Coenen <[email protected]> Cc: Jaegeuk Kim <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-05blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flagBart Van Assche2-0/+9
elevator_get_default() uses the following algorithm to select an I/O scheduler from inside add_disk(): - In case of a single hardware queue or if sharing hardware queues across multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline. - Otherwise, use 'none'. This is a good choice for most but not for all block drivers. Make it possible to override the selection of mq-deadline with a new flag, namely BLK_MQ_F_NO_SCHED_BY_DEFAULT. Cc: Christoph Hellwig <[email protected]> Cc: Ming Lei <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Martijn Coenen <[email protected]> Cc: Jaegeuk Kim <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove blk-mq-sysfs dead codeDamien Le Moal1-55/+0
In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to define any attribute since the "mq" sysfs directory contains only sub-directories (no attribute files). As a result, blk_mq_sysfs_show(), blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all unused and unnecessary. Remove all this unused code. Signed-off-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02loop: raise media_change eventMatteo Croce1-0/+5
Make the loop device raise a DISK_MEDIA_CHANGE event on attach or detach. # udevadm monitor -up |grep -e DISK_MEDIA_CHANGE -e DEVNAME & # losetup -f zero [ 7.454235] loop0: detected capacity change from 0 to 16384 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop0 DEVNAME=/dev/loop0 DEVNAME=/dev/loop0 # losetup -f zero [ 10.205245] loop1: detected capacity change from 0 to 16384 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop1 # losetup -f zero2 [ 13.532368] loop2: detected capacity change from 0 to 40960 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop2 DEVNAME=/dev/loop2 # losetup -D DEVNAME=/dev/loop1 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop2 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop2 DEVNAME=/dev/loop0 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop0 Signed-off-by: Matteo Croce <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: add a helper to raise a media changed eventMatteo Croce2-15/+47
Refactor disk_check_events() and move some code into disk_event_uevent(). Then add disk_force_media_change(), a helper which will be used by devices to force issuing a DISK_EVENT_MEDIA_CHANGE event. Co-developed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: export diskseq in sysfsMatteo Croce2-0/+22
Add a new sysfs handle to export the new diskseq value. Place it in <sysfs>/block/<disk>/diskseq and document it. $ grep . /sys/class/block/*/diskseq /sys/class/block/loop0/diskseq:13 /sys/class/block/loop1/diskseq:14 /sys/class/block/loop2/diskseq:5 /sys/class/block/loop3/diskseq:6 /sys/class/block/ram0/diskseq:1 /sys/class/block/ram1/diskseq:2 /sys/class/block/vda/diskseq:7 Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: add ioctl to read the disk sequence numberMatteo Croce2-0/+3
Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number from the genhd structure. # ./getdiskseq /dev/loop* /dev/loop0: 13 /dev/loop0p1: 13 /dev/loop0p2: 13 /dev/loop0p3: 13 /dev/loop1: 14 /dev/loop1p1: 14 /dev/loop1p2: 14 /dev/loop2: 5 /dev/loop3: 6 Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: export the diskseq in ueventsMatteo Croce1-0/+9
Export the newly introduced diskseq in uevents: $ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ E: DEVNAME=/dev/loop0 E: DISKSEQ=1 E: DEVNAME=/dev/loop1 E: DISKSEQ=2 E: DEVNAME=/dev/loop2 E: DISKSEQ=3 E: DEVNAME=/dev/loop3 E: DISKSEQ=4 E: DEVNAME=/dev/loop4 E: DISKSEQ=5 E: DEVNAME=/dev/loop5 E: DISKSEQ=6 E: DEVNAME=/dev/loop6 E: DISKSEQ=7 E: DEVNAME=/dev/loop7 E: DISKSEQ=8 E: DEVNAME=/dev/nvme0n1 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p1 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p2 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p3 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p4 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p5 E: DISKSEQ=9 E: DEVNAME=/dev/sda E: DISKSEQ=10 E: DEVNAME=/dev/sda1 E: DISKSEQ=10 E: DEVNAME=/dev/sda2 E: DISKSEQ=10 Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: add disk sequence numberMatteo Croce3-0/+29
Associating uevents with block devices in userspace is difficult and racy: the uevent netlink socket is lossy, and on slow and overloaded systems has a very high latency. Block devices do not have exclusive owners in userspace, any process can set one up (e.g. loop devices). Moreover, device names can be reused (e.g. loop0 can be reused again and again). A userspace process setting up a block device and watching for its events cannot thus reliably tell whether an event relates to the device it just set up or another earlier instance with the same name. Being able to set a UUID on a loop device would solve the race conditions. But it does not allow to derive orderings from uevents: if you see a uevent with a UUID that does not match the device you are waiting for, you cannot tell whether it's because the right uevent has not arrived yet, or it was already sent and you missed it. So you cannot tell whether you should wait for it or not. Associating a unique, monotonically increasing sequential number to the lifetime of each block device, which can be retrieved with an ioctl immediately upon setting it up, allows to solve the race conditions with uevents, and also allows userspace processes to know whether they should wait for the uevent they need or if it was dropped and thus they should move on. Additionally, increment the disk sequence number when the media change, i.e. on DISK_EVENT_MEDIA_CHANGE event. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove cmdline-parser.cChristoph Hellwig7-319/+262
cmdline-parser.c is only used by the cmdline faux partition format, so merge the code into that and avoid an indirect call. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove disk_name()Christoph Hellwig2-9/+9
Remove the disk_name function now that all users are gone. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: simplify disk name formatting in check_partitionChristoph Hellwig1-1/+1
disk_name for partition 0 just copies out the disk_name field. Replace the call to disk_name with a %s format specifier. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: simplify printing the device names disk_stack_limitsChristoph Hellwig1-9/+3
Printk ->disk_name directly for the disk and use the %pg format specifier for the block device, which is equivalent to a bdevname call. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: use the %pg format specifier in show_partitionChristoph Hellwig1-4/+2
Simplify printing the partition name by using the %pg format specifier that is equivalent to a bdevname call. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: use the %pg format specifier in printk_all_partitionsChristoph Hellwig1-4/+2
Simplify printing the partition name by using the %pg format specifier that is equivalent to a bdevname call. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: reduce stack usage in diskstats_showAbd-Alrhman Masalkhi1-4/+2
I have compiled the kernel with a cross compiler "hppa-linux-gnu-" v9.3.0 on x86-64 host machine. I got the following warning: block/genhd.c: In function ‘diskstats_show’: block/genhd.c:1227:1: warning: the frame size of 1688 bytes is larger than 1280 bytes [-Wframe-larger-than=] 1227 | } By Reduced the stack footprint by using the %pg printk specifier instead of disk_name to remove the need for the on-stack buffer. Signed-off-by: Abd-Alrhman Masalkhi <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove bdputChristoph Hellwig4-10/+3
Now that we've stopped using inode references for anything meaninful in the block layer get rid of the helper to put it and just open code the call to iput on the block_device inode. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove bdgrabChristoph Hellwig2-16/+0
All callers are gone, and no one should grab a pure inode reference to a block device anymore. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02loop: don't grab a reference to the block deviceChristoph Hellwig1-5/+0
The whole device block device won't be removed while the disk is still alive, so don't bother to grab a reference to it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>