blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2020-12-07	null_blk: Move driver into its own directory	Damien Le Moal	9	-16/+28
	Move null_blk driver code into the new sub-directory drivers/block/null_blk. Suggested-by: Bart Van Assche <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	null_blk: Allow controlling max_hw_sectors limit	Damien Le Moal	2	-1/+20
	Add the module option and configfs attribute max_sectors to allow configuring the maximum size of a command issued to a null_blk device. This allows exercising the block layer BIO splitting with different limits than the default BLK_SAFE_MAX_SECTORS. This is also useful for testing the zone append write path of file systems as the max_hw_sectors limit value is also used for the max_zone_append_sectors limit. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	null_blk: discard zones on reset	Damien Le Moal	3	-2/+7
	When memory backing is enabled, use null_handle_discard() to free the backing memory used by a zone when the zone is being reset. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	null_blk: cleanup discard handling	Damien Le Moal	1	-21/+22
	null_handle_discard() is called from both null_handle_rq() and null_handle_bio(). As these functions are only passed a nullb_cmd structure, this forces pointer dereferences to identiify the discard operation code and to access the sector range to be discarded. Simplify all this by changing the interface of the functions null_handle_discard() and null_handle_memory_backed() to pass along the operation code, operation start sector and number of sectors. With this change null_handle_discard() can be called directly from null_handle_memory_backed(). Also add a message warning that the discard configuration attribute has no effect when memory backing is disabled. No functional change is introduced by this patch. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	null_blk: Improve implicit zone close	Damien Le Moal	2	-5/+18
	When open zone resource management is enabled, that is, when a null_blk zoned device is created with zone_max_open different than 0, implicitly or explicitly opening a zone may require implicitly closing a zone that is already implicitly open. This operation is done using the function null_close_first_imp_zone(), which search for an implicitly open zone to close starting from the first sequential zone. This implementation is simple but may result in the same being constantly implicitly closed and then implicitly reopened on write, namely, the lowest numbered zone that is being written. Avoid this by starting the search for an implicitly open zone starting from the zone following the last zone that was implicitly closed. The function null_close_first_imp_zone() is renamed null_close_imp_open_zone(). Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	null_blk: improve zone locking	Damien Le Moal	2	-120/+188
	With memory backing disabled, using a single spinlock for protecting zone information and zone resource management prevents the parallel execution on multiple queue of IO requests to different zones. Furthermore, regardless of the use of memory backing, if a null_blk device is created without limits on the number of open and active zones, accounting for zone resource management is not necessary. >From these observations, zone locking is changed as follows to improve performance: 1) the zone_lock spinlock is renamed zone_res_lock and used only if zone resource management is necessary, that is, if either zone_max_open or zone_max_active are not 0. This is indicated using the new boolean need_zone_res_mgmt in the nullb_device structure. null_zone_write() is modified to reduce the amount of code executed with the zone_res_lock spinlock held. 2) With memory backing disabled, per zone locking is changed to a spinlock per zone. 3) Introduce the structure nullb_zone to replace the use of struct blk_zone for zone information. This new structure includes a union of a spinlock and a mutex for zone locking. The spinlock is used when memory backing is disabled and the mutex is used with memory backing. With these changes, fio performance with zonemode=zbd for 4K random read and random write on a dual socket (24 cores per socket) machine using the none schedulder is as follows: before patch: write (psync x 96 jobs) = 465 KIOPS read (libaio@qd=8 x 96 jobs) = 1361 KIOPS after patch: write (psync x 96 jobs) = 456 KIOPS read (libaio@qd=8 x 96 jobs) = 4096 KIOPS Write performance remains mostly unchanged but read performance is three times higher. Performance when using the mq-deadline scheduler is not changed by this patch as mq-deadline becomes the bottleneck for a multi-queue device. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	block: Align max_hw_sectors to logical blocksize	Damien Le Moal	1	-5/+18
	Block device drivers do not have to call blk_queue_max_hw_sectors() to set a limit on request size if the default limit BLK_SAFE_MAX_SECTORS is acceptable. However, this limit (255 sectors) may not be aligned to the device logical block size which cannot be used as is for a request maximum size. This is the case for the null_blk device driver. Modify blk_queue_max_hw_sectors() to make sure that the request size limits specified by the max_hw_sectors and max_sectors queue limits are always aligned to the device logical block size. Additionally, to avoid introducing a dependence on the execution order of this function with blk_queue_logical_block_size(), also modify blk_queue_logical_block_size() to perform the same alignment when the logical block size is set after max_hw_sectors. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	null_blk: Fail zone append to conventional zones	Damien Le Moal	1	-1/+4
	Conventional zones do not have a write pointer and so cannot accept zone append writes. Make sure to fail any zone append write command issued to a conventional zone. Reported-by: Naohiro Aota <[email protected]> Fixes: e0489ed5daeb ("null_blk: Support REQ_OP_ZONE_APPEND") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	null_blk: Fix zone size initialization	Damien Le Moal	1	-8/+15
	For a null_blk device with zoned mode enabled is currently initialized with a number of zones equal to the device capacity divided by the zone size, without considering if the device capacity is a multiple of the zone size. If the zone size is not a divisor of the capacity, the zones end up not covering the entire capacity, potentially resulting is out of bounds accesses to the zone array. Fix this by adding one last smaller zone with a size equal to the remainder of the disk capacity divided by the zone size if the capacity is not a multiple of the zone size. For such smaller last zone, the zone capacity is also checked so that it does not exceed the smaller zone size. Reported-by: Naohiro Aota <[email protected]> Fixes: ca4b2a011948 ("null_blk: add zone support") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	bcache: fix race between setting bdev state to none and new write request ↵	Dongsheng Yang	2	-9/+9
	direct to backing There is a race condition in detaching as below: A. detaching B. Write request (1) writing back (2) write back done, set bdev state to clean. (3) cached_dev_put() and schedule_work(&dc->detach); (4) write data [0 - 4K] directly into backing and ack to user. (5) power-failure... When we restart this bcache device, this bdev is clean but not detached, and read [0 - 4K], we will get unexpected old data from cache device. To fix this problem, set the bdev state to none when we writeback done in detaching, and then if power-failure happened as above, the data in cache will not be used in next bcache device starting, it's detached, we will read the correct data from backing derectly. Signed-off-by: Dongsheng Yang <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-07	block/rnbd: fix a null pointer dereference on dev->blk_symlink_name	Colin Ian King	1	-1/+1
	Currently in the case where dev->blk_symlink_name fails to be allocates the error return path attempts to set an end-of-string character to the unallocated dev->blk_symlink_name causing a null pointer dereference error. Fix this by returning with an explicity ENOMEM error (which also is missing in the original code as was not initialized). Fixes: 1eb54f8f5dd8 ("block/rnbd: client: sysfs interface functions") Signed-off-by: Colin Ian King <[email protected]> Addresses-Coverity: ("Dereference after null check") Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name	Md Haris Iqbal	3	-7/+23
	For every rnbd_clt_dev, we alloc the pathname and blk_symlink_name statically to NAME_MAX which is 255 bytes. In most of the cases we only need less than 10 bytes, so 500 bytes per block device are wasted. This commit dynamically allocates memory buffer for pathname and blk_symlink_name. Signed-off-by: Md Haris Iqbal <[email protected]> Signed-off-by: Jack Wang <[email protected]> Reviewed-by: Lutz Pogrell <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	block/rnbd: call kobject_put in the failure path	Guoqing Jiang	2	-13/+19
	Per the comment of kobject_init_and_add, we need to cleanup the memory by call kobject_put. Also we need to call kobject_del for the other failure cases if the kobject_init_and_add doesn't fail. Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Jack Wang <[email protected]> Reviewed-by: Md Haris Iqbal <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	Documentation/ABI/rnbd-srv: add document for force_close	Jack Wang	1	-0/+8
	describe force_close of device Signed-off-by: Jack Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	block/rnbd-srv: close a mapped device from server side.	Lutz Pogrell	3	-4/+57
	The forceful close of an exported device is required for the use case, when the client side hangs, is crashed, or is not accessible. There have been cases observed, where only some of the devices are to be cleaned up, but the session shall remain. When the device is to be exported to a different client host, server side cleanup is required. Signed-off-by: Lutz Pogrell <[email protected]> Signed-off-by: Jack Wang <[email protected]> Reviewed-by: Gioh Kim <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	Documentation/ABI/rnbd-clt: session name is appended to the device path	Gioh Kim	1	-3/+3
	When mapping a device, /sys/devices/virtual/rnbd-client/ctl/devices/<device_id> was created. But we found out that it had a problem when mapping the same file on different servers. So we append the session name after the device_id as below. /sys/devices/virtual/rnbd-client/ctl/devices/<device_id>@<session_name> Signed-off-by: Gioh Kim <[email protected]> Signed-off-by: Jack Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	Documentation/ABI/rnbd-clt: fix typo in sysfs-class-rnbd-client	Gioh Kim	1	-1/+1
	/sys/block/rnbd<N> is created, not /sys/block/rnbd_client/rnbd<N> Signed-off-by: Gioh Kim <[email protected]> Signed-off-by: Jack Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	block/rnbd-clt: support mapping two devices with the same name from ↵	Guoqing Jiang	2	-5/+12
	different servers Previously, we can't map same device name from different sessions due to the limitation of sysfs naming mechanism. root@clt2:~# ls -l /sys/class/rnbd-client/ctl/devices/ total 0 lrwxrwxrwx 1 root 0 Sep 2 16:31 !dev!nullb1 -> ../../../block/rnbd0 We only use the device name in above, which caused device with the same name can't be mapped from another server. To address the issue, the sessname is appended to the node to differentiate where the device comes from. Also, we need to check if the pathname is existed in a specific session instead of search it in global sess_list. Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Gioh Kim <[email protected]> Signed-off-by: Jack Wang <[email protected]> Reviewed-by: Md Haris Iqbal <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-04	block/rnbd-clt: Make path parameter optional for map_device	Md Haris Iqbal	2	-1/+6
	During map_device if the given session exists, then the path parameter is not used. In such a case, the path parameter is redundant. This commit makes the path parameter optional for map_device. When the path parameter is not given, if the session exists then that is used to establish the rtrs connection. If the session does not exist, and the path parameter is also missing, then map_device fails. Signed-off-by: Md Haris Iqbal <[email protected]> Signed-off-by: Jack Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-02	Merge tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme into ↵	Jens Axboe	21	-117/+336
	for-5.11/drivers Pull NVMe updates from Christoph: "nvme updates for 5.11 - nvmet passthrough improvements (Chaitanya Kulkarni) - fcloop error injection support (James Smart) - read-only support for zoned namespaces without Zone Append (Javier González) - improve some error message (Minwoo Im) - reject I/O to offline fabrics namespaces (Victor Gladkov) - PCI queue allocation cleanups (Niklas Schnelle) - remove an unused allocation in nvmet (Amit Engel) - a Kconfig spelling fix (Colin Ian King) - nvme_req_qid simplication (Baolin Wang)" * tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme: (23 commits) nvme: export zoned namespaces without Zone Append support read-only nvme: rename bdev operations nvme: rename controller base dev_t char device nvme: remove unnecessary return values nvme: print a warning for when listing active namespaces fails nvme: improve an error message on Identify failure nvme-fabrics: reject I/O to offline device nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig nvmet: make sure discovery change log event is protected nvmet: remove unused ctrl->cqs nvme-pci: don't allocate unused I/O queues nvme-pci: drop min() from nr_io_queues assignment nvmet: use inline bio for passthru fast path nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio nvmet: remove op_flags for passthru commands nvme: split nvme_alloc_request() block: move blk_rq_bio_prep() to linux/blk-mq.h nvmet: add passthru io timeout value attr nvmet: add passthru admin timeout value attr nvme: use consistent macro name for timeout ...
2020-12-01	nvme: export zoned namespaces without Zone Append support read-only	Javier González	3	-5/+12
	Allow ZNS NVMe SSDs to present a read-only namespace when append is not supported, instead of rejecting the namespace directly. This allows (i) the namespace to be used in read-only mode, which is not a problem as the append command only affects the write path, and (ii) to use standard management tools such as nvme-cli to choose a different format or firmware slot that is compatible with the Linux zoned block device. Signed-off-by: Javier González <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: rename bdev operations	Javier González	1	-4/+4
	Remane block device operations in preparation to add char device file operations. Signed-off-by: Javier González <[email protected]> Reviewed-by: Minwoo Im <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: rename controller base dev_t char device	Javier González	1	-5/+7
	Rename controller base dev_t char device in preparation for adding a namespace char device. Signed-off-by: Javier González <[email protected]> Reviewed-by: Minwoo Im <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: remove unnecessary return values	Javier González	1	-5/+3
	Cleanup unnecessary ret values that are not checked or used in nvme_alloc_ns(). Signed-off-by: Javier González <[email protected]> Reviewed-by: Minwoo Im <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: print a warning for when listing active namespaces fails	Minwoo Im	1	-1/+4
	During the scan_work, an Identify command is issued to figure out which namespaces are active. If this command fails, the nvme driver falls back to scanning namespaces sequentially. In this situation, we don't see any warnings and don't even know whether list-ns command has been failed or not easiliy. Printa warning when the Identify command executin fail: [ 1.108399] nvme nvme0: Identify NS List failed (status=0x400b) [ 1.109583] nvme0n1: detected capacity change from 0 to 1048576 [ 1.112186] nvme nvme0: Identify Descriptors failed (nsid=2, status=0x4002) [ 1.113929] nvme nvme0: Identify Descriptors failed (nsid=3, status=0x4002) [ 1.116537] nvme nvme0: Identify Descriptors failed (nsid=4, status=0x4002) ... Signed-off-by: Minwoo Im <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: improve an error message on Identify failure	Minwoo Im	1	-1/+2
	Add the namespace ID to the error message when the Identify command used to retrieve the Namespace Identification Descriptor list fails. This avoids rather useless and duplicative messages like the following: [ 1.321031] nvme nvme0: Identify Descriptors failed (16386) [ 1.321948] nvme nvme0: Identify Descriptors failed (16386) [ 1.322872] nvme nvme0: Identify Descriptors failed (16386) [ 1.323775] nvme nvme0: Identify Descriptors failed (16386) [ 1.324687] nvme nvme0: Identify Descriptors failed (16386) ... Also, print the nvme status code in hexadecimal rather than decimal format rather for better readability. Signed-off-by: Minwoo Im <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme-fabrics: reject I/O to offline device	Victor Gladkov	5	-4/+77
	Commands get stuck while Host NVMe-oF controller is in reconnect state. The controller enters into reconnect state when it loses connection with the target. It tries to reconnect every 10 seconds (default) until a successful reconnect or until the reconnect time-out is reached. The default reconnect time out is 10 minutes. Applications are expecting commands to complete with success or error within a certain timeout (30 seconds by default). The NVMe host is enforcing that timeout while it is connected, but during reconnect the timeout is not enforced and commands may get stuck for a long period or even forever. To fix this long delay due to the default timeout, introduce new "fast_io_fail_tmo" session parameter. The timeout is measured in seconds from the controller reconnect and any command beyond that timeout is rejected. The new parameter value may be passed during 'connect'. The default value of -1 means no timeout (similar to current behavior). Signed-off-by: Victor Gladkov <[email protected]> Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chao Leng <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig	Colin Ian King	1	-1/+1
	There is a spelling mistake in the Kconfig help text. Fix it. Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: make sure discovery change log event is protected	Max Gurtovoy	1	-0/+1
	Generation counter is protected by nvmet_config_sem. Make sure the callers that call functions that might change it, are calling it properly. Signed-off-by: Max Gurtovoy <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: remove unused ctrl->cqs	Amit	2	-14/+2
	remove unused cqs from nvmet_ctrl struct this will reduce the allocated memory. Signed-off-by: Amit <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme-pci: don't allocate unused I/O queues	Niklas Schnelle	1	-9/+7
	currently the NVME_QUIRK_SHARED_TAGS quirk for Apple devices is handled during the assignment of nr_io_queues in nvme_setup_io_queues(). This however means that for these devices nvme_max_io_queues() will actually not return the supported maximum which is confusing and unexpected and also means that in nvme_probe() we are allocating for I/O queues that will never be used. Fix this by moving the quirk handling into nvme_max_io_queues(). Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme-pci: drop min() from nr_io_queues assignment	Niklas Schnelle	1	-2/+1
	in nvme_setup_io_queues() the number of I/O queues is set to either 1 in case of a quirky Apple device or to the min of nvme_max_io_queues() or dev->nr_allocated_queues - 1. This is unnecessarily complicated as dev->nr_allocated_queues is only assigned once and is nvme_max_io_queues() + 1. Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: use inline bio for passthru fast path	Chaitanya Kulkarni	2	-3/+10
	In nvmet_passthru_execute_cmd() which is a high frequency function it uses bio_alloc() which leads to memory allocation from the fs pool for each I/O. For NVMeoF nvmet_req we already have inline_bvec allocated as a part of request allocation that can be used with preallocated bio when we already know the size of request before bio allocation with bio_alloc(), which we already do. Introduce a bio member for the nvmet_req passthru anon union. In the fast path, check if we can get away with inline bvec and bio from nvmet_req with bio_init() call before actually allocating from the bio_alloc(). This will be useful to avoid any new memory allocation under high memory pressure situation and get rid of any extra work of allocation (bio_alloc()) vs initialization (bio_init()) when transfer len is < NVMET_MAX_INLINE_DATA_LEN that user can configure at compile time. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio	Chaitanya Kulkarni	1	-6/+2
	The function blk_rq_append_bio() is a genereric API written for all types driver (having bounce buffers) and different context (where request is already having a bio i.e. rq->bio != NULL). It does mainly three things: calculating the segments, bounce queue and if req->bio == NULL call blk_rq_bio_prep() or handle low level merge() case. The NVMe PCIe and fabrics transports currently does not use queue bounce mechanism. In order to find this for each request processing in the passthru blk_rq_append_bio() does extra work in the fast path for each request. When I ran I/Os with different block sizes on the passthru controller I found that we can reuse the req->sg_cnt instead of iterating over the bvecs to find out nr_segs in blk_rq_append_bio(). This calculation in blk_rq_append_bio() is a duplication of work given that we have the value in req->sg_cnt. (correct me here if I'm wrong). With NVMe passthru request based driver we allocate fresh request each time, so every call to blk_rq_append_bio() rq->bio will be NULL i.e. we don't really need the second condition in the blk_rq_append_bio() and the resulting error condition in the caller of blk_rq_append_bio(). So for NVMeOF passthru driver recalculating the segments, bounce check and ll_back_merge code is not needed such that we can get away with the minimal version of the blk_rq_append_bio() which removes the error check in the fast path along with extra variable in nvmet_passthru_map_sg(). This patch updates the nvmet_passthru_map_sg() such that it does only appending the bio to the request in the context of the NVMeOF Passthru driver. Following are perf numbers :- With current implementation (blk_rq_append_bio()) :- ---------------------------------------------------- + 5.80% 0.02% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd + 5.44% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd + 4.88% 0.00% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd + 5.44% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd + 4.86% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd + 5.17% 0.00% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd With this patch using blk_rq_bio_prep() :- ---------------------------------------------------- + 3.14% 0.02% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd + 3.26% 0.01% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd + 5.37% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd + 5.18% 0.02% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd + 4.84% 0.02% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd + 4.87% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: remove op_flags for passthru commands	Chaitanya Kulkarni	1	-7/+1
	For passthru commands setting op_flags has no meaning. Remove the code that sets the op flags in nvmet_passthru_map_sg(). Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: split nvme_alloc_request()	Chaitanya Kulkarni	5	-23/+42
	Right now nvme_alloc_request() allocates a request from block layer based on the value of the qid. When qid set to NVME_QID_ANY it used blk_mq_alloc_request() else blk_mq_alloc_request_hctx(). The function nvme_alloc_request() is called from different context, The only place where it uses non NVME_QID_ANY value is for fabrics connect commands :- nvme_submit_sync_cmd() NVME_QID_ANY nvme_features() NVME_QID_ANY nvme_sec_submit() NVME_QID_ANY nvmf_reg_read32() NVME_QID_ANY nvmf_reg_read64() NVME_QID_ANY nvmf_reg_write32() NVME_QID_ANY nvmf_connect_admin_queue() NVME_QID_ANY nvme_submit_user_cmd() NVME_QID_ANY nvme_alloc_request() nvme_keep_alive() NVME_QID_ANY nvme_alloc_request() nvme_timeout() NVME_QID_ANY nvme_alloc_request() nvme_delete_queue() NVME_QID_ANY nvme_alloc_request() nvmet_passthru_execute_cmd() NVME_QID_ANY nvme_alloc_request() nvmf_connect_io_queue() QID __nvme_submit_sync_cmd() nvme_alloc_request() With passthru nvme_alloc_request() now falls into the I/O fast path such that blk_mq_alloc_request_hctx() is never gets called and that adds additional branch check in fast path. Split the nvme_alloc_request() into nvme_alloc_request() and nvme_alloc_request_qid(). Replace each call of the nvme_alloc_request() with NVME_QID_ANY param with a call to newly added nvme_alloc_request() without NVME_QID_ANY. Replace a call to nvme_alloc_request() with QID param with a call to newly added nvme_alloc_request() and nvme_alloc_request_qid() based on the qid value set in the __nvme_submit_sync_cmd(). Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	block: move blk_rq_bio_prep() to linux/blk-mq.h	Chaitanya Kulkarni	2	-12/+12
	This is a preparation patch to have minimal block layer request bio append functionality in the context of the NVMeOF Passthru driver which falls in the fast path and doesn't need calls from blk_rq_append_bio(). Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: add passthru io timeout value attr	Chaitanya Kulkarni	3	-1/+23
	NVMeOF controller in the passsthru mode is capable of handling wide set of I/O commands including vender specific passhtru io comands. The vendor specific I/O commands are used to read the large drive logs and can take longer than default NVMe commands, i.e. for passthru requests the timeout value may differ from the passthru controller's default timeout values (nvme-core:io_timeout). Add a configfs attribute so that user can set the io timeout values. In case if this configfs value is not set nvme_alloc_request() will set the NVME_IO_TIMEOUT value when request queuedata is NULL. Signed-off-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvmet: add passthru admin timeout value attr	Chaitanya Kulkarni	3	-0/+27
	NVMeOF controller in the passsthru mode is capable of handling wide set of admin commands including vender specific passhtru admin comands. The vendor specific admin commands are used to read the large drive logs and can take longer than default NVMe commands, i.e. for passthru requests the timeout value may differ from the passthru controller's default timeout values (nvme-core:admin_timeout). Add a configfs attribute so that user can set the admin timeout values. In case if this configfs value is not set nvme_alloc_request() will set the ADMIN_TIMEOUT value when request queuedata is NULL. Signed-off-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: use consistent macro name for timeout	Chaitanya Kulkarni	7	-10/+10
	This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to make consistent with NVME_IO_TIMEOUT. Signed-off-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: centralize setting the timeout in nvme_alloc_request	Chaitanya Kulkarni	3	-5/+11
	The function nvme_alloc_request() is called from different context (I/O and Admin queue) where callers do not consider the I/O timeout when called from I/O queue context. Update nvme_alloc_request() to set the default I/O and Admin timeout value based on whether the queuedata is set or not. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme: simplify nvme_req_qid()	Baolin Wang	1	-1/+2
	Use the request's '->mq_hctx->queue_num' directly to simplify the nvme_req_qid() function. Signed-off-by: Baolin Wang <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-12-01	nvme-fcloop: add sysfs attribute to inject command drop	James Smart	1	-2/+79
	Add sysfs attribute to specify parameters for dropping a command. The attribute takes a string of: <opcode>:<starting a what instance>:<number of times> Opcode is formatted as lower 8 bits are opcode. If a fabrics opcode, a bit above bits 7:0 will be set. Once set, each sqe is looked at. If the opcode matches the running instance count is updated. If the instance count is in the range of where to drop (based on starting and # of times), then drop the command by not passing it to the target layer. Signed-off-by: James Smart <[email protected]>
2020-11-30	Merge branch 'md-next' of ↵	Jens Axboe	4	-43/+66
	https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.11/drivers Pull MD changes from Song: "Summary: 1. Fix race condition in md_ioctl(), by Dae R. Jeong; 2. Initialize read_slot properly for raid10, by Kevin Vigor; 3. Code cleanup, by Pankaj Gupta; 4. md-cluster resync/reshape fix, by Zhao Heming." * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/cluster: fix deadlock when node is doing resync job md/cluster: block reshape with remote resync job md: use current request time as base for ktime comparisons md: add comments in md_flush_request() md: improve variable names in md_flush_request() md/raid10: initialize r10_bio->read_slot before use. md: fix a warning caused by a race between concurrent md_ioctl()s
2020-11-30	md/cluster: fix deadlock when node is doing resync job	Zhao Heming	2	-31/+42
	md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) \|\| MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock. c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. d. B recv METADATA_UPDATED msg, which send from A in step <a>. This will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove". For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock. Repro steps (I only triggered 3 times with hundreds tests): two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` test script will hung when executing "mdadm --remove". ``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` At last, thanks for Xiao's solution. Cc: [email protected] Signed-off-by: Zhao Heming <[email protected]> Suggested-by: Xiao Ni <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]>
2020-11-30	md/cluster: block reshape with remote resync job	Zhao Heming	1	-2/+6
	Reshape request should be blocked with ongoing resync job. In cluster env, a node can start resync job even if the resync cmd isn't executed on it, e.g., user executes "mdadm --grow" on node A, sometimes node B will start resync job. However, current update_raid_disks() only check local recovery status, which is incomplete. As a result, we see user will execute "mdadm --grow" successfully on local, while the remote node deny to do reshape job when it doing resync job. The inconsistent handling cause array enter unexpected status. If user doesn't observe this issue and continue executing mdadm cmd, the array doesn't work at last. Fix this issue by blocking reshape request. When node executes "--grow" and detects ongoing resync, it should stop and report error to user. The following script reproduces the issue with ~100% probability. (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB) ``` # on node1, node2 is the remote node. ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` Cc: [email protected] Signed-off-by: Zhao Heming <[email protected]> Signed-off-by: Song Liu <[email protected]>
2020-11-30	md: use current request time as base for ktime comparisons	Pankaj Gupta	1	-2/+2
	Request coalescing logic uses 'prev_flush_start' as base to compare the current request start time. 'prev_flush_start' is updated in other context. This patch changes this by using ktime comparison base to 'req_start' for better readability of code. Signed-off-by: Pankaj Gupta <[email protected]> Signed-off-by: Song Liu <[email protected]>
2020-11-30	md: add comments in md_flush_request()	Pankaj Gupta	1	-0/+4
	Request coalescing logic is dependent on flush time update in other context. This patch adds comments to understand the code flow better. Signed-off-by: Pankaj Gupta <[email protected]> Signed-off-by: Song Liu <[email protected]>
2020-11-30	md: improve variable names in md_flush_request()	Pankaj Gupta	2	-7/+7
	This patch improves readability by using better variable names in flush request coalescing logic. Signed-off-by: Pankaj Gupta <[email protected]> Reviewed-by: Paul Menzel <[email protected]> Signed-off-by: Song Liu <[email protected]>
2020-11-30	md/raid10: initialize r10_bio->read_slot before use.	Kevin Vigor	1	-1/+2
	In __make_request() a new r10bio is allocated and passed to raid10_read_request(). The read_slot member of the bio is not initialized, and the raid10_read_request() uses it to index an array. This leads to occasional panics. Fix by initializing the field to invalid value and checking for valid value in raid10_read_request(). Cc: [email protected] Signed-off-by: Kevin Vigor <[email protected]> Signed-off-by: Song Liu <[email protected]>