blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2019-09-03	Merge branch 'md-next' of ↵	Jens Axboe	6	-6/+58
	git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.4/block Pull MD fixes from Song. * 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5: use bio_end_sector to calculate last_sector md/raid1: fail run raid1 array when active disk less than one md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
2019-09-03	md/raid5: use bio_end_sector to calculate last_sector	Guoqing Jiang	1	-1/+1
	Use the common way to get last_sector. Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2019-09-03	md/raid1: fail run raid1 array when active disk less than one	Yufen Yu	1	-1/+12
	When run test case: mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sd[b-c] --run --force mdadm --zero /dev/sda mdadm /dev/md1 -a /dev/sda echo offline > /sys/block/sdc/device/state echo offline > /sys/block/sdb/device/state sleep 5 mdadm -S /dev/md1 echo running > /sys/block/sdb/device/state echo running > /sys/block/sdc/device/state mdadm -A /dev/md1 /dev/sd[a-c] --run --force mdadm run fail with kernel message as follow: [ 172.986064] md: kicking non-fresh sdb from array! [ 173.004210] md: kicking non-fresh sdc from array! [ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors [ 173.022406] md1: failed to create bitmap (-5) In fact, when active disk in raid1 array less than one, we need to return fail in raid1_run(). Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Yufen Yu <[email protected]> Signed-off-by: Song Liu <[email protected]>
2019-09-03	md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone	Guilherme G. Piccoli	4	-4/+45
	Currently md raid0/linear are not provided with any mechanism to validate if an array member got removed or failed. The driver keeps sending BIOs regardless of the state of array members, and kernel shows state 'clean' in the 'array_state' sysfs attribute. This leads to the following situation: if a raid0/linear array member is removed and the array is mounted, some user writing to this array won't realize that errors are happening unless they check dmesg or perform one fsync per written file. Despite udev signaling the member device is gone, 'mdadm' cannot issue the STOP_ARRAY ioctl successfully, given the array is mounted. In other words, no -EIO is returned and writes (except direct ones) appear normal. Meaning the user might think the wrote data is correctly stored in the array, but instead garbage was written given that raid0 does stripping (and so, it requires all its members to be working in order to not corrupt data). For md/linear, writes to the available members will work fine, but if the writes go to the missing member(s), it'll cause a file corruption situation, whereas the portion of the writes to the missing devices aren't written effectively. This patch changes this behavior: we check if the block device's gendisk is UP when submitting the BIO to the array member, and if it isn't, we flag the md device as MD_BROKEN and fail subsequent I/Os to that device; a read request to the array requiring data from a valid member is still completed. While flagging the device as MD_BROKEN, we also show a rate-limited warning in the kernel log. A new array state 'broken' was added too: it mimics the state 'clean' in every aspect, being useful only to distinguish if the array has some member missing. We rely on the MD_BROKEN flag to put the array in the 'broken' state. This state cannot be written in 'array_state' as it just shows one or more members of the array are missing but acts like 'clean', it wouldn't make sense to write it. With this patch, the filesystem reacts much faster to the event of missing array member: after some I/O errors, ext4 for instance aborts the journal and prevents corruption. Without this change, we're able to keep writing in the disk and after a machine reboot, e2fsck shows some severe fs errors that demand fixing. This patch was tested in ext4 and xfs filesystems, and requires a 'mdadm' counterpart to handle the 'broken' state. Cc: Song Liu <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Guilherme G. Piccoli <[email protected]> Signed-off-by: Song Liu <[email protected]>
2019-09-03	closures: fix a race on wakeup from closure_sync	Kent Overstreet	1	-2/+8
	The race was when a thread using closure_sync() notices cl->s->done == 1 before the thread calling closure_put() calls wake_up_process(). Then, it's possible for that thread to return and exit just before wake_up_process() is called - so we're trying to wake up a process that no longer exists. rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier somewhere in the process teardown path. Signed-off-by: Kent Overstreet <[email protected]> Acked-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-03	bcache: Fix an error code in bch_dump_read()	Dan Carpenter	1	-3/+2
	The copy_to_user() function returns the number of bytes remaining to be copied, but the intention here was to return -EFAULT if the copy fails. Fixes: cafe56359144 ("bcache: A block layer cache") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-03	bcache: add cond_resched() in __bch_cache_cmp()	Shile Zhang	1	-0/+1
	Read /sys/fs/bcache/<uuid>/cacheN/priority_stats can take very long time with huge cache after long run. Signed-off-by: Shile Zhang <[email protected]> Tested-by: Heitor Alves de Siqueira <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-03	Documentation:kernel-per-CPU-kthreads.txt: Remove reference to elevator=	Marcos Paulo de Souza	1	-5/+3
	This argument was not being considered since blk-mq was set by default, so removed this documentation to avoid confusion. Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Marcos Paulo de Souza <[email protected]> .txt file is now .rst Signed-off-by: Jens Axboe <[email protected]>
2019-09-03	Documenation: switching-sched: Remove notes about elevator argument	Marcos Paulo de Souza	1	-4/+0
	This argument was ignored since blk-mq was set as default, so remove it from documentation. Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Marcos Paulo de Souza <[email protected]> .txt file is now .rst Signed-off-by: Jens Axboe <[email protected]>
2019-09-03	block: elevator.c: Remove now unused elevator= argument	Marcos Paulo de Souza	2	-20/+0
	Since the inclusion of blk-mq, elevator argument was not being considered anymore, and it's utility died long with the legacy IO path, now removed too. Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Marcos Paulo de Souza <[email protected]> Fold with doc removal patch. Signed-off-by: Jens Axboe <[email protected]>
2019-09-03	block: mq-deadline: Fix queue restart handling	Damien Le Moal	1	-10/+9
	Commit 7211aef86f79 ("block: mq-deadline: Fix write completion handling") added a call to blk_mq_sched_mark_restart_hctx() in dd_dispatch_request() to make sure that write request dispatching does not stall when all target zones are locked. This fix left a subtle race when a write completion happens during a dispatch execution on another CPU: CPU 0: Dispatch CPU1: write completion dd_dispatch_request() lock(&dd->lock); ... lock(&dd->zone_lock); dd_finish_request() rq = find request lock(&dd->zone_lock); unlock(&dd->zone_lock); zone write unlock unlock(&dd->zone_lock); ... __blk_mq_free_request check restart flag (not set) -> queue not run ... if (!rq && have writes) blk_mq_sched_mark_restart_hctx() unlock(&dd->lock) Since the dispatch context finishes after the write request completion handling, marking the queue as needing a restart is not seen from __blk_mq_free_request() and blk_mq_sched_restart() not executed leading to the dispatch stall under 100% write workloads. Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from dd_dispatch_request() into dd_finish_request() under the zone lock to ensure full mutual exclusion between write request dispatch selection and zone unlock on write request completion. Fixes: 7211aef86f79 ("block: mq-deadline: Fix write completion handling") Cc: [email protected] Reported-by: Hans Holmberg <[email protected]> Reviewed-by: Hans Holmberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-30	writeback: don't access page->mapping directly in track_foreign_dirty TP	Tejun Heo	1	-1/+4
	page->mapping may encode different values in it and page_mapping() should always be used to access the mapping pointer. track_foreign_dirty tracepoint was incorrectly accessing page->mapping directly. Use page_mapping() instead. Also, add NULL checks while at it. Fixes: 3a8e9ac89e6a ("writeback: add tracepoints for cgroup foreign writebacks") Reported-by: Jan Kara <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-30	Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-5.4/block	Jens Axboe	16	-157/+379
	Pull NVMe changes from Sagi: "The nvme updates include: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." * 'nvme-5.4' of git://git.infradead.org/nvme: (30 commits) nvme-rdma: Use rq_dma_dir macro nvme-fc: Use rq_dma_dir macro nvme-pci: Tidy up nvme_unmap_data nvme: make fabrics command run on a separate request queue nvme-pci: Support shared tags across queues for Apple 2018 controllers nvme-pci: Add support for Apple 2018+ models nvme-pci: Add support for variable IO SQ element size nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros nvme: trace bio completion nvme-multipath: fix ana log nsid lookup when nsid is not found nvmet-tcp: Add TOS for tcp transport nvme-tcp: Add TOS for tcp transport nvme-tcp: Use struct nvme_ctrl directly nvme-rdma: Add TOS for rdma transport nvme-fabrics: Add type of service (TOS) configuration nvmet-tcp: fix possible memory leak nvmet-tcp: fix possible NULL deref nvmet: trace: parse Get LBA Status command in detail nvme: trace: parse Get LBA Status command in detail nvme: trace: support for Get LBA Status opcode parsed ...
2019-08-30	writeback: add tracepoints for cgroup foreign writebacks	Tejun Heo	3	-0/+133
	cgroup foreign inode handling has quite a bit of heuristics and internal states which sometimes makes it difficult to understand what's going on. Add tracepoints to improve visibility. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-30	blkcg: add missing NULL check in ioc_cpd_alloc()	Tejun Heo	1	-1/+3
	ioc_cpd_alloc() forgot to check NULL return from kzalloc(). Add it. Signed-off-by: Tejun Heo <[email protected]> Reported-by: kbuild test robot <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-29	nvme-rdma: Use rq_dma_dir macro	Israel Rukshin	1	-7/+3
	Remove code duplication. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-fc: Use rq_dma_dir macro	Israel Rukshin	1	-5/+2
	Remove code duplication. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: James Smart <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-pci: Tidy up nvme_unmap_data	Israel Rukshin	1	-3/+2
	Remove pointless local variable and use rq_dma_dir macro. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: make fabrics command run on a separate request queue	Sagi Grimberg	6	-14/+64
	We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: James Smart <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-pci: Support shared tags across queues for Apple 2018 controllers	Benjamin Herrenschmidt	2	-1/+35
	Another issue with the Apple T2 based 2018 controllers seem to be that they blow up (and shut the machine down) if there's a tag collision between the IO queue and the Admin queue. My suspicion is that they use our tags for their internal tracking and don't mix them with the queue id. They also seem to not like when tags go beyond the IO queue depth, ie 128 tags. This adds a quirk that marks tags 0..31 of the IO queue reserved Signed-off-by: Benjamin Herrenschmidt <[email protected]> Reviewed-by: Ming Lei <[email protected]> Acked-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-pci: Add support for Apple 2018+ models	Benjamin Herrenschmidt	2	-1/+30
	Based on reverse engineering and original patch by Paul Pawlowski <[email protected]> This adds support for Apple weird implementation of NVME in their 2018 or later machines. It accounts for the twice-as-big SQ entries for the IO queues, and the fact that only interrupt vector 0 appears to function properly. Signed-off-by: Benjamin Herrenschmidt <[email protected]> Reviewed-by: Minwoo Im <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-pci: Add support for variable IO SQ element size	Benjamin Herrenschmidt	2	-3/+9
	The size of a submission queue element should always be 6 (64 bytes) by spec. However some controllers such as Apple's are not properly implementing the standard and require a different size. This provides the ground work for the subsequent quirks for these controllers. Signed-off-by: Benjamin Herrenschmidt <[email protected]> Reviewed-by: Minwoo Im <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros	Benjamin Herrenschmidt	1	-15/+15
	This will make it easier to handle variable queue entry sizes later. No functional change. Signed-off-by: Benjamin Herrenschmidt <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Minwoo Im <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: trace bio completion	Hannes Reinecke	2	-3/+21
	When native multipathing is enabled we cannot enable blktrace for the underlying paths, so any completion is never traced. Signed-off-by: Hannes Reinecke <[email protected]> [fixed-up by Mikhail for non-multipath-build] Signed-off-by: Mikhail Skorzhinskii <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-multipath: fix ana log nsid lookup when nsid is not found	Anton Eidelman	1	-3/+5
	ANA log parsing invokes nvme_update_ana_state() per ANA group desc. This updates the state of namespaces with nsids in desc->nsids[]. Both ctrl->namespaces list and desc->nsids[] array are sorted by nsid. Hence nvme_update_ana_state() performs a single walk over ctrl->namespaces: - if current namespace matches the current desc->nsids[n], this namespace is updated, and n is incremented. - the process stops when it encounters the end of either ctrl->namespaces end or desc->nsids[] In case desc->nsids[n] does not match any of ctrl->namespaces, the remaining nsids following desc->nsids[n] will not be updated. Such situation was considered abnormal and generated WARN_ON_ONCE. However ANA log MAY contain nsids not (yet) found in ctrl->namespaces. For example, lets consider the following scenario: - nvme0 exposes namespaces with nsids = [2, 3] to the host - a new namespace nsid = 1 is added dynamically - also, a ANA topology change is triggered - NS_CHANGED aen is generated and triggers scan_work - before scan_work discovers nsid=1 and creates a namespace, a NOTICE_ANA aen was issues and ana_work receives ANA log with nsids=[1, 2, 3] Result: ana_work fails to update ANA state on existing namespaces [2, 3] Solution: Change the way nvme_update_ana_state() namespace list walk checks the current namespace against desc->nsids[n] as follows: a) ns->head->ns_id < desc->nsids[n]: keep walking ctrl->namespaces. b) ns->head->ns_id == desc->nsids[n]: match, update the namespace c) ns->head->ns_id >= desc->nsids[n]: skip to desc->nsids[n+1] This enables correct operation in the scenario described above. This also allows ANA log to contain nsids currently invisible to the host, i.e. inactive nsids. Signed-off-by: Anton Eidelman <[email protected]> Reviewed-by: James Smart <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvmet-tcp: Add TOS for tcp transport	Israel Rukshin	1	-0/+11
	Set the outgoing packets type of service (TOS) according to the receiving TOS. Signed-off-by: Israel Rukshin <[email protected]> Suggested-by: Sagi Grimberg <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-tcp: Add TOS for tcp transport	Israel Rukshin	1	-1/+14
	TOS provide clients the ability to segregate traffic flows for different type of data. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. usage examples: nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-tcp: Use struct nvme_ctrl directly	Israel Rukshin	1	-10/+10
	This patch doesn't change any functionality. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-rdma: Add TOS for rdma transport	Israel Rukshin	1	-3/+6
	For RDMA transports, TOS is an extension of IB QoS to provide clients the ability to segregate traffic flows for different type of data. RDMA CM abstract it for ULPs using rdma_set_service_type(). Internally, each traffic flow is represented by a connection with all of its independent resources like that of a normal connection, and is differentiated by service type. In other words, there can be multiple qp connections between an IP pair and each supports a unique service type. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. Note: In addition to the TOS configuration, QOS must be configured on the relevant HCA on the target (send RDMA commands) and initiator to effect the traffic. usage examples: nvme connect --tos=0 --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-fabrics: Add type of service (TOS) configuration	Israel Rukshin	2	-0/+21
	TOS is user-defined and needs to be configured via nvme-cli. It must be set before initiating any traffic and once set the TOS cannot be changed. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvmet-tcp: fix possible memory leak	Sagi Grimberg	1	-0/+1
	when we uninit a command in error flow we also need to free an iovec if it was allocated. Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvmet-tcp: fix possible NULL deref	Sagi Grimberg	1	-4/+8
	We must only call sgl_free for sgl that we actually allocated. Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvmet: trace: parse Get LBA Status command in detail	Minwoo Im	1	-0/+18
	Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing in target side also. Signed-off-by: Minwoo Im <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: trace: parse Get LBA Status command in detail	Minwoo Im	1	-0/+18
	Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing. Signed-off-by: Minwoo Im <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: trace: support for Get LBA Status opcode parsed	Minwoo Im	1	-1/+2
	This patch adds Get LBA Status command's opcode to the macro that is used by the trace feature. Now we can see "get_lba_status" instead of the opcode value itself. Signed-off-by: Minwoo Im <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: add Get LBA Status command opcode	Minwoo Im	1	-0/+1
	NVMe 1.4 added Get LBA Status command with opcode 0x86. Signed-off-by: Minwoo Im <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvmet: fix data units read and written counters in SMART log	Tom Wu	1	-6/+8
	In nvme spec 1.3 there is a definition for data write/read counters from SMART log, (See section 5.14.1.2): This value is reported in thousands (i.e., a value of 1 corresponds to 1000 units of 512 bytes read) and is rounded up. However, in nvme target where value is reported with actual units, but not thousands of units as the spec requires. Signed-off-by: Tom Wu <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-tcp: support simple polling	Sagi Grimberg	1	-6/+45
	Simple polling support via socket busy_poll interface. Although we do not shutdown interrupts but simply hammer the socket poll, we can sometimes find completions faster than the normal interrupt driven RX path. We add per queue nr_cqe counter that resets every time RX path is invoked such that .poll callback can return it to stay consistent with the semantics. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: tcp: selects CRYPTO_CRC32C for nvme-tcp	Minwoo Im	1	-0/+1
	The tcp host module is now taking those APIs from crypto ahash: (1) crypto_ahash_final() (2) crypto_ahash_digest() (3) crypto_alloc_ahash() nvme-tcp should depends on CRYPTO_CRC32C. Cc: Christoph Hellwig <[email protected]> Cc: Keith Busch <[email protected]> Cc: Jens Axboe <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Minwoo Im <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: don't pass cap to nvme_disable_ctrl	Sagi Grimberg	5	-7/+7
	All seem to call it with ctrl->cap so no need to pass it at all. Reviewed-by: Minwoo Im <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: move sqsize setting to the core	Sagi Grimberg	7	-55/+17
	nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-pci: set ctrl sqsize to the device q_depth	Sagi Grimberg	1	-0/+1
	Align with what the rest of the transports are doing. Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme: have nvme_init_identify set ctrl->cap	Sagi Grimberg	1	-4/+3
	No need to use a stack cap variable. Reviewed-by: Minwoo Im <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-tcp: Use protocol specific operations while reading socket	Potnuri Bharat Teja	1	-2/+3
	Using socket specific read_sock() calls instead of directly calling tcp_read_sock() helps lld module registered handlers if any, to be called from nvme-tcp host. This patch therefore replaces the tcp_read_sock() with socket specific prot_ops. Signed-off-by: Potnuri Bharat Teja <[email protected]> Acked-by: Sagi Grimberg <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	nvme-tcp: cleanup nvme_tcp_recv_pdu	Sagi Grimberg	1	-8/+3
	Can return directly in the switch statement Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-08-29	blkcg: fix missing free on error path of blk_iocost_init()	Tejun Heo	1	-0/+1
	blk_iocost_init() forgot to free its percpu stat on the error path. Fix it. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Reported-by: Hillf Danton <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-29	blkcg: blk-iocost: predeclare used structs	Stephen Rothwell	1	-0/+4
	Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Acked-by: Tejun Heo <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-28	blkcg: add tools/cgroup/iocost_coef_gen.py	Tejun Heo	3	-0/+184
	Add a script which can be used to generate device-specific iocost linear model coefficients. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-28	blkcg: add tools/cgroup/iocost_monitor.py	Tejun Heo	2	-0/+291
	Instead of mucking with debugfs and ->pd_stat(), add drgn based monitoring script. Signed-off-by: Tejun Heo <[email protected]> Cc: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-08-28	blkcg: implement blk-iocost	Tejun Heo	7	-0/+2656
	This patchset implements IO cost model based work-conserving proportional controller. While io.latency provides the capability to comprehensively prioritize and protect IOs depending on the cgroups, its protection is binary - the lowest latency target cgroup which is suffering is protected at the cost of all others. In many use cases including stacking multiple workload containers in a single system, it's necessary to distribute IO capacity with better granularity. One challenge of controlling IO resources is the lack of trivially observable cost metric. The most common metrics - bandwidth and iops - can be off by orders of magnitude depending on the device type and IO pattern. However, the cost isn't a complete mystery. Given several key attributes, we can make fairly reliable predictions on how expensive a given stream of IOs would be, at least compared to other IO patterns. The function which determines the cost of a given IO is the IO cost model for the device. This controller distributes IO capacity based on the costs estimated by such model. The more accurate the cost model the better but the controller adapts based on IO completion latency and as long as the relative costs across differents IO patterns are consistent and sensible, it'll adapt to the actual performance of the device. Currently, the only implemented cost model is a simple linear one with a few sets of default parameters for different classes of device. This covers most common devices reasonably well. All the infrastructure to tune and add different cost models is already in place and a later patch will also allow using bpf progs for cost models. Please see the top comment in blk-iocost.c and documentation for more details. v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix for a divide-by-zero bug in current_hweight() triggered by zero inuse_sum. Signed-off-by: Tejun Heo <[email protected]> Cc: Andy Newell <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Jens Axboe <[email protected]>