aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-11-12genirq/proc: Return proper error code when irq_set_affinity() failsWen Yaxng1-2/+3
write_irq_affinity() returns the number of written bytes, which means success, unconditionally whether the actual irq_set_affinity() call succeeded or not. Add proper error handling and pass the error code returned from irq_set_affinity() back to user space in case of failure. [ tglx: Fixed coding style and massaged changelog ] Signed-off-by: Wen Yang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Jiang Biao <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2017-11-12modpost: detect modules without a MODULE_LICENSERandy Dunlap1-1/+1
Partially revert commit 2fa365682943 ("kbuild: soften MODULE_LICENSE check") so that modpost detects modules that do not have a MODULE_LICENSE. Sam's commit also changed the fatal error to a warning, which I am leaving as is. This gives advance notice of when a module has no license and will taint the kernel if the module is loaded. This produces the following warnings on x86_64 allmodconfig: MODPOST 6520 modules WARNING: modpost: missing MODULE_LICENSE() in drivers/auxdisplay/img-ascii-lcd.o WARNING: modpost: missing MODULE_LICENSE() in drivers/gpio/gpio-ath79.o WARNING: modpost: missing MODULE_LICENSE() in drivers/gpio/gpio-iop.o WARNING: modpost: missing MODULE_LICENSE() in drivers/iio/accel/kxsd9-i2c.o WARNING: modpost: missing MODULE_LICENSE() in drivers/iio/adc/qcom-vadc-common.o WARNING: modpost: missing MODULE_LICENSE() in drivers/media/platform/mtk-vcodec/mtk-vcodec-common.o WARNING: modpost: missing MODULE_LICENSE() in drivers/media/platform/soc_camera/soc_scale_crop.o WARNING: modpost: missing MODULE_LICENSE() in drivers/mtd/nand/denali_pci.o WARNING: modpost: missing MODULE_LICENSE() in drivers/net/phy/cortina.o WARNING: modpost: missing MODULE_LICENSE() in drivers/pinctrl/pxa/pinctrl-pxa2xx.o WARNING: modpost: missing MODULE_LICENSE() in drivers/power/reset/zx-reboot.o WARNING: modpost: missing MODULE_LICENSE() in drivers/rpmsg/qcom_glink_native.o WARNING: modpost: missing MODULE_LICENSE() in drivers/staging/comedi/drivers/ni_atmio.o WARNING: modpost: missing MODULE_LICENSE() in net/9p/9pnet_xen.o WARNING: modpost: missing MODULE_LICENSE() in sound/soc/codecs/snd-soc-pcm512x-spi.o Signed-off-by: Randy Dunlap <[email protected]> Cc: Sam Ravnborg <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-12Linux 4.14Linus Torvalds1-1/+1
2017-11-12Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds7-19/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of small fixes: - make KGDB work again which got broken by the conversion of WARN() to #UD. The WARN fixup needs to run before the notifier callchain, otherwise KGDB tries to handle it and crashes. - disable KASAN in the ORC unwinder to prevent false positive KASAN warnings - prevent default mapping above 47bit when 5 level page tables are enabled - make the delay calibration optimization work correctly, which had the conditionals the wrong way around and was operating on data which was not yet updated. - remove the bogus X86_TRAP_BP trap init from the default IDT init table, which broke 32bit int3 handling by overwriting the correct int3 setup. - replace this_cpu* with boot_cpu_data access in the preemptible oprofile init code" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/debug: Handle warnings before the notifier chain, to fix KGDB crash x86/mm: Fix ELF_ET_DYN_BASE for 5-level paging x86/idt: Remove X86_TRAP_BP initialization in idt_setup_traps() x86/oprofile/ppro: Do not use __this_cpu*() in preemptible context x86/unwind: Disable KASAN checking in the ORC unwinder x86/smpboot: Make optimization of delay calibration work correctly
2017-11-12Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds3-2/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf tool fixes from Thomas Gleixner: "A small set of fixes for perf tool: - synchronize the i915 drm header to avoid the 'out of date' warning - make sure that perf trace cleans up its temporary files on exit - unbreak the build with newer flex versions - add missing braces in the eBPF parsing rules" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tooling/headers: Sync the tools/include/uapi/drm/i915_drm.h UAPI header perf trace: Call machine__exit() at exit perf tools: Fix eBPF event specification parsing perf tools: Add "reject" option for parse-events.l
2017-11-12timers: Add a function to start/reduce a timerDavid Howells2-7/+39
Add a function, similar to mod_timer(), that will start a timer if it isn't running and will modify it if it is running and has an expiry time longer than the new time. If the timer is running with an expiry time that's the same or sooner, no change is made. The function looks like: int timer_reduce(struct timer_list *timer, unsigned long expires); This can be used by code such as networking code to make it easier to share a timer for multiple timeouts. For instance, in upcoming AF_RXRPC code, the rxrpc_call struct will maintain a number of timeouts: unsigned long ack_at; unsigned long resend_at; unsigned long ping_at; unsigned long expect_rx_by; unsigned long expect_req_by; unsigned long expect_term_by; each of which is set independently of the others. With timer reduction available, when the code needs to set one of the timeouts, it only needs to look at that timeout and then call timer_reduce() to modify the timer, starting it or bringing it forward if necessary. There is no need to refer to the other timeouts to see which is earliest and no need to take any lock other than, potentially, the timer lock inside timer_reduce(). Note, that this does not protect against concurrent invocations of any of the timer functions. As an example, the expect_rx_by timeout above, which terminates a call if we don't get a packet from the server within a certain time window, would be set something like this: unsigned long now = jiffies; unsigned long expect_rx_by = now + packet_receive_timeout; WRITE_ONCE(call->expect_rx_by, expect_rx_by); timer_reduce(&call->timer, expect_rx_by); The timer service code (which might, say, be in a work function) would then check all the timeouts to see which, if any, had triggered, deal with those: t = READ_ONCE(call->ack_at); if (time_after_eq(now, t)) { cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET); set_bit(RXRPC_CALL_EV_ACK, &call->events); } and then restart the timer if necessary by finding the soonest timeout that hasn't yet passed and then calling timer_reduce(). The disadvantage of doing things this way rather than comparing the timers each time and calling mod_timer() is that you *will* take timer events unless you can finish what you're doing and delete the timer in time. The advantage of doing things this way is that you don't need to use a lock to work out when the next timer should be set, other than the timer's own lock - which you might not have to take. [ tglx: Fixed weird formatting and adopted it to pending changes ] Signed-off-by: David Howells <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
2017-11-12pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()Arnd Bergmann2-4/+2
__getnstimeofday() is a rather odd interface, with a number of quirks: - The caller may come from NMI context, but the implementation is not NMI safe, one way to get there from NMI is NMI handler: something bad panic() kmsg_dump() pstore_dump() pstore_record_init() __getnstimeofday() - The calling conventions are different from any other timekeeping functions, to deal with returning an error code during suspended timekeeping. Address the above issues by using a completely different method to get the time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior when timekeeping is suspended: it returns the time at which it got suspended. As Thomas Gleixner explained, this is safe, as ktime_get_real_fast_ns() does not call into the clocksource driver that might be suspended. The result can easily be transformed into a timespec structure. Since ktime_get_real_fast_ns() was not exported to modules, add the export. The pstore behavior for the suspended case changes slightly, as it now stores the timestamp at which timekeeping was suspended instead of storing a zero timestamp. This change is not addressing y2038-safety, that's subject to a more complex follow up patch. Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Tony Luck <[email protected]> Cc: Anton Vorontsov <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: John Stultz <[email protected]> Cc: Colin Cross <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-11-12irq/work: Use llist_for_each_entry_safeThomas Gleixner1-3/+3
The llist_for_each_entry() loop in irq_work_run_list() is unsafe because once the works PENDING bit is cleared it can be requeued on another CPU. Use llist_for_each_entry_safe() instead. Fixes: 16c0890dc66d ("irq/work: Don't reinvent the wheel but use existing llist API") Reported-by:Chris Wilson <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Petri Latvala <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-11-12x86/intel_rdt: Fix a silent failure when writing zero value schemataXiaochen Shen1-0/+5
Writing an invalid schemata with no domain values (e.g., "(L3|MB):"), results in a silent failure, i.e. the last_cmd_status returns OK, Check for an empty value and set the result string with a proper error message and return -EINVAL. Before the fix: # mkdir /sys/fs/resctrl/p1 # echo "L3:" > /sys/fs/resctrl/p1/schemata (silent failure) # cat /sys/fs/resctrl/info/last_cmd_status ok # echo "MB:" > /sys/fs/resctrl/p1/schemata (silent failure) # cat /sys/fs/resctrl/info/last_cmd_status ok After the fix: # mkdir /sys/fs/resctrl/p1 # echo "L3:" > /sys/fs/resctrl/p1/schemata -bash: echo: write error: Invalid argument # cat /sys/fs/resctrl/info/last_cmd_status Missing 'L3' value # echo "MB:" > /sys/fs/resctrl/p1/schemata -bash: echo: write error: Invalid argument # cat /sys/fs/resctrl/info/last_cmd_status Missing 'MB' value [ Tony: This is an unintended side effect of the patch earlier to allow the user to just write the value they want to change. While allowing user to specify less than all of the values, it also allows an empty value. ] Fixes: c4026b7b95a4 ("x86/intel_rdt: Implement "update" mode when writing schemata file") Signed-off-by: Xiaochen Shen <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Vikas Shivappa <[email protected]> Cc: Fenghua Yu <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-11-11nvme: fix visibility of "uuid" ns attributeMartin Wilck1-1/+1
"uuid" must be invisible if both ns->uuid and ns->nguid are unset, not if either one is. Fixes: d934f9848a77 "nvme: provide UUID value to userspace" Signed-off-by: Martin Wilck <[email protected]> [hch: rebased to the nvme-4.15 tree to help resolving a conflict] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds15-34/+68
Pull networking fixes from David Miller: 1) Use after free in vlan, from Cong Wang. 2) Handle NAPI poll with a zero budget properly in mlx5 driver, from Saeed Mahameed. 3) If DMA mapping fails in mlx5 driver, NULL out page, from Inbar Karmy. 4) Handle overrun in RX FIFO of sun4i CAN driver, from Gerhard Bertelsmann. 5) Missing return in mdb and vlan prepare phase of DSA layer, from Vivien Didelot. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: vlan: fix a use-after-free in vlan_device_event() net: dsa: return after vlan prepare phase net: dsa: return after mdb prepare phase can: ifi: Fix transmitter delay calculation tcp: fix tcp_fastretrans_alert warning tcp: gso: avoid refcount_t warning from tcp_gso_segment() can: peak: Add support for new PCIe/M2 CAN FD interfaces can: sun4i: handle overrun in RX FIFO can: c_can: don't indicate triple sampling support for D_CAN net/mlx5e: Increase Striding RQ minimum size limit to 4 multi-packet WQEs net/mlx5e: Set page to null in case dma mapping fails net/mlx5e: Fix napi poll with zero budget net/mlx5: Cancel health poll before sending panic teardown command net/mlx5: Loop over temp list to release delay events rds: ib: Fix NULL pointer dereference in debug code
2017-11-11staging: lustre: add SPDX identifiers to all lustre filesGreg Kroah-Hartman271-0/+271
It's good to have SPDX identifiers in all files to make it easier to audit the kernel tree for correct licenses. Update the drivers/staging/lustre files files with the correct SPDX license identifier based on the license text in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This work is based on a script and data from Thomas Gleixner, Philippe Ombredanne, and Kate Stewart. Cc: Oleg Drokin <[email protected]> Cc: James Simmons <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Kate Stewart <[email protected]> Cc: Philippe Ombredanne <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-11-11staging: greybus: Remove redundant license textGreg Kroah-Hartman63-127/+0
Now that the SPDX tag is in all greybus files, that identifies the license in a specific and legally-defined manner. So the extra GPL text wording can be removed as it is no longer needed at all. This is done on a quest to remove the 700+ different ways that files in the kernel describe the GPL license text. And there's unneeded stuff like the address (sometimes incorrect) for the FSF which is never needed. No copyright headers or other non-license-description text was removed. Cc: Vaibhav Hiremath <[email protected]> Reviewed-by: Alex Elder <[email protected]> Acked-by: Vaibhav Agarwal <[email protected]> Acked-by: David Lin <[email protected]> Acked-by: Johan Hovold <[email protected]> Acked-by: Viresh Kumar <[email protected]> Acked-by: Mark Greer <[email protected]> Acked-by: Rui Miguel Silva <[email protected]> Acked-by: "Bryan O'Donoghue" <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-11-11staging: greybus: add SPDX identifiers to all greybus driver filesGreg Kroah-Hartman74-0/+74
It's good to have SPDX identifiers in all files to make it easier to audit the kernel tree for correct licenses. Update the drivers/staging/greybus files files with the correct SPDX license identifier based on the license text in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This work is based on a script and data from Thomas Gleixner, Philippe Ombredanne, and Kate Stewart. Cc: Vaibhav Hiremath <[email protected]> Cc: "Bryan O'Donoghue" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Kate Stewart <[email protected]> Cc: Philippe Ombredanne <[email protected]> Acked-by: Vaibhav Agarwal <[email protected]> Acked-by: David Lin <[email protected]> Reviewed-by: Alex Elder <[email protected]> Acked-by: Johan Hovold <[email protected]> Acked-by: Viresh Kumar <[email protected]> Acked-by: Mark Greer <[email protected]> Acked-by: Rui Miguel Silva <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-11-11Merge tag 'linux-can-fixes-for-4.14-20171110' of ↵David S. Miller5-9/+25
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2017-11-10 this is a pull request for net/master. The first patch by Richard Schütz for the c_can driver removes the false indication to support triple sampling for d_can. Gerhard Bertelsmann's patch for the sun4i driver improves the RX overrun handling. The patch by Stephane Grosjean for the peak_canfd driver adds the PCI ids for various new PCIe/M2 interfaces. Marek Vasut's patch for the ifi driver fix transmitter delay calculation. ==================== Signed-off-by: David S. Miller <[email protected]>
2017-11-11Merge tag 'mlx5-fixes-2017-11-08' of ↵David S. Miller5-13/+20
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== Mellanox, mlx5 fixes 2017-11-08 The following series includes some fixes for mlx5 core and etherent driver. Sorry for the late submission but as you can see i have some very critical fixes below that i would like them merged into this RC. Please pull and let me know if there is any problem. For -stable: ('net/mlx5e: Set page to null in case dma mapping fails') kernels >= 4.13 ('net/mlx5: FPGA, return -EINVAL if size is zero') kernels >= 4.13 ('net/mlx5: Cancel health poll before sending panic teardown command') kernels >= 4.13 V1->V2: - Fix Reviewed-by tag of the 2nd patch. - Drop the FPGA 0 size fix, it needs some more change log info. ==================== Signed-off-by: David S. Miller <[email protected]>
2017-11-11vlan: fix a use-after-free in vlan_device_event()Cong Wang1-3/+3
After refcnt reaches zero, vlan_vid_del() could free dev->vlan_info via RCU: RCU_INIT_POINTER(dev->vlan_info, NULL); call_rcu(&vlan_info->rcu, vlan_info_rcu_free); However, the pointer 'grp' still points to that memory since it is set before vlan_vid_del(): vlan_info = rtnl_dereference(dev->vlan_info); if (!vlan_info) goto out; grp = &vlan_info->grp; Depends on when that RCU callback is scheduled, we could trigger a use-after-free in vlan_group_for_each_dev() right following this vlan_vid_del(). Fix it by moving vlan_vid_del() before setting grp. This is also symmetric to the vlan_vid_add() we call in vlan_device_event(). Reported-by: Fengguang Wu <[email protected]> Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct") Cc: Alexander Duyck <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Girish Moodalbail <[email protected]> Signed-off-by: Cong Wang <[email protected]> Reviewed-by: Girish Moodalbail <[email protected]> Tested-by: Fengguang Wu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-11-11tooling/headers: Sync the tools/include/uapi/drm/i915_drm.h UAPI headerIngo Molnar1-0/+1
Last minute upstream update to one of the UAPI headers - sync it with tooling, to address this warning: Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h' Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-11-11Merge branch 'perf/urgent' of ↵Ingo Molnar2-2/+13
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf tooling fixes from Arnaldo Carvalho de Melo. Signed-off-by: Ingo Molnar <[email protected]>
2017-11-11net: dsa: return after vlan prepare phaseVivien Didelot1-0/+2
The current code does not return after successfully preparing the VLAN addition on every ports member of a it. Fix this. Fixes: 1ca4aa9cd4cc ("net: dsa: check VLAN capability of every switch") Signed-off-by: Vivien Didelot <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-11-11net: dsa: return after mdb prepare phaseVivien Didelot1-0/+2
The current code does not return after successfully preparing the MDB addition on every ports member of a multicast group. Fix this. Fixes: a1a6b7ea7f2d ("net: dsa: add cross-chip multicast support") Reported-by: Egil Hjelmeland <[email protected]> Signed-off-by: Vivien Didelot <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-11-10blk-mq: fixup some comment typos and lengthsJens Axboe1-7/+10
Various typos and/or spelling errors in comments. Fixes a few > 80 char lines as well. Signed-off-by: Jens Axboe <[email protected]>
2017-11-10ide: ide-atapi: fix compile error with defining macro DEBUGHongxu Jia1-3/+3
Compile ide-atapi failed with defining macro "DEBUG" ... |drivers/ide/ide-atapi.c:285:52: error: 'struct request' has no member named 'cmd'; did you mean 'csd'? | debug_log("%s: rq->cmd[0]: 0x%x\n", __func__, rq->cmd[0]); ... Since we split the scsi_request out of struct request, it missed do the same thing on debug_log Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request") Signed-off-by: Hongxu Jia <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10blk-mq: improve tag waiting setup for non-shared tagsJens Axboe1-26/+55
If we run out of driver tags, we currently treat shared and non-shared tags the same - both cases hook into the tag waitqueue. This is a bit more costly than it needs to be on unshared tags, since we have to both grab the hctx lock, and the waitqueue lock (and disable interrupts). For the non-shared case, we can simply mark the queue as needing a restart. Split blk_mq_dispatch_wait_add() to account for both cases, and rename it to blk_mq_mark_tag_wait() to better reflect what it does now. Without this patch, shared and non-shared performance is about the same with 4 fio thread hammering on a single null_blk device (~410K, at 75% sys). With the patch, the shared case is the same, but the non-shared tags case runs at 431K at 71% sys. Reviewed-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10brd: remove unused brd_mutexMikulas Patocka1-1/+0
Remove unused mutex brd_mutex. It is unused since the commit ff26956875c2 ("brd: remove support for BLKFLSBUF"). Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10blk-mq: only run the hardware queue if IO is pendingJens Axboe4-16/+13
Currently we are inconsistent in when we decide to run the queue. Using blk_mq_run_hw_queues() we check if the hctx has pending IO before running it, but we don't do that from the individual queue run function, blk_mq_run_hw_queue(). This results in a lot of extra and pointless queue runs, potentially, on flush requests and (much worse) on tag starvation situations. This is observable just looking at top output, with lots of kworkers active. For the !async runs, it just adds to the CPU overhead of blk-mq. Move the has-pending check into the run function instead of having callers do it. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block: avoid null pointer dereference on null diskColin Ian King1-1/+1
It is possible that the pointer disk can be null and hence we can get a null pointer deference when accessing disk->flags. Add a null pointer check to avoid the dereference. Detected by CoverityScan, CID#1461133 ("Explicit null dereferenced") Fixes: 8ddcd653257c ("block: introduce GENHD_FL_HIDDEN") Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10fs: guard_bio_eod() needs to consider partitionsGreg Edwards2-1/+10
guard_bio_eod() needs to look at the partition capacity, not just the capacity of the whole device, when determining if truncation is necessary. [ 60.268688] attempt to access beyond end of device [ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506 [ 60.268693] buffer_io_error: 2 callbacks suppressed [ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index") Cc: [email protected] # v4.13 Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Greg Edwards <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10xtensa/simdisk: fix compile errorChristoph Hellwig1-1/+1
Fixes: d004a5e7d4dd ("block: remove __bio_kmap_atomic") Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: expose subsys attribute to sysfsHannes Reinecke1-0/+48
We should be exposing the subsystem attributes like 'model' and 'subsysnqn' to sysfs to allow for easier identification of the subsystem. Signed-off-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: create 'slaves' and 'holders' entries for hidden controllersHannes Reinecke3-0/+40
When creating nvme multipath devices we should populate the 'slaves' and 'holders' directorys properly to aid userspace topology detection. Signed-off-by: Hannes Reinecke <[email protected]> [hch: split from a larger patch, compile fix for CONFIG_NVME_MULTIPATH=n] Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block: create 'slaves' and 'holders' entries for hidden gendisksHannes Reinecke1-7/+7
When creating nvme multipath devices we should populate the 'slaves' and 'holders' directorys properly to aid userspace topology detection. Signed-off-by: Hannes Reinecke <[email protected]> [hch: split from a larger patch] Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: also expose the namespace identification sysfs files for mpath nodesChristoph Hellwig3-26/+38
We do this by adding a helper that returns the ns_head for a device that can belong to either the per-controller or per-subsystem block device nodes, and otherwise reuse all the existing code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: implement multipath access to nvme subsystemsChristoph Hellwig5-14/+455
This patch adds native multipath support to the nvme driver. For each namespace we create only single block device node, which can be used to access that namespace through any of the controllers that refer to it. The gendisk for each controllers path to the name space still exists inside the kernel, but is hidden from userspace. The character device nodes are still available on a per-controller basis. A new link from the sysfs directory for the subsystem allows to find all controllers for a given subsystem. Currently we will always send I/O to the first available path, this will be changed once the NVMe Asynchronous Namespace Access (ANA) TP is ratified and implemented, at which point we will look at the ANA state for each namespace. Another possibility that was prototyped is to use the path that is closes to the submitting NUMA code, which will be mostly interesting for PCI, but might also be useful for RDMA or FC transports in the future. There is not plan to implement round robin or I/O service time path selectors, as those are not scalable with the performance rates provided by NVMe. The multipath device will go away once all paths to it disappear, any delay to keep it alive needs to be implemented at the controller level. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: track shared namespacesChristoph Hellwig3-50/+210
Introduce a new struct nvme_ns_head that holds information about an actual namespace, unlike struct nvme_ns, which only holds the per-controller namespace information. For private namespaces there is a 1:1 relation of the two, but for shared namespaces this lets us discover all the paths to it. For now only the identifiers are moved to the new structure, but most of the information in struct nvme_ns should eventually move over. To allow lockless path lookup the list of nvme_ns structures per nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns structure through call_srcu. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Javier González <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: introduce a nvme_ns_ids structureChristoph Hellwig2-34/+49
This allows us to manage the various uniqueue namespace identifiers together instead needing various variables and arguments. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: track subsystemsChristoph Hellwig3-36/+205
This adds a new nvme_subsystem structure so that we can track multiple controllers that belong to a single subsystem. For now we only use it to store the NQN, and to check that we don't have duplicate NQNs unless the involved subsystems support multiple controllers. Includes code originally from Hannes Reinecke to expose the subsystems in sysfs. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block, nvme: Introduce blk_mq_req_flags_tBart Van Assche8-21/+30
Several block layer and NVMe core functions accept a combination of BLK_MQ_REQ_* flags through the 'flags' argument but there is no verification at compile time whether the right type of block layer flags is passed. Make it possible for sparse to verify this. This patch does not change any functionality. Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Cc: [email protected] Cc: Christoph Hellwig <[email protected]> Cc: Johannes Thumshirn <[email protected]> Cc: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block, scsi: Make SCSI quiesce and resume work reliablyBart Van Assche6-25/+70
The contexts from which a SCSI device can be quiesced or resumed are: * Writing into /sys/class/scsi_device/*/device/state. * SCSI parallel (SPI) domain validation. * The SCSI device power management methods. See also scsi_bus_pm_ops. It is essential during suspend and resume that neither the filesystem state nor the filesystem metadata in RAM changes. This is why while the hibernation image is being written or restored that SCSI devices are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() and scsi_device_resume(). In the SDEV_QUIESCE state execution of non-preempt requests is deferred. This is realized by returning BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI devices. Avoid that a full queue prevents power management requests to be submitted by deferring allocation of non-preempt requests for devices in the quiesced state. This patch has been tested by running the following commands and by verifying that after each resume the fio job was still running: for ((i=0; i<10; i++)); do ( cd /sys/block/md0/md && while true; do [ "$(<sync_action)" = "idle" ] && echo check > sync_action sleep 1 done ) & pids=($!) for d in /sys/class/block/sd*[a-z]; do bdev=${d#/sys/class/block/} hcil=$(readlink "$d/device") hcil=${hcil#../../../} echo 4 > "$d/queue/nr_requests" echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth" fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \ --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16 \ --iodepth_batch=1 --thread --loops=$((2**31)) & pids+=($!) done sleep 1 echo "$(date) Hibernating ..." >>hibernate-test-log.txt systemctl hibernate sleep 10 kill "${pids[@]}" echo idle > /sys/block/md0/md/sync_action wait echo "$(date) Done." >>hibernate-test-log.txt done Reported-by: Oleksandr Natalenko <[email protected]> References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348). Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Martin Steigerwald <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Cc: Martin K. Petersen <[email protected]> Cc: Ming Lei <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flagBart Van Assche3-0/+37
This flag will be used in the next patch to let the block layer core know whether or not a SCSI request queue has been quiesced. A quiesced SCSI queue namely only processes RQF_PREEMPT requests. Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Martin Steigerwald <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Cc: Ming Lei <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10ide, scsi: Tell the block layer at request allocation time about preempt ↵Bart Van Assche2-5/+5
requests Convert blk_get_request(q, op, __GFP_RECLAIM) into blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not change any functionality. Signed-off-by: Bart Van Assche <[email protected]> Tested-by: Martin Steigerwald <[email protected]> Acked-by: David S. Miller <[email protected]> [ for IDE ] Acked-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Cc: Ming Lei <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block: Introduce BLK_MQ_REQ_PREEMPTBart Van Assche3-1/+6
Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to blk_get_request_flags(). Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Martin Steigerwald <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ming Lei <[email protected]> Cc: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block: Introduce blk_get_request_flags()Bart Van Assche2-15/+38
A side effect of this patch is that the GFP mask that is passed to several allocation functions in the legacy block layer is changed from GFP_KERNEL into __GFP_DIRECT_RECLAIM. Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Martin Steigerwald <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ming Lei <[email protected]> Cc: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10block: Make q_usage_counter also track legacy requestsMing Lei2-8/+14
This patch makes it possible to pause request allocation for the legacy block layer by calling blk_mq_freeze_queue() and blk_mq_unfreeze_queue(). Signed-off-by: Ming Lei <[email protected]> [ bvanassche: Combined two patches into one, edited a comment and made sure REQ_NOWAIT is handled properly in blk_old_get_request() ] Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Martin Steigerwald <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Cc: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10blk-mq: fix issue with shared tag queue re-runningJens Axboe3-41/+50
This patch attempts to make the case of hctx re-running on driver tag failure more robust. Without this patch, it's pretty easy to trigger a stall condition with shared tags. An example is using null_blk like this: modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1 which sets up 4 devices, sharing the same tag set with a depth of 1. Running a fio job ala: [global] bs=4k rw=randread norandommap direct=1 ioengine=libaio iodepth=4 [nullb0] filename=/dev/nullb0 [nullb1] filename=/dev/nullb1 [nullb2] filename=/dev/nullb2 [nullb3] filename=/dev/nullb3 will inevitably end with one or more threads being stuck waiting for a scheduler tag. That IO is then stuck forever, until someone else triggers a run of the queue. Ensure that we always re-run the hardware queue, if the driver tag we were waiting for got freed before we added our leftover request entries back on the dispatch list. Reviewed-by: Bart Van Assche <[email protected]> Tested-by: Bart Van Assche <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvmet: kill nvmet_inline_bio_initChristoph Hellwig1-14/+4
Much easier to just opencode this helper. Also use ARRAY_SIZE instead of passing the inline bvec array size manually. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvmet: better data length validationChristoph Hellwig5-25/+34
Currently the NVMe target stores the expexted data length in req->data_len and uses that for data transfer decisions, but that does not take the actual transfer length in the SGLs into account. So this adds a new transfer_len field, into which the transport drivers store the actual transfer length. We then check the two match before actually executing the command. The FC transport driver already had such a field, which is removed in favour of the common one. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme-pci: avoid dereference of symbol from unloaded moduleMing Lei1-1/+2
The 'remove_work' may be scheduled to run after nvme_remove() returns since we can't simply cancel it in nvme_remove() for avoiding deadlock. Once nvme_remove() returns, this module(nvme) can be unloaded. On the other hand, nvme_put_ctrl() calls ctr->ops->free_ctrl which may point to nvme_pci_free_ctrl() in unloaded module. This patch avoids this issue by queuing 'remove_work' via 'nvme_wq', and flush this worqueue in nvme_exit() as suggested by Sagi. Suggested-by: Sagi Grimberg <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: send uevent for some asynchronous eventsKeith Busch3-0/+33
This will give udev a chance to observe and handle asynchronous event notifications and clear the log to unmask future events of the same type. The driver will create a change uevent of the asyncronuos event result before submitting the next AEN request to the device if a completed AEN event is of type error, smart, command set or vendor specific, Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Guan Junxiong <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-10nvme: unexport starting async event workKeith Busch2-8/+1
Async event work is for core use only and should not be called directly from drivers. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Guan Junxiong <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>