aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2017-04-26uapi: change the type of struct statx_timestamp.tv_nsec to unsignedDmitry V. Levin1-6/+2
The comment asserting that the value of struct statx_timestamp.tv_nsec must be negative when statx_timestamp.tv_sec is negative, is wrong, as could be seen from the following example: #define _FILE_OFFSET_BITS 64 #include <assert.h> #include <fcntl.h> #include <stdio.h> #include <sys/stat.h> #include <unistd.h> #include <asm/unistd.h> #include <linux/stat.h> int main(void) { static const struct timespec ts[2] = { { .tv_nsec = UTIME_OMIT }, { .tv_sec = -2, .tv_nsec = 42 } }; assert(utimensat(AT_FDCWD, ".", ts, 0) == 0); struct stat st; assert(stat(".", &st) == 0); printf("st_mtim.tv_sec = %lld, st_mtim.tv_nsec = %lu\n", (long long) st.st_mtim.tv_sec, (unsigned long) st.st_mtim.tv_nsec); struct statx stx; assert(syscall(__NR_statx, AT_FDCWD, ".", 0, 0, &stx) == 0); printf("stx_mtime.tv_sec = %lld, stx_mtime.tv_nsec = %lu\n", (long long) stx.stx_mtime.tv_sec, (unsigned long) stx.stx_mtime.tv_nsec); return 0; } It expectedly prints: st_mtim.tv_sec = -2, st_mtim.tv_nsec = 42 stx_mtime.tv_sec = -2, stx_mtime.tv_nsec = 42 The more generic comment asserting that the value of struct statx_timestamp.tv_nsec might be negative is confusing to say the least. It contradicts both the struct stat.st_[acm]time_nsec tradition and struct timespec.tv_nsec requirements in utimensat syscall. If statx syscall ever returns a stx_[acm]time containing a negative tv_nsec that cannot be passed unmodified to utimensat syscall, it will cause an immense confusion. Fix this source of confusion by changing the type of struct statx_timestamp.tv_nsec from __s32 to __u32. Fixes: a528d35e8bfc ("statx: Add a system call to make enhanced file info available") Signed-off-by: Dmitry V. Levin <[email protected]> Signed-off-by: David Howells <[email protected]> cc: [email protected] cc: [email protected] Signed-off-by: Al Viro <[email protected]>
2017-04-26srcu: Specify auto-expedite holdoff timePaul E. McKenney1-0/+1
On small systems, in the absence of readers, expedited SRCU grace periods can complete in less than a microsecond. This means that an eight-CPU system can have all CPUs doing synchronize_srcu() in a tight loop and almost always expedite. This might actually be desirable in some situations, but in general it is a good way to needlessly burn CPU cycles. And in those situations where it is desirable, your friend is the function synchronize_srcu_expedited(). For other situations, this commit adds a kernel parameter that specifies a holdoff between completing the last SRCU grace period and auto-expediting the next. If the next grace period starts before the holdoff expires, auto-expediting is disabled. The holdoff is 50 microseconds by default, and can be tuned to the desired number of nanoseconds. A value of zero disables auto-expediting. Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Mike Galbraith <[email protected]>
2017-04-26srcu: Expedited grace periods with reduced memory contentionPaul E. McKenney1-1/+3
Commit f60d231a87c5 ("srcu: Crude control of expedited grace periods") introduced a per-srcu_struct atomic counter to track outstanding requests for grace periods. This works, but represents a memory-contention bottleneck. This commit therefore uses the srcu_node combining tree to remove this bottleneck. This commit adds new ->srcu_gp_seq_needed_exp fields to the srcu_data, srcu_node, and srcu_struct structures, which track the farthest-in-the-future grace period that must be expedited, which in turn requires that all nearer-term grace periods also be expedited. Requests for expediting start with the srcu_data structure, run up through the srcu_node tree, and end at the srcu_struct structure. Note that it may be necessary to expedite a grace period that just now started, and this is handled by a new srcu_funnel_exp_start() function, which is invoked when the grace period itself is already in its way, but when that grace period was not marked as expedited. A new srcu_get_delay() function returns zero if there is at least one expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise. This function is used to calculate delays: Normal grace periods are allowed to extend in order to cover more requests with a given grace-period computation, which decreases per-request overhead. Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Mike Galbraith <[email protected]>
2017-04-27ACPICA: iasl: Fix IORT SMMU GSI disassemblingLv Zheng1-0/+9
ACPICA commit 637b88de24a78c20478728d9d66632b06fcaa5bf If the IORT template is compiled and then iort.aml binary disassembled to iort.dsl, SMMUv1 node lists incorrect offset for SMMU_Nsg_cfg_irpt Interrupt: [0ECh 0236 8] SMMU_Nsg_irpt Interrupt : 0000000000000000 [0ECh 0236 8] SMMU_Nsg_cfg_irpt Interrupt : 0000000000000000 This is because iasl hasn't implemented SMMU GSI decoding yet. This patch fixes this issue by preparing structures for decoding IORT SMMU GSI. ACPICA BZ 1340, reported by Alexei Fedorov, fixed by Lv Zheng. Link: https://github.com/acpica/acpica/commit/637b88de Link: https://bugs.acpica.org/show_bug.cgi?id=1340 Reported-by: Alexei Fedorov <[email protected]> Signed-off-by: Lv Zheng <[email protected]> Signed-off-by: Bob Moore <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-04-27ACPI / bus: Introduce a list of ids for "always present" devicesHans de Goede1-0/+9
Several Bay / Cherry Trail devices (all of which ship with Windows 10) hide the LPSS PWM controller in ACPI, typically the _STA method looks like this: Method (_STA, 0, NotSerialized) // _STA: Status { If (OSID == One) { Return (Zero) } Return (0x0F) } Where OSID is some dark magic seen in all Cherry Trail ACPI tables making the machine behave differently depending on which OS it *thinks* it is booting, this gets set in a number of ways which we cannot control, on some newer machines it simple hardcoded to "One" aka win10. This causes the PWM controller to get hidden, which means Linux cannot control the backlight level on cht based tablets / laptops. Since loading the driver for this does no harm (the only in kernel user of it is the i915 driver, which will only uses it when it needs it), this commit makes acpi_bus_get_status() always set status to ACPI_STA_DEFAULT for the LPSS PWM device, fixing the lack of backlight control. Signed-off-by: Hans de Goede <[email protected]> [ rjw: Rename the new file to utils.c ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-04-26x86, iommu/vt-d: Add an option to disable Intel IOMMU force onShaohua Li1-0/+1
IOMMU harms performance signficantly when we run very fast networking workloads. It's 40GB networking doing XDP test. Software overhead is almost unaware, but it's the IOTLB miss (based on our analysis) which kills the performance. We observed the same performance issue even with software passthrough (identity mapping), only the hardware passthrough survives. The pps with iommu (with software passthrough) is only about ~30% of that without it. This is a limitation in hardware based on our observation, so we'd like to disable the IOMMU force on, but we do want to use TBOOT and we can sacrifice the DMA security bought by IOMMU. I must admit I know nothing about TBOOT, but TBOOT guys (cc-ed) think not eabling IOMMU is totally ok. So introduce a new boot option to disable the force on. It's kind of silly we need to run into intel_iommu_init even without force on, but we need to disable TBOOT PMR registers. For system without the boot option, nothing is changed. Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2017-04-26ieee80211: fix kernel-doc parsing errorsJohannes Berg1-6/+6
Some of the enum definitions are unnamed but there's still an attempt at documenting them - that doesn't work. Name them to make that work. Signed-off-by: Johannes Berg <[email protected]>
2017-04-26ieee80211: add FT-802.1X AKM suite selectorLuca Coelho1-0/+1
Add the definition for FT-8021.1X AKM selector as defined in IEEE Std 802.11-2016, table 9-133. Signed-off-by: Luca Coelho <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2017-04-26ieee80211: add SUITE_B AKM selectorsLuca Coelho1-12/+14
Add the definitions for SUITE_B and SUITE_B_192 AKM selectors as defined in IEEE802.11REVmc_D5.0, table 9-132. Signed-off-by: Luca Coelho <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2017-04-26cfg80211: add request id parameter to .sched_scan_stop() signatureArend Van Spriel1-7/+8
For multiple scheduled scan support the driver needs to know which scheduled scan request is being stopped. Pass the request id in the .sched_scan_stop() callback. Reviewed-by: Hante Meuleman <[email protected]> Reviewed-by: Pieter-Paul Giesberts <[email protected]> Reviewed-by: Franky Lin <[email protected]> Signed-off-by: Arend van Spriel <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2017-04-26nl80211: add support for BSSIDs in scheduled scan matchsetsArend Van Spriel2-1/+9
This patch allows for the scheduled scan request to specify matchsets for specific BSSIDs. Reviewed-by: Hante Meuleman <[email protected]> Reviewed-by: Pieter-Paul Giesberts <[email protected]> Reviewed-by: Franky Lin <[email protected]> Signed-off-by: Arend van Spriel <[email protected]> [docs, netlink policy fix] Signed-off-by: Johannes Berg <[email protected]>
2017-04-26nl80211: allow multiple active scheduled scan requestsArend Van Spriel2-2/+19
This patch implements the idea to have multiple scheduled scan requests running concurrently. It mainly illustrates how to deal with the incoming request from user-space in terms of backward compatibility. In order to use multiple scheduled scans user-space needs to provide a flag attribute NL80211_ATTR_SCHED_SCAN_MULTI to indicate support. If not the request is treated as a legacy scan. Drivers currently supporting scheduled scan are now indicating they support a single scheduled scan request. This obsoletes WIPHY_FLAG_SUPPORTS_SCHED_SCAN. Reviewed-by: Hante Meuleman <[email protected]> Reviewed-by: Pieter-Paul Giesberts <[email protected]> Reviewed-by: Franky Lin <[email protected]> Signed-off-by: Arend van Spriel <[email protected]> [clean up netlink destroy path to avoid allocations, code cleanups] Signed-off-by: Johannes Berg <[email protected]>
2017-04-26cfg80211: simplify netlink socket owner interface deletionJohannes Berg1-2/+4
There's no need to allocate a portid structure and then, for each of those, walk the interfaces - we can just add a flag to each interface and walk those directly. Due to padding in the struct, we can even do it without any memory cost, and it even simplifies the code. Signed-off-by: Johannes Berg <[email protected]>
2017-04-26blk-mq: Add blk_mq_ops.show_rq()Bart Van Assche1-0/+8
This new callback function will be used in the next patch to show more information about SCSI requests. Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Omar Sandoval <[email protected]> Cc: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-26tcp: switch rcv_rtt_est and rcvq_space to high resolution timestampsEric Dumazet1-6/+6
Some devices or distributions use HZ=100 or HZ=250 TCP receive buffer autotuning has poor behavior caused by this choice. Since autotuning happens after 4 ms or 10 ms, short distance flows get their receive buffer tuned to a very high value, but after an initial period where it was frozen to (too small) initial value. With tp->tcp_mstamp introduction, we can switch to high resolution timestamps almost for free (at the expense of 8 additional bytes per TCP structure) Note that some TCP stacks use usec TCP timestamps where this patch makes even more sense : Many TCP flows have < 500 usec RTT. Hopefully this finer TS option can be standardized soon. Tested: HZ=100 kernel ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 & Peer without patch : lpaa24:~# ss -tmi dst lpaa23 ... skmem:(r0,rb8388608,...) rcv_rtt:10 rcv_space:3210000 minrtt:0.017 Peer with the patch : lpaa23:~# ss -tmi dst lpaa24 ... skmem:(r0,rb428800,...) rcv_rtt:0.069 rcv_space:30000 minrtt:0.017 We can see saner RCVBUF, and more precise rcv_rtt information. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rack_advance()Eric Dumazet1-2/+1
No longer needed, since tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rate_gen()Eric Dumazet1-1/+1
No longer needed, since tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rack_mark_lost()Eric Dumazet1-1/+1
This is no longer used, since tcp_rack_detect_loss() takes the timestamp from tp->tcp_mstamp Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: add tp->tcp_mstamp fieldEric Dumazet1-0/+1
We want to use precise timestamps in TCP stack, but we do not want to call possibly expensive kernel time services too often. tp->tcp_mstamp is guaranteed to be updated once per incoming packet. We will use it in the following patches, removing specific skb_mstamp_get() calls, and removing ack_time from struct tcp_sacktag_state. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26rhashtable: remove insecure_max_entries paramFlorian Westphal1-4/+2
no users in the tree, insecure_max_entries is always set to ht->p.max_size * 2 in rhtashtable_init(). Replace only spot that uses it with a ht->p.max_size check. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26net: phy: fix auto-negotiation stall due to unavailable interruptAlexander Kochetkov1-0/+1
The Ethernet link on an interrupt driven PHY was not coming up if the Ethernet cable was plugged before the Ethernet interface was brought up. The patch trigger PHY state machine to update link state if PHY was requested to do auto-negotiation and auto-negotiation complete flag already set. During power-up cycle the PHY do auto-negotiation, generate interrupt and set auto-negotiation complete flag. Interrupt is handled by PHY state machine but doesn't update link state because PHY is in PHY_READY state. After some time MAC bring up, start and request PHY to do auto-negotiation. If there are no new settings to advertise genphy_config_aneg() doesn't start PHY auto-negotiation. PHY continue to stay in auto-negotiation complete state and doesn't fire interrupt. At the same time PHY state machine expect that PHY started auto-negotiation and is waiting for interrupt from PHY and it won't get it. Fixes: 321beec5047a ("net: phy: Use interrupts when available in NOLINK state") Signed-off-by: Alexander Kochetkov <[email protected]> Cc: stable <[email protected]> # v4.9+ Tested-by: Roger Quadros <[email protected]> Tested-by: Alexandre Belloni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26srcu: Make rcutorture writer stalls print SRCU GP statePaul E. McKenney3-0/+30
In the past, SRCU was simple enough that there was little point in making the rcutorture writer stall messages print the SRCU grace-period number state. With the advent of Tree SRCU, this has changed. This commit therefore makes Classic, Tiny, and Tree SRCU report this state to rcutorture as needed. Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Mike Galbraith <[email protected]>
2017-04-26srcu: Exact tracking of srcu_data structures containing callbacksPaul E. McKenney1-0/+4
The current Tree SRCU implementation schedules a workqueue for every srcu_data covered by a given leaf srcu_node structure having callbacks, even if only one of those srcu_data structures actually contains callbacks. This is clearly inefficient for workloads that don't feature callbacks everywhere all the time. This commit therefore adds an array of masks that are used by the leaf srcu_node structures to track exactly which srcu_data structures contain callbacks. Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Mike Galbraith <[email protected]>
2017-04-26NFSv4: Don't special case "launder"Trond Myklebust1-13/+1
If the client receives a fatal server error from nfs_pageio_add_request(), then we should always truncate the page on which the error occurred. Signed-off-by: Trond Myklebust <[email protected]>
2017-04-26CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional nowAl Viro2-46/+2
all architectures converted Signed-off-by: Al Viro <[email protected]>
2017-04-26Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', ↵Al Viro77-331/+817
'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess
2017-04-26Merge remote-tracking branches 'spi/topic/ti-qspi' and 'spi/topic/xlp' into ↵Mark Brown1-0/+4
spi-next
2017-04-26Merge remote-tracking branches 'spi/topic/devprop', 'spi/topic/fsl', ↵Mark Brown1-0/+4
'spi/topic/fsl-dspi', 'spi/topic/imx' and 'spi/topic/lantiq' into spi-next
2017-04-26Merge remote-tracking branch 'spi/topic/core' into spi-nextMark Brown1-1/+1
2017-04-26netfilter: don't attach a nat extension by defaultFlorian Westphal1-1/+1
nowadays the NAT extension only stores the interface index (used to purge connections that got masqueraded when interface goes down) and pptp nat information. Previous patches moved nf_ct_nat_ext_add to those places that need it. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: conntrack: mark extension structs as constFlorian Westphal1-2/+2
Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: conntrack: remove prealloc supportFlorian Westphal1-6/+0
It was used by the nat extension, but since commit 7c9664351980 ("netfilter: move nat hlist_head to nf_conn") its only needed for connections that use MASQUERADE target or a nat helper. Also it seems a lot easier to preallocate a fixed size instead. With default settings, conntrack first adds ecache extension (sysctl defaults to 1), so we get 40(ct extension header) + 24 (ecache) == 64 byte on x86_64 for initial allocation. Followup patches can constify the extension structs and avoid the initial zeroing of the entire extension area. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26ebtables: remove nf_hook_register usageFlorian Westphal1-2/+4
Similar to ip_register_table, pass nf_hook_ops to ebt_register_table(). This allows to handle hook registration also via pernet_ops and allows us to avoid use of legacy register_hook api. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: synproxy: only register hooks when neededFlorian Westphal1-0/+2
Defer registration of the synproxy hooks until the first SYNPROXY rule is added. Also means we only register hooks in namespaces that need it. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-25svcrdma: Clean out old XDR encodersChuck Lever1-4/+0
Clean up: These have been replaced and are no longer used. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Remove the req_map cacheChuck Lever1-33/+1
req_maps are no longer used by the send path and can thus be removed. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Remove unused RDMA Write completion handlerChuck Lever1-1/+0
Clean up. All RDMA Write completions are now handled by svc_rdma_wc_write_ctx. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxtChuck Lever1-2/+7
The sge array in struct svc_rdma_op_ctxt is no longer used for sending RDMA Write WRs. It need only accommodate the construction of Send and Receive WRs. The maximum inline size is the largest payload it needs to handle now. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Clean up RPC-over-RDMA backchannel reply processingChuck Lever1-1/+1
Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Clean up RDMA_ERROR pathChuck Lever2-5/+3
Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can be refactored to reduce code duplication and remove C structure- based XDR encoding. It is also relocated to the source file that contains its only caller. This is a refactoring change only. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Use rdma_rw API in RPC reply pathChuck Lever1-1/+0
The current svcrdma sendto code path posts one RDMA Write WR at a time. Each of these Writes typically carries a small number of pages (for instance, up to 30 pages for mlx4 devices). That means a 1MB NFS READ reply requires 9 ib_post_send() calls for the Write WRs, and one for the Send WR carrying the actual RPC Reply message. Instead, use the new rdma_rw API. The details of Write WR chain construction and memory registration are taken care of in the RDMA core. svcrdma can focus on the details of the RPC-over-RDMA protocol. This gives three main benefits: 1. All Write WRs for one RDMA segment are posted in a single chain. As few as one ib_post_send() for each Write chunk. 2. The Write path can now use FRWR to register the Write buffers. If the device's maximum page list depth is large, this means a single Write WR is needed for each RPC's Write chunk data. 3. The new code introduces support for RPCs that carry both a Write list and a Reply chunk. This combination can be used for an NFSv4 READ where the data payload is large, and thus is removed from the Payload Stream, but the Payload Stream is still larger than the inline threshold. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Introduce local rdma_rw API helpersChuck Lever1-0/+11
The plan is to replace the local bespoke code that constructs and posts RDMA Read and Write Work Requests with calls to the rdma_rw API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and posting Work Requests. Some design notes: o The structure of RPC-over-RDMA transport headers is flexible, allowing multiple segments per Reply with arbitrary alignment, each with a unique R_key. Write and Send WRs continue to be built and posted in separate code paths. However, one whole chunk (with one or more RDMA segments apiece) gets exactly one ib_post_send and one work completion. o svc_xprt reference counting is modified, since a chain of rdma_rw_ctx structs generates one completion, no matter how many Write WRs are posted. o The current code builds the transport header as it is construct- ing Write WRs. I've replaced that with marshaling of transport header data items in a separate step. This is because the exact structure of client-provided segments may not align with the components of the server's reply xdr_buf, or the pages in the page list. Thus parts of each client-provided segment may be written at different points in the send path. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULTChuck Lever1-1/+0
The Send Queue depth is temporarily reduced to 1 SQE per credit. The new rdma_rw API does an internal computation, during QP creation, to increase the depth of the Send Queue to handle RDMA Read and Write operations. This change has to come before the NFSD code paths are updated to use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases the size of the SQ too much, resulting in memory allocation failures during QP creation. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Add svc_rdma_map_reply_hdr()Chuck Lever1-0/+3
Introduce a helper to DMA-map a reply's transport header before sending it. This will in part replace the map vector cache. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Move send_wr to svc_rdma_op_ctxtChuck Lever1-0/+4
Clean up: Move the ib_send_wr off the stack, and move common code to post a Send Work Request into a helper. This is a refactoring change only. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25uapi: fix linux/nfsd/cld.h userspace compilation errorsDmitry V. Levin1-6/+8
Include <linux/types.h> and consistently use types it provides to fix the following linux/nfsd/cld.h userspace compilation errors: /usr/include/linux/nfsd/cld.h:40:2: error: unknown type name 'uint16_t' uint16_t cn_len; /* length of cm_id */ /usr/include/linux/nfsd/cld.h:46:2: error: unknown type name 'uint8_t' uint8_t cm_vers; /* upcall version */ /usr/include/linux/nfsd/cld.h:47:2: error: unknown type name 'uint8_t' uint8_t cm_cmd; /* upcall command */ /usr/include/linux/nfsd/cld.h:48:2: error: unknown type name 'int16_t' int16_t cm_status; /* return code */ /usr/include/linux/nfsd/cld.h:49:2: error: unknown type name 'uint32_t' uint32_t cm_xid; /* transaction id */ /usr/include/linux/nfsd/cld.h:51:3: error: unknown type name 'int64_t' int64_t cm_gracetime; /* grace period start time */ Signed-off-by: Dmitry V. Levin <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25nfsd: check for oversized NFSv2/v3 argumentsJ. Bruce Fields1-2/+1
A client can append random data to the end of an NFSv2 or NFSv3 RPC call without our complaining; we'll just stop parsing at the end of the expected data and ignore the rest. Encoded arguments and replies are stored together in an array of pages, and if a call is too large it could leave inadequate space for the reply. This is normally OK because NFS RPC's typically have either short arguments and long replies (like READ) or long arguments and short replies (like WRITE). But a client that sends an incorrectly long reply can violate those assumptions. This was observed to cause crashes. So, insist that the argument not be any longer than we expect. Also, several operations increment rq_next_page in the decode routine before checking the argument size, which can leave rq_next_page pointing well past the end of the page array, causing trouble later in svc_free_pages. As followup we may also want to rewrite the encoding routines to check more carefully that they aren't running off the end of the page array. Reported-by: Tuomas Haanpää <[email protected]> Reported-by: Ari Kauppi <[email protected]> Cc: [email protected] Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25net: move xdp_prog field in RX cache linesEric Dumazet1-1/+1
(struct net_device, xdp_prog) field should be moved in RX cache lines, reducing latencies when a single packet is received on idle host, since netif_elide_gro() needs it. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-25x86, dax, pmem: remove indirection around memcpy_from_pmem()Dan Williams2-23/+8
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper serves no real benefit aside from affording a more generic function name than the x86-specific 'mcsafe'. However this would not be the first time that x86 terminology leaked into the global namespace. For lack of better name, just use memcpy_mcsafe() directly. This conversion also catches a place where we should have been using plain memcpy, acpi_nfit_blk_single_io(). Cc: <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Moyer <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ross Zwisler <[email protected]> Acked-by: Tony Luck <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2017-04-25block: remove block_device_operations ->direct_access()Dan Williams1-17/+0
Now that all the producers and consumers of dax interfaces have been converted to using dax_operations on a dax_device, remove the block device direct_access enabling. Signed-off-by: Dan Williams <[email protected]>