Age | Commit message (Collapse) | Author | Files | Lines |
|
A single NFSv4 WRITE compound can often have three operations:
PUTFH, WRITE, then GETATTR.
When the WRITE payload is sent in a Read chunk, the client places
the GETATTR in the inline part of the RPC/RDMA message, just after
the WRITE operation (sans payload). The position value in the Read
chunk enables the receiver to insert the Read chunk at the correct
place in the received XDR stream; that is between the WRITE and
GETATTR.
According to RFC 8166, an NFS/RDMA client does not have to add XDR
round-up to the Read chunk that carries the WRITE payload. The
receiver adds XDR round-up padding if it is absent and the
receiver's XDR decoder requires it to be present.
Commit 193bcb7b3719 ("svcrdma: Populate tail iovec when receiving")
attempted to add support for receiving such a compound so that just
the WRITE payload appears in rq_arg's page list, and the trailing
GETATTR is placed in rq_arg's tail iovec. (TCP just strings the
whole compound into the head iovec and page list, without regard
to the alignment of the WRITE payload).
The server transport logic also had to accommodate the optional XDR
round-up of the Read chunk, which it did simply by lengthening the
tail iovec when round-up was needed. This approach is adequate for
the NFSv2 and NFSv3 WRITE decoders.
Unfortunately it is not sufficient for nfsd4_decode_write. When the
Read chunk length is a couple of bytes less than PAGE_SIZE, the
computation at the end of nfsd4_decode_write allows argp->pagelen to
go negative, which breaks the logic in read_buf that looks for the
tail iovec.
The result is that a WRITE operation whose payload length is just
less than a multiple of a page succeeds, but the subsequent GETATTR
in the same compound fails with NFS4ERR_OP_ILLEGAL because the XDR
decoder can't find it. Clients ignore the error, but they must
update their attribute cache via a separate round trip.
As nfsd4_decode_write appears to expect the payload itself to always
have appropriate XDR round-up, have svc_rdma_build_normal_read_chunk
add the Read chunk XDR round-up to the page_len rather than
lengthening the tail iovec.
Reported-by: Olga Kornievskaia <[email protected]>
Fixes: 193bcb7b3719 ("svcrdma: Populate tail iovec when receiving")
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Olga Kornievskaia <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
tracepoint tcp_send_reset requires a full socket to work. However, it
may be called when in TCP_TIME_WAIT:
case TCP_TW_RST:
tcp_v6_send_reset(sk, skb);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
To avoid this problem, this patch checks the socket with sk_fullsock()
before calling trace_tcp_send_reset().
Fixes: c24b14c46bb8 ("tcp: add tracepoint trace_tcp_send_reset")
Signed-off-by: Song Liu <[email protected]>
Reviewed-by: Lawrence Brakmo <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In Kernel 4.15.0+, Netem does not work properly.
Netem setup:
tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms
Result:
PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=22.8 ms
64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=10.9 ms
64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=10.9 ms
64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=11.4 ms
64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=4303 ms
64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.2 ms
64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.3 ms
64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=4304 ms
64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=4303 ms
Patch:
(rnd % (2 * sigma)) - sigma was overflowing s32. After applying the
patch, I found following output which is desirable.
PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=21.1 ms
64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=8.46 ms
64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=9.00 ms
64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=11.8 ms
64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=8.36 ms
64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=8.11 ms
64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=10.0 ms
64 bytes from 172.16.101.2: icmp_seq=9 ttl=64 time=11.3 ms
64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.5 ms
64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.2 ms
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Because of differences in how ipv4 and ipv6 handle fib lookups,
verification of nexthops with onlink flag need to default to the main
table rather than the local table used by IPv4. As it stands an
address within a connected route on device 1 can be used with
onlink on device 2. Updating the table properly rejects the route
due to the egress device mismatch.
Update the extack message as well to show it could be a device
mismatch for the nexthop spec.
Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Verification of nexthops with onlink flag need to handle unreachable
routes. The lookup is only intended to validate the gateway address
is not a local address and if the gateway resolves the egress device
must match the given device. Hence, hitting any default reject route
is ok.
Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
AF_RXRPC is incorrectly sending back to the server any abort it receives
for a client connection. This is due to the final-ACK offload to the
connection event processor patch. The abort code is copied into the
last-call information on the connection channel and then the event
processor is set.
Instead, the following should be done:
(1) In the case of a final-ACK for a successful call, the ACK should be
scheduled as before.
(2) In the case of a locally generated ABORT, the ABORT details should be
cached for sending in response to further packets related to that
call and no further action scheduled at call disconnect time.
(3) In the case of an ACK received from the peer, the call should be
considered dead, no ABORT should be transmitted at this time. In
response to further non-ABORT packets from the peer relating to this
call, an RX_USER_ABORT ABORT should be transmitted.
(4) In the case of a call killed due to network error, an RX_USER_ABORT
ABORT should be cached for transmission in response to further
packets, but no ABORT should be sent at this time.
Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client call")
Signed-off-by: David Howells <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This should help reduce the latency on replies.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for you net tree, they
are:
1) Restore __GFP_NORETRY in xt_table allocations to mitigate effects of
large memory allocation requests, from Michal Hocko.
2) Release IPv6 fragment queue in case of error in fragmentation header,
this is a follow up to amend patch 83f1999caeb1, from Subash Abhinov
Kasiviswanathan.
3) Flowtable infrastructure depends on NETFILTER_INGRESS as it registers
a hook for each flowtable, reported by John Crispin.
4) Missing initialization of info->priv in xt_cgroup version 1, from
Cong Wang.
5) Give a chance to garbage collector to run after scheduling flowtable
cleanup.
6) Releasing flowtable content on nft_flow_offload module removal is
not required at all, there is not dependencies between this module
and flowtables, remove it.
7) Fix missing xt_rateest_mutex grabbing for hash insertions, also from
Cong Wang.
8) Move nf_flow_table_cleanup() routine to flowtable core, this patch is
a dependency for the next patch in this list.
9) Flowtable resources are not properly released on removal from the
control plane. Fix this resource leak by scheduling removal of all
entries and explicit call to the garbage collector.
10) nf_ct_nat_offset() declaration is dead code, this function prototype
is not used anywhere, remove it. From Taehee Yoo.
11) Fix another flowtable resource leak on entry insertion failures,
this patch also fixes a possible use-after-free. Patch from Felix
Fietkau.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
The response to a write_space notification is very latency sensitive,
so we should queue it to the lower latency xprtiod_workqueue. This
is something we already do for the other cases where an rpc task
holds the transport XPRT_LOCKED bitlock.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
flow_offload_del frees the flow, so all associated resource must be
freed before.
Since the ct entry in struct flow_offload_entry was allocated by
flow_offload_alloc, it should be freed by flow_offload_free to take care
of the error handling path when flow_offload_add fails.
While at it, make flow_offload_del static, since it should never be
called directly, only from the gc step
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
The DEFINE_SHOW_ATTRIBUTE() helper macro would be useful for current
users, which are many of them, and for new comers to decrease code
duplication.
Acked-by: Lee Jones <[email protected]>
Acked-by: Darren Hart (VMware) <[email protected]>
Signed-off-by: Andy Shevchenko <[email protected]>
|
|
Merge misc updates from Andrew Morton:
- kasan updates
- procfs
- lib/bitmap updates
- other lib/ updates
- checkpatch tweaks
- rapidio
- ubsan
- pipe fixes and cleanups
- lots of other misc bits
* emailed patches from Andrew Morton <[email protected]>: (114 commits)
Documentation/sysctl/user.txt: fix typo
MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns
MAINTAINERS: update various PALM patterns
MAINTAINERS: update "ARM/OXNAS platform support" patterns
MAINTAINERS: update Cortina/Gemini patterns
MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern
MAINTAINERS: remove ANDROID ION pattern
mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors
mm: docs: fix parameter names mismatch
mm: docs: fixup punctuation
pipe: read buffer limits atomically
pipe: simplify round_pipe_size()
pipe: reject F_SETPIPE_SZ with size over UINT_MAX
pipe: fix off-by-one error when checking buffer limits
pipe: actually allow root to exceed the pipe buffer limits
pipe, sysctl: remove pipe_proc_fn()
pipe, sysctl: drop 'min' parameter from pipe-max-size converter
kasan: rework Kconfig settings
crash_dump: is_kdump_kernel can be boolean
kernel/mutex: mutex_is_locked can be boolean
...
|
|
This macro is only used by net/ipv6/mcast.c, but there is no reason
why it must be BUILD_BUG_ON_NULL().
Replace it with BUILD_BUG_ON_ZERO(), and remove BUILD_BUG_ON_NULL()
definition from <linux/build_bug.h>.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Cc: Ian Abbott <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Hideaki YOSHIFUJI <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: "David S. Miller" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
with bitmap_{from,to}_arr32 over the kernel. Additionally to it:
* __check_eq_bitmap() now takes single nbits argument.
* __check_eq_u32_array is not used in new test but may be used in
future. So I don't remove it here, but annotate as __used.
Tested on arm64 and 32-bit BE mips.
[[email protected]: perf: arm_dsu_pmu: convert to bitmap_from_arr32]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix net/core/ethtool.c]
Link: http://lkml.kernel.org/r/20180205071747.4ekxtsbgxkj5b2fz@yury-thinkpad
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: David Decotigny <[email protected]>,
Cc: David S. Miller <[email protected]>,
Cc: Geert Uytterhoeven <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Heiner Kallweit <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Every flow_offload entry is added into the table twice. Because of this,
rhashtable_free_and_destroy can't be used, since it would call kfree for
each flow_offload object twice.
This patch cleans up the flowtable via nf_flow_table_iterate() to
schedule removal of entries by setting on the dying bit, then there is
an explicitly invocation of the garbage collector to release resources.
Based on patch from Felix Fietkau.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Move the flowtable cleanup routines to nf_flow_table and expose the
nf_flow_table_cleanup() helper function.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
rateest_hash is supposed to be protected by xt_rateest_mutex,
and, as suggested by Eric, lookup and insert should be atomic,
so we should acquire the xt_rateest_mutex once for both.
So introduce a non-locking helper for internal use and keep the
locking one for external.
Reported-by: <[email protected]>
Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
Signed-off-by: Cong Wang <[email protected]>
Reviewed-by: Florian Westphal <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Most places in the kernel that we need to distinguish functions by the
type of their arguments, we use '_ul' as a suffix for the unsigned long
variant, not '_ext'. Also add kernel-doc.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
No real benefit to this classifier, but since we're allocating a u32
anyway, we should use this function.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Commit e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles)
converted htid allocation to use the IDR. The ID allocated by this
scheme changes; it used to be cyclic, but now always allocates the
lowest available. The IDR supports cyclic allocation, so just use
the right function.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Use the new helper which saves a temporary variable and a few lines
of code.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Use the new helper. This has a modest reduction in both lines of code
and compiled code size.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Use the new helper which saves a temporary variable and a few lines of
code.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Use the new helper.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Use the new helper. Also untangle the error path, and in so doing
noticed that estimator generator failure would lead to us leaking an
ID. Fix that bug.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Simply changing idr_remove's 'id' argument to 'unsigned long' works
for all callers.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Changing idr_replace's 'id' argument to 'unsigned long' works for all
callers. Callers which passed a negative ID now get -ENOENT instead of
-EINVAL. No callers relied on this error value.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
Simply changing idr_remove's 'id' argument to 'unsigned long' suffices
for all callers.
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
git://git.linux-nfs.org/projects/anna/linux-nfs
NFS-over-RDMA client fixes for Linux 4.16 #2
Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
|
|
Scenario:
1. Port down and do fail over
2. Ap do rds_bind syscall
PID: 47039 TASK: ffff89887e2fe640 CPU: 47 COMMAND: "kworker/u:6"
#0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9
#1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3
#2 [ffff898e35f15b30] oops_end at ffffffff8150f518
#3 [ffff898e35f15b60] no_context at ffffffff8104854c
#4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675
#5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3
#6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8
#7 [ffff898e35f15d10] page_fault at ffffffff8150ea95
[exception RIP: unknown or invalid address]
RIP: 0000000000000000 RSP: ffff898e35f15dc8 RFLAGS: 00010282
RAX: 00000000fffffffe RBX: ffff889b77f6fc00 RCX:ffffffff81c99d88
RDX: 0000000000000000 RSI: ffff896019ee08e8 RDI:ffff889b77f6fc00
RBP: ffff898e35f15df0 R8: ffff896019ee08c8 R9:0000000000000000
R10: 0000000000000400 R11: 0000000000000000 R12:ffff896019ee08c0
R13: ffff889b77f6fe68 R14: ffffffff81c99d80 R15: ffffffffa022a1e0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm]
#9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6
#10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0
#11 [ffff898e35f15ee8] kthread at ffffffff81090fe6
PID: 45659 TASK: ffff880d313d2500 CPU: 31 COMMAND: "oracle_45659_ap"
#0 [ffff881024ccfc98] __schedule at ffffffff8150bac4
#1 [ffff881024ccfd40] schedule at ffffffff8150c2cf
#2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7
#3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb
#4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm]
#5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma]
#6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds]
#7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds]
#8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670
PID: 45659 PID: 47039
rds_ib_laddr_check
/* create id_priv with a null event_handler */
rdma_create_id
rdma_bind_addr
cma_acquire_dev
/* add id_priv to cma_dev->id_list */
cma_attach_to_dev
cma_ndev_work_handler
/* event_hanlder is null */
id_priv->id.event_handler
Signed-off-by: Guanglei Li <[email protected]>
Signed-off-by: Honglei Wang <[email protected]>
Reviewed-by: Junxiao Bi <[email protected]>
Reviewed-by: Yanjun Zhu <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Acked-by: Santosh Shilimkar <[email protected]>
Acked-by: Doug Ledford <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
When an erspan tunnel device receives an erpsan packet with different
tunnel metadata (ex: version, index, hwid, direction), existing code
overwrites the tunnel device's erspan configuration with the received
packet's metadata. The patch fixes it.
Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support")
Signed-off-by: William Tu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Commit d350a823020e ("net: erspan: create erspan metadata uapi header")
moves the erspan 'version' in front of the 'struct erspan_md2' for
later extensibility reason. This breaks the existing erspan metadata
extraction code because the erspan_md2 then has a 4-byte offset
to between the erspan_metadata and erspan_base_hdr. This patch
fixes it.
Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
Signed-off-by: William Tu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Li Shuang reported an Oops with cls_u32 due to an use-after-free
in u32_destroy_key(). The use-after-free can be triggered with:
dev=lo
tc qdisc add dev $dev root handle 1: htb default 10
tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256
tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\
10.0.0.0/8 hashkey mask 0x0000ff00 at 16 link 1:
tc qdisc del dev $dev root
Which causes the following kasan splat:
==================================================================
BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
Read of size 4 at addr ffff881b83dae618 by task kworker/u48:5/571
CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87
Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32]
Call Trace:
dump_stack+0xd6/0x182
? dma_virt_map_sg+0x22e/0x22e
print_address_description+0x73/0x290
kasan_report+0x277/0x360
? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
u32_delete_key_freepf_work+0x1c/0x30 [cls_u32]
process_one_work+0xae0/0x1c80
? sched_clock+0x5/0x10
? pwq_dec_nr_in_flight+0x3c0/0x3c0
? _raw_spin_unlock_irq+0x29/0x40
? trace_hardirqs_on_caller+0x381/0x570
? _raw_spin_unlock_irq+0x29/0x40
? finish_task_switch+0x1e5/0x760
? finish_task_switch+0x208/0x760
? preempt_notifier_dec+0x20/0x20
? __schedule+0x839/0x1ee0
? check_noncircular+0x20/0x20
? firmware_map_remove+0x73/0x73
? find_held_lock+0x39/0x1c0
? worker_thread+0x434/0x1820
? lock_contended+0xee0/0xee0
? lock_release+0x1100/0x1100
? init_rescuer.part.16+0x150/0x150
? retint_kernel+0x10/0x10
worker_thread+0x216/0x1820
? process_one_work+0x1c80/0x1c80
? lock_acquire+0x1a5/0x540
? lock_downgrade+0x6b0/0x6b0
? sched_clock+0x5/0x10
? lock_release+0x1100/0x1100
? compat_start_thread+0x80/0x80
? do_raw_spin_trylock+0x190/0x190
? _raw_spin_unlock_irq+0x29/0x40
? trace_hardirqs_on_caller+0x381/0x570
? _raw_spin_unlock_irq+0x29/0x40
? finish_task_switch+0x1e5/0x760
? finish_task_switch+0x208/0x760
? preempt_notifier_dec+0x20/0x20
? __schedule+0x839/0x1ee0
? kmem_cache_alloc_trace+0x143/0x320
? firmware_map_remove+0x73/0x73
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x39/0x1c0
? schedule+0xf3/0x3b0
? lock_downgrade+0x6b0/0x6b0
? __schedule+0x1ee0/0x1ee0
? do_wait_intr_irq+0x340/0x340
? do_raw_spin_trylock+0x190/0x190
? _raw_spin_unlock_irqrestore+0x32/0x60
? process_one_work+0x1c80/0x1c80
? process_one_work+0x1c80/0x1c80
kthread+0x312/0x3d0
? kthread_create_worker_on_cpu+0xc0/0xc0
ret_from_fork+0x3a/0x50
Allocated by task 1688:
kasan_kmalloc+0xa0/0xd0
__kmalloc+0x162/0x380
u32_change+0x1220/0x3c9e [cls_u32]
tc_ctl_tfilter+0x1ba6/0x2f80
rtnetlink_rcv_msg+0x4f0/0x9d0
netlink_rcv_skb+0x124/0x320
netlink_unicast+0x430/0x600
netlink_sendmsg+0x8fa/0xd60
sock_sendmsg+0xb1/0xe0
___sys_sendmsg+0x678/0x980
__sys_sendmsg+0xc4/0x210
do_syscall_64+0x232/0x7f0
return_from_SYSCALL_64+0x0/0x75
Freed by task 112:
kasan_slab_free+0x71/0xc0
kfree+0x114/0x320
rcu_process_callbacks+0xc3f/0x1600
__do_softirq+0x2bf/0xc06
The buggy address belongs to the object at ffff881b83dae600
which belongs to the cache kmalloc-4096 of size 4096
The buggy address is located 24 bytes inside of
4096-byte region [ffff881b83dae600, ffff881b83daf600)
The buggy address belongs to the page:
page:ffffea006e0f6a00 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
flags: 0x17ffffc0008100(slab|head)
raw: 0017ffffc0008100 0000000000000000 0000000000000000 0000000100070007
raw: dead000000000100 dead000000000200 ffff880187c0e600 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The problem is that the htnode is freed before the linked knodes and the
latter will try to access the first at u32_destroy_key() time.
This change addresses the issue using the htnode refcnt to guarantee
the correct free order. While at it also add a RCU annotation,
to keep sparse happy.
v1 -> v2: use rtnl_derefence() instead of RCU read locks
v2 -> v3:
- don't check refcnt in u32_destroy_hnode()
- cleaned-up u32_destroy() implementation
- cleaned-up code comment
v3 -> v4:
- dropped unneeded comment
Reported-by: Li Shuang <[email protected]>
Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter")
Signed-off-by: Paolo Abeni <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Support the AFS dynamic root which is a pseudo-volume that doesn't connect
to any server resource, but rather is just a root directory that
dynamically creates mountpoint directories where the name of such a
directory is the name of the cell.
Such a mount can be created thus:
mount -t afs none /afs -o dyn
Dynamic root superblocks aren't shared except by bind mounts and
propagation. Cell root volumes can then be mounted by referring to them by
name, e.g.:
ls /afs/grand.central.org/
ls /afs/.grand.central.org/
The kernel will upcall to consult the DNS if the address wasn't supplied
directly.
Signed-off-by: David Howells <[email protected]>
|
|
Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.
For sockmap, which is added in the next patches, a ULP is used to
hook into TCP sockets close state. In this case the ULP being added
is done at map insert time and the ULP is known and done on the kernel
side. In this case the named lookup is not needed. Because we don't
want to expose psock internals to user space socket options a user
visible flag is also added. For TLS this is set for BPF it will be
cleared.
Alos remove pr_notice, user gets an error code back and should check
that rather than rely on logs.
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
410f03831 ("sctp: add routing output fallback"):
When walking the address_list, successive ip_route_output_key() calls
may return the same rt->dst with the reference incremented on each call.
The code would not decrement the dst refcount when the dst pointer was
identical from the previous iteration, causing the dst refcnt leak.
Testcase:
ip netns add TEST
ip netns exec TEST ip link set lo up
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dev dummy0 netns TEST
ip link set dev dummy1 netns TEST
ip link set dev dummy2 netns TEST
ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
ip netns exec TEST ip link set dummy0 up
ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
ip netns exec TEST ip link set dummy1 up
ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
ip netns exec TEST ip link set dummy2 up
ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 20000 -s -B 192.168.1.3
ip netns del TEST
In 4.4 and 4.9 kernels this results to:
[ 354.179591] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 364.419674] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 374.663664] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 384.903717] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 395.143724] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 405.383645] unregister_netdevice: waiting for lo to become free. Usage count = 1
...
Fixes: 410f03831 ("sctp: add routing output fallback")
Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses")
Signed-off-by: Tommi Rantala <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Acked-by: Neil Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
When going through the bind address list in sctp_v6_get_dst() and
the previously found address is better ('matchlen > bmatchlen'),
the code continues to the next iteration without releasing currently
held destination.
Fix it by releasing 'bdst' before continue to the next iteration, and
instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
move the already existed one right after ip6_dst_lookup_flow(), i.e. we
shouldn't proceed further if we get an error for the route lookup.
Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6")
Signed-off-by: Alexey Kodanev <[email protected]>
Acked-by: Neil Horman <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Ensure that we release the TCP socket once it is in the TCP_CLOSE or
TCP_TIME_WAIT state (and only then) so that we don't confuse rkhunter
and its ilk.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Setting values in struct sock directly is the usual method. Remove
the long dead code using set_fs() and the related comment.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Alexei Starovoitov says:
====================
pull-request: bpf 2018-02-02
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) support XDP attach in libbpf, from Eric.
2) minor fixes, from Daniel, Jakub, Yonghong, Alexei.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull spectre/meltdown updates from Thomas Gleixner:
"The next round of updates related to melted spectrum:
- The initial set of spectre V1 mitigations:
- Array index speculation blocker and its usage for syscall,
fdtable and the n180211 driver.
- Speculation barrier and its usage in user access functions
- Make indirect calls in KVM speculation safe
- Blacklisting of known to be broken microcodes so IPBP/IBSR are not
touched.
- The initial IBPB support and its usage in context switch
- The exposure of the new speculation MSRs to KVM guests.
- A fix for a regression in x86/32 related to the cpu entry area
- Proper whitelisting for known to be safe CPUs from the mitigations.
- objtool fixes to deal proper with retpolines and alternatives
- Exclude __init functions from retpolines which speeds up the boot
process.
- Removal of the syscall64 fast path and related cleanups and
simplifications
- Removal of the unpatched paravirt mode which is yet another source
of indirect unproteced calls.
- A new and undisputed version of the module mismatch warning
- A couple of cleanup and correctness fixes all over the place
Yet another step towards full mitigation. There are a few things still
missing like the RBS underflow mitigation for Skylake and other small
details, but that's being worked on.
That said, I'm taking a belated christmas vacation for a week and hope
that everything is magically solved when I'm back on Feb 12th"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
KVM/x86: Add IBPB support
KVM/x86: Update the reverse_cpuid list to include CPUID_7_EDX
x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL
x86/pti: Mark constant arrays as __initconst
x86/spectre: Simplify spectre_v2 command line parsing
x86/retpoline: Avoid retpolines for built-in __init functions
x86/kvm: Update spectre-v1 mitigation
KVM: VMX: make MSR bitmaps per-VCPU
x86/paravirt: Remove 'noreplace-paravirt' cmdline option
x86/speculation: Use Indirect Branch Prediction Barrier in context switch
x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"
x86/spectre: Report get_user mitigation for spectre_v1
nl80211: Sanitize array index in parse_txq_params
vfs, fdtable: Prevent bounds-check bypass via speculative execution
x86/syscall: Sanitize syscall table de-references under speculation
x86/get_user: Use pointer masking to limit speculation
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy whitelisting from Kees Cook:
"Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs.
To further restrict what memory is available for copying, this creates
a way to whitelist specific areas of a given slab cache object for
copying to/from userspace, allowing much finer granularity of access
control.
Slab caches that are never exposed to userspace can declare no
whitelist for their objects, thereby keeping them unavailable to
userspace via dynamic copy operations. (Note, an implicit form of
whitelisting is the use of constant sizes in usercopy operations and
get_user()/put_user(); these bypass all hardened usercopy checks since
these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over
the next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage"
* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
lkdtm: Update usercopy tests for whitelisting
usercopy: Restrict non-usercopy caches to size 0
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
kvm: whitelist struct kvm_vcpu_arch
arm: Implement thread_struct whitelist for hardened usercopy
arm64: Implement thread_struct whitelist for hardened usercopy
x86: Implement thread_struct whitelist for hardened usercopy
fork: Provide usercopy whitelisting for task_struct
fork: Define usercopy region in thread_stack slab caches
fork: Define usercopy region in mm_struct slab caches
net: Restrict unwhitelisted proto caches to size 0
sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
sctp: Define usercopy region in SCTP proto slab cache
caif: Define usercopy region in caif proto slab cache
ip: Define usercopy region in IP proto slab cache
net: Define usercopy region in struct proto slab cache
scsi: Define usercopy region in scsi_sense_cache slab cache
cifs: Define usercopy region in cifs_request slab cache
vxfs: Define usercopy region in vxfs_inode slab cache
ufs: Define usercopy region in ufs_inode_cache slab cache
...
|
|
Pull networking fixes from David Miller:
1) The bnx2x can hang if you give it a GSO packet with a segment size
which is too big for the hardware, detect and drop in this case.
From Daniel Axtens.
2) Fix some overflows and pointer leaks in xtables, from Dmitry Vyukov.
3) Missing RCU locking in igmp, from Eric Dumazet.
4) Fix RX checksum handling on r8152, it can only checksum UDP and TCP
packets. From Hayes Wang.
5) Minor pacing tweak to TCP BBR congestion control, from Neal
Cardwell.
6) Missing RCU annotations in cls_u32, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
Revert "defer call to mem_cgroup_sk_alloc()"
soreuseport: fix mem leak in reuseport_add_sock()
net: qlge: use memmove instead of skb_copy_to_linear_data
net: qed: use correct strncpy() size
net: cxgb4: avoid memcpy beyond end of source buffer
cls_u32: add missing RCU annotation.
r8152: set rx mode early when linking on
r8152: fix wrong checksum status for received IPv4 packets
nfp: fix TLV offset calculation
net: pxa168_eth: add netconsole support
net: igmp: add a missing rcu locking section
ibmvnic: fix firmware version when no firmware level has been provided by the VIOS server
vmxnet3: remove redundant initialization of pointer 'rq'
lan78xx: remove redundant initialization of pointer 'phydev'
net: jme: remove unused initialization of 'rxdesc'
rtnetlink: remove check for IFLA_IF_NETNSID
rocker: fix possible null pointer dereference in rocker_router_fib_event_work
inet: Avoid unitialized variable warning in inet_unhash()
net: bridge: Fix uninitialized error in br_fdb_sync_static()
openvswitch: Remove padding from packet before L3+ conntrack processing
...
|
|
This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol:
defer call to mem_cgroup_sk_alloc()").
Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
memcg socket memory accounting, as packets received before memcg
pointer initialization are not accounted and are causing refcounting
underflow on socket release.
Actually the free-after-use problem was fixed by
commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in
sk_clone_lock()") for the cgroup pointer.
So, let's revert it and call mem_cgroup_sk_alloc() just before
cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
we're cloning, and it holds a reference to the memcg.
Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
mem_cgroup_sk_alloc(). I see no reasons why bumping the root
memcg counter is a good reason to panic, and there are no realistic
ways to hit it.
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
reuseport_add_sock() needs to deal with attaching a socket having
its own sk_reuseport_cb, after a prior
setsockopt(SO_ATTACH_REUSEPORT_?BPF)
Without this fix, not only a WARN_ONCE() was issued, but we were also
leaking memory.
Thanks to sysbot and Eric Biggers for providing us nice C repros.
------------[ cut here ]------------
socket already in reuseport group
WARNING: CPU: 0 PID: 3496 at net/core/sock_reuseport.c:119
reuseport_add_sock+0x742/0x9b0 net/core/sock_reuseport.c:117
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 3496 Comm: syzkaller869503 Not tainted 4.15.0-rc6+ #245
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1dc/0x200 kernel/panic.c:547
report_bug+0x211/0x2d0 lib/bug.c:184
fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
fixup_bug arch/x86/kernel/traps.c:247 [inline]
do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
Fixes: ef456144da8e ("soreuseport: define reuseport groups")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: [email protected]
Acked-by: Craig Gallek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In a couple of points of the control path, n->ht_down is currently
accessed without the required RCU annotation. The accesses are
safe, but sparse complaints. Since we already held the
rtnl lock, let use rtnl_dereference().
Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs")
Fixes: de5df63228fc ("net: sched: cls_u32 changes to knode must appear atomic to readers")
Signed-off-by: Paolo Abeni <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Michal Kalderon reports a BUG that occurs just after device removal:
[ 169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
[ 169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]
The RPC/RDMA client transport attempts to allocate some resources
on demand. Registered buffers are one such resource. These are
allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
and Reply messages. A hardware resource is associated with each of
these buffers, as they can be used for a Send or Receive Work
Request.
If a device is removed from under an NFS/RDMA mount, the transport
layer is responsible for releasing all hardware resources before
the device can be finally unplugged. A BUG results when the NFS
mount hasn't yet seen much activity: the transport tries to release
resources that haven't yet been allocated.
rpcrdma_free_regbuf() already checks for this case, so just move
that check to cover the DEVICE_REMOVAL case as well.
Reported-by: Michal Kalderon <[email protected]>
Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA ...")
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Michal Kalderon <[email protected]>
Cc: [email protected] # v4.12+
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Commit 16f906d66cd7 ("xprtrdma: Reduce required number of send
SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
a problem where xprtrdma would not work if the device's max_sge
capability was small (low single digits).
At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
each RPC. ri_max_send_sges is set to this value:
ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
Then when marshaling each RPC, rpcrdma_args_inline uses that value
to determine whether the device has enough Send SGEs to convey an
NFS WRITE payload inline, or whether instead a Read chunk is
required.
More recently, commit ae72950abf99 ("xprtrdma: Add data structure to
manage RDMA Send arguments") used the ri_max_send_sges value to
calculate the size of an array, but that commit erroneously assumed
ri_max_send_sges contains a value similar to the device's max_sge,
and not one that was reduced by the minimum SGE count.
This assumption results in the calculated size of the sendctx's
Send SGE array to be too small. When the array is used to marshal
an RPC, the code can write Send SGEs into the following sendctx
element in that array, corrupting it. When the device's max_sge is
large, this issue is entirely harmless; but it results in an oops
in the provider's post_send method, if dev.attrs.max_sge is small.
So let's straighten this out: ri_max_send_sges will now contain a
value with the same meaning as dev.attrs.max_sge, which makes
the code easier to understand, and enables rpcrdma_sendctx_create
to calculate the size of the SGE array correctly.
Reported-by: Michal Kalderon <[email protected]>
Fixes: 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Michal Kalderon <[email protected]>
Cc: [email protected] # v4.10+
Signed-off-by: Anna Schumaker <[email protected]>
|
|
nft_flow_offload module removal does not require to flush existing
flowtables, it is valid to remove this module while keeping flowtables
around.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
If netdevice goes down, then flowtable entries are scheduled to be
removed. Wait for garbage collector to have a chance to run so it can
delete them from the hashtable.
The flush call might sleep, so hold the nfnl mutex from
nft_flow_table_iterate() instead of rcu read side lock. The use of the
nfnl mutex is also implicitly fixing races between updates via nfnetlink
and netdevice event.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|