aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-10-19netfilter: nf_flow_table: remove flowtable hook flush routine in netns exit ↵Taehee Yoo1-3/+0
routine When device is unregistered, flowtable flush routine is called by notifier_call(nf_tables_flowtable_event). and exit callback of nftables pernet_operation(nf_tables_exit_net) also has flowtable flush routine. but when network namespace is destroyed, both notifier_call and pernet_operation are called. hence flowtable flush routine in pernet_operation is unnecessary. test commands: %ip netns add vm1 %ip netns exec vm1 nft add table ip filter %ip netns exec vm1 nft add flowtable ip filter w \ { hook ingress priority 0\; devices = { lo }\; } %ip netns del vm1 splat looks like: [ 265.187019] WARNING: CPU: 0 PID: 87 at net/netfilter/core.c:309 nf_hook_entry_head+0xc7/0xf0 [ 265.187112] Modules linked in: nf_flow_table_ipv4 nf_flow_table nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip_tables x_tables [ 265.187390] CPU: 0 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc3+ #5 [ 265.187453] Workqueue: netns cleanup_net [ 265.187514] RIP: 0010:nf_hook_entry_head+0xc7/0xf0 [ 265.187546] Code: 8d 81 68 03 00 00 5b c3 89 d0 83 fa 04 48 8d 84 c7 e8 11 00 00 76 81 0f 0b 31 c0 e9 78 ff ff ff 0f 0b 48 83 c4 08 31 c0 5b c3 <0f> 0b 31 c0 e9 65 ff ff ff 0f 0b 31 c0 e9 5c ff ff ff 48 89 0c 24 [ 265.187573] RSP: 0018:ffff88011546f098 EFLAGS: 00010246 [ 265.187624] RAX: ffffffff8d90e135 RBX: 1ffff10022a8de1c RCX: 0000000000000000 [ 265.187645] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff880116298040 [ 265.187645] RBP: ffff88010ea4c1a8 R08: 0000000000000000 R09: 0000000000000000 [ 265.187645] R10: ffff88011546f1d8 R11: ffffed0022c532c1 R12: ffff88010ea4c1d0 [ 265.187645] R13: 0000000000000005 R14: dffffc0000000000 R15: ffff88010ea4c1c4 [ 265.187645] FS: 0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000 [ 265.187645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 265.187645] CR2: 00007fdfb8d00000 CR3: 0000000057a16000 CR4: 00000000001006f0 [ 265.187645] Call Trace: [ 265.187645] __nf_unregister_net_hook+0xca/0x5d0 [ 265.187645] ? nf_hook_entries_free.part.3+0x80/0x80 [ 265.187645] ? save_trace+0x300/0x300 [ 265.187645] nf_unregister_net_hooks+0x2e/0x40 [ 265.187645] nf_tables_exit_net+0x479/0x1340 [nf_tables] [ 265.187645] ? find_held_lock+0x39/0x1c0 [ 265.187645] ? nf_tables_abort+0x30/0x30 [nf_tables] [ 265.187645] ? inet_frag_destroy_rcu+0xd0/0xd0 [ 265.187645] ? trace_hardirqs_on+0x93/0x210 [ 265.187645] ? __bpf_trace_preemptirq_template+0x10/0x10 [ 265.187645] ? inet_frag_destroy_rcu+0xd0/0xd0 [ 265.187645] ? inet_frag_destroy_rcu+0xd0/0xd0 [ 265.187645] ? __mutex_unlock_slowpath+0x17f/0x740 [ 265.187645] ? wait_for_completion+0x710/0x710 [ 265.187645] ? bucket_table_free+0xb2/0x1f0 [ 265.187645] ? nested_table_free+0x130/0x130 [ 265.187645] ? __lock_is_held+0xb4/0x140 [ 265.187645] ops_exit_list.isra.10+0x94/0x140 [ 265.187645] cleanup_net+0x45b/0x900 [ ... ] This WARNING means that hook unregisteration is failed because all flowtables hooks are already unregistered by notifier_call. Network namespace exit routine guarantees that all devices will be unregistered first. then, other exit callbacks of pernet_operations are called. so that removing flowtable flush routine in exit callback of pernet_operation(nf_tables_exit_net) doesn't make flowtable leak. Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-10-16netfilter: xt_nat: fix DNAT target for shifted portmap rangesPaolo Abeni1-0/+2
The commit 2eb0f624b709 ("netfilter: add NAT support for shifted portmap ranges") did not set the checkentry/destroy callbacks for the newly added DNAT target. As a result, rulesets using only such nat targets are not effective, as the relevant conntrack hooks are not enabled. The above affect also nft_compat rulesets. Fix the issue adding the missing initializers. Fixes: 2eb0f624b709 ("netfilter: add NAT support for shifted portmap ranges") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-10-11netfilter: nft_compat: do not dump private areaPablo Neira Ayuso1-2/+22
Zero pad private area, otherwise we expose private kernel pointer to userspace. This patch also zeroes the tail area after the ->matchsize and ->targetsize that results from XT_ALIGN(). Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Reported-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-10-11netfilter: xt_TEE: add missing code to get interface index in checkentry.Taehee Yoo1-0/+7
checkentry(tee_tg_check) should initialize priv->oif from dev if possible. But only netdevice notifier handler can set that. Hence priv->oif is always -1 until notifier handler is called. Fixes: 9e2f6c5d78db ("netfilter: Rework xt_TEE netdevice notifier") Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-10-11netfilter: xt_TEE: fix wrong interface selectionTaehee Yoo1-17/+52
TEE netdevice notifier handler checks only interface name. however each netns can have same interface name. hence other netns's interface could be selected. test commands: %ip netns add vm1 %iptables -I INPUT -p icmp -j TEE --gateway 192.168.1.1 --oif enp2s0 %ip link set enp2s0 netns vm1 Above rule is in the root netns. but that rule could get enp2s0 ifindex of vm1 by notifier handler. After this patch, TEE rule is added to the per-netns list. Fixes: 9e2f6c5d78db ("netfilter: Rework xt_TEE netdevice notifier") Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-10-11netfilter: nft_osf: usage from output path is not validFernando Fernandez Mancera1-0/+10
The nft_osf extension, like xt_osf, is not supported from the output path. Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Fernando Fernandez Mancera <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-10-11netfilter: nft_set_rbtree: allow loose matching of closing element in intervalPablo Neira Ayuso1-2/+8
Allow to find closest matching for the right side of an interval (end flag set on) so we allow lookups in inner ranges, eg. 10-20 in 5-25. Fixes: ba0e4d9917b4 ("netfilter: nf_tables: get set elements via netlink") Reported-by: Phil Sutter <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-10-10rds: RDS (tcp) hangs on sendto() to unresponding addressKa-Cheong Poon1-3/+10
In rds_send_mprds_hash(), if the calculated hash value is non-zero and the MPRDS connections are not yet up, it will wait. But it should not wait if the send is non-blocking. In this case, it should just use the base c_path for sending the message. Signed-off-by: Ka-Cheong Poon <[email protected]> Acked-by: Santosh Shilimkar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-10net: make skb_partial_csum_set() more robust against overflowsEric Dumazet1-5/+7
syzbot managed to crash in skb_checksum_help() [1] : BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb)); Root cause is the following check in skb_partial_csum_set() if (unlikely(start > skb_headlen(skb)) || unlikely((int)start + off > skb_headlen(skb) - 2)) return false; If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff and the check fails to detect that ((int)start + off) is off the limit, since the compare is unsigned. When we fix that, then the first condition (start > skb_headlen(skb)) becomes obsolete. Then we should also check that (skb_headroom(skb) + start) wont overflow 16bit field. [1] kernel BUG at net/core/dev.c:2880! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 7330 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #253 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_checksum_help+0x9e3/0xbb0 net/core/dev.c:2880 Code: 85 00 ff ff ff 48 c1 e8 03 42 80 3c 28 00 0f 84 09 fb ff ff 48 8b bd 00 ff ff ff e8 97 a8 b9 fb e9 f8 fa ff ff e8 2d 09 76 fb <0f> 0b 48 8b bd 28 ff ff ff e8 1f a8 b9 fb e9 b1 f6 ff ff 48 89 cf RSP: 0018:ffff8801d83a6f60 EFLAGS: 00010293 RAX: ffff8801b9834380 RBX: ffff8801b9f8d8c0 RCX: ffffffff8608c6d7 RDX: 0000000000000000 RSI: ffffffff8608cc63 RDI: 0000000000000006 RBP: ffff8801d83a7068 R08: ffff8801b9834380 R09: 0000000000000000 R10: ffff8801d83a76d8 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000010001 R14: 000000000000ffff R15: 00000000000000a8 FS: 00007f1a66db5700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7d77f091b0 CR3: 00000001ba252000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: skb_csum_hwoffload_help+0x8f/0xe0 net/core/dev.c:3269 validate_xmit_skb+0xa2a/0xf30 net/core/dev.c:3312 __dev_queue_xmit+0xc2f/0x3950 net/core/dev.c:3797 dev_queue_xmit+0x17/0x20 net/core/dev.c:3838 packet_snd net/packet/af_packet.c:2928 [inline] packet_sendmsg+0x422d/0x64c0 net/packet/af_packet.c:2953 Fixes: 5ff8dda3035d ("net: Ensure partial checksum offset is inside the skb head") Signed-off-by: Eric Dumazet <[email protected]> Cc: Herbert Xu <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-10Merge branch 'devlink-param-type-string-fixes'David S. Miller2-8/+47
Moshe Shemesh says: ==================== devlink param type string fixes This patchset fixes devlink param infrastructure for string param type. The devlink param infrastructure doesn't handle copying the string data correctly. The first two patches fix it and the third patch adds helper function to safely copy string value without exceeding DEVLINK_PARAM_MAX_STRING_VALUE. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-10-10devlink: Add helper function for safely copy string paramMoshe Shemesh2-3/+28
Devlink string param buffer is allocated at the size of DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure this size is not exceeded. Renamed DEVLINK_PARAM_MAX_STRING_VALUE to __DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by devlink only. The driver should use the helper function instead to verify it doesn't exceed the allowed length. Signed-off-by: Moshe Shemesh <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-10devlink: Fix param cmode driverinit for string typeMoshe Shemesh1-3/+12
Driverinit configuration mode value is held by devlink to enable the driver fetch the value after reload command. In case the param type is string devlink should copy the value from driver string buffer to devlink string buffer on devlink_param_driverinit_value_set() and vice-versa on devlink_param_driverinit_value_get(). Fixes: ec01aeb1803e ("devlink: Add support for get/set driverinit value") Signed-off-by: Moshe Shemesh <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-10devlink: Fix param set handling for string typeMoshe Shemesh2-4/+9
In case devlink param type is string, it needs to copy the string value it got from the input to devlink_param_value. Fixes: e3b7ca18ad7b ("devlink: Add param set command") Signed-off-by: Moshe Shemesh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-09Merge branch 'ena-fixes'David S. Miller2-14/+16
Arthur Kiyanovski says: ==================== minor bug fixes for ENA Ethernet driver Arthur Kiyanovski (4): net: ena: fix warning in rmmod caused by double iounmap net: ena: fix rare bug when failed restart/resume is followed by driver removal net: ena: fix NULL dereference due to untimely napi initialization net: ena: fix auto casting to boolean ==================== Signed-off-by: David S. Miller <[email protected]>
2018-10-09net: ena: fix auto casting to booleanArthur Kiyanovski1-4/+4
Eliminate potential auto casting compilation error. Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-09net: ena: fix NULL dereference due to untimely napi initializationArthur Kiyanovski1-2/+7
napi poll functions should be initialized before running request_irq(), to handle a rare condition where there is a pending interrupt, causing the ISR to fire immediately while the poll function wasn't set yet, causing a NULL dereference. Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-09net: ena: fix rare bug when failed restart/resume is followed by driver removalArthur Kiyanovski1-0/+4
In a rare scenario when ena_device_restore() fails, followed by device remove, an FLR will not be issued. In this case, the device will keep sending asynchronous AENQ keep-alive events, even after driver removal, leading to memory corruption. Fixes: 8c5c7abdeb2d ("net: ena: add power management ops to the ENA driver") Signed-off-by: Arthur Kiyanovski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-09net: ena: fix warning in rmmod caused by double iounmapArthur Kiyanovski1-8/+1
Memory mapped with devm_ioremap is automatically freed when the driver is disconnected from the device. Therefore there is no need to explicitly call devm_iounmap. Fixes: 0857d92f71b6 ("net: ena: add missing unmap bars on device removal") Fixes: 411838e7b41c ("net: ena: fix rare kernel crash when bar memory remap fails") Signed-off-by: Arthur Kiyanovski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-07Merge branch 'net-smc-userspace-breakage-fixes'David S. Miller1-11/+14
Eugene Syromiatnikov says: ==================== net/smc: userspace breakage fixes These two patches correct some userspace-affecting issues introduced during 4.19 development cycle, specifically: * New structure "struct smcd_diag_dmbinfo" has been defined in a way that would lead to different layout of the structure on most 32-bit ABIs in comparison with layout on 64-bit ABIs; * One of the commits renamed an UAPI-exposed field name. Changes since v1: * Managed not to forget to add --cover-letter. * Commit ID format in commit message has been changed in accordance with Sergei Shtylyov's recommendations. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-10-07net/smc: retain old name for diag_mode fieldEugene Syromiatnikov1-1/+4
Commit c601171d7a60 ("net/smc: provide smc mode in smc_diag.c") changed the name of diag_fallback field of struct smc_diag_msg structure to diag_mode. However, this structure is a part of UAPI, and this change breaks user space applications that use it ([1], for example). Since the new name is more suitable, convert the field to a union that provides access to the data via both the new and the old name. [1] https://gitlab.com/strace/strace/blob/v4.24/netlink_smc_diag.c#L165 Fixes: c601171d7a60 ("net/smc: provide smc mode in smc_diag.c") Signed-off-by: Eugene Syromiatnikov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-07net/smc: use __aligned_u64 for 64-bit smc_diag fieldsEugene Syromiatnikov1-11/+11
Commit 4b1b7d3b30a6 ("net/smc: add SMC-D diag support") introduced new UAPI-exposed structure, struct smcd_diag_dmbinfo. However, it's not usable by compat binaries, as it has different layout there. Probably, the most straightforward fix that will avoid similar issues in the future is to use __aligned_u64 for 64-bit fields. Fixes: 4b1b7d3b30a6 ("net/smc: add SMC-D diag support") Signed-off-by: Eugene Syromiatnikov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-07net: sched: cls_u32: fix hnode refcountingAl Viro1-5/+5
cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via ->hlist and via ->tp_root together. u32_destroy() drops the former and, in case when there had been links, leaves the sucker on the list. As the result, there's nothing to protect it from getting freed once links are dropped. That also makes the "is it busy" check incapable of catching the root hnode - it *is* busy (there's a reference from tp), but we don't see it as something separate. "Is it our root?" check partially covers that, but the problem exists for others' roots as well. AFAICS, the minimal fix preserving the existing behaviour (where it doesn't include oopsen, that is) would be this: * count tp->root and tp_c->hlist as separate references. I.e. have u32_init() set refcount to 2, not 1. * in u32_destroy() we always drop the former; in u32_destroy_hnode() - the latter. That way we have *all* references contributing to refcount. List removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) an in u32_destroy() in case of tc_u_common going away, along with everything reachable from it. IOW, that way we know that u32_destroy_key() won't free something still on the list (or pointed to by someone's ->root). Reproducer: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \ u32 divisor 1 tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \ u32 divisor 1 tc filter add dev eth0 parent ffff: protocol ip prio 100 \ handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \ plus 0 eat match ip protocol 6 ff tc filter delete dev eth0 parent ffff: protocol ip prio 200 tc filter change dev eth0 parent ffff: protocol ip prio 100 \ handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \ eat match ip protocol 6 ff tc filter delete dev eth0 parent ffff: protocol ip prio 100 Signed-off-by: Al Viro <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-07udp: Unbreak modules that rely on external __skb_recv_udp() availabilityJiri Kosina1-1/+1
Commit 2276f58ac589 ("udp: use a separate rx queue for packet reception") turned static inline __skb_recv_udp() from being a trivial helper around __skb_recv_datagram() into a UDP specific implementaion, making it EXPORT_SYMBOL_GPL() at the same time. There are external modules that got broken by __skb_recv_udp() not being visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL(). Rationale (one of those) why this is actually "technically correct" thing to do: __skb_recv_udp() used to be an inline wrapper around __skb_recv_datagram(), which itself (still, and correctly so, I believe) is EXPORT_SYMBOL(). Cc: Paolo Abeni <[email protected]> Cc: Eric Dumazet <[email protected]> Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") Signed-off-by: Jiri Kosina <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netGreg Kroah-Hartman24-81/+203
Dave writes: "Networking fixes: 1) Fix truncation of 32-bit right shift in bpf, from Jann Horn. 2) Fix memory leak in wireless wext compat, from Stefan Seyfried. 3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao. 4) Need to cancel pending work when unbinding in smsc75xx otherwise we oops, also from Yu Zhao. 5) Don't allow enslaving a team device to itself, from Ido Schimmel. 6) Fix backwards compat with older userspace for rtnetlink FDB dumps. From Mauricio Faria. 7) Add validation of tc policy netlink attributes, from David Ahern. 8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) net: mvpp2: Extract the correct ethtype from the skb for tx csum offload ipv6: take rcu lock in rawv6_send_hdrinc() net: sched: Add policy validation for tc attributes rtnetlink: fix rtnl_fdb_dump() for ndmsg header yam: fix a missing-check bug net: bpfilter: Fix type cast and pointer warnings net: cxgb3_main: fix a missing-check bug bpf: 32-bit RSH verification must truncate input before the ALU op net: phy: phylink: fix SFP interface autodetection be2net: don't flip hw_features when VXLANs are added/deleted net/packet: fix packet drop as of virtio gso net: dsa: b53: Keep CPU port as tagged in all VLANs openvswitch: load NAT helper bnxt_en: get the reduced max_irqs by the ones used by RDMA bnxt_en: free hwrm resources, if driver probe fails. bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request bnxt_en: Fix VNIC reservations on the PF. team: Forbid enslaving team device to itself net/usb: cancel pending work when unbinding smsc75xx mlxsw: spectrum: Delete RIF when VLAN device is removed ...
2018-10-05Merge branch 'akpm'Greg Kroah-Hartman18-30/+189
* akpm: mm: madvise(MADV_DODUMP): allow hugetlbfs pages ocfs2: fix locking for res->tracking and dlm->tracking_list mm/vmscan.c: fix int overflow in callers of do_shrink_slab() mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly mm/vmstat.c: fix outdated vmstat_text proc: restrict kernel stack dumps to root mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes mm/migrate.c: split only transparent huge pages when allocation fails ipc/shm.c: use ERR_CAST() for shm_lock() error return mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl mm, thp: fix mlocking THP page with migration enabled ocfs2: fix crash in ocfs2_duplicate_clusters_by_page() hugetlb: take PMD sharing into account when flushing tlb/caches mm: migration: fix migration of huge PMD shared pages
2018-10-05mm: madvise(MADV_DODUMP): allow hugetlbfs pagesDaniel Black1-1/+1
Reproducer, assuming 2M of hugetlbfs available: Hugetlbfs mounted, size=2M and option user=testuser # mount | grep ^hugetlbfs hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan) # sysctl vm.nr_hugepages=1 vm.nr_hugepages = 1 # grep Huge /proc/meminfo AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 1 HugePages_Free: 1 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 2048 kB Code: #include <sys/mman.h> #include <stddef.h> #define SIZE 2*1024*1024 int main() { void *ptr; ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0); madvise(ptr, SIZE, MADV_DONTDUMP); madvise(ptr, SIZE, MADV_DODUMP); } Compile and strace: mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000 madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0 madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument) hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on author testing with analysis from Florian Weimer[1]. The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a consequence of the large useage of VM_DONTEXPAND in device drivers. A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be marked DODUMP. A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs memory for a while and later request that madvise(MADV_DODUMP) on the same memory. We correct this omission by allowing madvice(MADV_DODUMP) on hugetlbfs pages. [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit [2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers") Link: http://lkml.kernel.org/r/[email protected] Link: https://lists.launchpad.net/maria-discuss/msg05245.html Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers") Reported-by: Kenneth Penza <[email protected]> Signed-off-by: Daniel Black <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05ocfs2: fix locking for res->tracking and dlm->tracking_listAshish Samant1-2/+2
In dlm_init_lockres() we access and modify res->tracking and dlm->tracking_list without holding dlm->track_lock. This can cause list corruptions and can end up in kernel panic. Fix this by locking res->tracking and dlm->tracking_list with dlm->track_lock instead of dlm->spinlock. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ashish Samant <[email protected]> Reviewed-by: Changwei Ge <[email protected]> Acked-by: Joseph Qi <[email protected]> Acked-by: Jun Piao <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm/vmscan.c: fix int overflow in callers of do_shrink_slab()Kirill Tkhai1-4/+3
do_shrink_slab() returns unsigned long value, and the placing into int variable cuts high bytes off. Then we compare ret and 0xfffffffe (since SHRINK_EMPTY is converted to ret type). Thus a large number of objects returned by do_shrink_slab() may be interpreted as SHRINK_EMPTY, if low bytes of their value are equal to 0xfffffffe. Fix that by declaration ret as unsigned long in these functions. Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <[email protected]> Reported-by: Cyrill Gorcunov <[email protected]> Acked-by: Cyrill Gorcunov <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properlyJann Horn1-0/+3
5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside the kernel unconditional to reduce #ifdef soup, but (either to avoid showing dummy zero counters to userspace, or because that code was missed) didn't update the vmstat_array, meaning that all following counters would be shown with incorrect values. This only affects kernel builds with CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n. Link: http://lkml.kernel.org/r/[email protected] Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Kemi Wang <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm/vmstat.c: fix outdated vmstat_textJann Horn1-1/+0
7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding entry in vmstat_text. This causes an out-of-bounds access in vmstat_show(). Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which is probably very rare. Link: http://lkml.kernel.org/r/[email protected] Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Kemi Wang <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05proc: restrict kernel stack dumps to rootJann Horn1-0/+14
Currently, you can use /proc/self/task/*/stack to cause a stack walk on a task you control while it is running on another CPU. That means that the stack can change under the stack walker. The stack walker does have guards against going completely off the rails and into random kernel memory, but it can interpret random data from your kernel stack as instruction pointers and stack pointers. This can cause exposure of kernel stack contents to userspace. Restrict the ability to inspect kernel stacks of arbitrary tasks to root in order to prevent a local attacker from exploiting racy stack unwinding to leak kernel task stack contents. See the added comment for a longer rationale. There don't seem to be any users of this userspace API that can't gracefully bail out if reading from the file fails. Therefore, I believe that this change is unlikely to break things. In the case that this patch does end up needing a revert, the next-best solution might be to fake a single-entry stack based on wchan. Link: http://lkml.kernel.org/r/[email protected] Fixes: 2ec220e27f50 ("proc: add /proc/*/stack") Signed-off-by: Jann Horn <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Ken Chen <[email protected]> Cc: Will Deacon <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H . Peter Anvin" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizesAnshuman Khandual4-0/+8
ARM64 architecture also supports 32MB and 512MB HugeTLB page sizes. This just adds mmap() system call argument encoding for them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Punit Agrawal <[email protected]> Acked-by: Mike Kravetz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm/migrate.c: split only transparent huge pages when allocation failsAnshuman Khandual1-1/+1
split_huge_page_to_list() fails on HugeTLB pages. I was experimenting with moving 32MB contig HugeTLB pages on arm64 (with a debug patch applied) and hit the following stack trace when the kernel crashed. [ 3732.462797] Call trace: [ 3732.462835] split_huge_page_to_list+0x3b0/0x858 [ 3732.462913] migrate_pages+0x728/0xc20 [ 3732.462999] soft_offline_page+0x448/0x8b0 [ 3732.463097] __arm64_sys_madvise+0x724/0x850 [ 3732.463197] el0_svc_handler+0x74/0x110 [ 3732.463297] el0_svc+0x8/0xc [ 3732.463347] Code: d1000400 f90b0e60 f2fbd5a2 a94982a1 (f9000420) When unmap_and_move[_huge_page]() fails due to lack of memory, the splitting should happen only for transparent huge pages not for HugeTLB pages. PageTransHuge() returns true for both THP and HugeTLB pages. Hence the conditonal check should test PagesHuge() flag to make sure that given pages is not a HugeTLB one. Link: http://lkml.kernel.org/r/[email protected] Fixes: 94723aafb9 ("mm: unclutter THP migration") Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Zi Yan <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05ipc/shm.c: use ERR_CAST() for shm_lock() error returnKees Cook1-1/+1
This uses ERR_CAST() instead of an open-coded cast, as it is casting across structure pointers, which upsets __randomize_layout: ipc/shm.c: In function `shm_lock': ipc/shm.c:209:9: note: randstruct: casting between randomized structure pointer types (ssa): `struct shmid_kernel' and `struct kern_ipc_perm' return (void *)ipcp; ^~~~~~~~~~~~ Link: http://lkml.kernel.org/r/20180919180722.GA15073@beast Fixes: 82061c57ce93 ("ipc: drop ipc_lock()") Signed-off-by: Kees Cook <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Manfred Spraul <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctlYueHaibing1-1/+2
get_user_pages_fast() will return negative value if no pages were pinned, then be converted to a unsigned, which is compared to zero, giving the wrong result. Link: http://lkml.kernel.org/r/[email protected] Fixes: 09e35a4a1ca8 ("mm/gup_benchmark: handle gup failures") Signed-off-by: YueHaibing <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm, thp: fix mlocking THP page with migration enabledKirill A. Shutemov2-1/+4
A transparent huge page is represented by a single entry on an LRU list. Therefore, we can only make unevictable an entire compound page, not individual subpages. If a user tries to mlock() part of a huge page, we want the rest of the page to be reclaimable. We handle this by keeping PTE-mapped huge pages on normal LRU lists: the PMD on border of VM_LOCKED VMA will be split into PTE table. Introduction of THP migration breaks[1] the rules around mlocking THP pages. If we had a single PMD mapping of the page in mlocked VMA, the page will get mlocked, regardless of PTE mappings of the page. For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in remove_migration_pmd(). Anon THP pages can only be shared between processes via fork(). Mlocked page can only be shared if parent mlocked it before forking, otherwise CoW will be triggered on mlock(). For Anon-THP, we can fix the issue by munlocking the page on removing PTE migration entry for the page. PTEs for the page will always come after mlocked PMD: rmap walks VMAs from oldest to newest. Test-case: #include <unistd.h> #include <sys/mman.h> #include <sys/wait.h> #include <linux/mempolicy.h> #include <numaif.h> int main(void) { unsigned long nodemask = 4; void *addr; addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0); if (fork()) { wait(NULL); return 0; } mlock(addr, 4UL << 10); mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES, &nodemask, 4, MPOL_MF_MOVE); return 0; } [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com Link: http://lkml.kernel.org/r/[email protected] Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: Vegard Nossum <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> [4.14+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()Larry Chen1-4/+12
ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages is dirty. When a page has not been written back, it is still in dirty state. If ocfs2_duplicate_clusters_by_page() is called against the dirty page, the crash happens. To fix this bug, we can just unlock the page and wait until the page until its not dirty. The following is the backtrace: kernel BUG at /root/code/ocfs2/refcounttree.c:2961! [exception RIP: ocfs2_duplicate_clusters_by_page+822] __ocfs2_move_extent+0x80/0x450 [ocfs2] ? __ocfs2_claim_clusters+0x130/0x250 [ocfs2] ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2] __ocfs2_move_extents_range+0x2a4/0x470 [ocfs2] ocfs2_move_extents+0x180/0x3b0 [ocfs2] ? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2] ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2] ocfs2_ioctl+0x253/0x640 [ocfs2] do_vfs_ioctl+0x90/0x5f0 SyS_ioctl+0x74/0x80 do_syscall_64+0x74/0x140 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Once we find the page is dirty, we do not wait until it's clean, rather we use write_one_page() to write it back Link: http://lkml.kernel.org/r/[email protected] [[email protected]: update comments] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: coding-style fixes] Signed-off-by: Larry Chen <[email protected]> Acked-by: Changwei Ge <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Joseph Qi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05hugetlb: take PMD sharing into account when flushing tlb/cachesMike Kravetz1-9/+44
When fixing an issue with PMD sharing and migration, it was discovered via code inspection that other callers of huge_pmd_unshare potentially have an issue with cache and tlb flushing. Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst case ranges for mmu notifiers. Ensure that this range is flushed if huge_pmd_unshare succeeds and unmaps a PUD_SUZE area. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05mm: migration: fix migration of huge PMD shared pagesMike Kravetz4-5/+94
The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page. This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost. This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated. To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB. mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case. Link: http://lkml.kernel.org/r/[email protected] [[email protected]: make _range_in_vma() a static inline] Link: http://lkml.kernel.org/r/[email protected] Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-10-05Merge tag 'pci-v4.19-fixes-3' of ↵Greg Kroah-Hartman3-13/+67
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Bjorn writes: "PCI fixes for v4.19: - Reprogram bridge prefetch registers to fix NVIDIA and Radeon issues after suspend/resume (Daniel Drake) - Fix mvebu I/O mapping creation sequence (Thomas Petazzoni) - Fix minor MAINTAINERS file match issue (Bjorn Helgaas)" * tag 'pci-v4.19-fixes-3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: mvebu: Fix PCI I/O mapping creation sequence MAINTAINERS: Remove obsolete drivers/pci pattern from ACPI section PCI: Reprogram bridge prefetch registers on resume
2018-10-05Merge tag 'for-4.19/dm-fixes-2' of ↵Greg Kroah-Hartman5-15/+20
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Mike writes: "device mapper fixes - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during 4.19 merge window. - Fix leak and dangling pointer in DM multipath's scsi_dh related code. - A couple stable@ fixes for DM cache's resize support. - A DM raid fix to remove "const" from decipher_sync_action()'s return type." * tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix resize crash if user doesn't reload cache table dm cache metadata: ignore hints array being too small during resize dm raid: remove bogus const from decipher_sync_action() return type dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer dm thin metadata: fix __udivdi3 undefined on 32-bit
2018-10-05Merge tag 'gpio-v4.19-3' of ↵Greg Kroah-Hartman1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Linus writes: "A single GPIO fix: Free the last used descriptor, an off by one error. This is tagged for stable as well." * tag 'gpio-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpiolib: Free the last requested descriptor
2018-10-05Merge tag 'pm-4.19-rc7' of ↵Greg Kroah-Hartman1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Rafael writes: "Power management fix for 4.19-rc7 Fix a bug that may cause runtime PM to misbehave for some devices after a failing or aborted system suspend which is nasty enough for an -rc7 time frame fix." * tag 'pm-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / core: Clear the direct_complete flag on errors
2018-10-05Merge branch 'perf-urgent-for-linus' of ↵Greg Kroah-Hartman4-14/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "perf fixes: - fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver - fix a CPU event enumeration bug in the x86 AMD PMU driver - fix a perf ring-buffer corruption bug when using tracepoints - fix a PMU unregister locking bug" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX perf/ring_buffer: Prevent concurent ring buffer access perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0 perf/core: Fix perf_pmu_unregister() locking
2018-10-05Merge branch 'x86-urgent-for-linus' of ↵Greg Kroah-Hartman6-15/+211
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "x86 fixes: Misc fixes: - fix various vDSO bugs: asm constraints and retpolines - add vDSO test units to make sure they never re-appear - fix UV platform TSC initialization bug - fix build warning on Clang" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Fix vDSO syscall fallback asm constraint regression x86/cpu/amd: Remove unnecessary parentheses x86/vdso: Only enable vDSO retpolines when enabled and supported x86/tsc: Fix UV TSC initialization x86/platform/uv: Provide is_early_uv_system() selftests/x86: Add clock_gettime() tests to test_vdso x86/vdso: Fix asm constraints on vDSO syscall fallbacks
2018-10-05Merge branch 'sched-urgent-for-linus' of ↵Greg Kroah-Hartman8-108/+95
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "scheduler fixes: These fixes address a rather involved performance regression between v4.17->v4.19 in the sched/numa auto-balancing code. Since distros really need this fix we accelerated it to sched/urgent for a faster upstream merge. NUMA scheduling and balancing performance is now largely back to v4.17 levels, without reintroducing the NUMA placement bugs that v4.18 and v4.19 fixed. Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for reporting, testing, re-testing and solving this rather complex set of bugs." * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration sched/numa: Avoid task migration for small NUMA improvement mm/migrate: Use spin_trylock() while resetting rate limit sched/numa: Limit the conditions where scan period is reset sched/numa: Reset scan rate whenever task moves across nodes sched/numa: Pass destination CPU as a parameter to migrate_task_rq sched/numa: Stop multiple tasks from moving to the CPU at the same time
2018-10-05Merge branch 'locking-urgent-for-linus' of ↵Greg Kroah-Hartman2-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "locking fixes: A fix in the ww_mutex self-test that produces a scary splat, plus an updates to the maintained-filed patters in MAINTAINER." * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/ww_mutex: Fix runtime warning in the WW mutex selftest MAINTAINERS: Remove dead path from LOCKING PRIMITIVES entry
2018-10-05Merge tag 'sound-4.19-rc7' of ↵Greg Kroah-Hartman2-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Takashi writes: "sound fixes for 4.19-rc7 Just two small fixes for HD-audio: one is for a typo in completion timeout, and another a fixup for Dell machines as usual" * tag 'sound-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: hda/realtek - Cannot adjust speaker's volume on Dell XPS 27 7760 ALSA: hda: Fix the audio-component completion timeout
2018-10-05net: mvpp2: Extract the correct ethtype from the skb for tx csum offloadMaxime Chevallier1-4/+5
When offloading the L3 and L4 csum computation on TX, we need to extract the l3_proto from the ethtype, independently of the presence of a vlan tag. The actual driver uses skb->protocol as-is, resulting in packets with the wrong L4 checksum being sent when there's a vlan tag in the packet header and checksum offloading is enabled. This commit makes use of vlan_protocol_get() to get the correct ethtype regardless the presence of a vlan tag. Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit") Signed-off-by: Maxime Chevallier <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-10-05ipv6: take rcu lock in rawv6_send_hdrinc()Wei Wang1-9/+20
In rawv6_send_hdrinc(), in order to avoid an extra dst_hold(), we directly assign the dst to skb and set passed in dst to NULL to avoid double free. However, in error case, we free skb and then do stats update with the dst pointer passed in. This causes use-after-free on the dst. Fix it by taking rcu read lock right before dst could get released to make sure dst does not get freed until the stats update is done. Note: we don't have this issue in ipv4 cause dst is not used for stats update in v4. Syzkaller reported following crash: BUG: KASAN: use-after-free in rawv6_send_hdrinc net/ipv6/raw.c:692 [inline] BUG: KASAN: use-after-free in rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921 Read of size 8 at addr ffff8801d95ba730 by task syz-executor0/32088 CPU: 1 PID: 32088 Comm: syz-executor0 Not tainted 4.19.0-rc2+ #93 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 rawv6_send_hdrinc net/ipv6/raw.c:692 [inline] rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:631 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114 __sys_sendmsg+0x11d/0x280 net/socket.c:2152 __do_sys_sendmsg net/socket.c:2161 [inline] __se_sys_sendmsg net/socket.c:2159 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457099 Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f83756edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f83756ee6d4 RCX: 0000000000457099 RDX: 0000000000000000 RSI: 0000000020003840 RDI: 0000000000000004 RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000004d4b30 R14: 00000000004c90b1 R15: 0000000000000000 Allocated by task 32088: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490 kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554 dst_alloc+0xbb/0x1d0 net/core/dst.c:105 ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:353 ip6_rt_cache_alloc+0x247/0x7b0 net/ipv6/route.c:1186 ip6_pol_route+0x8f8/0xd90 net/ipv6/route.c:1895 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2093 fib6_rule_lookup+0x277/0x860 net/ipv6/fib6_rules.c:122 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2121 ip6_route_output include/net/ip6_route.h:88 [inline] ip6_dst_lookup_tail+0xe27/0x1d60 net/ipv6/ip6_output.c:951 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079 rawv6_sendmsg+0x12d9/0x4630 net/ipv6/raw.c:905 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:631 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114 __sys_sendmsg+0x11d/0x280 net/socket.c:2152 __do_sys_sendmsg net/socket.c:2161 [inline] __se_sys_sendmsg net/socket.c:2159 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 5356: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kmem_cache_free+0x83/0x290 mm/slab.c:3756 dst_destroy+0x267/0x3c0 net/core/dst.c:141 dst_destroy_rcu+0x16/0x19 net/core/dst.c:154 __rcu_reclaim kernel/rcu/rcu.h:236 [inline] rcu_do_batch kernel/rcu/tree.c:2576 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline] __rcu_process_callbacks kernel/rcu/tree.c:2847 [inline] rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864 __do_softirq+0x30b/0xad8 kernel/softirq.c:292 Fixes: 1789a640f556 ("raw: avoid two atomics in xmit") Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>