aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-05-17Merge branch 'ip6_gre-Fixes-in-headroom-handling'David S. Miller1-39/+145
Petr Machata says: ==================== net: ip6_gre: Fixes in headroom handling This series mends some problems in headroom management in ip6_gre module. The current code base has the following three closely-related problems: - ip6gretap tunnels neglect to ensure there's enough writable headroom before pushing GRE headers. - ip6erspan does this, but assumes that dev->needed_headroom is primed. But that doesn't happen until ip6_tnl_xmit() is called later. Thus for the first packet, ip6erspan actually behaves like ip6gretap above. - ip6erspan shares some of the code with ip6gretap, including calculations of needed header length. While there is custom ERSPAN-specific code for calculating the headroom, the computed values are overwritten by the ip6gretap code. The first two issues lead to a kernel panic in situations where a packet is mirrored from a veth device to the device in question. They are fixed, respectively, in patches #1 and #2, which include the full panic trace and a reproducer. The rest of the patchset deals with the last issue. In patches #3 to #6, several functions are split up into reusable parts. Finally in patch #7 these blocks are used to compose ERSPAN-specific callbacks where necessary to fix the hlen calculation. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-05-17net: ip6_gre: Fix ip6erspan hlen calculationPetr Machata1-9/+65
Even though ip6erspan_tap_init() sets up hlen and tun_hlen according to what ERSPAN needs, it goes ahead to call ip6gre_tnl_link_config() which overwrites these settings with GRE-specific ones. Similarly for changelink callbacks, which are handled by ip6gre_changelink() calls ip6gre_tnl_change() calls ip6gre_tnl_link_config() as well. The difference ends up being 12 vs. 20 bytes, and this is generally not a problem, because a 12-byte request likely ends up allocating more and the extra 8 bytes are thus available. However correct it is not. So replace the newlink and changelink callbacks with an ERSPAN-specific ones, reusing the newly-introduced _common() functions. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <[email protected]> Acked-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17net: ip6_gre: Split up ip6gre_changelink()Petr Machata1-9/+24
Extract from ip6gre_changelink() a reusable function ip6gre_changelink_common(). This will allow introduction of ERSPAN-specific _changelink() function with not a lot of code duplication. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <[email protected]> Acked-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17net: ip6_gre: Split up ip6gre_newlink()Petr Machata1-6/+18
Extract from ip6gre_newlink() a reusable function ip6gre_newlink_common(). The ip6gre_tnl_link_config() call needs to be made customizable for ERSPAN, thus reorder it with calls to ip6_tnl_change_mtu() and dev_hold(), and extract the whole tail to the caller, ip6gre_newlink(). Thus enable an ERSPAN-specific _newlink() function without a lot of duplicity. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <[email protected]> Acked-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17net: ip6_gre: Split up ip6gre_tnl_change()Petr Machata1-2/+8
Split a reusable function ip6gre_tnl_copy_tnl_parm() from ip6gre_tnl_change(). This will allow ERSPAN-specific code to reuse the common parts while customizing the behavior for ERSPAN. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <[email protected]> Acked-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17net: ip6_gre: Split up ip6gre_tnl_link_config()Petr Machata1-12/+26
The function ip6gre_tnl_link_config() is used for setting up configuration of both ip6gretap and ip6erspan tunnels. Split the function into the common part and the route-lookup part. The latter then takes the calculated header length as an argument. This split will allow the patches down the line to sneak in a custom header length computation for the ERSPAN tunnel. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <[email protected]> Acked-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit()Petr Machata1-1/+1
dev->needed_headroom is not primed until ip6_tnl_xmit(), so it starts out zero. Thus the call to skb_cow_head() fails to actually make sure there's enough headroom to push the ERSPAN headers to. That can lead to the panic cited below. (Reproducer below that). Fix by requesting either needed_headroom if already primed, or just the bare minimum needed for the header otherwise. [ 190.703567] kernel BUG at net/core/skbuff.c:104! [ 190.708384] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI [ 190.714007] Modules linked in: act_mirred cls_matchall ip6_gre ip6_tunnel tunnel6 gre sch_ingress vrf veth x86_pkg_temp_thermal mlx_platform nfsd e1000e leds_mlxcpld [ 190.728975] CPU: 1 PID: 959 Comm: kworker/1:2 Not tainted 4.17.0-rc4-net_master-custom-139 #10 [ 190.737647] Hardware name: Mellanox Technologies Ltd. "MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016 [ 190.747006] Workqueue: ipv6_addrconf addrconf_dad_work [ 190.752222] RIP: 0010:skb_panic+0xc3/0x100 [ 190.756358] RSP: 0018:ffff8801d54072f0 EFLAGS: 00010282 [ 190.761629] RAX: 0000000000000085 RBX: ffff8801c1a8ecc0 RCX: 0000000000000000 [ 190.768830] RDX: 0000000000000085 RSI: dffffc0000000000 RDI: ffffed003aa80e54 [ 190.776025] RBP: ffff8801bd1ec5a0 R08: ffffed003aabce19 R09: ffffed003aabce19 [ 190.783226] R10: 0000000000000001 R11: ffffed003aabce18 R12: ffff8801bf695dbe [ 190.790418] R13: 0000000000000084 R14: 00000000000006c0 R15: ffff8801bf695dc8 [ 190.797621] FS: 0000000000000000(0000) GS:ffff8801d5400000(0000) knlGS:0000000000000000 [ 190.805786] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 190.811582] CR2: 000055fa929aced0 CR3: 0000000003228004 CR4: 00000000001606e0 [ 190.818790] Call Trace: [ 190.821264] <IRQ> [ 190.823314] ? ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre] [ 190.828940] ? ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre] [ 190.834562] skb_push+0x78/0x90 [ 190.837749] ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre] [ 190.843219] ? ip6gre_tunnel_ioctl+0xd90/0xd90 [ip6_gre] [ 190.848577] ? debug_check_no_locks_freed+0x210/0x210 [ 190.853679] ? debug_check_no_locks_freed+0x210/0x210 [ 190.858783] ? print_irqtrace_events+0x120/0x120 [ 190.863451] ? sched_clock_cpu+0x18/0x210 [ 190.867496] ? cyc2ns_read_end+0x10/0x10 [ 190.871474] ? skb_network_protocol+0x76/0x200 [ 190.875977] dev_hard_start_xmit+0x137/0x770 [ 190.880317] ? do_raw_spin_trylock+0x6d/0xa0 [ 190.884624] sch_direct_xmit+0x2ef/0x5d0 [ 190.888589] ? pfifo_fast_dequeue+0x3fa/0x670 [ 190.892994] ? pfifo_fast_change_tx_queue_len+0x810/0x810 [ 190.898455] ? __lock_is_held+0xa0/0x160 [ 190.902422] __qdisc_run+0x39e/0xfc0 [ 190.906041] ? _raw_spin_unlock+0x29/0x40 [ 190.910090] ? pfifo_fast_enqueue+0x24b/0x3e0 [ 190.914501] ? sch_direct_xmit+0x5d0/0x5d0 [ 190.918658] ? pfifo_fast_dequeue+0x670/0x670 [ 190.923047] ? __dev_queue_xmit+0x172/0x1770 [ 190.927365] ? preempt_count_sub+0xf/0xd0 [ 190.931421] __dev_queue_xmit+0x410/0x1770 [ 190.935553] ? ___slab_alloc+0x605/0x930 [ 190.939524] ? print_irqtrace_events+0x120/0x120 [ 190.944186] ? memcpy+0x34/0x50 [ 190.947364] ? netdev_pick_tx+0x1c0/0x1c0 [ 190.951428] ? __skb_clone+0x2fd/0x3d0 [ 190.955218] ? __copy_skb_header+0x270/0x270 [ 190.959537] ? rcu_read_lock_sched_held+0x93/0xa0 [ 190.964282] ? kmem_cache_alloc+0x344/0x4d0 [ 190.968520] ? cyc2ns_read_end+0x10/0x10 [ 190.972495] ? skb_clone+0x123/0x230 [ 190.976112] ? skb_split+0x820/0x820 [ 190.979747] ? tcf_mirred+0x554/0x930 [act_mirred] [ 190.984582] tcf_mirred+0x554/0x930 [act_mirred] [ 190.989252] ? tcf_mirred_act_wants_ingress.part.2+0x10/0x10 [act_mirred] [ 190.996109] ? __lock_acquire+0x706/0x26e0 [ 191.000239] ? sched_clock_cpu+0x18/0x210 [ 191.004294] tcf_action_exec+0xcf/0x2a0 [ 191.008179] tcf_classify+0xfa/0x340 [ 191.011794] __netif_receive_skb_core+0x8e1/0x1c60 [ 191.016630] ? debug_check_no_locks_freed+0x210/0x210 [ 191.021732] ? nf_ingress+0x500/0x500 [ 191.025458] ? process_backlog+0x347/0x4b0 [ 191.029619] ? print_irqtrace_events+0x120/0x120 [ 191.034302] ? lock_acquire+0xd8/0x320 [ 191.038089] ? process_backlog+0x1b6/0x4b0 [ 191.042246] ? process_backlog+0xc2/0x4b0 [ 191.046303] process_backlog+0xc2/0x4b0 [ 191.050189] net_rx_action+0x5cc/0x980 [ 191.053991] ? napi_complete_done+0x2c0/0x2c0 [ 191.058386] ? mark_lock+0x13d/0xb40 [ 191.062001] ? clockevents_program_event+0x6b/0x1d0 [ 191.066922] ? print_irqtrace_events+0x120/0x120 [ 191.071593] ? __lock_is_held+0xa0/0x160 [ 191.075566] __do_softirq+0x1d4/0x9d2 [ 191.079282] ? ip6_finish_output2+0x524/0x1460 [ 191.083771] do_softirq_own_stack+0x2a/0x40 [ 191.087994] </IRQ> [ 191.090130] do_softirq.part.13+0x38/0x40 [ 191.094178] __local_bh_enable_ip+0x135/0x190 [ 191.098591] ip6_finish_output2+0x54d/0x1460 [ 191.102916] ? ip6_forward_finish+0x2f0/0x2f0 [ 191.107314] ? ip6_mtu+0x3c/0x2c0 [ 191.110674] ? ip6_finish_output+0x2f8/0x650 [ 191.114992] ? ip6_output+0x12a/0x500 [ 191.118696] ip6_output+0x12a/0x500 [ 191.122223] ? ip6_route_dev_notify+0x5b0/0x5b0 [ 191.126807] ? ip6_finish_output+0x650/0x650 [ 191.131120] ? ip6_fragment+0x1a60/0x1a60 [ 191.135182] ? icmp6_dst_alloc+0x26e/0x470 [ 191.139317] mld_sendpack+0x672/0x830 [ 191.143021] ? igmp6_mcf_seq_next+0x2f0/0x2f0 [ 191.147429] ? __local_bh_enable_ip+0x77/0x190 [ 191.151913] ipv6_mc_dad_complete+0x47/0x90 [ 191.156144] addrconf_dad_completed+0x561/0x720 [ 191.160731] ? addrconf_rs_timer+0x3a0/0x3a0 [ 191.165036] ? mark_held_locks+0xc9/0x140 [ 191.169095] ? __local_bh_enable_ip+0x77/0x190 [ 191.173570] ? addrconf_dad_work+0x50d/0xa20 [ 191.177886] ? addrconf_dad_work+0x529/0xa20 [ 191.182194] addrconf_dad_work+0x529/0xa20 [ 191.186342] ? addrconf_dad_completed+0x720/0x720 [ 191.191088] ? __lock_is_held+0xa0/0x160 [ 191.195059] ? process_one_work+0x45d/0xe20 [ 191.199302] ? process_one_work+0x51e/0xe20 [ 191.203531] ? rcu_read_lock_sched_held+0x93/0xa0 [ 191.208279] process_one_work+0x51e/0xe20 [ 191.212340] ? pwq_dec_nr_in_flight+0x200/0x200 [ 191.216912] ? get_lock_stats+0x4b/0xf0 [ 191.220788] ? preempt_count_sub+0xf/0xd0 [ 191.224844] ? worker_thread+0x219/0x860 [ 191.228823] ? do_raw_spin_trylock+0x6d/0xa0 [ 191.233142] worker_thread+0xeb/0x860 [ 191.236848] ? process_one_work+0xe20/0xe20 [ 191.241095] kthread+0x206/0x300 [ 191.244352] ? process_one_work+0xe20/0xe20 [ 191.248587] ? kthread_stop+0x570/0x570 [ 191.252459] ret_from_fork+0x3a/0x50 [ 191.256082] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24 [ 191.275327] RIP: skb_panic+0xc3/0x100 RSP: ffff8801d54072f0 [ 191.281024] ---[ end trace 7ea51094e099e006 ]--- [ 191.285724] Kernel panic - not syncing: Fatal exception in interrupt [ 191.292168] Kernel Offset: disabled [ 191.295697] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Reproducer: ip link add h1 type veth peer name swp1 ip link add h3 type veth peer name swp3 ip link set dev h1 up ip address add 192.0.2.1/28 dev h1 ip link add dev vh3 type vrf table 20 ip link set dev h3 master vh3 ip link set dev vh3 up ip link set dev h3 up ip link set dev swp3 up ip address add dev swp3 2001:db8:2::1/64 ip link set dev swp1 up tc qdisc add dev swp1 clsact ip link add name gt6 type ip6erspan \ local 2001:db8:2::1 remote 2001:db8:2::2 oseq okey 123 ip link set dev gt6 up sleep 1 tc filter add dev swp1 ingress pref 1000 matchall skip_hw \ action mirred egress mirror dev gt6 ping -I h1 192.0.2.2 Fixes: e41c7c68ea77 ("ip6erspan: make sure enough headroom at xmit.") Signed-off-by: Petr Machata <[email protected]> Acked-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17net: ip6_gre: Request headroom in __gre6_xmit()Petr Machata1-0/+3
__gre6_xmit() pushes GRE headers before handing over to ip6_tnl_xmit() for generic IP-in-IP processing. However it doesn't make sure that there is enough headroom to push the header to. That can lead to the panic cited below. (Reproducer below that). Fix by requesting either needed_headroom if already primed, or just the bare minimum needed for the header otherwise. [ 158.576725] kernel BUG at net/core/skbuff.c:104! [ 158.581510] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI [ 158.587174] Modules linked in: act_mirred cls_matchall ip6_gre ip6_tunnel tunnel6 gre sch_ingress vrf veth x86_pkg_temp_thermal mlx_platform nfsd e1000e leds_mlxcpld [ 158.602268] CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 4.17.0-rc4-net_master-custom-139 #10 [ 158.610938] Hardware name: Mellanox Technologies Ltd. "MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016 [ 158.620426] RIP: 0010:skb_panic+0xc3/0x100 [ 158.624586] RSP: 0018:ffff8801d3f27110 EFLAGS: 00010286 [ 158.629882] RAX: 0000000000000082 RBX: ffff8801c02cc040 RCX: 0000000000000000 [ 158.637127] RDX: 0000000000000082 RSI: dffffc0000000000 RDI: ffffed003a7e4e18 [ 158.644366] RBP: ffff8801bfec8020 R08: ffffed003aabce19 R09: ffffed003aabce19 [ 158.651574] R10: 000000000000000b R11: ffffed003aabce18 R12: ffff8801c364de66 [ 158.658786] R13: 000000000000002c R14: 00000000000000c0 R15: ffff8801c364de68 [ 158.666007] FS: 0000000000000000(0000) GS:ffff8801d5400000(0000) knlGS:0000000000000000 [ 158.674212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 158.680036] CR2: 00007f4b3702dcd0 CR3: 0000000003228002 CR4: 00000000001606e0 [ 158.687228] Call Trace: [ 158.689752] ? __gre6_xmit+0x246/0xd80 [ip6_gre] [ 158.694475] ? __gre6_xmit+0x246/0xd80 [ip6_gre] [ 158.699141] skb_push+0x78/0x90 [ 158.702344] __gre6_xmit+0x246/0xd80 [ip6_gre] [ 158.706872] ip6gre_tunnel_xmit+0x3bc/0x610 [ip6_gre] [ 158.711992] ? __gre6_xmit+0xd80/0xd80 [ip6_gre] [ 158.716668] ? debug_check_no_locks_freed+0x210/0x210 [ 158.721761] ? print_irqtrace_events+0x120/0x120 [ 158.726461] ? sched_clock_cpu+0x18/0x210 [ 158.730572] ? sched_clock_cpu+0x18/0x210 [ 158.734692] ? cyc2ns_read_end+0x10/0x10 [ 158.738705] ? skb_network_protocol+0x76/0x200 [ 158.743216] ? netif_skb_features+0x1b2/0x550 [ 158.747648] dev_hard_start_xmit+0x137/0x770 [ 158.752010] sch_direct_xmit+0x2ef/0x5d0 [ 158.755992] ? pfifo_fast_dequeue+0x3fa/0x670 [ 158.760460] ? pfifo_fast_change_tx_queue_len+0x810/0x810 [ 158.765975] ? __lock_is_held+0xa0/0x160 [ 158.770002] __qdisc_run+0x39e/0xfc0 [ 158.773673] ? _raw_spin_unlock+0x29/0x40 [ 158.777781] ? pfifo_fast_enqueue+0x24b/0x3e0 [ 158.782191] ? sch_direct_xmit+0x5d0/0x5d0 [ 158.786372] ? pfifo_fast_dequeue+0x670/0x670 [ 158.790818] ? __dev_queue_xmit+0x172/0x1770 [ 158.795195] ? preempt_count_sub+0xf/0xd0 [ 158.799313] __dev_queue_xmit+0x410/0x1770 [ 158.803512] ? ___slab_alloc+0x605/0x930 [ 158.807525] ? ___slab_alloc+0x605/0x930 [ 158.811540] ? memcpy+0x34/0x50 [ 158.814768] ? netdev_pick_tx+0x1c0/0x1c0 [ 158.818895] ? __skb_clone+0x2fd/0x3d0 [ 158.822712] ? __copy_skb_header+0x270/0x270 [ 158.827079] ? rcu_read_lock_sched_held+0x93/0xa0 [ 158.831903] ? kmem_cache_alloc+0x344/0x4d0 [ 158.836199] ? skb_clone+0x123/0x230 [ 158.839869] ? skb_split+0x820/0x820 [ 158.843521] ? tcf_mirred+0x554/0x930 [act_mirred] [ 158.848407] tcf_mirred+0x554/0x930 [act_mirred] [ 158.853104] ? tcf_mirred_act_wants_ingress.part.2+0x10/0x10 [act_mirred] [ 158.860005] ? __lock_acquire+0x706/0x26e0 [ 158.864162] ? mark_lock+0x13d/0xb40 [ 158.867832] tcf_action_exec+0xcf/0x2a0 [ 158.871736] tcf_classify+0xfa/0x340 [ 158.875402] __netif_receive_skb_core+0x8e1/0x1c60 [ 158.880334] ? nf_ingress+0x500/0x500 [ 158.884059] ? process_backlog+0x347/0x4b0 [ 158.888241] ? lock_acquire+0xd8/0x320 [ 158.892050] ? process_backlog+0x1b6/0x4b0 [ 158.896228] ? process_backlog+0xc2/0x4b0 [ 158.900291] process_backlog+0xc2/0x4b0 [ 158.904210] net_rx_action+0x5cc/0x980 [ 158.908047] ? napi_complete_done+0x2c0/0x2c0 [ 158.912525] ? rcu_read_unlock+0x80/0x80 [ 158.916534] ? __lock_is_held+0x34/0x160 [ 158.920541] __do_softirq+0x1d4/0x9d2 [ 158.924308] ? trace_event_raw_event_irq_handler_exit+0x140/0x140 [ 158.930515] run_ksoftirqd+0x1d/0x40 [ 158.934152] smpboot_thread_fn+0x32b/0x690 [ 158.938299] ? sort_range+0x20/0x20 [ 158.941842] ? preempt_count_sub+0xf/0xd0 [ 158.945940] ? schedule+0x5b/0x140 [ 158.949412] kthread+0x206/0x300 [ 158.952689] ? sort_range+0x20/0x20 [ 158.956249] ? kthread_stop+0x570/0x570 [ 158.960164] ret_from_fork+0x3a/0x50 [ 158.963823] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24 [ 158.983235] RIP: skb_panic+0xc3/0x100 RSP: ffff8801d3f27110 [ 158.988935] ---[ end trace 5af56ee845aa6cc8 ]--- [ 158.993641] Kernel panic - not syncing: Fatal exception in interrupt [ 159.000176] Kernel Offset: disabled [ 159.003767] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Reproducer: ip link add h1 type veth peer name swp1 ip link add h3 type veth peer name swp3 ip link set dev h1 up ip address add 192.0.2.1/28 dev h1 ip link add dev vh3 type vrf table 20 ip link set dev h3 master vh3 ip link set dev vh3 up ip link set dev h3 up ip link set dev swp3 up ip address add dev swp3 2001:db8:2::1/64 ip link set dev swp1 up tc qdisc add dev swp1 clsact ip link add name gt6 type ip6gretap \ local 2001:db8:2::1 remote 2001:db8:2::2 ip link set dev gt6 up sleep 1 tc filter add dev swp1 ingress pref 1000 matchall skip_hw \ action mirred egress mirror dev gt6 ping -I h1 192.0.2.2 Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Petr Machata <[email protected]> Acked-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17selftests/bpf: check return value of fopen in test_verifier.cJesper Dangaard Brouer1-0/+5
Commit 0a6748740368 ("selftests/bpf: Only run tests if !bpf_disabled") forgot to check return value of fopen. This caused some confusion, when running test_verifier (from tools/testing/selftests/bpf/) on an older kernel (< v4.4) as it will simply seqfault. This fix avoids the segfault and prints an error, but allow program to continue. Given the sysctl was introduced in 1be7f75d1668 ("bpf: enable non-root eBPF programs"), we know that the running kernel cannot support unpriv, thus continue with unpriv_disabled = true. Fixes: 0a6748740368 ("selftests/bpf: Only run tests if !bpf_disabled") Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-17erspan: fix invalid erspan version.William Tu2-2/+7
ERSPAN only support version 1 and 2. When packets send to an erspan device which does not have proper version number set, drop the packet. In real case, we observe multicast packets sent to the erspan pernet device, erspan0, which does not have erspan version configured. Reported-by: Greg Rose <[email protected]> Signed-off-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17x86/apic/x2apic: Initialize cluster ID properlyThomas Gleixner1-0/+1
Rick bisected a regression on large systems which use the x2apic cluster mode for interrupt delivery to the commit wich reworked the cluster management. The problem is caused by a missing initialization of the clusterid field in the shared cluster data structures. So all structures end up with cluster ID 0 which only allows sharing between all CPUs which belong to cluster 0. All other CPUs with a cluster ID > 0 cannot share the data structure because they cannot find existing data with their cluster ID. This causes malfunction with IPIs because IPIs are sent to the wrong cluster and the caller waits for ever that the target CPU handles the IPI. Add the missing initialization when a upcoming CPU is the first in a cluster so that the later booting CPUs can find the data and share it for proper operation. Fixes: 023a611748fd ("x86/apic/x2apic: Simplify cluster management") Reported-by: Rick Warner <[email protected]> Bisected-by: Rick Warner <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Rick Warner <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-05-17Merge branch 'ibmvnic-Fix-bugs-and-memory-leaks'David S. Miller1-11/+17
Thomas Falcon says: ==================== ibmvnic: Fix bugs and memory leaks This is a small patch series fixing up some bugs and memory leaks in the ibmvnic driver. The first fix frees up previously allocated memory that should be freed in case of an error. The second fixes a reset case that was failing due to TX/RX queue IRQ's being erroneously disabled without being enabled again. The final patch fixes incorrect reallocated of statistics buffers during a device reset, resulting in loss of statistics information and a memory leak. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-05-17ibmvnic: Fix statistics buffers memory leakThomas Falcon1-9/+15
Move initialization of statistics buffers from ibmvnic_init function into ibmvnic_probe. In the current state, ibmvnic_init will be called again during a device reset, resulting in the allocation of new buffers without freeing the old ones. Signed-off-by: Thomas Falcon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17ibmvnic: Fix non-fatal firmware error resetThomas Falcon1-2/+1
It is not necessary to disable interrupt lines here during a reset to handle a non-fatal firmware error. Move that call within the code block that handles the other cases that do require interrupts to be disabled and re-enabled. Signed-off-by: Thomas Falcon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17ibmvnic: Free coherent DMA memory if FW map failedThomas Falcon1-0/+1
If the firmware map fails for whatever reason, remember to free up the memory after. Signed-off-by: Thomas Falcon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17net/ipv4: Initialize proto and ports in flow structDavid Ahern3-3/+14
Updating the FIB tracepoint for the recent change to allow rules using the protocol and ports exposed a few places where the entries in the flow struct are not initialized. For __fib_validate_source add the call to fib4_rules_early_flow_dissect since it is invoked for the input path. For netfilter, add the memset on the flow struct to avoid future problems like this. In ip_route_input_slow need to set the fields if the skb dissection does not happen. Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport and dport") Signed-off-by: David Ahern <[email protected]> Acked-by: Roopa Prabhu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17tls: don't use stack memory in a scatterlistMatt Mullins2-5/+7
scatterlist code expects virt_to_page() to work, which fails with CONFIG_VMAP_STACK=y. Fixes: c46234ebb4d1e ("tls: RX path for ktls") Signed-off-by: Matt Mullins <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds18-87/+157
Pull kvm fixes from Paolo Bonzini: - ARM/ARM64 locking fixes - x86 fixes: PCID, UMIP, locking - improved support for recent Windows version that have a 2048 Hz APIC timer - rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME - better behaved selftests * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity KVM: arm/arm64: Properly protect VGIC locks from IRQs KVM: X86: Lower the default timer frequency limit to 200us KVM: vmx: update sec exec controls for UMIP iff emulating UMIP kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled KVM: selftests: exit with 0 status code when tests cannot be run KVM: hyperv: idr_find needs RCU protection x86: Delay skip of emulated hypercall instruction KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
2018-05-17Merge tag 'kvm-s390-master-4.17-1' of ↵Paolo Bonzini1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: Fix vsie handling for transactional diagnostic block vsie (nested KVM) might reject a valid input. Fix it.
2018-05-17Merge tag 'sound-4.17-rc6' of ↵Linus Torvalds5-3/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "We have a core fix in the compat code for covering a potential race (double references), but it's a very minor change. The rest are all small device-specific quirks, as well as a correction of the new UAC3 support code" * tag 'sound-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Use Class Specific EP for UAC3 devices. ALSA: hda/realtek - Clevo P950ER ALC1220 Fixup ALSA: usb: mixer: volume quirk for CM102-A+/102S+ ALSA: hda: Add Lenovo C50 All in one to the power_save blacklist ALSA: control: fix a redundant-copy issue
2018-05-17kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIMEMichael S. Tsirkin3-8/+8
KVM_HINTS_DEDICATED seems to be somewhat confusing: Guest doesn't really care whether it's the only task running on a host CPU as long as it's not preempted. And there are more reasons for Guest to be preempted than host CPU sharing, for example, with memory overcommit it can get preempted on a memory access, post copy migration can cause preemption, etc. Let's call it KVM_HINTS_REALTIME which seems to better match what guests expect. Also, the flag most be set on all vCPUs - current guests assume this. Note so in the documentation. Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-05-17Merge branch 'for-linus' of ↵Linus Torvalds23-167/+422
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - a fix for the vfio ccw translation code - update an incorrect email address in the MAINTAINERS file - fix a division by zero oops in the cpum_sf code found by trinity - two fixes for the error handling of the qdio code - several spectre related patches to convert all left-over indirect branches in the kernel to expoline branches - update defconfigs to avoid warnings due to the netfilter Kconfig changes - avoid several compiler warnings in the kexec_file code for s390 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/qdio: don't release memory in qdio_setup_irq() s390/qdio: fix access to uninitialized qdio_q fields s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero s390: use expoline thunks in the BPF JIT s390: extend expoline to BC instructions s390: remove indirect branch from do_softirq_own_stack s390: move spectre sysfs attribute code s390/kernel: use expoline for indirect branches s390/ftrace: use expoline for indirect branches s390/lib: use expoline for indirect branches s390/crc32-vx: use expoline for indirect branches s390: move expoline assembler macros to a header vfio: ccw: fix cleanup if cp_prefetch fails s390/kexec_file: add declaration of purgatory related globals s390: update defconfigs MAINTAINERS: update s390 zcrypt maintainers email address
2018-05-17Merge tag 'selinux-pr-20180516' of ↵Linus Torvalds1-22/+28
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull SELinux fixes from Paul Moore: "A small pull request to fix a few regressions in the SELinux/SCTP code with applications that call bind() with AF_UNSPEC/INADDR_ANY. The individual commit descriptions have more information, but the commits themselves should be self explanatory" * tag 'selinux-pr-20180516' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: correctly handle sa_family cases in selinux_sctp_bind_connect() selinux: fix address family in bind() and connect() to match address/port selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()
2018-05-17proc: do not access cmdline nor environ from file-backed areasWilly Tarreau3-4/+8
proc_pid_cmdline_read() and environ_read() directly access the target process' VM to retrieve the command line and environment. If this process remaps these areas onto a file via mmap(), the requesting process may experience various issues such as extra delays if the underlying device is slow to respond. Let's simply refuse to access file-backed areas in these functions. For this we add a new FOLL_ANON gup flag that is passed to all calls to access_remote_vm(). The code already takes care of such failures (including unmapped areas). Accesses via /proc/pid/mem were not changed though. This was assigned CVE-2018-1120. Note for stable backports: the patch may apply to kernels prior to 4.11 but silently miss one location; it must be checked that no call to access_remote_vm() keeps zero as the last argument. Reported-by: Qualys Security Advisory <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-05-17bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=nColy Li1-1/+3
Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()") returns the return value of debugfs_create_dir() to bcache_init(). When CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes bcache_init() failedi. This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n, so bcache can continue to work for the kernels which don't have debugfs enanbled. Changelog: v4: Add Acked-by from Kent Overstreet. v3: Use IS_ENABLED(CONFIG_DEBUG_FS) to replace #ifdef DEBUG_FS. v2: Remove a warning information v1: Initial version. Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()") Cc: [email protected] Signed-off-by: Coly Li <[email protected]> Reported-by: Massimo B. <[email protected]> Reported-by: Kai Krakow <[email protected]> Tested-by: Kai Krakow <[email protected]> Acked-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-17KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBDTom Lendacky6-18/+50
Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD. [ tglx: Folded the migration fixup from Paolo Bonzini ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-05-17x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFGThomas Gleixner2-0/+36
Add the necessary logic for supporting the emulated VIRT_SPEC_CTRL MSR to x86_virt_spec_ctrl(). If either X86_FEATURE_LS_CFG_SSBD or X86_FEATURE_VIRT_SPEC_CTRL is set then use the new guest_virt_spec_ctrl argument to check whether the state must be modified on the host. The update reuses speculative_store_bypass_update() so the ZEN-specific sibling coordination can be reused. Signed-off-by: Thomas Gleixner <[email protected]>
2018-05-17x86/bugs: Rework spec_ctrl base and mask logicThomas Gleixner1-7/+19
x86_spec_ctrL_mask is intended to mask out bits from a MSR_SPEC_CTRL value which are not to be modified. However the implementation is not really used and the bitmask was inverted to make a check easier, which was removed in "x86/bugs: Remove x86_spec_ctrl_set()" Aside of that it is missing the STIBP bit if it is supported by the platform, so if the mask would be used in x86_virt_spec_ctrl() then it would prevent a guest from setting STIBP. Add the STIBP bit if supported and use the mask in x86_virt_spec_ctrl() to sanitize the value which is supplied by the guest. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]>
2018-05-17x86/bugs: Remove x86_spec_ctrl_set()Thomas Gleixner2-13/+2
x86_spec_ctrl_set() is only used in bugs.c and the extra mask checks there provide no real value as both call sites can just write x86_spec_ctrl_base to MSR_SPEC_CTRL. x86_spec_ctrl_base is valid and does not need any extra masking or checking. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/bugs: Expose x86_spec_ctrl_base directlyThomas Gleixner3-24/+6
x86_spec_ctrl_base is the system wide default value for the SPEC_CTRL MSR. x86_spec_ctrl_get_default() returns x86_spec_ctrl_base and was intended to prevent modification to that variable. Though the variable is read only after init and globaly visible already. Remove the function and export the variable instead. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}Borislav Petkov2-49/+44
Function bodies are very similar and are going to grow more almost identical code. Add a bool arg to determine whether SPEC_CTRL is being set for the guest or restored to the host. No functional changes. Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Rework speculative_store_bypass_update()Thomas Gleixner3-4/+9
The upcoming support for the virtual SPEC_CTRL MSR on AMD needs to reuse speculative_store_bypass_update() to avoid code duplication. Add an argument for supplying a thread info (TIF) value and create a wrapper speculative_store_bypass_update_current() which is used at the existing call site. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Add virtualized speculative store bypass disable supportTom Lendacky4-2/+18
Some AMD processors only support a non-architectural means of enabling speculative store bypass disable (SSBD). To allow a simplified view of this to a guest, an architectural definition has been created through a new CPUID bit, 0x80000008_EBX[25], and a new MSR, 0xc001011f. With this, a hypervisor can virtualize the existence of this definition and provide an architectural method for using SSBD to a guest. Add the new CPUID feature, the new MSR and update the existing SSBD support to use this MSR when present. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]>
2018-05-17x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRLThomas Gleixner4-9/+35
AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care about the bit position of the SSBD bit and thus facilitate migration. Also, the sibling coordination on Family 17H CPUs can only be done on the host. Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an extra argument for the VIRT_SPEC_CTRL MSR. Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU data structure which is going to be used in later patches for the actual implementation. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Handle HT correctly on AMDThomas Gleixner3-6/+130
The AMD64_LS_CFG MSR is a per core MSR on Family 17H CPUs. That means when hyperthreading is enabled the SSBD bit toggle needs to take both cores into account. Otherwise the following situation can happen: CPU0 CPU1 disable SSB disable SSB enable SSB <- Enables it for the Core, i.e. for CPU0 as well So after the SSB enable on CPU1 the task on CPU0 runs with SSB enabled again. On Intel the SSBD control is per core as well, but the synchronization logic is implemented behind the per thread SPEC_CTRL MSR. It works like this: CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL i.e. if one of the threads enables a mitigation then this affects both and the mitigation is only disabled in the core when both threads disabled it. Add the necessary synchronization logic for AMD family 17H. Unfortunately that requires a spinlock to serialize the access to the MSR, but the locks are only shared between siblings. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/cpufeatures: Add FEATURE_ZENThomas Gleixner2-0/+2
Add a ZEN feature bit so family-dependent static_cpu_has() optimizations can be built for ZEN. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/cpufeatures: Disentangle SSBD enumerationThomas Gleixner6-16/+14
The SSBD enumeration is similarly to the other bits magically shared between Intel and AMD though the mechanisms are different. Make X86_FEATURE_SSBD synthetic and set it depending on the vendor specific features or family dependent setup. Change the Intel bit to X86_FEATURE_SPEC_CTRL_SSBD to denote that SSBD is controlled via MSR_SPEC_CTRL and fix up the usage sites. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRSThomas Gleixner4-9/+20
The availability of the SPEC_CTRL MSR is enumerated by a CPUID bit on Intel and implied by IBRS or STIBP support on AMD. That's just confusing and in case an AMD CPU has IBRS not supported because the underlying problem has been fixed but has another bit valid in the SPEC_CTRL MSR, the thing falls apart. Add a synthetic feature bit X86_FEATURE_MSR_SPEC_CTRL to denote the availability on both Intel and AMD. While at it replace the boot_cpu_has() checks with static_cpu_has() where possible. This prevents late microcode loading from exposing SPEC_CTRL, but late loading is already very limited as it does not reevaluate the mitigation options and other bits and pieces. Having static_cpu_has() is the simplest and least fragile solution. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Use synthetic bits for IBRS/IBPB/STIBPBorislav Petkov5-23/+26
Intel and AMD have different CPUID bits hence for those use synthetic bits which get set on the respective vendor's in init_speculation_control(). So that debacles like what the commit message of c65732e4f721 ("x86/cpu: Restore CPUID_8000_0008_EBX reload") talks about don't happen anymore. Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Jörg Otte <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-05-17KVM: SVM: Move spec control call after restore of GSThomas Gleixner1-12/+12
svm_vcpu_run() invokes x86_spec_ctrl_restore_host() after VMEXIT, but before the host GS is restored. x86_spec_ctrl_restore_host() uses 'current' to determine the host SSBD state of the thread. 'current' is GS based, but host GS is not yet restored and the access causes a triple fault. Move the call after the host GS restore. Fixes: 885f82bfbc6f x86/process: Allow runtime control of Speculative Store Bypass Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Acked-by: Paolo Bonzini <[email protected]>
2018-05-18powerpc/powernv: Fix NVRAM sleep in invalid context when crashingNicholas Piggin1-2/+12
Similarly to opal_event_shutdown, opal_nvram_write can be called in the crash path with irqs disabled. Special case the delay to avoid sleeping in invalid context. Fixes: 3b8070335f75 ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops") Cc: [email protected] # v3.2 Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-05-17MAINTAINERS: add entry for STM32 I2C driverPierre-Yves MORDRET1-0/+6
Add I2C/SMBUS Driver entry for STM32 family from ST Microelectronics. Signed-off-by: Pierre-Yves MORDRET <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
2018-05-17Makefile: disable PIE before testing asm gotoMichal Kubecek1-2/+3
Since commit e501ce957a78 ("x86: Force asm-goto"), aarch64 build on distributions which enable PIE by default (e.g. openSUSE Tumbleweed) does not detect support for asm goto correctly. The problem is that ARM specific part of scripts/gcc-goto.sh fails with PIE even with recent gcc versions. Moving the asm goto detection up in Makefile put it before the place where we disable PIE. As a result, kernel is built without jump label support. Move the lines disabling PIE before the asm goto test to make it work. Fixes: e501ce957a78 ("x86: Force asm-goto") Reported-by: Andreas Faerber <[email protected]> Signed-off-by: Michal Kubecek <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-05-17kbuild: gcov: enable -fno-tree-loop-im if supportedNick Desaulniers1-1/+3
Clang does not recognize this compiler option. Reported-by: Prasad Sodagudi <[email protected]> Signed-off-by: Nick Desaulniers <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-05-17btrfs: fix crash when trying to resume balance without the resume flagAnand Jain1-0/+9
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance() only, which isn't called during the remount. So when resuming from the paused balance we hit the bug: kernel: kernel BUG at fs/btrfs/volumes.c:3890! :: kernel: balance_kthread+0x51/0x60 [btrfs] kernel: kthread+0x111/0x130 :: kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8 Reproducer: On a mounted filesystem: btrfs balance start --full-balance /btrfs btrfs balance pause /btrfs mount -o remount,ro /dev/sdb /btrfs mount -o remount,rw /dev/sdb /btrfs To fix this set the BTRFS_BALANCE_RESUME flag in btrfs_resume_balance_async(). CC: [email protected] # 4.4+ Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: Fix delalloc inodes invalidation during transaction abortNikolay Borisov1-11/+15
When a transaction is aborted btrfs_cleanup_transaction is called to cleanup all the various in-flight bits and pieces which migth be active. One of those is delalloc inodes - inodes which have dirty pages which haven't been persisted yet. Currently the process of freeing such delalloc inodes in exceptional circumstances such as transaction abort boiled down to calling btrfs_invalidate_inodes whose sole job is to invalidate the dentries for all inodes related to a root. This is in fact wrong and insufficient since such delalloc inodes will likely have pending pages or ordered-extents and will be linked to the sb->s_inode_list. This means that unmounting a btrfs instance with an aborted transaction could potentially lead inodes/their pages visible to the system long after their superblock has been freed. This in turn leads to a "use-after-free" situation once page shrink is triggered. This situation could be simulated by running generic/019 which would cause such inodes to be left hanging, followed by generic/176 which causes memory pressure and page eviction which lead to touching the freed super block instance. This situation is additionally detected by the unmount code of VFS with the following message: "VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..." Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); in free_fs_root for the same reason. This patch aims to rectify the sitaution by doing the following: 1. Change btrfs_destroy_delalloc_inodes so that it calls invalidate_inode_pages2 for every inode on the delalloc list, this ensures that all the pages of the inode are released. This function boils down to calling btrfs_releasepage. During test I observed cases where inodes on the delalloc list were having an i_count of 0, so this necessitates using igrab to be sure we are working on a non-freed inode. 2. Since calling btrfs_releasepage might queue delayed iputs move the call out to btrfs_cleanup_transaction in btrfs_error_commit_super before calling run_delayed_iputs for the last time. This is necessary to ensure that delayed iputs are run. Note: this patch is tagged for 4.14 stable but the fix applies to older versions too but needs to be backported manually due to conflicts. CC: [email protected] # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions CC: [email protected] # 4.14.x Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment to igrab ] Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: Split btrfs_del_delalloc_inode into 2 functionsNikolay Borisov2-3/+12
This is in preparation of fixing delalloc inodes leakage on transaction abort. Also export the new function. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: fix reading stale metadata blocks after degraded raid1 mountsLiu Bo1-3/+3
If a btree block, aka. extent buffer, is not available in the extent buffer cache, it'll be read out from the disk instead, i.e. btrfs_search_slot() read_block_for_search() # hold parent and its lock, go to read child btrfs_release_path() read_tree_block() # read child Unfortunately, the parent lock got released before reading child, so commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had used 0 as parent transid to read the child block. It forces read_tree_block() not to check if parent transid is different with the generation id of the child that it reads out from disk. A simple PoC is included in btrfs/124, 0. A two-disk raid1 btrfs, 1. Right after mkfs.btrfs, block A is allocated to be device tree's root. 2. Mount this filesystem and put it in use, after a while, device tree's root got COW but block A hasn't been allocated/overwritten yet. 3. Umount it and reload the btrfs module to remove both disks from the global @fs_devices list. 4. mount -odegraded dev1 and write some data, so now block A is allocated to be a leaf in checksum tree. Note that only dev1 has the latest metadata of this filesystem. 5. Umount it and mount it again normally (with both disks), since raid1 can pick up one disk by the writer task's pid, if btrfs_search_slot() needs to read block A, dev2 which does NOT have the latest metadata might be read for block A, then we got a stale block A. 6. As parent transid is not checked, block A is marked as uptodate and put into the extent buffer cache, so the future search won't bother to read disk again, which means it'll make changes on this stale one and make it dirty and flush it onto disk. To avoid the problem, parent transid needs to be passed to read_tree_block(). In order to get a valid parent transid, we need to hold the parent's lock until finishing reading child. This patch needs to be slightly adapted for stable kernels, the &first_key parameter added to read_tree_block() is from 4.16+ (581c1760415c4). The fix is to replace 0 by 'gen'. Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race") CC: [email protected] # 4.4+ Signed-off-by: Liu Bo <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: property: Set incompat flag if lzo/zstd compression is setMisono Tomohiro1-4/+8
Incompat flag of LZO/ZSTD compression should be set at: 1. mount time (-o compress/compress-force) 2. when defrag is done 3. when property is set Currently 3. is missing and this commit adds this. This could lead to a filesystem that uses ZSTD but is not marked as such. If a kernel without a ZSTD support encounteres a ZSTD compressed extent, it will handle that but this could be confusing to the user. Typically the filesystem is mounted with the ZSTD option, but the discrepancy can arise when a filesystem is never mounted with ZSTD and then the property on some file is set (and some new extents are written). A simple mount with -o compress=zstd will fix that up on an unpatched kernel. Same goes for LZO, but this has been around for a very long time (2.6.37) so it's unlikely that a pre-LZO kernel would be used. Fixes: 5c1aab1dd544 ("btrfs: Add zstd support") CC: [email protected] # 4.14+ Signed-off-by: Tomohiro Misono <[email protected]> Reviewed-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add user visible impact ] Signed-off-by: David Sterba <[email protected]>
2018-05-17Btrfs: fix duplicate extents after fsync of file with prealloc extentsFilipe Manana1-25/+112
In commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), on fsync, we started to always log all prealloc extents beyond an inode's i_size in order to avoid losing them after a power failure. However under some cases this can lead to the log replay code to create duplicate extent items, with different lengths, in the extent tree. That happens because, as of that commit, we can now log extent items based on extent maps that are not on the "modified" list of extent maps of the inode's extent map tree. Logging extent items based on extent maps is used during the fast fsync path to save time and for this to work reliably it requires that the extent maps are not merged with other adjacent extent maps - having the extent maps in the list of modified extents gives such guarantee. Consider the following example, captured during a long run of fsstress, which illustrates this problem. We have inode 271, in the filesystem tree (root 5), for which all of the following operations and discussion apply to. A buffered write starts at offset 312391 with a length of 933471 bytes (end offset at 1245862). At this point we have, for this inode, the following extent maps with the their field values: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 376832, block_start 1106399232, block_len 376832, orig_block_len 376832 em C, start 417792, orig_start 417792, len 782336, block_start 18446744073709551613, block_len 0, orig_block_len 0 em D, start 1200128, orig_start 1200128, len 835584, block_start 1106776064, block_len 835584, orig_block_len 835584 em E, start 2035712, orig_start 2035712, len 245760, block_start 1107611648, block_len 245760, orig_block_len 245760 Extent map A corresponds to a hole and extent maps D and E correspond to preallocated extents. Extent map D ends where extent map E begins (1106776064 + 835584 = 1107611648), but these extent maps were not merged because they are in the inode's list of modified extent maps. An fsync against this inode is made, which triggers the fast path (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback of the data previously written using buffered IO, and when the respective ordered extent finishes, btrfs_drop_extents() is called against the (aligned) range 311296..1249279. This causes a split of extent map D at btrfs_drop_extent_cache(), replacing extent map D with a new extent map D', also added to the list of modified extents, with the following values: em D', start 1249280, orig_start of 1200128, block_start 1106825216 (= 1106776064 + 1249280 - 1200128), orig_block_len 835584, block_len 786432 (835584 - (1249280 - 1200128)) Then, during the fast fsync, btrfs_log_changed_extents() is called and extent maps D' and E are removed from the list of modified extents. The flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged clear_em_logging() is called on each of them, and that makes extent map E to be merged with extent map D' (try_merge_map()), resulting in D' being deleted and E adjusted to: em E, start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 A direct IO write at offset 1847296 and length of 360448 bytes (end offset at 2207744) starts, and at that moment the following extent maps exist for our inode: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em E (prealloc), start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 The dio write results in drop_extent_cache() being called twice. The first time for a range that starts at offset 1847296 and ends at offset 2035711 (length of 188416), which results in a double split of extent map E, replacing it with two new extent maps: em F, start 1249280, orig_start 1200128, block_start 1106825216, block_len 598016, orig_block_len 598016 em G, start 2035712, orig_start 1200128, block_start 1107611648, block_len 245760, orig_block_len 1032192 It also creates a new extent map that represents a part of the requested IO (through create_io_em()): em H, start 1847296, len 188416, block_start 1107423232, block_len 188416 The second call to drop_extent_cache() has a range with a start offset of 2035712 and end offset of 2207743 (length of 172032). This leads to replacing extent map G with a new extent map I with the following values: em I, start 2207744, orig_start 1200128, block_start 1107783680, block_len 73728, orig_block_len 1032192 It also creates a new extent map that represents the second part of the requested IO (through create_io_em()): em J, start 2035712, len 172032, block_start 1107611648, block_len 172032 The dio write set the inode's i_size to 2207744 bytes. After the dio write the inode has the following extent maps: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em F, start 1249280, orig_start 1200128, len 598016, block_start 1106825216, block_len 598016, orig_block_len 598016 em H, start 1847296, orig_start 1200128, len 188416, block_start 1107423232, block_len 188416, orig_block_len 835584 em J, start 2035712, orig_start 2035712, len 172032, block_start 1107611648, block_len 172032, orig_block_len 245760 em I, start 2207744, orig_start 1200128, len 73728, block_start 1107783680, block_len 73728, orig_block_len 1032192 Now do some change to the file, like adding a xattr for example and then fsync it again. This triggers a fast fsync path, and as of commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), we use the extent map I to log a file extent item because it's a prealloc extent and it starts at an offset matching the inode's i_size. However when we log it, we create a file extent item with a value for the disk byte location that is wrong, as can be seen from the following output of "btrfs inspect-internal dump-tree": item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53 generation 22 type 2 (prealloc) prealloc data disk byte 1106776064 nr 1032192 prealloc data offset 1007616 nr 73728 Here the disk byte value corresponds to calculation based on some fields from the extent map I: 1106776064 = block_start (1107783680) - 1007616 (extent_offset) extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616 The disk byte value of 1106776064 clashes with disk byte values of the file extent items at offsets 1249280 and 1847296 in the fs tree: item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1106776064 nr 835584 prealloc data offset 49152 nr 598016 item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1106776064 nr 835584 extent data offset 647168 nr 188416 ram 835584 extent compression 0 (none) item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1107611648 nr 245760 extent data offset 0 nr 172032 ram 245760 extent compression 0 (none) item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1107611648 nr 245760 prealloc data offset 172032 nr 73728 Instead of the disk byte value of 1106776064, the value of 1107611648 should have been logged. Also the data offset value should have been 172032 and not 1007616. After a log replay we end up getting two extent items in the extent tree with different lengths, one of 835584, which is correct and existed before the log replay, and another one of 1032192 which is wrong and is based on the logged file extent item: item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53 refs 2 gen 15 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 2 item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53 refs 1 gen 22 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 1 Obviously this leads to many problems and a filesystem check reports many errors: (...) checking extents Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1 extent item 1106776064 has multiple extent items ref mismatch on [1106776064 835584] extent item 2, found 3 Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680 Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70 Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192 backpointer mismatch on [1106776064 835584] checking free space cache block group 1103101952 has wrong amount of free space failed to load free space cache for block group 1103101952 checking fs roots (...) So fix this by logging the prealloc extents beyond the inode's i_size based on searches in the subvolume tree instead of the extent maps. Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: [email protected] # 4.14+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>