aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller10-55/+144
Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <[email protected]>
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-0/+4
Pull networking fixes from David Miller: "It looks like a decent sized set of fixes, but a lot of these are one liner off-by-one and similar type changes: 1) Fix netlink header pointer to calcular bad attribute offset reported to user. From Pablo Neira Ayuso. 2) Don't double clear PHY interrupts when ->did_interrupt is set, from Heiner Kallweit. 3) Add missing validation of various (devlink, nl802154, fib, etc.) attributes, from Jakub Kicinski. 4) Missing *pos increments in various netfilter seq_next ops, from Vasily Averin. 5) Missing break in of_mdiobus_register() loop, from Dajun Jin. 6) Don't double bump tx_dropped in veth driver, from Jiang Lidong. 7) Work around FMAN erratum A050385, from Madalin Bucur. 8) Make sure ARP header is pulled early enough in bonding driver, from Eric Dumazet. 9) Do a cond_resched() during multicast processing of ipvlan and macvlan, from Mahesh Bandewar. 10) Don't attach cgroups to unrelated sockets when in interrupt context, from Shakeel Butt. 11) Fix tpacket ring state management when encountering unknown GSO types. From Willem de Bruijn. 12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend() only in the suspend context. From Heiner Kallweit" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits) net: systemport: fix index check to avoid an array out of bounds access tc-testing: add ETS scheduler to tdc build configuration net: phy: fix MDIO bus PM PHY resuming net: hns3: clear port base VLAN when unload PF net: hns3: fix RMW issue for VLAN filter switch net: hns3: fix VF VLAN table entries inconsistent issue net: hns3: fix "tc qdisc del" failed issue taprio: Fix sending packets without dequeueing them net: mvmdio: avoid error message for optional IRQ net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register net: memcg: fix lockdep splat in inet_csk_accept() s390/qeth: implement smarter resizing of the RX buffer pool s390/qeth: refactor buffer pool code s390/qeth: use page pointers to manage RX buffer pool seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed net/packet: tpacket_rcv: do not increment ring index on drop sxgbe: Fix off by one in samsung driver strncpy size arg net: caif: Add lockdep expression to RCU traversal primitive MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer ...
2020-03-11Merge tag 'for-linus-2020-03-10' of ↵Linus Torvalds1-0/+10
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread fix from Christian Brauner: "This contains a single fix for a regression which was introduced when we introduced the ability to select a specific pid at process creation time. When this feature is requested, the error value will be set to -EPERM after exiting the pid allocation loop. This caused EPERM to be returned when e.g. the init process/child subreaper of the pid namespace has already died where we used to return ENOMEM before. The first patch here simply fixes the regression by unconditionally setting the return value back to ENOMEM again once we've successfully allocated the requested pid number. This should be easy to backport to v5.5. The second patch adds a comment explaining that we must keep returning ENOMEM since we've been doing it for a long time and have explicitly documented this behavior for userspace. This seemed worthwhile because we now have at least two separate example where people tried to change the return value to something other than ENOMEM (The first version of the regression fix did that too and the commit message links to an earlier patch that tried to do the same.). I have a simple regression test to make sure we catch this regression in the future but since that introduces a whole new selftest subdir and test files I'll keep this for v5.7" * tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: pid: make ENOMEM return value more obvious pid: Fix error return value in some cases
2020-03-11Merge tag 'trace-v5.6-rc4' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fix from Steven Rostedt: "Have ftrace lookup_rec() return a consistent record otherwise it can break live patching" * tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Return the first found result in lookup_rec()
2020-03-11ftrace: Return the first found result in lookup_rec()Artem Savkov1-0/+2
It appears that ip ranges can overlap so. In that case lookup_rec() returns whatever results it got last even if it found nothing in last searched page. This breaks an obscure livepatch late module patching usecase: - load livepatch - load the patched module - unload livepatch - try to load livepatch again To fix this return from lookup_rec() as soon as it found the record containing searched-for ip. This used to be this way prior lookup_rec() introduction. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()") Signed-off-by: Artem Savkov <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-10cgroup: memcg: net: do not associate sock with unrelated cgroupShakeel Butt1-0/+4
We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-03-10Merge branch 'for-5.6-fixes' of ↵Linus Torvalds2-14/+28
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cgroup.procs listing related fixes. It didn't interlock properly with exiting tasks leaving a short window where a cgroup has empty cgroup.procs but still can't be removed and misbehaved on short reads. - psi_show() crash fix on 32bit ino archs - Empty release_agent handling fix * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup1: don't call release_agent when it is "" cgroup: fix psi_show() crash on 32bit ino archs cgroup: Iterate tasks that did not finish do_exit() cgroup: cgroup_procs_next should increase position index cgroup-v1: cgroup_pidlist_next should update position index
2020-03-10Merge branch 'for-5.6-fixes' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Workqueue has been incorrectly round-robining per-cpu work items. Hillf's patch fixes that. The other patch documents memory-ordering properties of workqueue operations" * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: don't use wq_select_unbound_cpu() for bound works workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()
2020-03-10workqueue: don't use wq_select_unbound_cpu() for bound worksHillf Danton1-6/+8
wq_select_unbound_cpu() is designed for unbound workqueues only, but it's wrongly called when using a bound workqueue too. Fixing this ensures work queued to a bound workqueue with cpu=WORK_CPU_UNBOUND always runs on the local CPU. Before, that would happen only if wq_unbound_cpumask happened to include it (likely almost always the case), or was empty, or we got lucky with forced round-robin placement. So restricting /sys/devices/virtual/workqueue/cpumask to a small subset of a machine's CPUs would cause some bound work items to run unexpectedly there. Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") Cc: [email protected] # v4.5+ Signed-off-by: Hillf Danton <[email protected]> [dj: massage changelog] Signed-off-by: Daniel Jordan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: [email protected] Signed-off-by: Tejun Heo <[email protected]>
2020-03-09pid: make ENOMEM return value more obviousChristian Brauner1-0/+8
The alloc_pid() codepath used to be simpler. With the introducation of the ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID") it got more complex. It hasn't been super obvious that ENOMEM is returned when the pid namespace init process/child subreaper of the pid namespace has died. As can be seen from multiple attempts to improve this see e.g. [1] and most recently [2]. We regressed returning ENOMEM in [3] and [2] restored it. Let's add a comment on top explaining that this is historic and documented behavior and cannot easily be changed. [1]: 35f71bc0a09a ("fork: report pid reservation failure properly") [2]: b26ebfe12f34 ("pid: Fix error return value in some cases") [3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID") Signed-off-by: Christian Brauner <[email protected]>
2020-03-08pid: Fix error return value in some casesCorey Minyard1-0/+2
Recent changes to alloc_pid() allow the pid number to be specified on the command line. If set_tid_size is set, then the code scanning the levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM value. After the code scanning the levels, there are error returns that do not set retval, assuming it is still set to -ENOMEM. So set retval back to -ENOMEM after scanning the levels. Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID") Signed-off-by: Corey Minyard <[email protected]> Acked-by: Christian Brauner <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Adrian Reber <[email protected]> Cc: <[email protected]> # 5.5 Link: https://lore.kernel.org/r/[email protected] [[email protected]: fixup commit message] Signed-off-by: Christian Brauner <[email protected]>
2020-03-07Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+4
Pull block fixes from Jens Axboe: "Here are a few fixes that should go into this release. This contains: - Revert of a bad bcache patch from this merge window - Removed unused function (Daniel) - Fixup for the blktrace fix from Jan from this release (Cengiz) - Fix of deeper level bfqq overwrite in BFQ (Carlo)" * tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block: block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() blktrace: fix dereference after null check Revert "bcache: ignore pending signals when creating gc and allocator thread" block: Remove used kblockd_schedule_work_on()
2020-03-07Merge tag 'for-linus-2020-03-07' of ↵Linus Torvalds2-3/+3
gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux Pull thread fixes from Christian Brauner: "Here are a few hopefully uncontroversial fixes: - Use RCU_INIT_POINTER() when initializing rcu protected members in task_struct to fix sparse warnings. - Add pidfd_fdinfo_test binary to .gitignore file" * tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux: selftests: pidfd: Add pidfd_fdinfo_test in .gitignore exit: Fix Sparse errors and warnings fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()
2020-03-05blktrace: fix dereference after null checkCengiz Can1-1/+4
There was a recent change in blktrace.c that added a RCU protection to `q->blk_trace` in order to fix a use-after-free issue during access. However the change missed an edge case that can lead to dereferencing of `bt` pointer even when it's NULL: Coverity static analyzer marked this as a FORWARD_NULL issue with CID 1460458. ``` /kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store() 1898 ret = 0; 1899 if (bt == NULL) 1900 ret = blk_trace_setup_queue(q, bdev); 1901 1902 if (ret == 0) { 1903 if (attr == &dev_attr_act_mask) >>> CID 1460458: Null pointer dereferences (FORWARD_NULL) >>> Dereferencing null pointer "bt". 1904 bt->act_mask = value; 1905 else if (attr == &dev_attr_pid) 1906 bt->pid = value; 1907 else if (attr == &dev_attr_start_lba) 1908 bt->start_lba = value; 1909 else if (attr == &dev_attr_end_lba) ``` Added a reassignment with RCU annotation to fix the issue. Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU") Cc: [email protected] Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Cengiz Can <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-03-04cgroup1: don't call release_agent when it is ""Tycho Andersen1-1/+1
Older (and maybe current) versions of systemd set release_agent to "" when shutting down, but do not set notify_on_release to 0. Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()"), we filter out such calls when the user mode helper path is "". However, when used in conjunction with an actual (i.e. non "") STATIC_USERMODEHELPER, the path is never "", so the real usermode helper will be called with argv[0] == "". Let's avoid this by not invoking the release_agent when it is "". Signed-off-by: Tycho Andersen <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2020-03-04cgroup: fix psi_show() crash on 32bit ino archsQian Cai1-3/+3
Similar to the commit d7495343228f ("cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not equal to 1 on 32bit ino archs which triggers all sorts of issues with psi_show() on s390x. For example, BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/ Read of size 4 at addr 000000001e0ce000 by task read_all/3667 collect_percpu_times+0x2d0/0x798 psi_show+0x7c/0x2a8 seq_read+0x2ac/0x830 vfs_read+0x92/0x150 ksys_read+0xe2/0x188 system_call+0xd8/0x2b4 Fix it by using cgroup_ino(). Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID") Signed-off-by: Qian Cai <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected] # v5.5
2020-03-02Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a scheduler statistics bug" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix statistics for find_idlest_group()
2020-02-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller12-115/+228
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-02-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 41 non-merge commits during the last 7 day(s) which contain a total of 49 files changed, 1383 insertions(+), 499 deletions(-). The main changes are: 1) BPF and Real-Time nicely co-exist. 2) bpftool feature improvements. 3) retrieve bpf_sk_storage via INET_DIAG. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-02-28Merge tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-blockLinus Torvalds1-31/+83
Pull block fixes from Jens Axboe: - Passthrough insertion fix (Ming) - Kill off some unused arguments (John) - blktrace RCU fix (Jan) - Dead fields removal for null_blk (Dongli) - NVMe polled IO fix (Bijan) * tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block: nvme-pci: Hold cq_poll_lock while completing CQEs blk-mq: Remove some unused function arguments null_blk: remove unused fields in 'nullb_cmd' blktrace: Protect q->blk_trace with RCU blk-mq: insert passthrough request into hctx->dispatch directly
2020-02-28Merge tag 'pm-5.6-rc4' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Fix a recent cpufreq initialization regression (Rafael Wysocki), revert a devfreq commit that made incompatible changes and broke user land on some systems (Orson Zhai), drop a stale reference to a document that has gone away recently (Jonathan Neuschäfer), and fix a typo in a hibernation code comment (Alexandre Belloni)" * tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: Fix policy initialization for internal governor drivers Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs" PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst
2020-02-28exit: Fix Sparse errors and warningsMadhuparna Bhowmik1-2/+2
This patch fixes the following sparse error: kernel/exit.c:627:25: error: incompatible types in comparison expression And the following warning: kernel/exit.c:626:40: warning: incorrect type in assignment Signed-off-by: Madhuparna Bhowmik <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Christian Brauner <[email protected]> [[email protected]: edit commit message] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2020-02-28fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()Madhuparna Bhowmik1-1/+1
Use RCU_INIT_POINTER() instead of rcu_access_pointer() in copy_sighand(). Suggested-by: Oleg Nesterov <[email protected]> Signed-off-by: Madhuparna Bhowmik <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Christian Brauner <[email protected]> [[email protected]: edit commit message] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2020-02-28Merge branches 'pm-sleep' and 'pm-devfreq'Rafael J. Wysocki1-1/+1
* pm-sleep: PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst * pm-devfreq: Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
2020-02-27bpf: INET_DIAG support in bpf_sk_storageMartin KaFai Lau1-0/+15
This patch adds INET_DIAG support to bpf_sk_storage. 1. Although this series adds bpf_sk_storage diag capability to inet sk, bpf_sk_storage is in general applicable to all fullsock. Hence, the bpf_sk_storage logic will operate on SK_DIAG_* nlattr. The caller will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as the argument. 2. The request will be like: INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) ...... Considering there could have multiple bpf_sk_storages in a sk, instead of reusing INET_DIAG_INFO ("ss -i"), the user can select some specific bpf_sk_storage to dump by specifying an array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD. If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages of a sk. 3. The reply will be like: INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) ...... 4. Unlike other INET_DIAG info of a sk which is pretty static, the size required to dump the bpf_sk_storage(s) of a sk is dynamic as the system adding more bpf_sk_storage_map. It is hard to set a static min_dump_alloc size. Hence, this series learns it at the runtime and adjust the cb->min_dump_alloc as it iterates all sk(s) of a system. The "unsigned int *res_diag_size" in bpf_sk_storage_diag_put() is for this purpose. The next patch will update the cb->min_dump_alloc as it iterates the sk(s). Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller11-122/+225
The mptcp conflict was overlapping additions. The SMC conflict was an additional and removal happening at the same time. Signed-off-by: David S. Miller <[email protected]>
2020-02-28bpf: Replace zero-length array with flexible-array memberGustavo A. R. Silva3-3/+3
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor
2020-02-27Merge tag 'audit-pr-20200226' of ↵Linus Torvalds2-51/+60
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two fixes for problems found by syzbot: - Moving audit filter structure fields into a union caused some problems in the code which populates that filter structure. We keep the union (that idea is a good one), but we are fixing the code so that it doesn't needlessly set fields in the union and mess up the error handling. - The audit_receive_msg() function wasn't validating user input as well as it should in all cases, we add the necessary checks" * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: always check the netlink payload length in audit_receive_msg() audit: fix error handling in audit_data_to_entry()
2020-02-27sched/fair: Fix statistics for find_idlest_group()Vincent Guittot1-0/+2
sgs->group_weight is not set while gathering statistics in update_sg_wakeup_stats(). This means that a group can be classified as fully busy with 0 running tasks if utilization is high enough. This path is mainly used for fork and exec. Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-26Merge tag 'trace-v5.6-rc2' of ↵Linus Torvalds4-35/+127
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing and bootconfig updates: "Fixes and changes to bootconfig before it goes live in a release. Change in API of bootconfig (before it comes live in a release): - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file" * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue bootconfig: Add append value operator support bootconfig: Prohibit re-defining value on same key bootconfig: Print array as multiple commands for legacy command line bootconfig: Reject subkey and value on same parent key tools/bootconfig: Remove unneeded error message silencer bootconfig: Add bootconfig magic word for indicating bootconfig explicitly bootconfig: Set CONFIG_BOOT_CONFIG=n by default tracing: Clear trace_state when starting trace bootconfig: Mark boot_config_checksum() static tracing: Disable trace_printk() on post poned tests tracing: Have synthetic event test use raw_smp_processor_id() tracing: Fix number printing bug in print_synth_event() tracing: Check that number of vals matches number of synth event fields tracing: Make synth_event trace functions endian-correct tracing: Make sure synth_event_trace() example always uses u64
2020-02-26signal: avoid double atomic counter increments for user accountingLinus Torvalds1-9/+14
When queueing a signal, we increment both the users count of pending signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount of the user struct itself (because we keep a reference to the user in the signal structure in order to correctly account for it when freeing). That turns out to be fairly expensive, because both of them are atomic updates, and particularly under extreme signal handling pressure on big machines, you can get a lot of cache contention on the user struct. That can then cause horrid cacheline ping-pong when you do these multiple accesses. So change the reference counting to only pin the user for the _first_ pending signal, and to unpin it when the last pending signal is dequeued. That means that when a user sees a lot of concurrent signal queuing - which is the only situation when this matters - the only atomic access needed is generally the 'sigpending' count update. This was noticed because of a particularly odd timing artifact on a dual-socket 96C/192T Cascade Lake platform: when you get into bad contention, on that machine for some reason seems to be much worse when the contention happens in the upper 32-byte half of the cacheline. As a result, the kernel test robot will-it-scale 'signal1' benchmark had an odd performance regression simply due to random alignment of the 'struct user_struct' (and pointed to a completely unrelated and apparently nonsensical commit for the regression). Avoiding the double increments (and decrements on the dequeueing side, of course) makes for much less contention and hugely improved performance on that will-it-scale microbenchmark. Quoting Feng Tang: "It makes a big difference, that the performance score is tripled! bump from original 17000 to 54000. Also the gap between 5.0-rc6 and 5.0-rc6+Jiri's patch is reduced to around 2%" [ The "2% gap" is the odd cacheline placement difference on that platform: under the extreme contention case, the effect of which half of the cacheline was hot was 5%, so with the reduced contention the odd timing artifact is reduced too ] It does help in the non-contended case too, but is not nearly as noticeable. Reported-and-tested-by: Feng Tang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Huang, Ying <[email protected]> Cc: Philip Li <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-02-25bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issueMasami Hiramatsu1-1/+0
Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") also changed the CONFIG_BOOTTIME_TRACING to select CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu, it introduced wrong dependencies with BLK_DEV_INITRD as below. WARNING: unmet direct dependencies detected for BOOT_CONFIG Depends on [n]: BLK_DEV_INITRD [=n] Selected by [y]: - BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y] This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so that both boot-time tracing and boot configuration off but those appear on the menu list. Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2 Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") Reported-by: Randy Dunlap <[email protected]> Compiled-tested-by: Randy Dunlap <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-02-25blktrace: Protect q->blk_trace with RCUJan Kara1-31/+83
KASAN is reporting that __blk_add_trace() has a use-after-free issue when accessing q->blk_trace. Indeed the switching of block tracing (and thus eventual freeing of q->blk_trace) is completely unsynchronized with the currently running tracing and thus it can happen that the blk_trace structure is being freed just while __blk_add_trace() works on it. Protect accesses to q->blk_trace by RCU during tracing and make sure we wait for the end of RCU grace period when shutting down tracing. Luckily that is rare enough event that we can afford that. Note that postponing the freeing of blk_trace to an RCU callback should better be avoided as it could have unexpected user visible side-effects as debugfs files would be still existing for a short while block tracing has been shut down. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711 CC: [email protected] Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Ming Lei <[email protected]> Tested-by: Ming Lei <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reported-by: Tristan Madani <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-02-24bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabledDavid Miller1-3/+15
In a RT kernel down_read_trylock() cannot be used from NMI context and up_read_non_owner() is another problematic issue. So in such a configuration, simply elide the annotated stackmap and just report the raw IPs. In the longer term, it might be possible to provide a atomic friendly versions of the page cache traversal which will at least provide the info if the pages are resident and don't need to be paged in. [ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work callback and add a comment ] Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf, lpm: Make locking RT friendlyThomas Gleixner1-6/+6
The LPM trie map cannot be used in contexts like perf, kprobes and tracing as this map type dynamically allocates memory. The memory allocation happens with a raw spinlock held which is a truly spinning lock on a PREEMPT RT enabled kernel which disables preemption and interrupts. As RT does not allow memory allocation from such a section for various reasons, convert the raw spinlock to a regular spinlock. On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks which provide the proper protection but keep the code preemptible. On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does not cause any functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Prepare hashtab locking for PREEMPT_RTThomas Gleixner1-9/+56
PREEMPT_RT forbids certain operations like memory allocations (even with GFP_ATOMIC) from atomic contexts. This is required because even with GFP_ATOMIC the memory allocator calls into code pathes which acquire locks with long held lock sections. To ensure the deterministic behaviour these locks are regular spinlocks, which are converted to 'sleepable' spinlocks on RT. The only true atomic contexts on an RT kernel are the low level hardware handling, scheduling, low level interrupt handling, NMIs etc. None of these contexts should ever do memory allocations. As regular device interrupt handlers and soft interrupts are forced into thread context, the existing code which does spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*(); just works. In theory the BPF locks could be converted to regular spinlocks as well, but the bucket locks and percpu_freelist locks can be taken from arbitrary contexts (perf, kprobes, tracepoints) which are required to be atomic contexts even on RT. These mechanisms require preallocated maps, so there is no need to invoke memory allocations within the lock held sections. BPF maps which need dynamic allocation are only used from (forced) thread context on RT and can therefore use regular spinlocks which in turn allows to invoke memory allocations from the lock held section. To achieve this make the hash bucket lock a union of a raw and a regular spinlock and initialize and lock/unlock either the raw spinlock for preallocated maps or the regular variant for maps which require memory allocations. On a non RT kernel this distinction is neither possible nor required. spinlock maps to raw_spinlock and the extra code and conditional is optimized out by the compiler. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Factor out hashtab bucket lock operationsThomas Gleixner1-23/+46
As a preparation for making the BPF locking RT friendly, factor out the hash bucket lock operations into inline functions. This allows to do the necessary RT modification in one place instead of sprinkling it all over the place. No functional change. The now unused htab argument of the lock/unlock functions will be used in the next step which adds PREEMPT_RT support. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Replace open coded recursion prevention in sys_bpf()Thomas Gleixner1-19/+8
The required protection is that the caller cannot be migrated to a different CPU as these functions end up in places which take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_[dis|en]able and __this_cpu_[inc|dec] pairs with the new helper functions. These functions are already prepared to make BPF work on PREEMPT_RT enabled kernels. No functional change for !RT kernels. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Use recursion prevention helpers in hashtab codeThomas Gleixner1-8/+4
The required protection is that the caller cannot be migrated to a different CPU as these places take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_disable/enable() and this_cpu_inc/dec() pairs with the new recursion prevention helpers to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable in the helpers map to preempt_disable/enable(), i.e. no functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Use migrate_disable/enabe() in trampoline code.David Miller1-4/+5
Instead of preemption disable/enable to reflect the purpose. This allows PREEMPT_RT to substitute it with an actual migration disable implementation. On non RT kernels this is still mapped to preempt_disable/enable(). Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites.David Miller1-3/+1
All of these cases are strictly of the form: preempt_disable(); BPF_PROG_RUN(...); preempt_enable(); Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN() with: migrate_disable(); BPF_PROG_RUN(...); migrate_enable(); On non RT enabled kernels this maps to preempt_disable/enable() and on RT enabled kernels this solely prevents migration, which is sufficient as there is no requirement to prevent reentrancy to any BPF program from a preempting task. The only requirement is that the program stays on the same CPU. Therefore, this is a trivially correct transformation. The seccomp loop does not need protection over the loop. It only needs protection per BPF filter program [ tglx: Converted to bpf_prog_run_pin_on_cpu() ] Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Dont iterate over possible CPUs with interrupts disabledThomas Gleixner1-10/+10
pcpu_freelist_populate() is disabling interrupts and then iterates over the possible CPUs. The reason why this disables interrupts is to silence lockdep because the invoked ___pcpu_freelist_push() takes spin locks. Neither the interrupt disabling nor the locking are required in this function because it's called during initialization and the resulting map is not yet visible to anything. Split out the actual push assignement into an inline, call it from the loop and remove the interrupt disable. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Remove recursion prevention from rcu free callbackThomas Gleixner1-8/+0
If an element is freed via RCU then recursion into BPF instrumentation functions is not a concern. The element is already detached from the map and the RCU callback does not hold any locks on which a kprobe, perf event or tracepoint attached BPF program could deadlock. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24perf/bpf: Remove preempt disable around BPF invocationThomas Gleixner1-2/+0
The BPF invocation from the perf event overflow handler does not require to disable preemption because this is called from NMI or at least hard interrupt context which is already non-preemptible. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf/trace: Remove redundant preempt_disable from trace_call_bpf()Thomas Gleixner1-2/+1
Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: disable preemption for bpf progs attached to uprobeAlexei Starovoitov1-2/+9
trace_call_bpf() no longer disables preemption on its own. All callers of this function has to do it explicitly. Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Thomas Gleixner <[email protected]>
2020-02-24bpf/trace: Remove EXPORT from trace_call_bpf()Thomas Gleixner1-1/+0
All callers are built in. No point to export this. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2020-02-24bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run()Thomas Gleixner1-2/+1
__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation. This is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Update locking comment in hashtab codeThomas Gleixner1-4/+20
The comment where the bucket lock is acquired says: /* bpf_map_update_elem() can be called in_irq() */ which is not really helpful and aside of that it does not explain the subtle details of the hash bucket locks expecially in the context of BPF and perf, kprobes and tracing. Add a comment at the top of the file which explains the protection scopes and the details how potential deadlocks are prevented. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Enforce preallocation for instrumentation programs on RTThomas Gleixner1-4/+9
Aside of the general unsafety of run-time map allocation for instrumentation type programs RT enabled kernels have another constraint: The instrumentation programs are invoked with preemption disabled, but the memory allocator spinlocks cannot be acquired in atomic context because they are converted to 'sleeping' spinlocks on RT. Therefore enforce map preallocation for these programs types when RT is enabled. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-02-24bpf: Tighten the requirements for preallocated hash mapsThomas Gleixner1-11/+28
The assumption that only programs attached to perf NMI events can deadlock on memory allocators is wrong. Assume the following simplified callchain: kmalloc() from regular non BPF context cache empty freelist empty lock(zone->lock); tracepoint or kprobe BPF() update_elem() lock(bucket) kmalloc() cache empty freelist empty lock(zone->lock); <- DEADLOCK There are other ways which do not involve locking to create wreckage: kmalloc() from regular non BPF context local_irq_save(); ... obj = slab_first(); kprobe() BPF() update_elem() lock(bucket) kmalloc() local_irq_save(); ... obj = slab_first(); <- Same object as above ... So preallocation _must_ be enforced for all variants of intrusive instrumentation. Unfortunately immediate enforcement would break backwards compatibility, so for now such programs still are allowed to run, but a one time warning is emitted in dmesg and the verifier emits a warning in the verifier log as well so developers are made aware about this and can fix their programs before the enforcement becomes mandatory. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]