aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-03-29i3c: convert to use i2c_new_client_device()Wolfram Sang1-1/+1
Move away from the deprecated API. Signed-off-by: Wolfram Sang <[email protected]> Signed-off-by: Boris Brezillon <[email protected]> Link: https://lore.kernel.org/linux-i3c/[email protected]
2020-03-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds23-80/+221
Pull networking fixes from David Miller: 1) Fix memory leak in vti6, from Torsten Hilbrich. 2) Fix double free in xfrm_policy_timer, from YueHaibing. 3) NL80211_ATTR_CHANNEL_WIDTH attribute is put with wrong type, from Johannes Berg. 4) Wrong allocation failure check in qlcnic driver, from Xu Wang. 5) Get ks8851-ml IO operations right, for real this time, from Marek Vasut. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits) r8169: fix PHY driver check on platforms w/o module softdeps net: ks8851-ml: Fix IO operations, again mlxsw: spectrum_mr: Fix list iteration in error path qlcnic: Fix bad kzalloc null test mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX mac80211: mark station unauthorized before key removal mac80211: Check port authorization in the ieee80211_tx_dequeue() case cfg80211: Do not warn on same channel at the end of CSA mac80211: drop data frames without key on encrypted links ieee80211: fix HE SPR size calculation nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type xfrm: policy: Fix doulbe free in xfrm_policy_timer bpf: Explicitly memset some bpf info structures declared on the stack bpf: Explicitly memset the bpf_attr structure bpf: Sanitize the bpf_struct_ops tcp-cc name vti6: Fix memory leak of skb if input policy check fails esp: remove the skb from the chain when it's enqueued in cryptd_wq ipv6: xfrm6_tunnel.c: Use built-in RCU list checking xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire xfrm: fix uctx len check in verify_sec_ctx_len ...
2020-03-28Merge branch 'ifla_xdp_expected_fd'Alexei Starovoitov9-9/+146
Toke Høiland-Jørgensen says: ==================== This series adds support for atomically replacing the XDP program loaded on an interface. This is achieved by means of a new netlink attribute that can specify the expected previous program to replace on the interface. If set, the kernel will compare this "expected fd" attribute with the program currently loaded on the interface, and reject the operation if it does not match. With this primitive, userspace applications can avoid stepping on each other's toes when simultaneously updating the loaded XDP program. Changelog: v4: - Switch back to passing FD instead of ID (Andrii) - Rename flag to XDP_FLAGS_REPLACE (for consistency with other similar uses) v3: - Pass existing ID instead of FD (Jakub) - Use opts struct for new libbpf function (Andrii) v2: - Fix checkpatch nits and add .strict_start_type to netlink policy (Jakub) ==================== Signed-off-by: Alexei Starovoitov <[email protected]>
2020-03-28selftests/bpf: Add tests for attaching XDP programsToke Høiland-Jørgensen1-0/+62
This adds tests for the various replacement operations using IFLA_XDP_EXPECTED_FD. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-28libbpf: Add function to set link XDP fd while specifying old programToke Høiland-Jørgensen3-1/+42
This adds a new function to set the XDP fd while specifying the FD of the program to replace, using the newly added IFLA_XDP_EXPECTED_FD netlink parameter. The new function uses the opts struct mechanism to be extendable in the future. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-28tools: Add EXPECTED_FD-related definitions in if_link.hToke Høiland-Jørgensen1-1/+3
This adds the IFLA_XDP_EXPECTED_FD netlink attribute definition and the XDP_FLAGS_REPLACE flag to if_link.h in tools/include. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-28xdp: Support specifying expected existing program when attaching XDPToke Høiland-Jørgensen4-7/+39
While it is currently possible for userspace to specify that an existing XDP program should not be replaced when attaching to an interface, there is no mechanism to safely replace a specific XDP program with another. This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be set along with IFLA_XDP_FD. If set, the kernel will check that the program currently loaded on the interface matches the expected one, and fail the operation if it does not. This corresponds to a 'cmpxchg' memory operation. Setting the new attribute with a negative value means that no program is expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST flag. A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly request checking of the EXPECTED_FD attribute. This is needed for userspace to discover whether the kernel supports the new attribute. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-28platform/x86: surface3_power: Fix Kconfig section orderingAndy Shevchenko1-6/+6
Kconfig section is misplaced. Put it in the same order as it is done in Makefile for this driver. Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28platform/x86: surface3_power: Add missed headersAndy Shevchenko1-0/+2
We obviously are users of bits.h and types.h. Add them to the list. Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28platform/x86: surface3_power: Reformat GUID assignmentAndy Shevchenko1-2/+3
For better readability reformat GUID assignment. While here, add the comment how this GUID looks in a string representation. Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28platform/x86: surface3_power: Drop useless macro ACPI_PTR()Andy Shevchenko1-1/+1
Driver depends to ACPI, this marco always is evaluated to the parameter, thus useless. Drop it for good. Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28platform/x86: surface3_power: Prefix POLL_INTERVAL with SURFACE_3Andy Shevchenko1-3/+3
For better namespace maintenance prefix POLL_INTERVAL macro with SURFACE_3. Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28platform/x86: surface3_power: Simplify mshw0011_adp_psr() to one linerAndy Shevchenko1-8/+1
Refactor mshw0011_adp_psr() to be one liner. Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28platform/x86: surface3_power: Use dev_err() instead of pr_err()Andy Shevchenko1-1/+1
We have device and we may use it to print messages. Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28platform/x86: surface3_power: Drop unused structure definitionAndy Shevchenko1-7/+0
As reported by kbuild bot the struct mshw0011_lookup in never used. Drop its definition for good. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]>
2020-03-28Merge branch 'i2c/for-current' of ↵Linus Torvalds5-15/+13
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "Three more driver bugfixes, and two doc improvements fixing build warnings while we are here" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: pca-platform: Use platform_irq_get_optional i2c: st: fix missing struct parameter description i2c: nvidia-gpu: Handle timeout correctly in gpu_i2c_check_status() i2c: fix a doc warning i2c: hix5hd2: add missed clk_disable_unprepare in remove
2020-03-28bpf: Fix build warning regarding missing prototypesJean-Philippe Menil1-0/+4
Fix build warnings when building net/bpf/test_run.o with W=1 due to missing prototype for bpf_fentry_test{1..6}. Instead of declaring prototypes, turn off warnings with __diag_{push,ignore,pop} as pointed out by Alexei. Signed-off-by: Jean-Philippe Menil <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-28MIPS: ralink: mt7621: Fix soc_device introductionThomas Bogendoerfer2-2/+3
Depending on selected SMP config options soc_device didn't get initialised at all. With UP config vmlinux didn't link because of missing soc bus. Fixes: 71b9b5e0130d ("MIPS: ralink: mt7621: introduce 'soc_device' initialization") Signed-off-by: Thomas Bogendoerfer <[email protected]> Tested-by: René van Dorst <[email protected]>
2020-03-28Merge tag 'scsi-fixes' of ↵Linus Torvalds2-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Two small fixes: one in drivers (qla2xxx), and one in the core (sd) to try to cope with USB enclosures that silently change reported parameters" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd: Fix optimal I/O size for devices that change reported values scsi: qla2xxx: Fix I/Os being passed down when FC device is being deleted
2020-03-28libbpf, xsk: Init all ring members in xsk_umem__create and xsk_socket__createFletcher Dunn1-2/+14
Fix a sharp edge in xsk_umem__create and xsk_socket__create. Almost all of the members of the ring buffer structs are initialized, but the "cached_xxx" variables are not all initialized. The caller is required to zero them. This is needlessly dangerous. The results if you don't do it can be very bad. For example, they can cause xsk_prod_nb_free and xsk_cons_nb_avail to return values greater than the size of the queue. xsk_ring_cons__peek can return an index that does not refer to an item that has been queued. I have confirmed that without this change, my program misbehaves unless I memset the ring buffers to zero before calling the function. Afterwards, my program works without (or with) the memset. Signed-off-by: Fletcher Dunn <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Magnus Karlsson <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-28fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_tThomas Gleixner4-26/+16
Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this problem. The replacement was done conditionaly at compile time, but Christoph requested to do an unconditional conversion. Jan suggested to move the spinlock into a existing padding hole which avoids a size increase of struct buffer_head on production kernels. As a benefit the lock gains lockdep coverage. [ bigeasy: Remove the wrapper and use always spinlock_t and move it into the padding hole ] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-28thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_tClark Williams1-12/+12
The pkg_temp_lock spinlock is acquired in the thermal vector handler which is truly atomic context even on PREEMPT_RT kernels. The critical sections are tiny, so change it to a raw spinlock. Signed-off-by: Clark Williams <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-03-28Documentation/locking/locktypes: Minor copy editor fixesRandy Dunlap1-11/+11
Minor editorial fixes: - remove 'enabled' from PREEMPT_RT enabled kernels for consistency - add some periods for consistency - add "'" for possessive CPU's - spell out interrupts [ tglx: Picked up Paul's suggestions ] Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-28Documentation/locking/locktypes: Further clarifications and wordsmithingThomas Gleixner1-50/+98
The documentation of rw_semaphores is wrong as it claims that the non-owner reader release is not supported by RT. That's just history biased memory distortion. Split the 'Owner semantics' section up and add separate sections for semaphore and rw_semaphore to reflect reality. Aside of that the following updates are done: - Add pseudo code to document the spinlock state preserving mechanism on PREEMPT_RT - Wordsmith the bitspinlock and lock nesting sections Co-developed-by: Paul McKenney <[email protected]> Signed-off-by: Paul McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-28x86/boot/compressed: Fix debug_puthex() parameter typeJoerg Roedel1-1/+1
In the CONFIG_X86_VERBOSE_BOOTUP=Y case, the debug_puthex() macro just turns into __puthex(), which takes 'unsigned long' as parameter. But in the CONFIG_X86_VERBOSE_BOOTUP=N case, it is a function which takes 'unsigned char *', causing compile warnings when the function is used. Fix the parameter type to get rid of the warnings. Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-28Merge branch 'uaccess.futex' of ↵Thomas Gleixner22-184/+94
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into locking/core Pull uaccess futex cleanups for Al Viro: Consolidate access_ok() usage and the futex uaccess function zoo.
2020-03-28Merge branch 'next.uaccess-2' of ↵Thomas Gleixner17-778/+401
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into x86/cleanups Pull uaccess cleanups from Al Viro: Consolidate the user access areas and get rid of uaccess_try(), user_ex() and other warts.
2020-03-28m68knommu: Remove mm.h include from uaccess_no.hThomas Gleixner1-1/+0
In file included from include/linux/huge_mm.h:8, from include/linux/mm.h:567, from arch/m68k/include/asm/uaccess_no.h:8, from arch/m68k/include/asm/uaccess.h:3, from include/linux/uaccess.h:11, from include/linux/sched/task.h:11, from include/linux/sched/signal.h:9, from include/linux/rcuwait.h:6, from include/linux/percpu-rwsem.h:7, from kernel/locking/percpu-rwsem.c:6: include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' 1422 | struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; Removing the include of linux/mm.h from the uaccess header solves the problem and various build tests of nommu configurations still work. Fixes: 80fbaf1c3f29 ("rcuwait: Add @state argument to rcuwait_wait_event()") Reported-by: kbuild test robot <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-28cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus()Thomas Gleixner2-5/+11
A recent change to freeze_secondary_cpus() which added an early abort if a wakeup is pending missed the fact that the function is also invoked for shutdown, reboot and kexec via disable_nonboot_cpus(). In case of disable_nonboot_cpus() the wakeup event needs to be ignored as the purpose is to terminate the currently running kernel. Add a 'suspend' argument which is only set when the freeze is in context of a suspend operation. If not set then an eventually pending wakeup event is ignored. Fixes: a66d955e910a ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending") Reported-by: Boqun Feng <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Pavankumar Kondeti <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-03-28Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"Thomas Gleixner1-2/+0
This reverts commit 4f41fe386a94639cd9a1831298d4f85db5662f1e. The change breaks systems on which the DT node of a device is used by multiple drivers. The proposed workaround to clear OF_POPULATED is just a band aid and this needs to be cleaned up at the root of the problem. Revert this for now. Reported-by: Ionela Voinescu <[email protected]> Reported-by: Jon Hunter <[email protected]> Requested-by: Rob Herring <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Saravana Kannan <[email protected]> Cc: Daniel Lezcano <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-03-28MAINTAINERS: erofs: update my email addressGao Xiang1-1/+1
This email address will not be available in a few days. Update my own email address to [email protected], which should be available all the time. Link: https://lore.kernel.org/r/[email protected] Acked-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]>
2020-03-28i2c: pca-platform: Use platform_irq_get_optionalChris Packham1-1/+1
The interrupt is not required so use platform_irq_get_optional() to avoid error messages like i2c-pca-platform 22080000.i2c: IRQ index 0 not found Signed-off-by: Chris Packham <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
2020-03-28i2c: st: fix missing struct parameter descriptionAlain Volmat1-0/+1
Fix a missing struct parameter description to allow warning free W=1 compilation. Signed-off-by: Alain Volmat <[email protected]> Reviewed-by: Patrice Chotard <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
2020-03-27x86: get rid of user_atomic_cmpxchg_inatomic()Al Viro2-94/+19
Only one user left; the thing had been made polymorphic back in 2013 for the sake of MPX. No point keeping it now that MPX is gone. Convert futex_atomic_cmpxchg_inatomic() to user_access_{begin,end}() while we are at it. Signed-off-by: Al Viro <[email protected]>
2020-03-27generic arch_futex_atomic_op_inuser() doesn't need access_ok()Al Viro1-2/+0
uses get_user() and put_user() for memory accesses Signed-off-by: Al Viro <[email protected]>
2020-03-27x86: don't reload after cmpxchg in unsafe_atomic_op2() loopAl Viro1-8/+8
lock cmpxchg leaves the current value in eax; no need to reload it. Signed-off-by: Al Viro <[email protected]>
2020-03-27x86: convert arch_futex_atomic_op_inuser() to ↵Al Viro1-26/+36
user_access_begin/user_access_end() Lift stac/clac pairs from __futex_atomic_op{1,2} into arch_futex_atomic_op_inuser(), fold them with access_ok() in there. The switch in arch_futex_atomic_op_inuser() is what has required the previous (objtool) commit... Signed-off-by: Al Viro <[email protected]>
2020-03-27objtool: whitelist __sanitizer_cov_trace_switch()Al Viro1-0/+1
it's not really different from e.g. __sanitizer_cov_trace_cmp4(); as it is, the switches that generate an array of labels get rejected by objtool, while slightly different set of cases that gets compiled into a series of comparisons is accepted. Signed-off-by: Al Viro <[email protected]>
2020-03-27[parisc, s390, sparc64] no need for access_ok() in futex handlingAl Viro3-7/+0
access_ok() is always true on those Signed-off-by: Al Viro <[email protected]>
2020-03-27sh: no need of access_ok() in arch_futex_atomic_op_inuser()Al Viro1-3/+0
everything it uses is doing access_ok() already Signed-off-by: Al Viro <[email protected]>
2020-03-27futex: arch_futex_atomic_op_inuser() calling conventions changeAl Viro20-57/+43
Move access_ok() in and pagefault_enable()/pagefault_disable() out. Mechanical conversion only - some instances don't really need a separate access_ok() at all (e.g. the ones only using get_user()/put_user(), or architectures where access_ok() is always true); we'll deal with that in followups. Signed-off-by: Al Viro <[email protected]>
2020-03-27Merge branch 'cgroup-helpers'Alexei Starovoitov11-14/+336
Daniel Borkmann says: ==================== This adds various straight-forward helper improvements and additions to BPF cgroup based connect(), sendmsg(), recvmsg() and bind-related hooks which would allow to implement more fine-grained policies and improve current load balancer limitations we're seeing. For details please see individual patches. I've tested them on Kubernetes & Cilium and also added selftests for the small verifier extension. Thanks! ==================== Signed-off-by: Alexei Starovoitov <[email protected]>
2020-03-27bpf: Add selftest cases for ctx_or_null argument typeDaniel Borkmann1-0/+105
Add various tests to make sure the verifier keeps catching them: # ./test_verifier [...] #230/p pass ctx or null check, 1: ctx OK #231/p pass ctx or null check, 2: null OK #232/p pass ctx or null check, 3: 1 OK #233/p pass ctx or null check, 4: ctx - const OK #234/p pass ctx or null check, 5: null (connect) OK #235/p pass ctx or null check, 6: null (bind) OK #236/p pass ctx or null check, 7: ctx (bind) OK #237/p pass ctx or null check, 8: null (bind) OK [...] Summary: 1595 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/c74758d07b1b678036465ef7f068a49e9efd3548.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrival of pid/tgid/comm from bpf cgroup hooksDaniel Borkmann1-0/+8
We already have the bpf_get_current_uid_gid() helper enabled, and given we now have perf event RB output available for connect(), sendmsg(), recvmsg() and bind-related hooks, add a trivial change to enable bpf_get_current_pid_tgid() and bpf_get_current_comm() as well. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/18744744ed93c06343be8b41edcfd858706f39d7.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor idDaniel Borkmann6-2/+72
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(), recvmsg() and bind-related hooks in order to retrieve the cgroup v2 context which can then be used as part of the key for BPF map lookups, for example. Given these hooks operate in process context 'current' is always valid and pointing to the app that is performing mentioned syscalls if it's subject to a v2 cgroup. Also with same motivation of commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") enable retrieval of ancestor from current so the cgroup id can be used for policy lookups which can then forbid connect() / bind(), for example. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Allow to retrieve cgroup v1 classid from v2 hooksDaniel Borkmann2-1/+27
Today, Kubernetes is still operating on cgroups v1, however, it is possible to retrieve the task's classid based on 'current' out of connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators which attach to the root cgroup v2 hook in a mixed env like in case of Cilium, for example, in order to then correlate certain pod traffic and use it as part of the key for BPF map lookups. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Add netns cookie and enable it for bpf cgroup hooksDaniel Borkmann7-8/+103
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable perf event rb output for bpf cgroup progsDaniel Borkmann1-5/+9
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks are all lacking perf event rb output in order to push notifications or monitoring events up to user space. Back in commit a5a3a828cd00 ("bpf: add perf event notificaton support for sock_ops"), I've worked with Sowmini to enable them for sock_ops where the context part is not used (as opposed to skbs for example where the packet data can be appended). Make the bpf_sockopt_event_output() helper generic and enable it for mentioned hooks. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrieval of socket cookie for bind/post-bind hookDaniel Borkmann1-0/+14
We currently make heavy use of the socket cookie in BPF's connect(), sendmsg() and recvmsg() hooks for load-balancing decisions. However, it is currently not enabled/implemented in BPF {post-}bind hooks where it can later be used in combination for correlation in the tc egress path, for example. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/e9d71f310715332f12d238cc650c1edc5be55119.1585323121.git.daniel@iogearbox.net
2020-03-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller4-20/+25
Daniel Borkmann says: ==================== pull-request: bpf 2020-03-27 The following pull-request contains BPF updates for your *net* tree. We've added 3 non-merge commits during the last 4 day(s) which contain a total of 4 files changed, 25 insertions(+), 20 deletions(-). The main changes are: 1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid having to rely on compiler to do so. Issues have been noticed on some compilers with padding and other oddities where the request was then unexpectedly rejected, from Greg Kroah-Hartman. 2) Sanitize the bpf_struct_ops TCP congestion control name in order to avoid problematic characters such as whitespaces, from Martin KaFai Lau. ==================== Signed-off-by: David S. Miller <[email protected]>