Age | Commit message (Collapse) | Author | Files | Lines |
|
The bitmap set does not support for expressions, skip it from the
estimation step.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
If the global set expression definition mismatches the dynset
expression, then bail out.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Otherwise, nft_lookup might dereference an uninitialized pointer to the
element extension.
Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
When CONFIG_NF_CONNTRACK_MARK is not set, any CTA_MARK or CTA_MARK_MASK
in netlink message are not supported. We should return an error when one
of them is set, not both
Fixes: 9306425b70bf ("netfilter: ctnetlink: must check mark attributes vs NULL")
Signed-off-by: Romain Bellan <[email protected]>
Signed-off-by: Florent Fourcot <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
KP Singh says:
====================
** Motivation
Google does analysis of rich runtime security data to detect and thwart
threats in real-time. Currently, this is done in custom kernel modules
but we would like to replace this with something that's upstream and
useful to others.
The current kernel infrastructure for providing telemetry (Audit, Perf
etc.) is disjoint from access enforcement (i.e. LSMs). Augmenting the
information provided by audit requires kernel changes to audit, its
policy language and user-space components. Furthermore, building a MAC
policy based on the newly added telemetry data requires changes to
various LSMs and their respective policy languages.
This patchset allows BPF programs to be attached to LSM hooks This
facilitates a unified and dynamic (not requiring re-compilation of the
kernel) audit and MAC policy.
** Why an LSM?
Linux Security Modules target security behaviours rather than the
kernel's API. For example, it's easy to miss out a newly added system
call for executing processes (eg. execve, execveat etc.) but the LSM
framework ensures that all process executions trigger the relevant hooks
irrespective of how the process was executed.
Allowing users to implement LSM hooks at runtime also benefits the LSM
eco-system by enabling a quick feedback loop from the security community
about the kind of behaviours that the LSM Framework should be targeting.
** How does it work?
The patchset introduces a new eBPF (https://docs.cilium.io/en/v1.6/bpf/)
program type BPF_PROG_TYPE_LSM which can only be attached to LSM hooks.
Loading and attachment of BPF programs requires CAP_SYS_ADMIN.
The new LSM registers nop functions (bpf_lsm_<hook_name>) as LSM hook
callbacks. Their purpose is to provide a definite point where BPF
programs can be attached as BPF_TRAMP_MODIFY_RETURN trampoline programs
for hooks that return an int, and BPF_TRAMP_FEXIT trampoline programs
for void LSM hooks.
Audit logs can be written using a format chosen by the eBPF program to
the perf events buffer or to global eBPF variables or maps and can be
further processed in user-space.
** BTF Based Design
The current design uses BTF:
* https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html
* https://lwn.net/Articles/803258
which allows verifiable read-only structure accesses by field names
rather than fixed offsets. This allows accessing the hook parameters
using a dynamically created context which provides a certain degree of
ABI stability:
// Only declare the structure and fields intended to be used
// in the program
struct vm_area_struct {
unsigned long vm_start;
} __attribute__((preserve_access_index));
// Declare the eBPF program mprotect_audit which attaches to
// to the file_mprotect LSM hook and accepts three arguments.
SEC("lsm/file_mprotect")
int BPF_PROG(mprotect_audit, struct vm_area_struct *vma,
unsigned long reqprot, unsigned long prot, int ret)
{
unsigned long vm_start = vma->vm_start;
return 0;
}
By relocating field offsets, BTF makes a large portion of kernel data
structures readily accessible across kernel versions without requiring a
large corpus of BPF helper functions and requiring recompilation with
every kernel version. The BTF type information is also used by the BPF
verifier to validate memory accesses within the BPF program and also
prevents arbitrary writes to the kernel memory.
The limitations of BTF compatibility are described in BPF Co-Re
(http://vger.kernel.org/bpfconf2019_talks/bpf-core.pdf, i.e. field
renames, #defines and changes to the signature of LSM hooks). This
design imposes that the MAC policy (eBPF programs) be updated when the
inspected kernel structures change outside of BTF compatibility
guarantees. In practice, this is only required when a structure field
used by a current policy is removed (or renamed) or when the used LSM
hooks change. We expect the maintenance cost of these changes to be
acceptable as compared to the design presented in the RFC.
(https://lore.kernel.org/bpf/[email protected]/).
** Usage Examples
A simple example and some documentation is included in the patchset.
In order to better illustrate the capabilities of the framework some
more advanced prototype (not-ready for review) code has also been
published separately:
* Logging execution events (including environment variables and
arguments)
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
* Detecting deletion of running executables:
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c
* Detection of writes to /proc/<pid>/mem:
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
We have updated Google's internal telemetry infrastructure and have
started deploying this LSM on our Linux Workstations. This gives us more
confidence in the real-world applications of such a system.
** Changelog:
- v8 -> v9:
https://lore.kernel.org/bpf/[email protected]/
* Fixed a selftest crash when CONFIG_LSM doesn't have "bpf".
* Added James' Ack.
* Rebase.
- v7 -> v8:
https://lore.kernel.org/bpf/[email protected]/
* Removed CAP_MAC_ADMIN check from bpf_lsm_verify_prog. LSMs can add it
in their own bpf_prog hook. This can be revisited as a separate patch.
* Added Andrii and James' Ack/Review tags.
* Fixed an indentation issue and missing newlines in selftest error
a cases.
* Updated a comment as suggested by Alexei.
* Updated the documentation to use the newer libbpf API and some other
fixes.
* Rebase
- v6 -> v7:
https://lore.kernel.org/bpf/[email protected]/
* Removed __weak from the LSM attachment nops per Kees' suggestion.
Will send a separate patch (if needed) to update the noinline
definition in include/linux/compiler_attributes.h.
* waitpid to wait specifically for the forked child in selftests.
* Comment format fixes in security/... as suggested by Casey.
* Added Acks from Kees and Andrii and Casey's Reviewed-by: tags to
the respective patches.
* Rebase
- v5 -> v6:
https://lore.kernel.org/bpf/[email protected]/
* Updated LSM_HOOK macro to define a default value and cleaned up the
BPF LSM hook declarations.
* Added Yonghong's Acks and Kees' Reviewed-by tags.
* Simplification of the selftest code.
* Rebase and fixes suggested by Andrii and Yonghong and some other minor
fixes noticed in internal review.
- v4 -> v5:
https://lore.kernel.org/bpf/[email protected]/
* Removed static keys and special casing of BPF calls from the LSM
framework.
* Initialized the BPF callbacks (nops) as proper LSM hooks.
* Updated to using the newly introduced BPF_TRAMP_MODIFY_RETURN
trampolines in https://lkml.org/lkml/2020/3/4/877
* Addressed Andrii's feedback and rebased.
- v3 -> v4:
* Moved away from allocating a separate security_hook_heads and adding a
new special case for arch_prepare_bpf_trampoline to using BPF fexit
trampolines called from the right place in the LSM hook and toggled by
static keys based on the discussion in:
https://lore.kernel.org/bpf/CAG48ez25mW+_oCxgCtbiGMX07g_ph79UOJa07h=o_6B6+Q-u5g@mail.gmail.com/
* Since the code does not deal with security_hook_heads anymore, it goes
from "being a BPF LSM" to "BPF program attachment to LSM hooks".
* Added a new test case which ensures that the BPF programs' return value
is reflected by the LSM hook.
- v2 -> v3 does not change the overall design and has some minor fixes:
* LSM_ORDER_LAST is introduced to represent the behaviour of the BPF LSM
* Fixed the inadvertent clobbering of the LSM Hook error codes
* Added GPL license requirement to the commit log
* The lsm_hook_idx is now the more conventional 0-based index
* Some changes were split into a separate patch ("Load btf_vmlinux only
once per object")
https://lore.kernel.org/bpf/[email protected]/
* Addressed Andrii's feedback on the BTF implementation
* Documentation update for using generated vmlinux.h to simplify
programs
* Rebase
- Changes since v1:
https://lore.kernel.org/bpf/[email protected]
* Eliminate the requirement to maintain LSM hooks separately in
security/bpf/hooks.h Use BPF trampolines to dynamically allocate
security hooks
* Drop the use of securityfs as bpftool provides the required
introspection capabilities. Update the tests to use the bpf_skeleton
and global variables
* Use O_CLOEXEC anonymous fds to represent BPF attachment in line with
the other BPF programs with the possibility to use bpf program pinning
in the future to provide "permanent attachment".
* Drop the logic based on prog names for handling re-attachment.
* Drop bpf_lsm_event_output from this series and send it as a separate
patch.
====================
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Document how eBPF programs (BPF_PROG_TYPE_LSM) can be loaded and
attached (BPF_LSM_MAC) to the LSM hooks.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Thomas Garnier <[email protected]>
Reviewed-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
* Load/attach a BPF program that hooks to file_mprotect (int)
and bprm_committed_creds (void).
* Perform an action that triggers the hook.
* Verify if the audit event was received using the shared global
variables for the process executed.
* Verify if the mprotect returns a -EPERM.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Thomas Garnier <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Since BPF_PROG_TYPE_LSM uses the same attaching mechanism as
BPF_PROG_TYPE_TRACING, the common logic is refactored into a static
function bpf_program__attach_btf_id.
A new API call bpf_program__attach_lsm is still added to avoid userspace
conflicts if this ever changes in the future.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
* The hooks are initialized using the definitions in
include/linux/lsm_hook_defs.h.
* The LSM can be enabled / disabled with CONFIG_BPF_LSM.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Acked-by: Kees Cook <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
JITed BPF programs are dynamically attached to the LSM hooks
using BPF trampolines. The trampoline prologue generates code to handle
conversion of the signature of the hook to the appropriate BPF context.
The allocated trampoline programs are attached to the nop functions
initialized as LSM hooks.
BPF_PROG_TYPE_LSM programs must have a GPL compatible license and
and need CAP_SYS_ADMIN (required for loading eBPF programs).
Upon attachment:
* A BPF fexit trampoline is used for LSM hooks with a void return type.
* A BPF fmod_ret trampoline is used for LSM hooks which return an
int. The attached programs can override the return value of the
bpf LSM hook to indicate a MAC Policy decision.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are
generated for each LSM hook. These functions are initialized as LSM
hooks in a subsequent patch.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
The information about the different types of LSM hooks is scattered
in two locations i.e. union security_list_options and
struct security_hook_heads. Rather than duplicating this information
even further for BPF_PROG_TYPE_LSM, define all the hooks with the
LSM_HOOK macro in lsm_hook_defs.h which is then used to generate all
the data structures required by the LSM framework.
The LSM hooks are defined as:
LSM_HOOK(<return_type>, <default_value>, <hook_name>, args...)
with <default_value> acccessible in security.c as:
LSM_RET_DEFAULT(<hook_name>)
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Casey Schaufler <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Introduce types and configs for bpf programs that can be attached to
LSM hooks. The programs can be enabled by the config option
CONFIG_BPF_LSM.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Thomas Garnier <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This adds a test to exercise the new bpf_map__set_initial_value() function.
The test simply overrides the global data section with all zeroes, and
checks that the new value makes it into the kernel map on load.
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
For internal maps (most notably the maps backing global variables), libbpf
uses an internal mmaped area to store the data after opening the object.
This data is subsequently copied into the kernel map when the object is
loaded.
This adds a function to set a new value for that data, which can be used to
before it is loaded into the kernel. This is especially relevant for RODATA
maps, since those are frozen on load.
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Fix a redefinition of 'net_gen_cookie' error that was overlooked
when net ns is not configured.
Fixes: f318903c0bf4 ("bpf: Add netns cookie and enable it for bpf cgroup hooks")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
|
|
To 2.26
Signed-off-by: Steve French <[email protected]>
|
|
When encryption is used, smb2_transform_hdr is defined on the stack and is
passed to the transport. This doesn't work with RDMA as the buffer needs to
be DMA'ed.
Fix it by using kmalloc.
Signed-off-by: Long Li <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
When a RDMA packet is received and server is extending send credits, we should
check and unblock senders immediately in IRQ context. Doing it in a worker
queue causes unnecessary delay and doesn't save much CPU on the receive path.
Signed-off-by: Long Li <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
SMBDirect send/receive
The packet size needs to take account of SMB2 header size and possible
encryption header size. This is only done when signing is used and it is for
RDMA send/receive, not read/write.
Also remove the dead SMBD code in smb2_negotiate_r(w)size.
Signed-off-by: Long Li <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- Second batch of the GICv4.1 support saga
- Level triggered interrupt support for the stm32 controller
- Versatile-fpga chained interrupt fixes
- DT support for cascaded VIC interrupt controller
- RPi irqchip initialization fixes
- Multi-instance support for the Xilinx interrupt controller
- Multi-instance support for the PLIC interrupt controller
- CPU hotplug support for the PLIC interrupt controller
- Ingenic X1000 TCU support
- Small fixes all over the shop (GICv3, GICv4, Xilinx, Atmel, sa1111)
- Cleanups (setup_irq removal, zero-length array removal)
|
|
request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.
setup_irq() was required in older kernels as the memory allocator was not
available during early boot.
Hence replace setup_irq() by request_irq().
Signed-off-by: afzal mohammed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/82667ae23520611b2a9d8db77e1d8aeb982f08e5.1585320721.git.afzal.mohd.ma@gmail.com
|
|
request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.
setup_irq() was required in older kernels as the memory allocator was not
available during early boot.
Hence replace setup_irq() by request_irq().
Signed-off-by: afzal mohammed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/b060312689820559121ee0a6456bbc1202fb7ee5.1585320721.git.afzal.mohd.ma@gmail.com
|
|
request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.
setup_irq() was required in older kernels as the memory allocator was not
available during early boot.
Hence replace setup_irq() by request_irq().
Signed-off-by: afzal mohammed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/e84ac60de8f747d49ce082659e51595f708c29d4.1585320721.git.afzal.mohd.ma@gmail.com
|
|
request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.
setup_irq() was required in older kernels as the memory allocator was not
available during early boot.
Hence replace setup_irq() by request_irq().
Signed-off-by: afzal mohammed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/56e991e920ce5806771fab892574cba89a3d413f.1585320721.git.afzal.mohd.ma@gmail.com
|
|
request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.
setup_irq() was required in older kernels as the memory allocator was not
available during early boot.
Hence replace setup_irq() by request_irq().
Signed-off-by: afzal mohammed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Matt Turner <[email protected]>
Link: https://lkml.kernel.org/r/51f8ae7da9f47a23596388141933efa2bdef317b.1585320721.git.afzal.mohd.ma@gmail.com
|
|
Merge vm fixes from Andrew Morton:
"5 fixes"
* emailed patches from Andrew Morton <[email protected]>:
mm/sparse: fix kernel crash with pfn_section_valid check
mm: fork: fix kernel_stack memcg stats for various stack implementations
hugetlb_cgroup: fix illegal access to memory
drivers/base/memory.c: indicate all memory blocks as removable
mm/swapfile.c: move inode_lock out of claim_swapfile
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix for the Hyper-V clocksource driver to make sched clock
actually return nanoseconds and not the virtual clock value which
increments at 10e7 HZ (100ns)"
* tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctly
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single bugfix to prevent reference leaks in irq affinity notifiers"
* tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix reference leaks on irq affinity notifiers
|
|
Fix the crash like this:
BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc000000000c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
LR [c000000000088354] vmemmap_free+0x144/0x320
Call Trace:
section_deactivate+0x220/0x240
__remove_pages+0x118/0x170
arch_remove_memory+0x3c/0x150
memunmap_pages+0x1cc/0x2f0
devm_action_release+0x30/0x50
release_nodes+0x2f8/0x3e0
device_release_driver_internal+0x168/0x270
unbind_store+0x130/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x68/0x80
kernfs_fop_write+0x100/0x290
__vfs_write+0x3c/0x70
vfs_write+0xcc/0x240
ksys_write+0x7c/0x140
system_call+0x5c/0x68
The crash is due to NULL dereference at
test_bit(idx, ms->usage->subsection_map);
due to ms->usage = NULL in pfn_section_valid()
With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
depopulate_section_mem(). This was done so that pfn_page() can work
correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
config pfn_to_page does
__section_mem_map_addr(__sec) + __pfn;
where
static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}
Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
used to check the pfn validity (pfn_valid()). Since section_deactivate
release mem_section->usage if a section is fully deactivated,
pfn_valid() check after a subsection_deactivate cause a kernel crash.
static inline int pfn_valid(unsigned long pfn)
{
...
return early_section(ms) || pfn_section_valid(ms, pfn);
}
where
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);
return test_bit(idx, ms->usage->subsection_map);
}
Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
freed. For architectures like ppc64 where large pages are used for
vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
sections. Hence before a vmemmap mapping page can be freed, the kernel
needs to make sure there are no valid sections within that mapping.
Clearing the section valid bit before depopulate_section_memap enables
this.
[[email protected]: add comment]
Link: http://lkml.kernel.org/r/[email protected]: http://lkml.kernel.org/r/[email protected]
Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Reported-by: Sachin Sant <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Pankaj Gupta <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().
In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg . In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.
It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).
In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state(). It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .
Note: This is a special version of the patch created for stable
backports. It contains code from the following two patches:
- mm: memcg/slab: introduce mem_cgroup_from_obj()
- mm: fork: fix kernel_stack memcg stats for various stack implementations
[[email protected]: introduce mem_cgroup_from_obj()]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This appears to be a mistake in commit faced7e0806cf ("mm: hugetlb
controller for cgroups v2").
Essentially that commit does a hugetlb_cgroup_from_counter assuming that
page_counter_try_charge has initialized counter.
But if that has failed then it seems will not initialize counter, so
hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
causing kasan to complain.
The solution is to simply use 'h_cg', instead of
hugetlb_cgroup_from_counter(counter), since that is a reference to the
hugetlb_cgroup anyway. After this change kasan ceases to complain.
Fixes: faced7e0806cf ("mm: hugetlb controller for cgroups v2")
Reported-by: [email protected]
Signed-off-by: Mina Almasry <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Giuseppe Scrivano <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We see multiple issues with the implementation/interface to compute
whether a memory block can be offlined (exposed via
/sys/devices/system/memory/memoryX/removable) and would like to simplify
it (remove the implementation).
1. It runs basically lockless. While this might be good for performance,
we see possible races with memory offlining that will require at
least some sort of locking to fix.
2. Nowadays, more false positives are possible. No arch-specific checks
are performed that validate if memory offlining will not be denied
right away (and such check will require locking). For example, arm64
won't allow to offline any memory block that was added during boot -
which will imply a very high error rate. Other archs have other
constraints.
3. The interface is inherently racy. E.g., if a memory block is detected
to be removable (and was not a false positive at that time), there is
still no guarantee that offlining will actually succeed. So any
caller already has to deal with false positives.
4. It is unclear which performance benefit this interface actually
provides. The introducing commit 5c755e9fd813 ("memory-hotplug: add
sysfs removable attribute for hotplug memory remove") mentioned
"A user-level agent must be able to identify which sections
of memory are likely to be removable before attempting the
potentially expensive operation."
However, no actual performance comparison was included.
Known users:
- lsmem: Will group memory blocks based on the "removable" property. [1]
- chmem: Indirect user. It has a RANGE mode where one can specify
removable ranges identified via lsmem to be offlined. However,
it also has a "SIZE" mode, which allows a sysadmin to skip the
manual "identify removable blocks" step. [2]
- powerpc-utils: Uses the "removable" attribute to skip some memory
blocks right away when trying to find some to offline+remove.
However, with ballooning enabled, it already skips this
information completely (because it once resulted in many false
negatives). Therefore, the implementation can deal with false
positives properly already. [3]
According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
driven from userspace via the drmgr command (powerpc-utils). Nowadays
it's managed in the kernel - including onlining/offlining of memory
blocks - triggered by drmgr writing to /sys/kernel/dlpar. So the
affected legacy userspace handling is only active on old kernels. Only
very old versions of drmgr on a new kernel (unlikely) might execute
slower - totally acceptable.
With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
break any user space tool. We implement a very bad heuristic now.
Without CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
"not removable" as before.
Original discussion can be found in [4] ("[PATCH RFC v1] mm:
is_mem_section_removable() overhaul").
Other users of is_mem_section_removable() will be removed next, so that
we can remove is_mem_section_removable() completely.
[1] http://man7.org/linux/man-pages/man1/lsmem.1.html
[2] http://man7.org/linux/man-pages/man8/chmem.8.html
[3] https://github.com/ibm-power-utilities/powerpc-utils
[4] https://lkml.kernel.org/r/[email protected]
Also, this patch probably fixes a crash reported by Steve.
http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.com
Reported-by: "Scargall, Steve" <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Nathan Fontenot <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Badari Pulavarty <[email protected]>
Cc: Robert Jennings <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Karel Zak <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
claim_swapfile() currently keeps the inode locked when it is successful,
or the file is already swapfile (with -EBUSY). And, on the other error
cases, it does not lock the inode.
This inconsistency of the lock state and return value is quite confusing
and actually causing a bad unlock balance as below in the "bad_swap"
section of __do_sys_swapon().
This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
check out of claim_swapfile(). The inode is unlocked in
"bad_swap_unlock_inode" section, so that the inode is ensured to be
unlocked at "bad_swap". Thus, error handling codes after the locking now
jumps to "bad_swap_unlock_inode" instead of "bad_swap".
=====================================
WARNING: bad unlock balance detected!
5.5.0-rc7+ #176 Not tainted
-------------------------------------
swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
but there are no more locks to release!
other info that might help us debug this:
no locks held by swapon/4294.
stack backtrace:
CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
Call Trace:
dump_stack+0xa1/0xea
print_unlock_imbalance_bug.cold+0x114/0x123
lock_release+0x562/0xed0
up_write+0x2d/0x490
__do_sys_swapon+0x94b/0x3550
__x64_sys_swapon+0x54/0x80
do_syscall_64+0xa4/0x4b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f15da0a0dc7
Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices")
Signed-off-by: Naohiro Aota <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Qais Youef <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch fixes follwoing warning:
block/blk-core.c: In function ‘blk_alloc_queue’:
block/blk-core.c:558:10: warning: returning ‘int’ from a function with return type ‘struct request_queue *’ makes pointer from integer without a cast [-Wint-conversion]
return -EINVAL;
Fixes: 3d745ea5b095a ("block: simplify queue allocation")
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Instead of dropping refs+kfree, use the helper added in previous patch.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
nf_queue is problematic when another NF_QUEUE invocation happens
from nf_reinject().
1. nf_queue is invoked, increments state->sk refcount.
2. skb is queued, waiting for verdict.
3. sk is closed/released.
3. verdict comes back, nf_reinject is called.
4. nf_reinject drops the reference -- refcount can now drop to 0
Instead of get_ref/release_ref pattern, we need to nest the get_ref calls:
get_ref
get_ref
release_ref
release_ref
So that when we invoke the next processing stage (another netfilter
or the okfn()), we hold at least one reference count on the
devices/socket.
After previous patch, it is now safe to put the entry even after okfn()
has potentially free'd the skb.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
The refcount is done via entry->skb, which does work fine.
Major problem: When putting the refcount of the bridge ports, we
must always put the references while the skb is still around.
However, we will need to put the references after okfn() to avoid
a possible 1 -> 0 -> 1 refcount transition, so we cannot use the
skb pointer anymore.
Place the physports in the queue entry structure instead to allow
for refcounting changes in the next patch.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
This is a preparation patch, no logical changes.
Move free_entry into core and rename it to something more sensible.
Will ease followup patches which will complicate the refcount handling.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
The target outputmakefile is used to generate a Makefile
for out-of-tree builds and does not depend on the kernel
configuration.
Signed-off-by: David Engraf <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
As commit 5ef872636ca7 ("kbuild: get rid of misleading $(AS) from
documents") noted, we rarely use $(AS) directly in the kernel build.
Now that the only/last user of $(AS) in drivers/net/wan/Makefile was
converted to $(CC), $(AS) is no longer used in the build process.
You can still pass in AS=clang, which is just a switch to turn on
the LLVM integrated assembler.
Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
|
|
Split the big recipe into 3 stages: compile, link, and hexdump.
After this commit, the build log with CONFIG_WANXL_BUILD_FIRMWARE
will look like this:
M68KAS drivers/net/wan/wanxlfw.o
M68KLD drivers/net/wan/wanxlfw.bin
BLDFW drivers/net/wan/wanxlfw.inc
CC [M] drivers/net/wan/wanxl.o
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
The firmware source, wanxlfw.S, is currently compiled by the combo of
$(CPP) and $(M68KAS). This is not what we usually do for compiling *.S
files. In fact, this Makefile is the only user of $(AS) in the kernel
build.
Instead of combining $(CPP) and (AS) from different tool sets, using
$(M68KCC) as an assembler driver is simpler, and saner.
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
As far as I understood from the Kconfig help text, this build rule is
used to rebuild the driver firmware, which runs on an old m68k-based
chip. So, you need m68k tools for the firmware rebuild.
wanxl.c is a PCI driver, but CONFIG_M68K does not select CONFIG_HAVE_PCI.
So, you cannot enable CONFIG_WANXL_BUILD_FIRMWARE for ARCH=m68k. In other
words, ifeq ($(ARCH),m68k) is false here.
I am keeping the dead code for now, but rebuilding the firmware requires
'as68k' and 'ld68k', which I do not have in hand.
Instead, the kernel.org m68k GCC [1] successfully built it.
Allowing a user to pass in CROSS_COMPILE_M68K= is handier.
[1] https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/9.2.0/x86_64-gcc-9.2.0-nolibc-m68k-linux.tar.xz
Suggested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
GNU Make commit 8c888d95f618 ("[SV 8297] Implement "grouped targets"
for explicit rules.") added the '&:' syntax.
I think '&:' is a perfect fit here, but we cannot use it any time
soon. Just add a TODO comment.
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
Add -Wall to catch more warnings for C++ host programs.
When I submitted the previous version, the 0-day bot reported
-Wc++11-compat warnings for old GCC:
HOSTCXX -fPIC scripts/gcc-plugins/latent_entropy_plugin.o
In file included from /usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/tm.h:28:0,
from scripts/gcc-plugins/gcc-common.h:15,
from scripts/gcc-plugins/latent_entropy_plugin.c:78:
/usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/config/elfos.h:102:21: warning: C++11 requires a space between string literal and macro [-Wc++11-compat]
fprintf ((FILE), "%s"HOST_WIDE_INT_PRINT_UNSIGNED"\n",\
^
/usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/config/elfos.h:170:24: warning: C++11 requires a space between string literal and macro [-Wc++11-compat]
fprintf ((FILE), ","HOST_WIDE_INT_PRINT_UNSIGNED",%u\n", \
^
In file included from /usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/tm.h:42:0,
from scripts/gcc-plugins/gcc-common.h:15,
from scripts/gcc-plugins/latent_entropy_plugin.c:78:
/usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/defaults.h:126:24: warning: C++11 requires a space between string literal and macro [-Wc++11-compat]
fprintf ((FILE), ","HOST_WIDE_INT_PRINT_UNSIGNED",%u\n", \
^
The source of the warnings is in the plugin headers, so we have no
control of it. I just suppressed them by adding -Wno-c++11-compat to
scripts/gcc-plugins/Makefile.
Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Kees Cook <[email protected]>
|
|
If this file were compiled with -Wall, the following warning would be
reported:
scripts/kconfig/qconf.cc:312:6: warning: unused variable ‘i’ [-Wunused-variable]
int i;
^
The commit prepares to turn on -Wall for C++ host programs.
Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
|
|
Commit:
ec93fc371f014a6f ("efi/libstub: Add support for loading the initrd from a device path")
added a diagnostic print to the ARM version of the EFI stub that
reports whether an initrd has been loaded that was passed
via the command line using initrd=.
However, it failed to take into account that, for historical reasons,
the file loading routines return EFI_SUCCESS when no file was found,
and the only way to decide whether a file was loaded is to inspect
the 'size' argument that is passed by reference. So let's inspect
this returned size, to prevent the print from being emitted even if
no initrd was loaded at all.
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
|
|
Commit:
9f9223778ef3 ("efi/libstub/arm: Make efi_entry() an ordinary PE/COFF entrypoint")
did some code refactoring to get rid of the EFI entry point assembler
code, and in the process, it got rid of the assignment of image_addr
to the value of _text. Instead, it switched to using the image_base
field of the efi_loaded_image struct provided by UEFI, which should
contain the same value.
However, Michael reports that this is not the case: older GRUB builds
corrupt this value in some way, and since we can easily switch back to
referring to _text to discover this value, let's simply do that.
While at it, fix another issue in commit 9f9223778ef3, which may result
in the unassigned image_addr to be misidentified as the preferred load
offset of the kernel, which is unlikely but will cause a boot crash if
it does occur.
Finally, let's add a warning if the _text vs. image_base discrepancy is
detected, so we can tell more easily how widespread this issue actually
is.
Reported-by: Michael Kelley <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
|