aboutsummaryrefslogtreecommitdiff
path: root/mm/percpu.c
AgeCommit message (Collapse)AuthorFilesLines
2020-10-30percpu: convert flexible array initializers to use struct_size()Dennis Zhou1-4/+4
Use the safer macro as sparked by the long discussion in [1]. [1] https://lore.kernel.org/lkml/[email protected]/ Reviewed-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
2020-10-18mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current()Roman Gushchin1-2/+1
Patch series "mm: kmem: kernel memory accounting in an interrupt context". This patchset implements memcg-based memory accounting of allocations made from an interrupt context. Historically, such allocations were passed unaccounted mostly because charging the memory cgroup of the current process wasn't an option. Also performance reasons were likely a reason too. The remote charging API allows to temporarily overwrite the currently active memory cgroup, so that all memory allocations are accounted towards some specified memory cgroup instead of the memory cgroup of the current process. This patchset extends the remote charging API so that it can be used from an interrupt context. Then it removes the fence that prevented the accounting of allocations made from an interrupt context. It also contains a couple of optimizations/code refactorings. This patchset doesn't directly enable accounting for any specific allocations, but prepares the code base for it. The bpf memory accounting will likely be the first user of it: a typical example is a bpf program parsing an incoming network packet, which allocates an entry in hashmap map to store some information. This patch (of 4): Currently memcg_kmem_bypass() is called before obtaining the current memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the number of call sites and allows further code simplifications. Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-09-17percpu: fix first chunk size calculation for populated bitmapSunghyun Jin1-1/+1
Variable populated, which is a member of struct pcpu_chunk, is used as a unit of size of unsigned long. However, size of populated is miscounted. So, I fix this minor part. Fixes: 8ab16c43ea79 ("percpu: change the number of pages marked in the first_chunk pop bitmap") Cc: <[email protected]> # 4.14+ Signed-off-by: Sunghyun Jin <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
2020-08-12mm: memcg/percpu: per-memcg percpu memory statisticsRoman Gushchin1-0/+10
Percpu memory can represent a noticeable chunk of the total memory consumption, especially on big machines with many CPUs. Let's track percpu memory usage for each memcg and display it in memory.stat. A percpu allocation is usually scattered over multiple pages (and nodes), and can be significantly smaller than a page. So let's add a byte-sized counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs can be perfectly reused for percpu case. [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Dennis Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12mm: memcg/percpu: account percpu memory to memory cgroupsRoman Gushchin1-21/+164
Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [[email protected]: fix CONFIG_MEMCG_KMEM=n build errors and warning] [[email protected]: move unreachable code, per Roman] [[email protected]: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Bixuan Cui <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12percpu: return number of released bytes from pcpu_free_area()Roman Gushchin1-3/+10
Patch series "mm: memcg accounting of percpu memory", v3. This patchset adds percpu memory accounting to memory cgroups. It's based on the rework of the slab controller and reuses concepts and features introduced for the per-object slab accounting. Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. Percpu allocations by their nature are scattered over multiple pages, so they can't be tracked on the per-page basis. So the per-object tracking introduced by the new slab controller is reused. The patchset implements charging of percpu allocations, adds memcg-level statistics, enables accounting for percpu allocations made by memory cgroup internals and provides some basic tests. To implement the accounting of percpu memory without a significant memory and performance overhead the following approach is used: all accounted allocations are placed into a separate percpu chunk (or chunks). These chunks are similar to default chunks, except that they do have an attached vector of pointers to obj_cgroup objects, which is big enough to save a pointer for each allocated object. On the allocation, if the allocation has to be accounted (__GFP_ACCOUNT is passed, the allocating process belongs to a non-root memory cgroup, etc), the memory cgroup is getting charged and if the maximum limit is not exceeded the allocation is performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the allocation is forced over the limit, depending on gfp (as any other kernel memory allocation). The memory cgroup information is saved in the obj_cgroup vector at the corresponding offset. On the release time the memcg information is restored from the vector and the cgroup is getting uncharged. Unaccounted allocations (at this point the absolute majority of all percpu allocations) are performed in the old way, so no additional overhead is expected. To avoid pinning dying memory cgroups by outstanding allocations, obj_cgroup API is used instead of directly saving memory cgroup pointers. obj_cgroup is basically a pointer to a memory cgroup with a standalone reference counter. The trick is that it can be atomically swapped to point at the parent cgroup, so that the original memory cgroup can be released prior to all objects, which has been charged to it. Because all charges and statistics are fully recursive, it's perfectly correct to uncharge the parent cgroup instead. This scheme is used in the slab memory accounting, and percpu memory can just follow the scheme. This patch (of 5): To implement accounting of percpu memory we need the information about the size of freed object. Return it from pcpu_free_area(). Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> cC: Michal Koutný[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-07-16treewide: Remove uninitialized_var() usageKees Cook1-1/+1
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <[email protected]> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <[email protected]> # IB Acked-by: Kalle Valo <[email protected]> # wireless drivers Reviewed-by: Chao Yu <[email protected]> # erofs Signed-off-by: Kees Cook <[email protected]>
2020-06-02mm: remove the pgprot argument to __vmallocChristoph Hellwig1-1/+1
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Michael Kelley <[email protected]> [hyperv] Acked-by: Gao Xiang <[email protected]> [erofs] Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Wei Liu <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-05-07percpu: make pcpu_alloc() aware of current gfp contextFilipe Manana1-4/+10
Since 5.7-rc1, on btrfs we have a percpu counter initialization for which we always pass a GFP_KERNEL gfp_t argument (this happens since commit 2992df73268f78 ("btrfs: Implement DREW lock")). That is safe in some contextes but not on others where allowing fs reclaim could lead to a deadlock because we are either holding some btrfs lock needed for a transaction commit or holding a btrfs transaction handle open. Because of that we surround the call to the function that initializes the percpu counter with a NOFS context using memalloc_nofs_save() (this is done at btrfs_init_fs_root()). However it turns out that this is not enough to prevent a possible deadlock because percpu_alloc() determines if it is in an atomic context by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this case) and it is not aware that a NOFS context is set. Because percpu_alloc() thinks it is in a non atomic context it locks the pcpu_alloc_mutex. This can result in a btrfs deadlock when pcpu_balance_workfn() is running, has acquired that mutex and is waiting for reclaim, while the btrfs task that called percpu_counter_init() (and therefore percpu_alloc()) is holding either the btrfs commit_root semaphore or a transaction handle (done fs/btrfs/backref.c: iterate_extent_inodes()), which prevents reclaim from finishing as an attempt to commit the current btrfs transaction will deadlock. Lockdep reports this issue with the following trace: ====================================================== WARNING: possible circular locking dependency detected 5.6.0-rc7-btrfs-next-77 #1 Not tainted ------------------------------------------------------ kswapd0/91 is trying to acquire lock: ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] but task is already holding lock: ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (fs_reclaim){+.+.}: fs_reclaim_acquire.part.0+0x25/0x30 __kmalloc+0x5f/0x3a0 pcpu_create_chunk+0x19/0x230 pcpu_balance_workfn+0x56a/0x680 process_one_work+0x235/0x5f0 worker_thread+0x50/0x3b0 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 -> #3 (pcpu_alloc_mutex){+.+.}: __mutex_lock+0xa9/0xaf0 pcpu_alloc+0x480/0x7c0 __percpu_counter_init+0x50/0xd0 btrfs_drew_lock_init+0x22/0x70 [btrfs] btrfs_get_fs_root+0x29c/0x5c0 [btrfs] resolve_indirect_refs+0x120/0xa30 [btrfs] find_parent_nodes+0x50b/0xf30 [btrfs] btrfs_find_all_leafs+0x60/0xb0 [btrfs] iterate_extent_inodes+0x139/0x2f0 [btrfs] iterate_inodes_from_logical+0xa1/0xe0 [btrfs] btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs] btrfs_ioctl+0x165a/0x3130 [btrfs] ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #2 (&fs_info->commit_root_sem){++++}: down_write+0x38/0x70 btrfs_cache_block_group+0x2ec/0x500 [btrfs] find_free_extent+0xc6a/0x1600 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x5a0 [btrfs] btrfs_cow_block+0x106/0x240 [btrfs] commit_cowonly_roots+0x55/0x310 [btrfs] btrfs_commit_transaction+0x509/0xb20 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x93/0xc0 exit_to_usermode_loop+0xf9/0x100 do_syscall_64+0x20d/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #1 (&space_info->groups_sem){++++}: down_read+0x3c/0x140 find_free_extent+0xef6/0x1600 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x5a0 [btrfs] btrfs_cow_block+0x106/0x240 [btrfs] btrfs_search_slot+0x50c/0xd60 [btrfs] btrfs_lookup_inode+0x3a/0xc0 [btrfs] __btrfs_update_delayed_inode+0x90/0x280 [btrfs] __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs] __btrfs_run_delayed_items+0x8e/0x180 [btrfs] btrfs_commit_transaction+0x31b/0xb20 [btrfs] iterate_supers+0x87/0xf0 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&delayed_node->mutex){+.+.}: __lock_acquire+0xef0/0x1c80 lock_acquire+0xa2/0x1d0 __mutex_lock+0xa9/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x40d/0x560 [btrfs] evict+0xd9/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x124/0x1a0 do_shrink_slab+0x176/0x440 shrink_slab+0x23a/0x2c0 shrink_node+0x188/0x6e0 balance_pgdat+0x31d/0x7f0 kswapd+0x238/0x550 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(pcpu_alloc_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/91: #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0 #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8f/0xd0 check_noncircular+0x170/0x190 __lock_acquire+0xef0/0x1c80 lock_acquire+0xa2/0x1d0 __mutex_lock+0xa9/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x40d/0x560 [btrfs] evict+0xd9/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x124/0x1a0 do_shrink_slab+0x176/0x440 shrink_slab+0x23a/0x2c0 shrink_node+0x188/0x6e0 balance_pgdat+0x31d/0x7f0 kswapd+0x238/0x550 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL to percpu_counter_init() in contextes where it is not reclaim safe, however that type of approach is discouraged since memalloc_[nofs|noio]_save() were introduced. Therefore this change makes pcpu_alloc() look up into an existing nofs/noio context before deciding whether it is in an atomic context or not. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-01percpu: update copyright emails to [email protected]Dennis Zhou1-1/+1
Currently there are 3 emails tied to me in the kernel tree, I'd rather [email protected] be the only one. Signed-off-by: Dennis Zhou <[email protected]>
2020-01-20bitmap: genericize percpu bitmap region iteratorsDennis Zhou1-45/+16
Bitmaps are fairly popular for their space efficiency, but we don't have generic iterators available. Make percpu's bitmap region iterators available to everyone. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-09-04percpu: Use struct_size() helperGustavo A. R. Silva1-1/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct pcpu_alloc_info { ... struct pcpu_group_info groups[]; }; Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. So, replace the following form: sizeof(*ai) + nr_groups * sizeof(ai->groups[0]) with: struct_size(ai, groups, nr_groups) This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
2019-07-23percpu: fix typo in pcpu_setup_first_chunk() commentChristophe JAILLET1-1/+1
s/perpcu/percpu/ Signed-off-by: Christophe JAILLET <[email protected]> Signed-off-by: Dennis Zhou <[email protected]> [Dennis: updated title]
2019-07-04percpu: Make pcpu_setup_first_chunk() void functionKefeng Wang1-12/+7
pcpu_setup_first_chunk() will panic or BUG_ON if the are some error and doesn't return any error, hence it can be defined to return void. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Kefeng Wang <[email protected]> Signed-off-by: Dennis Zhou <[email protected]> [Dennis: fixed kbuild warning for pcpu_page_first_chunk()]
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428Thomas Gleixner1-2/+1
Based on 1 normalized pattern(s): this file is released under the gplv2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 68 file(s). Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Armijn Hemel <[email protected]> Reviewed-by: Allison Randal <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-05-13Merge branch 'for-5.2' of ↵Linus Torvalds1-162/+387
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu Pull percpu updates from Dennis Zhou: - scan hint update which helps address performance issues with heavily fragmented blocks - lockdep fix when freeing an allocation causes balance work to be scheduled * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: remove spurious lock dependency between percpu and sched percpu: use chunk scan_hint to skip some scanning percpu: convert chunk hints to be based on pcpu_block_md percpu: make pcpu_block_md generic percpu: use block scan_hint to only scan forward percpu: remember largest area skipped during allocation percpu: add block level scan_hint percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE percpu: relegate chunks unusable when failing small allocations percpu: manage chunks based on contig_bits instead of free_bytes percpu: introduce helper to determine if two regions overlap percpu: do not search past bitmap when allocating an area percpu: update free path with correct new free region
2019-05-08percpu: remove spurious lock dependency between percpu and schedJohn Sperbeck1-1/+5
In free_percpu() we sometimes call pcpu_schedule_balance_work() to queue a work item (which does a wakeup) while holding pcpu_lock. This creates an unnecessary lock dependency between pcpu_lock and the scheduler's pi_lock. There are other places where we call pcpu_schedule_balance_work() without hold pcpu_lock, and this case doesn't need to be different. Moving the call outside the lock prevents the following lockdep splat when running tools/testing/selftests/bpf/{test_maps,test_progs} in sequence with lockdep enabled: ====================================================== WARNING: possible circular locking dependency detected 5.1.0-dbg-DEV #1 Not tainted ------------------------------------------------------ kworker/23:255/18872 is trying to acquire lock: 000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520 but task is already holding lock: 00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (pcpu_lock){..-.}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 pcpu_alloc+0xfa/0x780 __alloc_percpu_gfp+0x12/0x20 alloc_htab_elem+0x184/0x2b0 __htab_percpu_map_update_elem+0x252/0x290 bpf_percpu_hash_update+0x7c/0x130 __do_sys_bpf+0x1912/0x1be0 __x64_sys_bpf+0x1a/0x20 do_syscall_64+0x59/0x400 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #3 (&htab->buckets[i].lock){....}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 htab_map_update_elem+0x1af/0x3a0 -> #2 (&rq->lock){-.-.}: lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 task_fork_fair+0x37/0x160 sched_fork+0x211/0x310 copy_process.part.43+0x7b1/0x2160 _do_fork+0xda/0x6b0 kernel_thread+0x29/0x30 rest_init+0x22/0x260 arch_call_rest_init+0xe/0x10 start_kernel+0x4fd/0x520 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0x6f/0x72 secondary_startup_64+0xa4/0xb0 -> #1 (&p->pi_lock){-.-.}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 try_to_wake_up+0x41/0x600 wake_up_process+0x15/0x20 create_worker+0x16b/0x1e0 workqueue_init+0x279/0x2ee kernel_init_freeable+0xf7/0x288 kernel_init+0xf/0x180 ret_from_fork+0x24/0x30 -> #0 (&(&pool->lock)->rlock){-.-.}: __lock_acquire+0x101f/0x12a0 lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 __queue_work+0xb2/0x520 queue_work_on+0x38/0x80 free_percpu+0x221/0x260 pcpu_freelist_destroy+0x11/0x20 stack_map_free+0x2a/0x40 bpf_map_free_deferred+0x3c/0x50 process_one_work+0x1f7/0x580 worker_thread+0x54/0x410 kthread+0x10f/0x150 ret_from_fork+0x24/0x30 other info that might help us debug this: Chain exists of: &(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcpu_lock); lock(&htab->buckets[i].lock); lock(pcpu_lock); lock(&(&pool->lock)->rlock); *** DEADLOCK *** 3 locks held by kworker/23:255/18872: #0: 00000000b36a6e16 ((wq_completion)events){+.+.}, at: process_one_work+0x17a/0x580 #1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x17a/0x580 #2: 00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260 stack backtrace: CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1 Hardware name: ... Workqueue: events bpf_map_free_deferred Call Trace: dump_stack+0x67/0x95 print_circular_bug.isra.38+0x1c6/0x220 check_prev_add.constprop.50+0x9f6/0xd20 __lock_acquire+0x101f/0x12a0 lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 __queue_work+0xb2/0x520 queue_work_on+0x38/0x80 free_percpu+0x221/0x260 pcpu_freelist_destroy+0x11/0x20 stack_map_free+0x2a/0x40 bpf_map_free_deferred+0x3c/0x50 process_one_work+0x1f7/0x580 worker_thread+0x54/0x410 kthread+0x10f/0x150 ret_from_fork+0x24/0x30 Signed-off-by: John Sperbeck <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
2019-03-18percpu: stop printing kernel addressesMatteo Croce1-4/+4
Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with %p"), at boot "____ptrval____" is printed instead of actual addresses: percpu: Embedded 38 pages/cpu @(____ptrval____) s124376 r0 d31272 u524288 Instead of changing the print to "%px", and leaking kernel addresses, just remove the print completely, cfr. e.g. commit 071929dbdd865f77 ("arm64: Stop printing the virtual memory layout"). Signed-off-by: Matteo Croce <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
2019-03-13percpu: use chunk scan_hint to skip some scanningDennis Zhou1-9/+27
Just like blocks, chunks now maintain a scan_hint. This can be used to skip some scanning by promoting the scan_hint to be the contig_hint. The chunk's scan_hint is primarily updated on the backside and relies on full scanning when a block becomes free or the free region spans across blocks. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: convert chunk hints to be based on pcpu_block_mdDennis Zhou1-67/+53
As mentioned in the last patch, a chunk's hints are no different than a block just responsible for more bits. This converts chunk level hints to use a pcpu_block_md to maintain them. This lets us reuse the same hint helper functions as a block. The left_free and right_free are unused by the chunk's pcpu_block_md. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: make pcpu_block_md genericDennis Zhou1-7/+13
In reality, a chunk is just a block covering a larger number of bits. The hints themselves are one in the same. Rather than maintaining the hints separately, first introduce nr_bits to genericize pcpu_block_update() to correctly maintain block->right_free. The next patch will convert chunk hints to be managed as a pcpu_block_md. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: use block scan_hint to only scan forwardDennis Zhou1-6/+17
Blocks now remember the latest scan_hint. This can be used on the allocation path as when a contig_hint is broken, we can promote the scan_hint to the contig_hint and scan forward from there. This works because pcpu_block_refresh_hint() is only called on the allocation path while block free regions are updated manually in pcpu_block_update_hint_free(). Signed-off-by: Dennis Zhou <[email protected]>
2019-03-13percpu: remember largest area skipped during allocationDennis Zhou1-2/+99
Percpu allocations attempt to do first fit by scanning forward from the first_free of a block. However, fragmentation from allocation requests can cause holes not seen by block hint update functions. To address this, create a local version of bitmap_find_next_zero_area_off() that remembers the largest area skipped over. The caveat is that it only sees regions skipped over due to not fitting, not regions skipped due to alignment. Prior to updating the scan_hint, a scan backwards is done to try and recover free bits skipped due to alignment. While this can cause scanning to miss earlier possible free areas, smaller allocations will eventually fill those holes due to first fit. Signed-off-by: Dennis Zhou <[email protected]>
2019-03-13percpu: add block level scan_hintDennis Zhou1-7/+94
Fragmentation can cause both blocks and chunks to have an early first_firee bit available, but only able to satisfy allocations much later on. This patch introduces a scan_hint to help mitigate some unnecessary scanning. The scan_hint remembers the largest area prior to the contig_hint. If the contig_hint == scan_hint, then scan_hint_start > contig_hint_start. This is necessary for scan_hint discovery when refreshing a block. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZEDennis Zhou1-66/+44
Previously, block size was flexible based on the constraint that the GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the overhead that keeping a floating number of populated free pages required scanning over the free regions of a chunk. Setting the block size to be fixed at PAGE_SIZE lets us know when an empty page becomes used as we will break a full contig_hint of a block. This means we no longer have to scan the whole chunk upon breaking a contig_hint which empty page management piggybacked off. A later patch takes advantage of this to optimize the allocation path by only scanning forward using the scan_hint introduced later too. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: relegate chunks unusable when failing small allocationsDennis Zhou1-9/+26
In certain cases, requestors of percpu memory may want specific alignments. However, it is possible to end up in situations where the contig_hint matches, but the alignment does not. This causes excess scanning of chunks that will fail. To prevent this, if a small allocation fails (< 32B), the chunk is moved to the empty list. Once an allocation is freed from that chunk, it is placed back into rotation. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: manage chunks based on contig_bits instead of free_bytesDennis Zhou1-1/+1
When a chunk becomes fragmented, it can end up having a large number of small allocation areas free. The free_bytes sorting of chunks leads to unnecessary checking of chunks that cannot satisfy the allocation. Switch to contig_bits sorting to prevent scanning chunks that may not be able to service the allocation request. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: introduce helper to determine if two regions overlapDennis Zhou1-4/+24
While block hints were always accurate, it's possible when spanning across blocks that we miss updating the chunk's contig_hint. Rather than rely on correctness of the boundaries of hints, do a full overlap comparison. A future patch introduces the scan_hint which makes the contig_hint slightly fuzzy as they can at times be smaller than the actual hint. Signed-off-by: Dennis Zhou <[email protected]>
2019-03-13percpu: do not search past bitmap when allocating an areaDennis Zhou1-1/+2
pcpu_find_block_fit() guarantees that a fit is found within PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it compares against the block's contig_hint. This can lead to incorrectly scanning past the end of the bitmap. The behavior was okay given the check after for bit_off >= end and the correctness of the hints from pcpu_find_block_fit(). This patch fixes this by bounding the end offset by the number of bits in a chunk. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-13percpu: update free path with correct new free regionDennis Zhou1-1/+1
When updating the chunk's contig_hint on the free path of a hint that does not touch the page boundaries, it was incorrectly using the starting offset of the free region and the block's contig_hint. This could lead to incorrect assumptions about fit given a size and better alignment of the start. Fix this by using (end - start) as this is only called when updating a hint within a block. Signed-off-by: Dennis Zhou <[email protected]> Reviewed-by: Peng Fan <[email protected]>
2019-03-12memblock: drop memblock_alloc_*_nopanic() variantsMike Rapoport1-7/+4
As all the memblock allocation functions return NULL in case of error rather than panic(), the duplicates with _nopanic suffix can be removed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Petr Mladek <[email protected]> [printk] Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Guo Ren <[email protected]> [c-sky] Cc: Heiko Carstens <[email protected]> Cc: Juergen Gross <[email protected]> [Xen] Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Paul Burton <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rob Herring <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-03-12mm/percpu: add checks for the return value of memblock_alloc*()Mike Rapoport1-17/+56
Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Guo Ren <[email protected]> [c-sky] Cc: Heiko Carstens <[email protected]> Cc: Juergen Gross <[email protected]> [Xen] Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Paul Burton <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rob Herring <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-23percpu: use nr_groups as check conditionPeng Fan1-1/+1
group_cnt array is defined with NR_CPUS entries, but normally nr_groups will not reach up to NR_CPUS. So there is no issue to the current code. Checking other parts of pcpu_build_alloc_info, use nr_groups as check condition, so make it consistent to use 'group < nr_groups' as for loop check. In case we do have nr_groups equals with NR_CPUS, we could also avoid memory access out of bounds. Signed-off-by: Peng Fan <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
2018-11-01Merge branch 'for-4.20' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu Pull percpu fixes from Dennis Zhou: "Two small things for v4.20. The first fixes a clang uninitialized variable warning for arm64 in the default path calls BUILD_BUG(). The second removes an unnecessary unlikely() in a WARN_ON() use" * 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: arm64: percpu: Initialize ret in the default case mm: percpu: remove unnecessary unlikely()
2018-10-31memblock: stop using implicit alignment to SMP_CACHE_BYTESMike Rapoport1-18/+20
When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES. Implicit alignment is done deep in the memblock allocator and it can come as a surprise. Not that such an alignment would be wrong even when used incorrectly but it is better to be explicit for the sake of clarity and the prinicple of the least surprise. Replace all such uses of memblock APIs with the 'align' parameter explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment in the memblock internal allocation functions. For the case when memblock APIs are used via helper functions, e.g. like iommu_arena_new_node() in Alpha, the helper functions were detected with Coccinelle's help and then manually examined and updated where appropriate. The direct memblock APIs users were updated using the semantic patch below: @@ expression size, min_addr, max_addr, nid; @@ ( | - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc(size, 0) + memblock_alloc(size, SMP_CACHE_BYTES) | - memblock_alloc_raw(size, 0) + memblock_alloc_raw(size, SMP_CACHE_BYTES) | - memblock_alloc_from(size, 0, min_addr) + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_nopanic(size, 0) + memblock_alloc_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_low(size, 0) + memblock_alloc_low(size, SMP_CACHE_BYTES) | - memblock_alloc_low_nopanic(size, 0) + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_from_nopanic(size, 0, min_addr) + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_node(size, 0, nid) + memblock_alloc_node(size, SMP_CACHE_BYTES, nid) ) [[email protected]: changelog update] [[email protected]: coding-style fixes] [[email protected]: fix missed uses of implicit alignment] Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Michael Ellerman <[email protected]> [powerpc] Acked-by: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michal Simek <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport1-1/+1
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [[email protected]: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Burton <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Serge Semin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-31memblock: remove _virt from APIs returning virtual addressMike Rapoport1-14/+14
The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Burton <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Serge Semin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-07percpu: stop leaking bitmap metadata blocksMike Rapoport1-0/+1
The commit ca460b3c9627 ("percpu: introduce bitmap metadata blocks") introduced bitmap metadata blocks. These metadata blocks are allocated whenever a new chunk is created, but they are never freed. Fix it. Fixes: ca460b3c9627 ("percpu: introduce bitmap metadata blocks") Signed-off-by: Mike Rapoport <[email protected]> Cc: [email protected] Signed-off-by: Dennis Zhou <[email protected]>
2018-09-12mm: percpu: remove unnecessary unlikely()Igor Stoppa1-1/+1
WARN_ON() already contains an unlikely(), so it's not necessary to wrap it into another. Signed-off-by: Igor Stoppa <[email protected]> Cc: Tejun Heo <[email protected]> Cc: zijun_hu <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Dennis Zhou <[email protected]>
2018-08-22/proc/meminfo: add percpu populated pages countDennis Zhou (Facebook)1-0/+29
Currently, percpu memory only exposes allocation and utilization information via debugfs. This more or less is only really useful for understanding the fragmentation and allocation information at a per-chunk level with a few global counters. This is also gated behind a config. BPF and cgroup, for example, have seen an increase in use causing increased use of percpu memory. Let's make it easier for someone to identify how much memory is being used. This patch adds the "Percpu" stat to meminfo to more easily look up how much percpu memory is in use. This number includes the cost for all allocated backing pages and not just insight at the per a unit, per chunk level. Metadata is excluded. I think excluding metadata is fair because the backing memory scales with the numbere of cpus and can quickly outweigh the metadata. It also makes this calculation light. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dennis Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-02Merge tag 'arch-removal' of ↵Linus Torvalds1-4/+0
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pul removal of obsolete architecture ports from Arnd Bergmann: "This removes the entire architecture code for blackfin, cris, frv, m32r, metag, mn10300, score, and tile, including the associated device drivers. I have been working with the (former) maintainers for each one to ensure that my interpretation was right and the code is definitely unused in mainline kernels. Many had fond memories of working on the respective ports to start with and getting them included in upstream, but also saw no point in keeping the port alive without any users. In the end, it seems that while the eight architectures are extremely different, they all suffered the same fate: There was one company in charge of an SoC line, a CPU microarchitecture and a software ecosystem, which was more costly than licensing newer off-the-shelf CPU cores from a third party (typically ARM, MIPS, or RISC-V). It seems that all the SoC product lines are still around, but have not used the custom CPU architectures for several years at this point. In contrast, CPU instruction sets that remain popular and have actively maintained kernel ports tend to all be used across multiple licensees. [ See the new nds32 port merged in the previous commit for the next generation of "one company in charge of an SoC line, a CPU microarchitecture and a software ecosystem" - Linus ] The removal came out of a discussion that is now documented at https://lwn.net/Articles/748074/. Unlike the original plans, I'm not marking any ports as deprecated but remove them all at once after I made sure that they are all unused. Some architectures (notably tile, mn10300, and blackfin) are still being shipped in products with old kernels, but those products will never be updated to newer kernel releases. After this series, we still have a few architectures without mainline gcc support: - unicore32 and hexagon both have very outdated gcc releases, but the maintainers promised to work on providing something newer. At least in case of hexagon, this will only be llvm, not gcc. - openrisc, risc-v and nds32 are still in the process of finishing their support or getting it added to mainline gcc in the first place. They all have patched gcc-7.3 ports that work to some degree, but complete upstream support won't happen before gcc-8.1. Csky posted their first kernel patch set last week, their situation will be similar [ Palmer Dabbelt points out that RISC-V support is in mainline gcc since gcc-7, although gcc-7.3.0 is the recommended minimum - Linus ]" This really says it all: 2498 files changed, 95 insertions(+), 467668 deletions(-) * tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (74 commits) MAINTAINERS: UNICORE32: Change email account staging: iio: remove iio-trig-bfin-timer driver tty: hvc: remove tile driver tty: remove bfin_jtag_comm and hvc_bfin_jtag drivers serial: remove tile uart driver serial: remove m32r_sio driver serial: remove blackfin drivers serial: remove cris/etrax uart drivers usb: Remove Blackfin references in USB support usb: isp1362: remove blackfin arch glue usb: musb: remove blackfin port usb: host: remove tilegx platform glue pwm: remove pwm-bfin driver i2c: remove bfin-twi driver spi: remove blackfin related host drivers watchdog: remove bfin_wdt driver can: remove bfin_can driver mmc: remove bfin_sdh driver input: misc: remove blackfin rotary driver input: keyboard: remove bf54x driver ...
2018-03-26treewide: simplify Kconfig dependencies for removed archsArnd Bergmann1-4/+0
A lot of Kconfig symbols have architecture specific dependencies. In those cases that depend on architectures we have already removed, they can be omitted. Acked-by: Kalle Valo <[email protected]> Acked-by: Alexandre Belloni <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2018-03-19mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()Kirill Tkhai1-2/+11
In case of memory deficit and low percpu memory pages, pcpu_balance_workfn() takes pcpu_alloc_mutex for a long time (as it makes memory allocations itself and waits for memory reclaim). If tasks doing pcpu_alloc() are choosen by OOM killer, they can't exit, because they are waiting for the mutex. The patch makes pcpu_alloc() to care about killing signal and use mutex_lock_killable(), when it's allowed by GFP flags. This guarantees, a task does not miss SIGKILL from OOM killer. Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-03-19percpu: include linux/sched.h for cond_resched()Tejun Heo1-0/+1
microblaze build broke due to missing declaration of the cond_resched() invocation added recently. Let's include linux/sched.h explicitly. Signed-off-by: Tejun Heo <[email protected]> Reported-by: kbuild test robot <[email protected]>
2018-02-23percpu: add a schedule point in pcpu_balance_workfn()Eric Dumazet1-0/+1
When a large BPF percpu map is destroyed, I have seen pcpu_balance_workfn() holding cpu for hundreds of milliseconds. On KASAN config and 112 hyperthreads, average time to destroy a chunk is ~4 ms. [ 2489.841376] destroy chunk 1 in 4148689 ns ... [ 2490.093428] destroy chunk 32 in 4072718 ns Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-02-18percpu: allow select gfp to be passed to underlying allocatorsDennis Zhou1-9/+7
The prior patch added support for passing gfp flags through to the underlying allocators. This patch allows users to pass along gfp flags (currently only __GFP_NORETRY and __GFP_NOWARN) to the underlying allocators. This should allow users to decide if they are ok with failing allocations recovering in a more graceful way. Additionally, gfp passing was done as additional flags in the previous patch. Instead, change this to caller passed semantics. GFP_KERNEL is also removed as the default flag. It continues to be used for internally caused underlying percpu allocations. V2: Removed gfp_percpu_mask in favor of doing it inline. Removed GFP_KERNEL as a default flag for __alloc_percpu_gfp. Signed-off-by: Dennis Zhou <[email protected]> Suggested-by: Daniel Borkmann <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-02-18percpu: add __GFP_NORETRY semantics to the percpu balancing pathDennis Zhou1-17/+27
Percpu memory using the vmalloc area based chunk allocator lazily populates chunks by first requesting the full virtual address space required for the chunk and subsequently adding pages as allocations come through. To ensure atomic allocations can succeed, a workqueue item is used to maintain a minimum number of empty pages. In certain scenarios, such as reported in [1], it is possible that physical memory becomes quite scarce which can result in either a rather long time spent trying to find free pages or worse, a kernel panic. This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them through to the underlying allocators. This should prevent any unnecessary panics potentially caused by the workqueue item. The passing of gfp around is as additional flags rather than a full set of flags. The next patch will change these to caller passed semantics. V2: Added const modifier to gfp flags in the balance path. Removed an extra whitespace. [1] https://lkml.org/lkml/2018/2/12/551 Signed-off-by: Dennis Zhou <[email protected]> Suggested-by: Daniel Borkmann <[email protected]> Reported-by: [email protected] Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-02-18percpu: match chunk allocator declarations with definitionsDennis Zhou1-2/+4
At some point the function declaration parameters got out of sync with the function definitions in percpu-vm.c and percpu-km.c. This patch makes them match again. Signed-off-by: Dennis Zhou <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-11-27percpu: hack to let the CRIS architecture to boot until they clean upNicolas Pitre1-0/+4
Commit 438a506180 ("percpu: don't forget to free the temporary struct pcpu_alloc_info") uncovered a problem on the CRIS architecture where the bootmem allocator is initialized with virtual addresses. Given it has: #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) then things just work out because the end result is the same whether you give this a physical or a virtual address. Untill you call memblock_free_early(__pa(address)) that is, because values from __pa() don't match with the virtual addresses stuffed in the bootmem allocator anymore. Avoid freeing the temporary pcpu_alloc_info memory on that architecture until they fix things up to let the kernel boot like it did before. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Fixes: 438a506180 ("percpu: don't forget to free the temporary struct pcpu_alloc_info")
2017-11-15Merge branch 'for-4.15' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu update from Tejun Heo: "Another minor pull request. It only contains one commit which can reclaim a bit of memory wasted during boot on UP" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: don't forget to free the temporary struct pcpu_alloc_info