Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2017-11-15 | gfs2: use pagevec_lookup_range_tag() | Jan Kara | 1 | -18/+2 | |
We want only pages from given range in gfs2_write_cache_jdata(). Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove unnecessary code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Bob Peterson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | f2fs: use find_get_pages_tag() for looking up single page | Jan Kara | 1 | -6/+7 | |
__get_first_dirty_index() wants to lookup only the first dirty page after given index. There's no point in using pagevec_lookup_tag() for that. Just use find_get_pages_tag() directly. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Chao Yu <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Jaegeuk Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | f2fs: simplify page iteration loops | Jan Kara | 2 | -50/+28 | |
In several places we want to iterate over all tagged pages in a mapping. However the code was apparently copied from places that iterate only over a limited range and thus it checks for index <= end, optimizes the case where we are coming close to range end which is all pointless when end == ULONG_MAX. So just remove this dead code. [[email protected]: fix warnings] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Reviewed-by: Chao Yu <[email protected]> Cc: Jaegeuk Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | f2fs: use pagevec_lookup_range_tag() | Jan Kara | 1 | -7/+2 | |
We want only pages from given range in f2fs_write_cache_pages(). Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove unnecessary code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Chao Yu <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Jaegeuk Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | ext4: use pagevec_lookup_range_tag() | Jan Kara | 1 | -12/+2 | |
We want only pages from given range in ext4_writepages(). Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove unnecessary code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | ceph: use pagevec_lookup_range_tag() | Jan Kara | 1 | -16/+3 | |
We want only pages from given range in ceph_writepages_start(). Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove unnecessary code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: "Yan, Zheng" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | btrfs: use pagevec_lookup_range_tag() | Jan Kara | 1 | -15/+4 | |
We want only pages from given range in btree_write_cache_pages() and extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove unnecessary code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: David Sterba <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: David Sterba <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: implement find_get_pages_range_tag() | Jan Kara | 4 | -17/+48 | |
Patch series "Ranged pagevec tagged lookup", v3. In this series I provide a ranged variant of pagevec_lookup_tag() and use it in places where it makes sense. This series removes some common code and it also has a potential for speeding up some operations similarly as for pagevec_lookup_range() (but for now I can think of only artificial cases where this happens). This patch (of 16): Implement a variant of find_get_pages_tag() that stops iterating at given index. Lots of users of this function (through pagevec_lookup()) actually want a range lookup and all of them are currently open-coding this. Also create corresponding pagevec_lookup_range_tag() function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Chao Yu <[email protected]> Cc: David Howells <[email protected]> Cc: David Sterba <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Steve French <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: "Yan, Zheng" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/page_owner.c: reduce page_owner structure size | Ayush Mittal | 1 | -2/+2 | |
Maximum page order can be at max 10 which can be accomodated in short data type(2 bytes). last_migrate_reason is defined as enum type whose values can be accomodated in short data type (2 bytes). Total structure size is currently 16 bytes but after changing structure size it goes to 12 bytes. Vlastimil said: "Looks like it works, so why not. Before: [ 0.001000] allocated 50331648 bytes of page_ext After: [ 0.001000] allocated 41943040 bytes of page_ext" Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ayush Mittal <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Amit Sahrawat <[email protected]> Cc: Vaneet Narang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/cma.c: change pr_info to pr_err for cma_alloc fail log | Pintu Agarwal | 1 | -1/+1 | |
It was observed that under cma_alloc fail log, pr_info was used instead of pr_err. This will lead to problems if printk debug level is set to below 7. In this case the cma_alloc failure log will not be captured in the log and it will be difficult to debug. Simply replace the pr_info with pr_err to capture failure log. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pintu Agarwal <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jaewon Kim <[email protected]> Cc: Doug Berger <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm, arch: remove empty_bad_page* | Michal Hocko | 7 | -54/+2 | |
empty_bad_page() and empty_bad_pte_table() seem to be relics from old days which is not used by any code for a long time. I have tried to find when exactly but this is not really all that straightforward due to many code movements - traces disappear around 2.4 times. Anyway no code really references neither empty_bad_page nor empty_bad_pte_table. We only allocate the storage which is not used by anybody so remove them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Ralf Baechle <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: David Howells <[email protected]> Cc: Rich Felker <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/swap_slots.c: fix race conditions in swap_slots cache init | Tim Chen | 1 | -2/+9 | |
Memory allocations can happen before the swap_slots cache initialization is completed during cpu bring up. If we are low on memory, we could call get_swap_page() and access swap_slots_cache before it is fully initialized. Add a check in get_swap_page() for initialized swap_slots_cache to prevent this condition. Similar check already exists in free_swap_slot. Also annotate the checks to indicate the likely condition. We also added a memory barrier to make sure that the locks initialization are done before the assignment of cache->slots and cache->slots_ret pointers. This ensures the assumption that it is safe to acquire the slots cache locks and use the slots cache when the corresponding cache->slots or cache->slots_ret pointers are non null. [[email protected]: tidy up comment] [[email protected]: fix spello in comment] Link: http://lkml.kernel.org/r/65a9d0f133f63e66bba37b53b2fd0464b7cae771.1500677066.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <[email protected]> Reported-by: Wenwei Tao <[email protected]> Acked-by: Ying Huang <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: remove unused pgdat->inactive_ratio | Andrey Ryabinin | 3 | -11/+3 | |
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") 'pgdat->inactive_ratio' is not used, except for printing "node_inactive_ratio: 0" in /proc/zoneinfo output. Remove it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/mmu_notifier: avoid call to invalidate_range() in range_end() | Jérôme Glisse | 5 | -12/+83 | |
This is an optimization patch that only affect mmu_notifier users which rely on the invalidate_range() callback. This patch avoids calling that callback twice in a row from inside __mmu_notifier_invalidate_range_end Existing pattern (before this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_end() mmu_notifier_invalidate_range() New pattern (after this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_only_end() We call the invalidate_range callback after clearing the page table under the page table lock and we skip the call to invalidate_range inside the __mmu_notifier_invalidate_range_end() function. Idea from Andrea Arcangeli Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Andrew Donnellan <[email protected]> Cc: Nadav Amit <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/mmu_notifier: avoid double notification when it is useless | Jérôme Glisse | 7 | -17/+198 | |
This patch only affects users of mmu_notifier->invalidate_range callback which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ... and it is an optimization for those users. Everyone else is unaffected by it. When clearing a pte/pmd we are given a choice to notify the event under the page table lock (notify version of *_clear_flush helpers do call the mmu_notifier_invalidate_range). But that notification is not necessary in all cases. This patch removes almost all cases where it is useless to have a call to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end. It also adds documentation in all those cases explaining why. Below is a more in depth analysis of why this is fine to do this: For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a process virtual address space). There is only 2 cases when you need to notify those secondary TLB while holding page table lock when clearing a pte/pmd: A) page backing address is free before mmu_notifier_invalidate_range_end B) a page table entry is updated to point to a new page (COW, write fault on zero page, __replace_page(), ...) Case A is obvious you do not want to take the risk for the device to write to a page that might now be used by something completely different. Case B is more subtle. For correctness it requires the following sequence to happen: - take page table lock - clear page table entry and notify (pmd/pte_huge_clear_flush_notify()) - set page table entry to point to new page If clearing the page table entry is not followed by a notify before setting the new pte/pmd value then you can break memory model like C11 or C++11 for the device. Consider the following scenario (device use a feature similar to ATS/ PASID): Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). [Time N] ----------------------------------------------------------------- CPU-thread-0 {try to write to addrA} CPU-thread-1 {try to write to addrB} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA and populate device TLB} DEV-thread-2 {read addrB and populate device TLB} [Time N+1] --------------------------------------------------------------- CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+2] --------------------------------------------------------------- CPU-thread-0 {COW_step1: {update page table point to new page for addrA}} CPU-thread-1 {COW_step1: {update page table point to new page for addrB}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page} DEV-thread-0 {} DEV-thread-2 {} [Time N+4] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+5] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA from old page} DEV-thread-2 {read addrB from new page} So here because at time N+2 the clear page table entry was not pair with a notification to invalidate the secondary TLB, the device see the new value for addrB before seing the new value for addrA. This break total memory ordering for the device. When changing a pte to write protect or to point to a new write protected page with same content (KSM) it is ok to delay invalidate_range callback to mmu_notifier_invalidate_range_end() outside the page table lock. This is true even if the thread doing page table update is preempted right after releasing page table lock before calling mmu_notifier_invalidate_range_end Thanks to Andrea for thinking of a problematic scenario for COW. [[email protected]: v2] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Andrew Donnellan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | zsmalloc: calling zs_map_object() from irq is a bug | Sergey Senozhatsky | 1 | -1/+1 | |
Use BUG_ON(in_interrupt()) in zs_map_object(). This is not a new BUG_ON(), it's always been there, but was recently changed to VM_BUG_ON(). There are several problems there. First, we use use per-CPU mappings both in zsmalloc and in zram, and interrupt may easily corrupt those buffers. Second, and more importantly, we believe it's possible to start leaking sensitive information. Consider the following case: -> process P swap out zram per-cpu mapping CPU1 compress page A -> IRQ swap out zram per-cpu mapping CPU1 compress page B write page from per-cpu mapping CPU1 to zsmalloc pool iret -> process P write page from per-cpu mapping CPU1 to zsmalloc pool [*] return * so we store overwritten data that actually belongs to another page (task) and potentially contains sensitive data. And when process P will page fault it's going to read (swap in) that other task's data. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/hmm: constify hmm_devmem_page_get_drvdata() parameter | Ralph Campbell | 1 | -2/+2 | |
Constify pointer parameter to avoid issue when use from code that only has const struct page pointer to use in the first place. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ralph Campbell <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | fs/hugetlbfs/inode.c: remove redundant -ENIVAL return from hugetlbfs_setattr() | Anshuman Khandual | 1 | -1/+0 | |
There is no need to have a local return code set with -EINVAL when both the conditions following it return error codes appropriately. Just remove the redundant one. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | zram: remove zlib from the list of recommended algorithms | Sergey Senozhatsky | 1 | -3/+0 | |
ZSTD tends to outperform deflate/inflate, thus we remove zlib from the list of recommended algorithms and recommend zstd instead. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Suggested-by: Minchan Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | zram: add zstd to the supported algorithms list | Sergey Senozhatsky | 1 | -0/+3 | |
Add ZSTD to the list of supported compression algorithms. ZRAM fio perf test: LZO DEFLATE ZSTD #jobs1 WRITE: (2180MB/s) (77.2MB/s) (1429MB/s) WRITE: (1617MB/s) (77.7MB/s) (1202MB/s) READ: (426MB/s) (595MB/s) (1181MB/s) READ: (422MB/s) (572MB/s) (1020MB/s) READ: (318MB/s) (67.8MB/s) (563MB/s) WRITE: (318MB/s) (67.9MB/s) (564MB/s) READ: (336MB/s) (68.3MB/s) (583MB/s) WRITE: (335MB/s) (68.2MB/s) (582MB/s) #jobs2 WRITE: (3441MB/s) (152MB/s) (2141MB/s) WRITE: (2507MB/s) (147MB/s) (1888MB/s) READ: (801MB/s) (1146MB/s) (1890MB/s) READ: (767MB/s) (1096MB/s) (2073MB/s) READ: (621MB/s) (126MB/s) (1009MB/s) WRITE: (621MB/s) (126MB/s) (1009MB/s) READ: (656MB/s) (125MB/s) (1075MB/s) WRITE: (657MB/s) (126MB/s) (1077MB/s) #jobs3 WRITE: (4772MB/s) (225MB/s) (3394MB/s) WRITE: (3905MB/s) (211MB/s) (2939MB/s) READ: (1216MB/s) (1608MB/s) (3218MB/s) READ: (1159MB/s) (1431MB/s) (2981MB/s) READ: (906MB/s) (156MB/s) (1457MB/s) WRITE: (907MB/s) (156MB/s) (1458MB/s) READ: (953MB/s) (158MB/s) (1595MB/s) WRITE: (952MB/s) (157MB/s) (1593MB/s) #jobs4 WRITE: (6036MB/s) (265MB/s) (4469MB/s) WRITE: (5059MB/s) (263MB/s) (3951MB/s) READ: (1618MB/s) (2066MB/s) (4276MB/s) READ: (1573MB/s) (1942MB/s) (3830MB/s) READ: (1202MB/s) (227MB/s) (1971MB/s) WRITE: (1200MB/s) (227MB/s) (1968MB/s) READ: (1265MB/s) (226MB/s) (2116MB/s) WRITE: (1264MB/s) (226MB/s) (2114MB/s) #jobs5 WRITE: (5339MB/s) (233MB/s) (3781MB/s) WRITE: (4298MB/s) (234MB/s) (3276MB/s) READ: (1626MB/s) (2048MB/s) (4081MB/s) READ: (1567MB/s) (1929MB/s) (3758MB/s) READ: (1174MB/s) (205MB/s) (1747MB/s) WRITE: (1173MB/s) (204MB/s) (1746MB/s) READ: (1214MB/s) (208MB/s) (1890MB/s) WRITE: (1215MB/s) (208MB/s) (1892MB/s) #jobs6 WRITE: (5666MB/s) (270MB/s) (4338MB/s) WRITE: (4828MB/s) (267MB/s) (3772MB/s) READ: (1803MB/s) (2058MB/s) (4946MB/s) READ: (1805MB/s) (2156MB/s) (4711MB/s) READ: (1334MB/s) (235MB/s) (2135MB/s) WRITE: (1335MB/s) (235MB/s) (2137MB/s) READ: (1364MB/s) (236MB/s) (2268MB/s) WRITE: (1365MB/s) (237MB/s) (2270MB/s) #jobs7 WRITE: (5474MB/s) (270MB/s) (4300MB/s) WRITE: (4666MB/s) (266MB/s) (3817MB/s) READ: (2022MB/s) (2319MB/s) (5472MB/s) READ: (1924MB/s) (2260MB/s) (5031MB/s) READ: (1369MB/s) (242MB/s) (2153MB/s) WRITE: (1370MB/s) (242MB/s) (2155MB/s) READ: (1499MB/s) (246MB/s) (2310MB/s) WRITE: (1497MB/s) (246MB/s) (2307MB/s) #jobs8 WRITE: (5558MB/s) (273MB/s) (4439MB/s) WRITE: (4763MB/s) (271MB/s) (3918MB/s) READ: (2201MB/s) (2599MB/s) (6062MB/s) READ: (2105MB/s) (2463MB/s) (5413MB/s) READ: (1490MB/s) (252MB/s) (2238MB/s) WRITE: (1488MB/s) (252MB/s) (2236MB/s) READ: (1566MB/s) (254MB/s) (2434MB/s) WRITE: (1568MB/s) (254MB/s) (2437MB/s) #jobs9 WRITE: (5120MB/s) (264MB/s) (4035MB/s) WRITE: (4531MB/s) (267MB/s) (3740MB/s) READ: (1940MB/s) (2258MB/s) (4986MB/s) READ: (2024MB/s) (2387MB/s) (4871MB/s) READ: (1343MB/s) (246MB/s) (2038MB/s) WRITE: (1342MB/s) (246MB/s) (2037MB/s) READ: (1553MB/s) (238MB/s) (2243MB/s) WRITE: (1552MB/s) (238MB/s) (2242MB/s) #jobs10 WRITE: (5345MB/s) (271MB/s) (3988MB/s) WRITE: (4750MB/s) (254MB/s) (3668MB/s) READ: (1876MB/s) (2363MB/s) (5150MB/s) READ: (1990MB/s) (2256MB/s) (5080MB/s) READ: (1355MB/s) (250MB/s) (2019MB/s) WRITE: (1356MB/s) (251MB/s) (2020MB/s) READ: (1490MB/s) (252MB/s) (2202MB/s) WRITE: (1488MB/s) (252MB/s) (2199MB/s) jobs1 perfstat instructions 52,065,555,710 ( 0.79) 855,731,114,587 ( 2.64) 54,280,709,944 ( 1.40) branches 14,020,427,116 ( 725.847) 101,733,449,582 (1074.521) 11,170,591,067 ( 992.869) branch-misses 22,626,174 ( 0.16%) 274,197,885 ( 0.27%) 25,915,805 ( 0.23%) jobs2 perfstat instructions 103,633,110,402 ( 0.75) 1,710,822,100,914 ( 2.59) 107,879,874,104 ( 1.28) branches 27,931,237,282 ( 679.203) 203,298,267,479 (1037.326) 22,185,350,842 ( 884.427) branch-misses 46,103,811 ( 0.17%) 533,747,204 ( 0.26%) 49,682,483 ( 0.22%) jobs3 perfstat instructions 154,857,283,657 ( 0.76) 2,565,748,974,197 ( 2.57) 161,515,435,813 ( 1.31) branches 41,759,490,355 ( 670.529) 304,905,605,277 ( 978.765) 33,215,805,907 ( 888.003) branch-misses 74,263,293 ( 0.18%) 759,746,240 ( 0.25%) 76,841,196 ( 0.23%) jobs4 perfstat instructions 206,215,849,076 ( 0.75) 3,420,169,460,897 ( 2.60) 215,003,061,664 ( 1.31) branches 55,632,141,739 ( 666.501) 406,394,977,433 ( 927.241) 44,214,322,251 ( 883.532) branch-misses 102,287,788 ( 0.18%) 1,098,617,314 ( 0.27%) 103,891,040 ( 0.23%) jobs5 perfstat instructions 258,711,315,588 ( 0.67) 4,275,657,533,244 ( 2.23) 269,332,235,685 ( 1.08) branches 69,802,821,166 ( 588.823) 507,996,211,252 ( 797.036) 55,450,846,129 ( 735.095) branch-misses 129,217,214 ( 0.19%) 1,243,284,991 ( 0.24%) 173,512,278 ( 0.31%) jobs6 perfstat instructions 312,796,166,008 ( 0.61) 5,133,896,344,660 ( 2.02) 323,658,769,588 ( 1.04) branches 84,372,488,583 ( 520.541) 610,310,494,402 ( 697.642) 66,683,292,992 ( 693.939) branch-misses 159,438,978 ( 0.19%) 1,396,368,563 ( 0.23%) 174,406,934 ( 0.26%) jobs7 perfstat instructions 363,211,372,930 ( 0.56) 5,988,205,600,879 ( 1.75) 377,824,674,156 ( 0.93) branches 98,057,013,765 ( 463.117) 711,841,255,974 ( 598.762) 77,879,009,954 ( 600.443) branch-misses 199,513,153 ( 0.20%) 1,507,651,077 ( 0.21%) 248,203,369 ( 0.32%) jobs8 perfstat instructions 413,960,354,615 ( 0.52) 6,842,918,558,378 ( 1.45) 431,938,486,581 ( 0.83) branches 111,812,574,884 ( 414.224) 813,299,084,518 ( 491.173) 89,062,699,827 ( 517.795) branch-misses 233,584,845 ( 0.21%) 1,531,593,921 ( 0.19%) 286,818,489 ( 0.32%) jobs9 perfstat instructions 465,976,220,300 ( 0.53) 7,698,467,237,372 ( 1.47) 486,352,600,321 ( 0.84) branches 125,931,456,162 ( 424.063) 915,207,005,715 ( 498.192) 100,370,404,090 ( 517.439) branch-misses 256,992,445 ( 0.20%) 1,782,809,816 ( 0.19%) 345,239,380 ( 0.34%) jobs10 perfstat instructions 517,406,372,715 ( 0.53) 8,553,527,312,900 ( 1.48) 540,732,653,094 ( 0.84) branches 139,839,780,676 ( 427.732) 1,016,737,699,389 ( 503.172) 111,696,557,638 ( 516.750) branch-misses 259,595,561 ( 0.19%) 1,952,570,279 ( 0.19%) 357,818,661 ( 0.32%) seconds elapsed 20.630411534 96.084546565 12.743373571 seconds elapsed 22.292627625 100.984155001 14.407413560 seconds elapsed 22.396016966 110.344880848 14.032201392 seconds elapsed 22.517330949 113.351459170 14.243074935 seconds elapsed 28.548305104 156.515193765 19.159286861 seconds elapsed 30.453538116 164.559937678 19.362492717 seconds elapsed 33.467108086 188.486827481 21.492612173 seconds elapsed 35.617727591 209.602677783 23.256422492 seconds elapsed 42.584239509 243.959902566 28.458540338 seconds elapsed 47.683632526 269.635248851 31.542404137 Over all, ZSTD has slower WRITE, but much faster READ (perhaps a static compression buffer used during the test helped ZSTD a lot), which results in faster test results. Memory consumption (zram mm_stat file): zram LZO mm_stat mm_stat (jobs1): 2147483648 23068672 33558528 0 33558528 0 0 mm_stat (jobs2): 2147483648 23068672 33558528 0 33558528 0 0 mm_stat (jobs3): 2147483648 23068672 33558528 0 33562624 0 0 mm_stat (jobs4): 2147483648 23068672 33558528 0 33558528 0 0 mm_stat (jobs5): 2147483648 23068672 33558528 0 33558528 0 0 mm_stat (jobs6): 2147483648 23068672 33558528 0 33562624 0 0 mm_stat (jobs7): 2147483648 23068672 33558528 0 33566720 0 0 mm_stat (jobs8): 2147483648 23068672 33558528 0 33558528 0 0 mm_stat (jobs9): 2147483648 23068672 33558528 0 33558528 0 0 mm_stat (jobs10): 2147483648 23068672 33558528 0 33562624 0 0 zram DEFLATE mm_stat mm_stat (jobs1): 2147483648 16252928 25178112 0 25178112 0 0 mm_stat (jobs2): 2147483648 16252928 25178112 0 25178112 0 0 mm_stat (jobs3): 2147483648 16252928 25178112 0 25178112 0 0 mm_stat (jobs4): 2147483648 16252928 25178112 0 25178112 0 0 mm_stat (jobs5): 2147483648 16252928 25178112 0 25178112 0 0 mm_stat (jobs6): 2147483648 16252928 25178112 0 25178112 0 0 mm_stat (jobs7): 2147483648 16252928 25178112 0 25190400 0 0 mm_stat (jobs8): 2147483648 16252928 25178112 0 25190400 0 0 mm_stat (jobs9): 2147483648 16252928 25178112 0 25178112 0 0 mm_stat (jobs10): 2147483648 16252928 25178112 0 25178112 0 0 zram ZSTD mm_stat mm_stat (jobs1): 2147483648 11010048 16781312 0 16781312 0 0 mm_stat (jobs2): 2147483648 11010048 16781312 0 16781312 0 0 mm_stat (jobs3): 2147483648 11010048 16781312 0 16785408 0 0 mm_stat (jobs4): 2147483648 11010048 16781312 0 16781312 0 0 mm_stat (jobs5): 2147483648 11010048 16781312 0 16781312 0 0 mm_stat (jobs6): 2147483648 11010048 16781312 0 16781312 0 0 mm_stat (jobs7): 2147483648 11010048 16781312 0 16781312 0 0 mm_stat (jobs8): 2147483648 11010048 16781312 0 16781312 0 0 mm_stat (jobs9): 2147483648 11010048 16781312 0 16785408 0 0 mm_stat (jobs10): 2147483648 11010048 16781312 0 16781312 0 0 ================================================================================== Official benchmarks [1]: Compressor name Ratio Compression Decompress. zstd 1.1.3 -1 2.877 430 MB/s 1110 MB/s zlib 1.2.8 -1 2.743 110 MB/s 400 MB/s brotli 0.5.2 -0 2.708 400 MB/s 430 MB/s quicklz 1.5.0 -1 2.238 550 MB/s 710 MB/s lzo1x 2.09 -1 2.108 650 MB/s 830 MB/s lz4 1.7.5 2.101 720 MB/s 3600 MB/s snappy 1.1.3 2.091 500 MB/s 1650 MB/s lzf 3.6 -1 2.077 400 MB/s 860 MB/s Minchan said: : I did test with my sample data and compared zstd with deflate. zstd's : compress ratio is lower a little bit but compression speed is much faster : 3 times more and decompress speed is too 2 times more. With different : data, it is different but overall, zstd would be better for speed at the : cost of a little lower compress ratio(about 5%) so I believe it's worth to : replace deflate. [1] https://github.com/facebook/zstd Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Tested-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical | Yafang Shao | 2 | -1/+11 | |
The vm direct limit setting must be set greater than vm background limit setting. Otherwise print a warning to help the operator to figure out that the vm dirtiness settings is in illogical state. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yafang Shao <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/memblock.c: make the index explicit argument of for_each_memblock_type | Gioh Kim | 2 | -8/+8 | |
for_each_memblock_type macro function relies on idx variable defined in the caller context. Silent macro arguments are almost always wrong thing to do. They make code harder to read and easier to get wrong. Let's use an explicit iterator parameter for for_each_memblock_type and make the code more obious. This patch is a mere cleanup and it shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gioh Kim <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm, memory_hotplug: remove timeout from __offline_memory | Michal Hocko | 1 | -7/+3 | |
We have a hardcoded 120s timeout after which the memory offline fails basically since the hot remove has been introduced. This is essentially a policy implemented in the kernel. Moreover there is no way to adjust the timeout and so we are sometimes facing memory offline failures if the system is under a heavy memory pressure or very intensive CPU workload on large machines. It is not very clear what purpose the timeout actually serves. The offline operation is interruptible by a signal so if userspace wants some timeout based termination this can be done trivially by sending a signal. If there is a strong usecase to do this from the kernel then we should do it properly and have a it tunable from the userspace with the timeout disabled by default along with the explanation who uses it and for what purporse. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm, memory_hotplug: do not fail offlining too early | Michal Hocko | 1 | -30/+10 | |
Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2. While testing memory hotplug on a large 4TB machine we have noticed that memory offlining is just too eager to fail. The primary reason is that the retry logic is just too easy to give up. We have 4 ways out of the offline - we have a permanent failure (isolation or memory notifiers fail, or hugetlb pages cannot be dropped) - userspace sends a signal - a hardcoded 120s timeout expires - page migration fails 5 times This is way too convoluted and it doesn't scale very well. We have seen both temporary migration failures as well as 120s being triggered. After removing those restrictions we were able to pass stress testing during memory hot remove without any other negative side effects observed. Therefore I suggest dropping both hard coded policies. I couldn't have found any specific reason for them in the changelog. I neither didn't get any response [1] from Kamezawa. If we need some upper bound - e.g. timeout based - then we should have a proper and user defined policy for that. In any case there should be a clear use case when introducing it. This patch (of 2): Memory offlining can fail too eagerly under heavy memory pressure. page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3 flags: 0x9855fe40010048(uptodate|active|mappedtodisk) page dumped because: isolation failed page->mem_cgroup:ffff8801cd662000 memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed Isolation has failed here because the page is not on LRU. Most probably because it was on the pcp LRU cache or it has been removed from the LRU already but it hasn't been freed yet. In both cases the page doesn't look non-migrable so retrying more makes sense. __offline_pages seems rather cluttered when it comes to the retry logic. We have 5 retries at maximum and a timeout. We could argue whether the timeout makes sense but failing just because of a race when somebody isoltes a page from LRU or puts it on a pcp LRU lists is just wrong. It only takes it to race with a process which unmaps some pages and remove them from the LRU list and we can fail the whole offline because of something that is a temporary condition and actually not harmful for the offline. Please note that unmovable pages should be already excluded during start_isolate_page_range. We could argue that has_unmovable_pages is racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable pages in most cases and movable zone shouldn't contain unmovable pages at all. Some of those pages might be pinned but not for ever because that would be a bug on its own. In any case the context is still interruptible and so the userspace can easily bail out when the operation takes too long. This is certainly better behavior than a hardcoded retry loop which is racy. Fix this by removing the max retry count and only rely on the timeout resp. interruption by a signal from the userspace. Also retry rather than fail when check_pages_isolated sees some !free pages because those could be a result of the race as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm, page_alloc: fail has_unmovable_pages when seeing reserved pages | Michal Hocko | 1 | -0/+3 | |
Reserved pages should be completely ignored by the core mm because they have a special meaning for their owners. has_unmovable_pages doesn't check those so we rely on other tests (reference count, or PageLRU) to fail on such pages. Althought this happens to work it is safer to simply check for those explicitly and do not rely on the owner of the page to abuse those fields for special purposes. Please note that this is more of a further fortification of the code rahter than a fix of an existing issue. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages() | Michal Hocko | 3 | -7/+17 | |
Joonsoo has noticed that "mm: drop migrate type checks from has_unmovable_pages" would break CMA allocator because it relies on has_unmovable_pages returning false even for CMA pageblocks which in fact don't have to be movable: alloc_contig_range start_isolate_page_range set_migratetype_isolate has_unmovable_pages This is a result of the code sharing between CMA and memory hotplug while each one has a different idea of what has_unmovable_pages should return. This is unfortunate but fixing it properly would require a lot of code duplication. Fix the issue by introducing the requested migrate type argument and special case MIGRATE_CMA case where CMA page blocks are handled properly. This will work for memory hotplug because it requires MIGRATE_MOVABLE. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Joonsoo Kim <[email protected]> Tested-by: Stefan Wahren <[email protected]> Tested-by: Ran Wang <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: drop migrate type checks from has_unmovable_pages | Michal Hocko | 1 | -4/+0 | |
Michael has noticed that the memory offline tries to migrate kernel code pages when doing echo 0 > /sys/devices/system/memory/memory0/online The current implementation will fail the operation after several failed page migration attempts but we shouldn't even attempt to migrate that memory and fail right away because this memory is clearly not migrateable. This will become a real problem when we drop the retry loop counter resp. timeout. The real problem is in has_unmovable_pages in fact. We should fail if there are any non migrateable pages in the area. In orther to guarantee that remove the migrate type checks because MIGRATE_MOVABLE is not guaranteed to contain only migrateable pages. It is merely a heuristic. Similarly MIGRATE_CMA does guarantee that the page allocator doesn't allocate any non-migrateable pages from the block but CMA allocations themselves are unlikely to migrateable. Therefore remove both checks. [[email protected]: remove unused local `mt'] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Michael Ellerman <[email protected]> Tested-by: Michael Ellerman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Tony Lindgren <[email protected]> Tested-by: Ran Wang <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/page-writeback.c: remove unused parameter from balance_dirty_pages() | Tahsin Erdogan | 1 | -3/+2 | |
"mapping" parameter to balance_dirty_pages() is not used anymore. Fixes: dfb8ae567835 ("writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tahsin Erdogan <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm, swap: fix false error message in __swp_swapcount() | Huang Ying | 1 | -0/+3 | |
When a page fault occurs for a swap entry, the physical swap readahead (not the VMA base swap readahead) may readahead several swap entries after the fault swap entry. The readahead algorithm calculates some of the swap entries to readahead via increasing the offset of the fault swap entry without checking whether they are beyond the end of the swap device and it relys on the __swp_swapcount() and swapcache_prepare() to check it. Although __swp_swapcount() checks for the swap entry passed in, it will complain with the error message as follow for the expected invalid swap entry. This may make the end users confused. swap_info_get: Bad swap offset entry 0200f8a7 To fix the false error message, the swap entry checking is added in swapin_readahead() to avoid to pass the out-of-bound swap entries and the swap entry reserved for the swap header to __swp_swapcount() and swapcache_prepare(). Link: http://lkml.kernel.org/r/[email protected] Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots") Signed-off-by: "Huang, Ying" <[email protected]> Reported-by: Christian Kujau <[email protected]> Acked-by: Minchan Kim <[email protected]> Suggested-by: Minchan Kim <[email protected]> Cc: Tim Chen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [4.11+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no ↵ | Minchan Kim | 3 | -9/+23 | |
other reference When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several processes, it can cause unnecessary memory wastage by skipping swap cache. Because, with swapin fault by read, they could share a page if the page were in swap cache. Thus, it avoids allocating same content new pages. This patch makes the swapcache skipping work only if the swap pte is non-sharable. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm, swap: skip swapcache for swapin of synchronous device | Minchan Kim | 4 | -23/+57 | |
With fast swap storage, the platforms want to use swap more aggressively and swap-in is crucial to application latency. The rw_page() based synchronous devices like zram, pmem and btt are such fast storage. When I profile swapin performance with zram lz4 decompress test, S/W overhead is more than 70%. Maybe, it would be bigger in nvdimm. This patch aims to reduce swap-in latency by skipping swapcache if the swap device is synchronous device like rw_page based device. It enhances 45% my swapin test(5G sequential swapin, no readahead, from 2.41sec to 1.64sec). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm, swap: introduce SWP_SYNCHRONOUS_IO | Minchan Kim | 2 | -1/+5 | |
If rw-page based fast storage is used for swap devices, we need to detect it to enhance swap IO operations. This patch is preparation for optimizing of swap-in operation with next patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | bdi: introduce BDI_CAP_SYNCHRONOUS_IO | Minchan Kim | 5 | -1/+16 | |
As discussed at https://lkml.kernel.org/r/<[email protected]> someday we will remove rw_page(). If so, we need something to detect such super-fast storage on which synchronous IO operations like the current rw_page are always a win. Introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices. With it, we could use various optimization techniques. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | zram: set BDI_CAP_STABLE_WRITES once | Minchan Kim | 1 | -10/+6 | |
With fast swap storage, the platform wants to use swap more aggressively and swap-in is crucial to application latency. The rw_page() based synchronous devices like zram, pmem and btt are such fast storage. When I profile swapin performance with zram lz4 decompress test, S/W overhead is more than 70%. Maybe, it would be bigger in nvdimm. This patchset reduces swap-in latency by skipping swapcache if the swap device is a synchronous device like a rw_page() based device. It enhances by 45% my swapin test (5G sequential swapin, no readahead) from 2.41sec to 1.64sec. This patch (of 4): Commit 19b7ccf8651d ("block: get rid of blk_integrity_revalidate()") fixed a weird thing (i.e., reset BDI_CAP_STABLE_WRITES flag unconditionally whenever revalidat_disk is called) so zram doesn't need to reset the flag any more when revalidating the bdev. Instead, set the flag just once when the zram device is created. It shouldn't change any behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: update comments for struct page.mapping | Changbin Du | 1 | -2/+4 | |
struct page.mapping can be NULL or points to one object of type address_space, anon_vma or KSM private structure. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Changbin Du <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | net/rds/ib_fmr.c: use kmalloc_array_node() | Johannes Thumshirn | 1 | -2/+2 | |
Now that we have a NUMA-aware version of kmalloc_array() we can use it instead of kmalloc_node() without an overflow check in the size calculation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: David Rientjes <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Hal Rosenstock <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/mempool.c: use kmalloc_array_node() | Johannes Thumshirn | 1 | -1/+1 | |
Now that we have a NUMA-aware version of kmalloc_array() we can use it instead of kmalloc_node() without an overflow check in the size calculation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: David Rientjes <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Hal Rosenstock <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | drivers/infiniband/sw/rdmavt/qp.c: use kmalloc_array_node() | Johannes Thumshirn | 1 | -1/+1 | |
Now that we have a NUMA-aware version of kmalloc_array() we can use it instead of kmalloc_node() without an overflow check in the size calculation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: David Rientjes <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Hal Rosenstock <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | drivers/infiniband/hw/qib/qib_init.c: use kmalloc_array_node() | Johannes Thumshirn | 1 | -2/+3 | |
Now that we have a NUMA-aware version of kmalloc_array() we can use it instead of kmalloc_node() without an overflow check in the size calculation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Hal Rosenstock <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: David Rientjes <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | block/blk-mq.c: use kmalloc_array_node() | Johannes Thumshirn | 1 | -1/+1 | |
Now that we have a NUMA-aware version of kmalloc_array() we can use it instead of kmalloc_node() without an overflow check in the size calculation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jens Axboe <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Hal Rosenstock <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | include/linux/slab.h: add kmalloc_array_node() and kcalloc_node() | Johannes Thumshirn | 1 | -0/+16 | |
Patch series "Add kmalloc_array_node() and kcalloc_node()". Our current memeory allocation routines suffer form an API imbalance, for one we have kmalloc_array() and kcalloc() which check for overflows in size multiplication and we have kmalloc_node() and kzalloc_node() which allow for memory allocation on a certain NUMA node but don't check for eventual overflows. This patch (of 6): We have kmalloc_array() and kcalloc() wrappers on top of kmalloc() which ensure us overflow free multiplication for the size of a memory allocation but these implementations are not NUMA-aware. Likewise we have kmalloc_node() which is a NUMA-aware version of kmalloc() but the implementation is not aware of any possible overflows in eventual size calculations. Introduce a combination of the two above cases to have a NUMA-node aware version of kmalloc_array() and kcalloc(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: David Rientjes <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Hal Rosenstock <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: Sean Hefty <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | slub: fix sysfs duplicate filename creation when slub_debug=O | Miles Chen | 1 | -0/+4 | |
When slub_debug=O is set. It is possible to clear debug flags for an "unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable" cache became "mergeable" in sysfs_slab_add(). These caches will generate their "unique IDs" by create_unique_id(), but it is possible to create identical unique IDs. In my experiment, sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and the kernel reports "sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'". To repeat my experiment, set disable_higher_order_debug=1, CONFIG_SLUB_DEBUG_ON=y in kernel-4.14. Fix this issue by setting unmergeable=1 if slub_debug=O and the the default slub_debug contains any no-merge flags. call path: kmem_cache_create() __kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here create_cache() __kmem_cache_create() kmem_cache_open() -> clear DEBUG_METADATA_FLAGS sysfs_slab_add() -> the slab cache is mergeable now sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096' ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123 Hardware name: linux,dummy-virt (DT) task: ffffffc07d4e0080 task.stack: ffffff8008008000 PC is at sysfs_warn_dup+0x60/0x7c LR is at sysfs_warn_dup+0x60/0x7c pc : lr : pstate: 60000145 Call trace: sysfs_warn_dup+0x60/0x7c sysfs_create_dir_ns+0x98/0xa0 kobject_add_internal+0xa0/0x294 kobject_init_and_add+0x90/0xb4 sysfs_slab_add+0x90/0x200 __kmem_cache_create+0x26c/0x438 kmem_cache_create+0x164/0x1f4 sg_pool_init+0x60/0x100 do_one_initcall+0x38/0x12c kernel_init_freeable+0x138/0x1d4 kernel_init+0x10/0xfc ret_from_fork+0x10/0x18 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | slab, slub, slob: convert slab_flags_t to 32-bit | Alexey Dobriyan | 4 | -28/+28 | |
struct kmem_cache::flags is "unsigned long" which is unnecessary on 64-bit as no flags are defined in the higher bits. Switch the field to 32-bit and save some space on x86_64 until such flags appear: add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657) function old new delta sysfs_slab_add 720 719 -1 ... check_object 699 676 -23 [[email protected]: fix printk warning] Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Pekka Enberg <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | slab, slub, slob: add slab_flags_t | Alexey Dobriyan | 15 | -81/+97 | |
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON, etc). SLAB is bloated temporarily by switching to "unsigned long", but only temporarily. Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Pekka Enberg <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/slab.c: only set __GFP_RECLAIMABLE once | David Rientjes | 1 | -2/+2 | |
SLAB_RECLAIM_ACCOUNT is a permanent attribute of a slab cache. Set __GFP_RECLAIMABLE as part of its ->allocflags rather than check the cachep flag on every page allocation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm/slob.c: remove an unnecessary check for __GFP_ZERO | Miles Chen | 1 | -1/+1 | |
Current flow guarantees a valid pointer when handling the __GFP_ZERO case. So remove the unnecessary NULL pointer check. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: oom: show unreclaimable slab info when unreclaimable slabs > user memory | Yang Shi | 3 | -2/+67 | |
The kernel may panic when an oom happens without killable process sometimes it is caused by huge unreclaimable slabs used by kernel. Although kdump could help debug such problem, however, kdump is not available on all architectures and it might be malfunction sometime. And, since kernel already panic it is worthy capturing such information in dmesg to aid touble shooting. Print out unreclaimable slab info (used size and total size) which actual memory usage is not zero (num_objs * size != 0) when unreclaimable slabs amount is greater than total user memory (LRU pages). The output looks like: Unreclaimable slab info: Name Used Total rpc_buffers 31KB 31KB rpc_tasks 7KB 7KB ebitmap_node 1964KB 1964KB avtab_node 5024KB 5024KB xfs_buf 1402KB 1402KB xfs_ili 134KB 134KB xfs_efi_item 115KB 115KB xfs_efd_item 115KB 115KB xfs_buf_item 134KB 134KB xfs_log_item_desc 342KB 342KB xfs_trans 1412KB 1412KB xfs_ifork 212KB 212KB [[email protected]: v11] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | mm: slabinfo: remove CONFIG_SLABINFO | Yang Shi | 5 | -15/+6 | |
According to discussion with Christoph (https://marc.info/?l=linux-kernel&m=150695909709711&w=2), it sounds like it is pointless to keep CONFIG_SLABINFO around. This patch removes the CONFIG_SLABINFO config option, but /proc/slabinfo is still available. [[email protected]: v11] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | tools: slabinfo: add "-U" option to show unreclaimable slabs only | Yang Shi | 1 | -1/+10 | |
Patch series "oom: capture unreclaimable slab info in oom message", v10. Recently we ran into a oom issue, kernel panic due to no killable process. The dmesg shows huge unreclaimable slabs used almost 100% memory, but kdump doesn't capture vmcore due to some reason. So, it may sound better to capture unreclaimable slab info in oom message when kernel panic to aid trouble shooting and cover the corner case. Since kernel already panic, so capturing more information sounds worthy and doesn't bother normal oom killer. With the patchset, tools/vm/slabinfo has a new option, "-U", to show unreclaimable slab only. And, oom will print all non zero (num_objs * size != 0) unreclaimable slabs in oom killer message. This patch (of 3): Add "-U" option to show unreclaimable slabs only. "-U" and "-S" together can tell us what unreclaimable slabs use the most memory to help debug huge unreclaimable slabs issue. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> | |||||
2017-11-15 | ocfs2: remove unneeded goto in ocfs2_reserve_cluster_bitmap_bits() | Guozhonghua | 1 | -4/+1 | |
Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4F3CDE3A9@H3CMLB14-EX.srv.huawei-3com.com Signed-off-by: guozhonghua <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Joseph Qi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> |