aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2013-09-11mm/writeback: make writeback_inodes_wb staticWanpeng Li1-2/+0
It's not used globally and could be static. Signed-off-by: Wanpeng Li <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: vmscan: fix do_try_to_free_pages() livelockLisa Du3-2/+1
This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [[email protected]: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar <[email protected]> Cc: Ying Han <[email protected]> Cc: Nick Piggin <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Bob Liu <[email protected]> Cc: Neil Zhang <[email protected]> Cc: Russell King - ARM Linux <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Lisa Du <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: munlock: manual pte walk in fast path instead of follow_page_mask()Vlastimil Babka1-6/+6
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Jörn Engel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: track vma changes with VM_SOFTDIRTY bitCyrill Gorcunov1-0/+6
Pavel reported that in case if vma area get unmapped and then mapped (or expanded) in-place, the soft dirty tracker won't be able to recognize this situation since it works on pte level and ptes are get zapped on unmap, loosing soft dirty bit of course. So to resolve this situation we need to track actions on vma level, there VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded) we set this bit, and keep it here until application calls for clearing soft dirty bit. Thus when user space application track memory changes now it can detect if vma area is renewed. Reported-by: Pavel Emelyanov <[email protected]> Signed-off-by: Cyrill Gorcunov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Matt Mackall <[email protected]> Cc: Xiao Guangrong <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Rob Landley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11memblock, numa: binary search node idYinghai Lu1-0/+2
Current early_pfn_to_nid() on arch that support memblock go over memblock.memory one by one, so will take too many try near the end. We can use existing memblock_search to find the node id for given pfn, that could save some time on bigger system that have many entries memblock.memory array. Here are the timing differences for several machines. In each case with the patch less time was spent in __early_pfn_to_nid(). 3.11-rc5 with patch difference (%) -------- ---------- -------------- UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%) UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%) UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%) UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%) Time in seconds. Signed-off-by: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Acked-by: Russ Anderson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: migrate: check movability of hugepage in unmap_and_move_huge_page()Naoya Horiguchi1-0/+12
Currently hugepage migration works well only for pmd-based hugepages (mainly due to lack of testing,) so we had better not enable migration of other levels of hugepages until we are ready for it. Some users of hugepage migration (mbind, move_pages, and migrate_pages) do page table walk and check pud/pmd_huge() there, so they are safe. But the other users (softoffline and memory hotremove) don't do this, so without this patch they can try to migrate unexpected types of hugepages. To prevent this, we introduce hugepage_migration_support() as an architecture dependent check of whether hugepage are implemented on a pmd basis or not. And on some architecture multiple sizes of hugepages are available, so hugepage_migration_support() also checks hugepage size. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: memory-hotplug: enable memory hotplug to handle hugepageNaoya Horiguchi1-0/+6
Until now we can't offline memory blocks which contain hugepages because a hugepage is considered as an unmovable page. But now with this patch series, a hugepage has become movable, so by using hugepage migration we can offline such memory blocks. What's different from other users of hugepage migration is that we need to decompose all the hugepages inside the target memory block into free buddy pages after hugepage migration, because otherwise free hugepages remaining in the memory block intervene the memory offlining. For this reason we introduce new functions dissolve_free_huge_page() and dissolve_free_huge_pages(). Other than that, what this patch does is straightforwardly to add hugepage migration code, that is, adding hugepage code to the functions which scan over pfn and collect hugepages to be migrated, and adding a hugepage allocation function to alloc_migrate_target(). As for larger hugepages (1GB for x86_64), it's not easy to do hotremove over them because it's larger than memory block. So we now simply leave it to fail as it is. [[email protected]: remove duplicated include] Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: migrate: remove VM_HUGETLB from vma flag check in vma_migratable()Naoya Horiguchi1-1/+1
Enable hugepage migration from migrate_pages(2), move_pages(2), and mbind(2). Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: mbind: add hugepage migration code to mbind()Naoya Horiguchi1-0/+3
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with mbind(2) after applying the enablement patch which comes later in this series. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: soft-offline: use migrate_pages() instead of migrate_huge_page()Naoya Horiguchi1-5/+0
Currently migrate_huge_page() takes a pointer to a hugepage to be migrated as an argument, instead of taking a pointer to the list of hugepages to be migrated. This behavior was introduced in commit 189ebff28 ("hugetlb: simplify migrate_huge_page()"), and was OK because until now hugepage migration is enabled only for soft-offlining which migrates only one hugepage in a single call. But the situation will change in the later patches in this series which enable other users of page migration to support hugepage migration. They can kick migration for both of normal pages and hugepages in a single call, so we need to go back to original implementation which uses linked lists to collect the hugepages to be migrated. With this patch, soft_offline_huge_page() switches to use migrate_pages(), and migrate_huge_page() is not used any more. So let's remove it. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: migrate: make core migration code aware of hugepageNaoya Horiguchi1-0/+4
Currently hugepage migration is available only for soft offlining, but it's also useful for some other users of page migration (clearly because users of hugepage can enjoy the benefit of mempolicy and memory hotplug.) So this patchset tries to extend such users to support hugepage migration. The target of this patchset is to enable hugepage migration for NUMA related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and memory hotplug. This patchset does not add hugepage migration for memory compaction, because users of memory compaction mainly expect to construct thp by arranging raw pages, and there's little or no need to compact hugepages. CMA, another user of page migration, can have benefit from hugepage migration, but is not enabled to support it for now (just because of lack of testing and expertise in CMA.) Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in x86_64, or hugepages in architectures like ia64) is not enabled for now (again, because of lack of testing.) As for how these are achived, I extended the API (migrate_pages()) to handle hugepage (with patch 1 and 2) and adjusted code of each caller to check and collect movable hugepages (with patch 3-7). Remaining 2 patches are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is about making sure that we only migrate pmd-based hugepages. And patch 9 is about choosing appropriate zone for hugepage allocation. My test is mainly functional one, simply kicking hugepage migration via each entry point and confirm that migration is done correctly. Test code is available here: git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git And I always run libhugetlbfs test when changing hugetlbfs's code. With this patchset, no regression was found in the test. This patch (of 9): Before enabling each user of page migration to support hugepage, this patch enables the list of pages for migration to link not only LRU pages, but also hugepages. As a result, putback_movable_pages() and migrate_pages() can handle both of LRU pages and hugepages. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11lib/genalloc.c: fix overflow of ending address of memory chunkJoonyoung Shim1-2/+2
In struct gen_pool_chunk, end_addr means the end address of memory chunk (inclusive), but in the implementation it is treated as address + size of memory chunk (exclusive), so it points to the address plus one instead of correct ending address. The ending address of memory chunk plus one will cause overflow on the memory chunk including the last address of memory map, e.g. when starting address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending address will be 0x100000000. Use correct ending address like starting address + size - 1. [[email protected]: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11vmstat: create separate function to fold per cpu diffs into local countersChristoph Lameter1-1/+2
The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [[email protected]: fix UP build] Signed-off-by: Christoph Lameter <[email protected]> Cc: KOSAKI Motohiro <[email protected]> CC: Tejun Heo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11swap: clean-up #ifdef in page_mapping()Joonsoo Kim1-0/+1
PageSwapCache() is always false when !CONFIG_SWAP, so compiler properly discard related code. Therefore, we don't need #ifdef explicitly. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: page_alloc: fair zone allocator policyJohannes Weiner1-0/+1
Each zone that holds userspace pages of one workload must be aged at a speed proportional to the zone size. Otherwise, the time an individual page gets to stay in memory depends on the zone it happened to be allocated in. Asymmetry in the zone aging creates rather unpredictable aging behavior and results in the wrong pages being reclaimed, activated etc. But exactly this happens right now because of the way the page allocator and kswapd interact. The page allocator uses per-node lists of all zones in the system, ordered by preference, when allocating a new page. When the first iteration does not yield any results, kswapd is woken up and the allocator retries. Due to the way kswapd reclaims zones below the high watermark while a zone can be allocated from when it is above the low watermark, the allocator may keep kswapd running while kswapd reclaim ensures that the page allocator can keep allocating from the first zone in the zonelist for extended periods of time. Meanwhile the other zones rarely see new allocations and thus get aged much slower in comparison. The result is that the occasional page placed in lower zones gets relatively more time in memory, even gets promoted to the active list after its peers have long been evicted. Meanwhile, the bulk of the working set may be thrashing on the preferred zone even though there may be significant amounts of memory available in the lower zones. Even the most basic test -- repeatedly reading a file slightly bigger than memory -- shows how broken the zone aging is. In this scenario, no single page should be able stay in memory long enough to get referenced twice and activated, but activation happens in spades: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 0 nr_active_file 8 nr_inactive_file 1582 nr_active_file 11994 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 70 nr_inactive_file 258753 nr_active_file 443214 nr_inactive_file 149793 nr_active_file 12021 Fix this with a very simple round robin allocator. Each zone is allowed a batch of allocations that is proportional to the zone's size, after which it is treated as full. The batch counters are reset when all zones have been tried and the allocator enters the slowpath and kicks off kswapd reclaim. Allocation and reclaim is now fairly spread out to all available/allowable zones: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 174 nr_active_file 4865 nr_inactive_file 53 nr_active_file 860 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 666622 nr_active_file 4988 nr_inactive_file 190969 nr_active_file 937 When zone_reclaim_mode is enabled, allocations will now spread out to all zones on the local node, not just the first preferred zone (which on a 4G node might be a tiny Normal zone). Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Paul Bolle <[email protected]> Cc: Zlatko Calusic <[email protected]> Tested-by: Kevin Hilman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag ↵Srivatsa S. Bhat1-3/+7
tracepoint() In the current code, the value of fallback_migratetype that is printed using the mm_page_alloc_extfrag tracepoint, is the value of the migratetype *after* it has been set to the preferred migratetype (if the ownership was changed). Obviously that wouldn't have been the original intent. (We already have a separate 'change_ownership' field to tell whether the ownership of the pageblock was changed from the fallback_migratetype to the preferred type.) The intent of the fallback_migratetype field is to show the migratetype from which we borrowed pages in order to satisfy the allocation request. So fix the code to print that value correctly. Signed-off-by: Srivatsa S. Bhat <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Cody P Schafer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11swap: make cluster allocation per-cpuShaohua Li1-0/+11
swap cluster allocation is to get better request merge to improve performance. But the cluster is shared globally, if multiple tasks are doing swap, this will cause interleave disk access. While multiple tasks swap is quite common, for example, each numa node has a kswapd thread doing swap and multiple threads/processes doing direct page reclaim. ioscheduler can't help too much here, because tasks don't send swapout IO down to block layer in the meantime. Block layer does merge some IOs, but a lot not, depending on how many tasks are doing swapout concurrently. In practice, I've seen a lot of small size IO in swapout workloads. We makes the cluster allocation per-cpu here. The interleave disk access issue goes away. All tasks swapout to their own cluster, so swapout will become sequential, which can be easily merged to big size IO. If one CPU can't get its per-cpu cluster (for example, there is no free cluster anymore in the swap), it will fallback to scan swap_map. The CPU can still continue swap. We don't need recycle free swap entries of other CPUs. In my test (swap to a 2-disk raid0 partition), this improves around 10% swapout throughput, and request size is increased significantly. How does this impact swap readahead is uncertain though. On one side, page reclaim always isolates and swaps several adjancent pages, this will make page reclaim write the pages sequentially and benefit readahead. On the other side, several CPU write pages interleave means the pages don't live _sequentially_ but relatively _near_. In the per-cpu allocation case, if adjancent pages are written by different cpus, they will live relatively _far_. So how this impacts swap readahead depends on how many pages page reclaim isolates and swaps one time. If the number is big, this patch will benefit swap readahead. Of course, this is about sequential access pattern. The patch has no impact for random access pattern, because the new cluster allocation algorithm is just for SSD. Alternative solution is organizing swap layout to be per-mm instead of this per-cpu approach. In the per-mm layout, we allocate a disk range for each mm, so pages of one mm live in swap disk adjacently. per-mm layout has potential issues of lock contention if multiple reclaimers are swap pages from one mm. For a sequential workload, per-mm layout is better to implement swap readahead, because pages from the mm are adjacent in disk. But per-cpu layout isn't very bad in this workload, as page reclaim always isolates and swaps several pages one time, such pages will still live in disk sequentially and readahead can utilize this. For a random workload, per-mm layout isn't beneficial of request merge, because it's quite possible pages from different mm are swapout in the meantime and IO can't be merged in per-mm layout. while with per-cpu layout we can merge requests from any mm. Considering random workload is more popular in workloads with swap (and per-cpu approach isn't too bad for sequential workload too), I'm choosing per-cpu layout. [[email protected]: coding-style fixes] Signed-off-by: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Kyungmin Park <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rafael Aquini <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11swap: make swap discard asyncShaohua Li1-9/+11
swap can do cluster discard for SSD, which is good, but there are some problems here: 1. swap do the discard just before page reclaim gets a swap entry and writes the disk sectors. This is useless for high end SSD, because an overwrite to a sector implies a discard to original sector too. A discard + overwrite == overwrite. 2. the purpose of doing discard is to improve SSD firmware garbage collection. Idealy we should send discard as early as possible, so firmware can do something smart. Sending discard just after swap entry is freed is considered early compared to sending discard before write. Of course, if workload is already bound to gc speed, sending discard earlier or later doesn't make 3. block discard is a sync API, which will delay scan_swap_map() significantly. 4. Write and discard command can be executed parallel in PCIe SSD. Making swap discard async can make execution more efficiently. This patch makes swap discard async and moves discard to where swap entry is freed. Discard and write have no dependence now, so above issues can be avoided. Idealy we should do discard for any freed sectors, but some SSD discard is very slow. This patch still does discard for a whole cluster. My test does a several round of 'mmap, write, unmap', which will trigger a lot of swap discard. In a fusionio card, with this patch, the test runtime is reduced to 18% of the time without it, so around 5.5x faster. [[email protected]: coding-style fixes] Signed-off-by: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Kyungmin Park <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rafael Aquini <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11swap: change block allocation algorithm for SSDShaohua Li1-0/+20
I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30% CPU time (when cluster is hard to find, the CPU time can be up to 80%), which becomes a bottleneck. scan_swap_map() scans a byte array to search a 256 page cluster, which is very slow. Here I introduced a simple algorithm to search cluster. Since we only care about 256 pages cluster, we can just use a counter to track if a cluster is free. Every 256 pages use one int to store the counter. If the counter of a cluster is 0, the cluster is free. All free clusters will be added to a list, so searching cluster is very efficient. With this, scap_swap_map() overhead disappears. This might help low end SD card swap too. Because if the cluster is aligned, SD firmware can do flash erase more efficiently. We only enable the algorithm for SSD. Hard disk swap isn't fast enough and has downside with the algorithm which might introduce regression (see below). The patch slightly changes which cluster is choosen. It always adds free cluster to list tail. This can help wear leveling for low end SSD too. And if no cluster found, the scan_swap_map() will do search from the end of last cluster. So if no cluster found, the scan_swap_map() will do search from the end of last free cluster, which is random. For SSD, this isn't a problem at all. Another downside is the cluster must be aligned to 256 pages, which will reduce the chance to find a cluster. I would expect this isn't a big problem for SSD because of the non-seek penality. (And this is the reason I only enable the algorithm for SSD). Signed-off-by: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Kyungmin Park <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rafael Aquini <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: vmstats: track TLB flush stats on UP tooDave Hansen1-1/+2
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled for SMP. UP systems do not do remote TLB flushes, so compile those counters out on UP. arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is probably an optimization since both the mtrr code and __flush_tlb() write cr4. It would probably be safe to make that a flush_tlb_all() (and then get these statistics), but the mtrr code is ancient and I'm hesitant to touch it other than to just stick in the counters. [[email protected]: tweak comments] Signed-off-by: Dave Hansen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: vmstats: tlb flush countersDave Hansen1-0/+5
I was investigating some TLB flush scaling issues and realized that we do not have any good methods for figuring out how many TLB flushes we are doing. It would be nice to be able to do these in generic code, but the arch-independent calls don't explicitly specify whether we actually need to do remote flushes or not. In the end, we really need to know if we actually _did_ global vs. local invalidations, so that leaves us with few options other than to muck with the counters from arch-specific code. Signed-off-by: Dave Hansen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: mempolicy: turn vma_set_policy() into vma_dup_policy()Oleg Nesterov1-2/+7
Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: Oleg Nesterov <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11block: support embedded device command line partitionCai Zhiyong1-0/+43
Read block device partition table from command line. The partition used for fixed block device (eMMC) embedded device. It is no MBR, save storage space. Bootloader can be easily accessed by absolute address of data on the block device. Users can easily change the partition. This code reference MTD partition, source "drivers/mtd/cmdlinepart.c" About the partition verbose reference "Documentation/block/cmdline-partition.txt" [[email protected]: fix printk text] [[email protected]: fix error return code in parse_parts()] Signed-off-by: Cai Zhiyong <[email protected]> Cc: Karel Zak <[email protected]> Cc: "Wanglin (Albert)" <[email protected]> Cc: Marius Groeger <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Brian Norris <[email protected]> Cc: Artem Bityutskiy <[email protected]> Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11include/linux/sched.h: don't use task->pid/tgid in ↵Oleg Nesterov1-4/+4
same_thread_group/has_group_leader_pid task_struct->pid/tgid should go away. 1. Change same_thread_group() to use task->signal for comparison. 2. Change has_group_leader_pid(task) to compare task_pid(task) with signal->leader_pid. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Sergey Dyasly <[email protected]> Reviewed-by: "Eric W. Biederman" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11include/linux/smp.h:on_each_cpu(): switch back to a C functionAndrew Morton1-8/+12
Revert commit c846ef7deba2 ("include/linux/smp.h:on_each_cpu(): switch back to a macro"). It turns out that the problematic linux/irqflags.h include was fixed within ia64 and mn10300. Cc: Geert Uytterhoeven <[email protected]> Cc: David Daney <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-0/+2
Pull networking fixes from David Miller: 1) Brown paper bag fix in HTB scheduler, class options set incorrectly due to a typoe. Fix from Vimalkumar. 2) It's possible for the ipv6 FIB garbage collector to run before all the necessary datastructure are setup during init, defer the notifier registry to avoid this problem. Fix from Michal Kubecek. 3) New i40e ethernet driver from the Intel folks. 4) Add new qmi wwan device IDs, from Bjørn Mork. 5) Doorbell lock in bnx2x driver is not initialized properly in some configurations, fix from Ariel Elior. 6) Revert an ipv6 packet option padding change that broke standardized ipv6 implementation test suites. From Jiri Pirko. 7) Fix synchronization of ARP information in bonding layer, from Nikolay Aleksandrov. 8) Fix missing error return resulting in illegal memory accesses in openvswitch, from Daniel Borkmann. 9) SCTP doesn't signal poll events properly due to mistaken operator precedence, fix also from Daniel Borkmann. 10) __netdev_pick_tx() passes wrong index to sk_tx_queue_set() which essentially disables caching of TX queue in sockets :-/ Fix from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits) net_sched: htb: fix a typo in htb_change_class() net: qmi_wwan: add new Qualcomm devices ipv6: don't call fib6_run_gc() until routing is ready net: tilegx driver: avoid compiler warning fib6_rules: fix indentation irda: vlsi_ir: Remove casting the return value which is a void pointer irda: donauboe: Remove casting the return value which is a void pointer net: fix multiqueue selection net: sctp: fix smatch warning in sctp_send_asconf_del_ip net: sctp: fix bug in sctp_poll for SOCK_SELECT_ERR_QUEUE net: fib: fib6_add: fix potential NULL pointer dereference net: ovs: flow: fix potential illegal memory access in __parse_flow_nlattrs bcm63xx_enet: remove deprecated IRQF_DISABLED net: korina: remove deprecated IRQF_DISABLED macvlan: Move skb_clone check closer to call qlcnic: Fix warning reported by kbuild test robot. bonding: fix bond_arp_rcv setting and arp validate desync state bonding: fix store_arp_validate race with mode change ipv6/exthdrs: accept tlv which includes only padding bnx2x: avoid atomic allocations during initialization ...
2013-09-11ipv6: don't call fib6_run_gc() until routing is readyMichal Kubeček1-0/+2
When loading the ipv6 module, ndisc_init() is called before ip6_route_init(). As the former registers a handler calling fib6_run_gc(), this opens a window to run the garbage collector before necessary data structures are initialized. If a network device is initialized in this window, adding MAC address to it triggers a NETDEV_CHANGEADDR event, leading to a crash in fib6_clean_all(). Take the event handler registration out of ndisc_init() into a separate function ndisc_late_init() and move it after ip6_route_init(). Signed-off-by: Michal Kubecek <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-11Merge tag 'for-linus-3.12-merge' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9p updates from Eric Van Hensbergen: "Minor 9p fixes and tweaks for 3.12 merge window The first fixes namespace issues which causes a kernel NULL pointer dereference, the second fixes uevent handling to work better with udev, and the third switches some code to use srlcpy instead of strncpy in order to be safer. All changes have been baking in for-next for at least 2 weeks" * tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: avoid accessing utsname after namespace has been torn down 9p: send uevent after adding/removing mount_tag attribute fs: 9p: use strlcpy instead of strncpy
2013-09-11Merge branch 'pm-cpufreq'Rafael J. Wysocki1-1/+0
* pm-cpufreq: intel_pstate: Add Haswell CPU models Revert "cpufreq: make sure frequency transitions are serialized" cpufreq: Use signed type for 'ret' variable, to store negative error values cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock cpufreq: Split __cpufreq_remove_dev() into two parts cpufreq: Fix wrong time unit conversion cpufreq: serialize calls to __cpufreq_governor() cpufreq: don't allow governor limits to be changed when it is disabled
2013-09-11Merge remote-tracking branch 'asoc/fix/rsnd' into asoc-linusMark Brown1-1/+1
2013-09-11Merge remote-tracking branch 'asoc/fix/fsl' into asoc-linusMark Brown4-6/+14
2013-09-11Merge remote-tracking branch 'asoc/fix/atmel' into asoc-linusMark Brown11-157/+389
2013-09-10Merge tag 'for-v3.12' of git://git.infradead.org/battery-2.6Linus Torvalds3-0/+58
Pull battery/power supply driver updates from Anton Vorontsov: "New drivers: - APM X-Gene system reboot driver by Feng Kan and Loc Ho (APM). - Qualcomm MSM reboot/poweroff driver by Abhimanyu Kapur (Codeaurora). - Texas Instruments BQ24190 charger driver by Mark A. Greer (Animal Creek Technologies). - Texas Instruments TWL4030 MADC battery driver by Lukas Märdian and Marek Belisko (Golden Delicious Computers). The driver is used on Freerunner GTA04 phones. Highlighted fixes and improvements: - Suspend/wakeup logic improvements: power supply objects will block system suspend until all power supply events are processed. Thanks to Zoran Markovic (Linaro), Arve Hjonnevag and Todd Poynor (Google)" * tag 'for-v3.12' of git://git.infradead.org/battery-2.6: rx51_battery: Fix channel number when reading adc value power: Add twl4030_madc battery driver. bq24190_charger: Workaround SS definition problem on i386 builds power_supply: Prevent suspend until power supply events are processed vexpress-poweroff: Should depend on the required infrastructure twl4030-charger: Fix compiler warning with regulator_enable() rx51_battery: Replace hardcoded channels values. bq24190_charger: Add support for TI BQ24190 Battery Charger ab8500-charger: We print an unintended error message max8925_power: Fix missing of_node_put power_supply: Replace strict_strtol() with kstrtol() power: Add APM X-Gene system reboot driver power_supply: tosa_battery: Get rid of irq_to_gpio usage power supply: collie_battery: Convert to use dev_pm_ops power_supply: Make goldfish_battery depend on GOLDFISH || COMPILE_TEST power: reset: Add msm restart support MAINTAINERS: drivers/power: add entry for SmartReflex AVS drivers
2013-09-10target/iscsi: Bump versions to v4.1.0Nicholas Bellinger1-1/+1
Signed-off-by: Nicholas Bellinger <[email protected]>
2013-09-10Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds3-0/+252
Pull drm fixes from Dave Airlie: "Daniel had some fixes queued up, that were delayed, the stolen memory ones and vga arbiter ones are quite useful, along with his usual bunch of stuff, nothing for HSW outputs yet. The one nouveau fix is for a regression I caused with the poweroff stuff" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (30 commits) drm/nouveau: fix oops on runtime suspend/resume drm/i915: Delay disabling of VGA memory until vgacon->fbcon handoff is done drm/i915: try not to lose backlight CBLV precision drm/i915: Confine page flips to BCS on Valleyview drm/i915: Skip stolen region initialisation if none is reserved drm/i915: fix gpu hang vs. flip stall deadlocks drm/i915: Hold an object reference whilst we shrink it drm/i915: fix i9xx_crtc_clock_get for multiplied pixels drm/i915: handle sdvo input pixel multiplier correctly again drm/i915: fix hpd work vs. flush_work in the pageflip code deadlock drm/i915: fix up the relocate_entry refactoring drm/i915: Fix pipe config warnings when dealing with LVDS fixed mode drm/i915: Don't call sg_free_table() if sg_alloc_table() fails i915: Update VGA arbiter support for newer devices vgaarb: Fix VGA decodes changes vgaarb: Don't disable resources that are not owned drm/i915: Pin pages whilst mapping the dma-buf drm/i915: enable trickle feed on Haswell x86: add early quirk for reserving Intel graphics stolen memory v5 drm/i915: split PCI IDs out into i915_drm.h v4 ...
2013-09-10Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2-3/+20
Pull nfsd updates from Bruce Fields: "This was a very quiet cycle! Just a few bugfixes and some cleanup" * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: rpc: let xdr layer allocate gssproxy receieve pages rpc: fix huge kmalloc's in gss-proxy rpc: comment on linux_cred encoding, treat all as unsigned rpc: clean up decoding of gssproxy linux creds svcrpc: remove unused rq_resused nfsd4: nfsd4_create_clid_dir prints uninitialized data nfsd4: fix leak of inode reference on delegation failure Revert "nfsd: nfs4_file_get_access: need to be more careful with O_RDWR" sunrpc: prepare NFS for 2038 nfsd4: fix setlease error return nfsd: nfs4_file_get_access: need to be more careful with O_RDWR
2013-09-10target: Add Third Party Copy (3PC) bit in INQUIRY responseNicholas Bellinger1-0/+3
This patch adds the Third Party Copy (3PC) bit to signal support for EXTENDED_COPY within standard inquiry response data. Also add emulate_3pc device attribute in configfs (enabled by default) to allow the exposure of this bit to be disabled, if necessary. Cc: Christoph Hellwig <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Martin Petersen <[email protected]> Cc: Chris Mason <[email protected]> Cc: Roland Dreier <[email protected]> Cc: Zach Brown <[email protected]> Cc: James Bottomley <[email protected]> Cc: Nicholas Bellinger <[email protected]> Signed-off-by: Nicholas Bellinger <[email protected]>
2013-09-10target: Add support for EXTENDED_COPY copy offload emulationNicholas Bellinger1-0/+1
This patch adds support for EXTENDED_COPY emulation from SPC-3, that enables full copy offload target support within both a single virtual backend device, and across multiple virtual backend devices. It also functions independent of target fabric, and supports copy offload across multiple target fabric ports. This implemenation supports both EXTENDED_COPY PUSH and PULL models of operation, so the actual CDB may be received on either source or desination logical unit. For Target Descriptors, it currently supports the NAA IEEE Registered Extended designator (type 0xe4), which allows the reference of target ports to occur independent of fabric type using EVPD 0x83 WWNs. For Segment Descriptors, it currently supports copy from block to block (0x02) mode. It also honors any present SCSI reservations of the destination target port. Note that only Supports No List Identifier (SNLID=1) mode is supported. Also included is basic RECEIVE_COPY_RESULTS with service action type OPERATING PARAMETERS (0x03) required for SNLID=1 operation. v3 changes: - Fix incorrect return type in target_do_receive_copy_results() (Fengguang) v2 changes: - Use target_alloc_sgl() instead of transport_generic_get_mem() - Convert debug output to use pr_debug() - Convert target_xcopy_parse_target_descriptors() NAA IEEN WWN dump to use 0x%16phN format specification - Drop unnecessary xcopy_pt_cmd->xpt_passthrough_wsem, and associated usage in xcopy_pt_write_pending() and target_xcopy_issue_pt_cmd() - Add check for unsupported EXTENDED_COPY(LID4) service action bits in target_do_xcopy() Cc: Christoph Hellwig <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Martin Petersen <[email protected]> Cc: Chris Mason <[email protected]> Cc: Roland Dreier <[email protected]> Cc: Zach Brown <[email protected]> Cc: James Bottomley <[email protected]> Cc: Nicholas Bellinger <[email protected]> Signed-off-by: Nicholas Bellinger <[email protected]>
2013-09-10target: Add global device list for EXTENDED_COPYNicholas Bellinger1-0/+1
EXTENDED_COPY needs to be able to search a global list of devices based on NAA WWN device identifiers, so add a simple g_device_list protected by g_device_mutex. Cc: Christoph Hellwig <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Martin Petersen <[email protected]> Cc: Chris Mason <[email protected]> Cc: Roland Dreier <[email protected]> Cc: Zach Brown <[email protected]> Cc: James Bottomley <[email protected]> Cc: Nicholas Bellinger <[email protected]> Signed-off-by: Nicholas Bellinger <[email protected]>
2013-09-10target: Make helpers non static for EXTENDED_COPY command setupNicholas Bellinger1-0/+4
Both target_alloc_sgl() and transport_generic_map_mem_to_cmd() are required by EXTENDED_COPY logic when setting up internally dispatched command descriptors, so go ahead and make both of these non static. Cc: Christoph Hellwig <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Martin Petersen <[email protected]> Cc: Chris Mason <[email protected]> Cc: Roland Dreier <[email protected]> Cc: Zach Brown <[email protected]> Cc: James Bottomley <[email protected]> Cc: Nicholas Bellinger <[email protected]> Signed-off-by: Nicholas Bellinger <[email protected]>
2013-09-10target/tcm_qla2xxx: Add/use target_reverse_dma_direction() in ↵Nicholas Bellinger1-0/+26
target_core_fabric.h Reversing the dma_data_direction for pci_map_sg() friends is useful for other drivers, so move it from tcm_qla2xxx into inline code within target_core_fabric.h. Also drop internal usage of equivlient in tcm_qla2xxx fabric code. Reported-by: Christoph Hellwig <[email protected]> Cc: Roland Dreier <[email protected]> Cc: Giridhar Malavali <[email protected]> Cc: Chad Dupuis <[email protected]> Cc: Nicholas Bellinger <[email protected]> Signed-off-by: Nicholas Bellinger <[email protected]>
2013-09-10target: Add support for COMPARE_AND_WRITE emulationNicholas Bellinger1-0/+1
This patch adds support for COMPARE_AND_WRITE emulation on a per block basis. This logic is used as an atomic test and set primative currently used by VMWare ESX VAAI for performing array side locking of individual VMFS extent ownership. This includes the COMPARE_AND_WRITE CDB parsing within sbc_parse_cdb(), and does the majority of the work within the compare_and_write_callback() to perform the verify instance user data comparision, and subsequent write instance user data I/O submission upon a successfull comparision. The synchronization is enforced by se_device->caw_sem, that is obtained before the initial READ I/O submission in sbc_compare_and_write(). The mutex is then released upon MISCOMPARE in compare_and_write_callback(), or upon WRITE instance user-data completion in compare_and_write_post(). The implementation currently assumes a single logical block (NoLB=1). v4 changes: - Explicitly clear cmd->transport_complete_callback for two failure cases in sbc_compare_and_write() in order to avoid double unlock of ->caw_sem in compare_and_write_callback() (Dan Carpenter) v3 changes: - Convert se_device->caw_mutex to ->caw_sem v2 changes: - Set SCF_COMPARE_AND_WRITE and cmd->execute_cmd() to sbc_compare_and_write() during setup in sbc_parse_cdb() - Use sbc_compare_and_write() for initial READ submission with DMA_FROM_DEVICE - Reset cmd->execute_cmd() to sbc_execute_rw() for write instance user-data in compare_and_write_callback() - Drop SCF_BIDI command flag usage - Set TRANSPORT_PROCESSING + transport_state flags before write instance submission, and convert to __target_execute_cmd() - Prevent sbc_get_size() from being being called twice to generate incorrect size in sbc_parse_cdb() - Enforce se_device->caw_mutex synchronization between initial READ I/O submission, and final WRITE I/O completion. Cc: Christoph Hellwig <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Martin Petersen <[email protected]> Cc: Chris Mason <[email protected]> Cc: James Bottomley <[email protected]> Cc: Nicholas Bellinger <[email protected]> Signed-off-by: Nicholas Bellinger <[email protected]>
2013-09-10list_lru: dynamically adjust node arraysGlauber Costa1-11/+2
We currently use a compile-time constant to size the node array for the list_lru structure. Due to this, we don't need to allocate any memory at initialization time. But as a consequence, the structures that contain embedded list_lru lists can become way too big (the superblock for instance contains two of them). This patch aims at ameliorating this situation by dynamically allocating the node arrays with the firmware provided nr_node_ids. Signed-off-by: Glauber Costa <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Al Viro <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Stultz <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Thomas Hellstrom <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-09-10shrinker: Kill old ->shrink API.Dave Chinner2-12/+7
There are no more users of this API, so kill it dead, dead, dead and quietly bury the corpse in a shallow, unmarked grave in a dark forest deep in the hills... [[email protected]: added flowers to the grave] Signed-off-by: Dave Chinner <[email protected]> Signed-off-by: Glauber Costa <[email protected]> Reviewed-by: Greg Thelen <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Al Viro <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Stultz <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Thomas Hellstrom <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-09-10fs: convert inode and dentry shrinking to be node awareDave Chinner1-2/+2
Now that the shrinker is passing a node in the scan control structure, we can pass this to the the generic LRU list code to isolate reclaim to the lists on matching nodes. Signed-off-by: Dave Chinner <[email protected]> Signed-off-by: Glauber Costa <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Al Viro <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Stultz <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Thomas Hellstrom <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-09-10vmscan: per-node deferred workGlauber Costa1-2/+12
The list_lru infrastructure already keeps per-node LRU lists in its node-specific list_lru_node arrays and provide us with a per-node API, and the shrinkers are properly equiped with node information. This means that we can now focus our shrinking effort in a single node, but the work that is deferred from one run to another is kept global at nr_in_batch. Work can be deferred, for instance, during direct reclaim under a GFP_NOFS allocation, where situation, all the filesystem shrinkers will be prevented from running and accumulate in nr_in_batch the amount of work they should have done, but could not. This creates an impedance problem, where upon node pressure, work deferred will accumulate and end up being flushed in other nodes. The problem we describe is particularly harmful in big machines, where many nodes can accumulate at the same time, all adding to the global counter nr_in_batch. As we accumulate more and more, we start to ask for the caches to flush even bigger numbers. The result is that the caches are depleted and do not stabilize. To achieve stable steady state behavior, we need to tackle it differently. In this patch we keep the deferred count per-node, in the new array nr_deferred[] (the name is also a bit more descriptive) and will never accumulate that to other nodes. Signed-off-by: Glauber Costa <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Al Viro <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Stultz <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Thomas Hellstrom <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-09-10shrinker: add node awarenessDave Chinner1-0/+3
Pass the node of the current zone being reclaimed to shrink_slab(), allowing the shrinker control nodemask to be set appropriately for node aware shrinkers. Signed-off-by: Dave Chinner <[email protected]> Signed-off-by: Glauber Costa <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Al Viro <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Stultz <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Thomas Hellstrom <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-09-10list_lru: remove special case function list_lru_dispose_all.Glauber Costa1-17/+0
The list_lru implementation has one function, list_lru_dispose_all, with only one user (the dentry code). At first, such function appears to make sense because we are really not interested in the result of isolating each dentry separately - all of them are going away anyway. However, it's implementation is buggy in the following way: When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries marking them with DCACHE_SHRINK_LIST. However, this is done without the nlru->lock taken. The imediate result of that is that someone else may add or remove the dentry from the LRU at the same time. When list_lru_del happens in that scenario we will see an element that is not yet marked with DCACHE_SHRINK_LIST (even though it will be in the future) and obviously remove it from an lru where the element no longer is. Since list_lru_dispose_all will in effect count down nlru's nr_items and list_lru_del will do the same, this will lead to an imbalance. The solution for this would not be so simple: we can obviously just keep the lru_lock taken, but then we have no guarantees that we will be able to acquire the dentry lock (dentry->d_lock). To properly solve this, we need a communication mechanism between the lru and dentry code, so they can coordinate this with each other. Such mechanism already exists in the form of the list_lru_walk_cb callback. So it is possible to construct a dcache-side prune function that does the right thing only by calling list_lru_walk in a loop until no more dentries are available. With only one user, plus the fact that a sane solution for the problem would involve boucing between dcache and list_lru anyway, I see little justification to keep the special case list_lru_dispose_all in tree. Signed-off-by: Glauber Costa <[email protected]> Cc: Michal Hocko <[email protected]> Acked-by: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-09-10list_lru: per-node APIGlauber Costa1-5/+34
This patch adapts the list_lru API to accept an optional node argument, to be used by NUMA aware shrinking functions. Code that does not care about the NUMA placement of objects can still call into the very same functions as before. They will simply iterate over all nodes. Signed-off-by: Glauber Costa <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Al Viro <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Stultz <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Thomas Hellstrom <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-09-10list_lru: per-node list infrastructureDave Chinner1-5/+18
Now that we have an LRU list API, we can start to enhance the implementation. This splits the single LRU list into per-node lists and locks to enhance scalability. Items are placed on lists according to the node the memory belongs to. To make scanning the lists efficient, also track whether the per-node lists have entries in them in a active nodemask. Note: We use a fixed-size array for the node LRU, this struct can be very big if MAX_NUMNODES is big. If this becomes a problem this is fixable by turning this into a pointer and dynamically allocating this to nr_node_ids. This quantity is firwmare-provided, and still would provide room for all nodes at the cost of a pointer lookup and an extra allocation. Because that allocation will most likely come from a may very well fail. [[email protected]: fix warnings, added note about node lru] Signed-off-by: Dave Chinner <[email protected]> Signed-off-by: Glauber Costa <[email protected]> Reviewed-by: Greg Thelen <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Al Viro <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Stultz <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Thomas Hellstrom <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>