aboutsummaryrefslogtreecommitdiff
path: root/include/linux/vmstat.h
AgeCommit message (Collapse)AuthorFilesLines
2012-12-11mm: numa: Add pte updates, hinting and migration statsMel Gorman1-0/+8
It is tricky to quantify the basic cost of automatic NUMA placement in a meaningful manner. This patch adds some vmstats that can be used as part of a basic costing model. u = basic unit = sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cpte = Cost PTE access = Ca Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock) where Cpte is incurred twice for a read and a write and Wlock is a constant representing the cost of taking or releasing a lock Cnumahint = Cost of a minor page fault = some high constant e.g. 1000 Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u Ci = Cost of page isolation = Ca + Wi where Wi is a constant that should reflect the approximate cost of the locking operation Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma) where Wnuma is the approximate NUMA factor. 1 is local. 1.2 would imply that remote accesses are 20% more expensive Balancing cost = Cpte * numa_pte_updates + Cnumahint * numa_hint_faults + Ci * numa_pages_migrated + Cpagecopy * numa_pages_migrated Note that numa_pages_migrated is used as a measure of how many pages were isolated even though it would miss pages that failed to migrate. A vmstat counter could have been added for it but the isolation cost is pretty marginal in comparison to the overall cost so it seemed overkill. The ideal way to measure automatic placement benefit would be to count the number of remote accesses versus local accesses and do something like benefit = (remote_accesses_before - remove_access_after) * Wnuma but the information is not readily available. As a workload converges, the expection would be that the number of remote numa hints would reduce to 0. convergence = numa_hint_faults_local / numa_hint_faults where this is measured for the last N number of numa hints recorded. When the workload is fully converged the value is 1. This can measure if the placement policy is converging and how fast it is doing it. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]>
2012-10-09memory-hotplug: fix zone stat mismatchMinchan Kim1-0/+4
During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing, causing the kernel to hang. When the system doesn't have enough free pages, it enters reclaim but never reclaim any pages due to too_many_isolated()==true and loops forever. The cause is that when we do memory-hotadd after memory-remove, __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset() although the vm_stat_diff of all CPUs still have values. In addtion, when we offline all pages of the zone, we reset them in zone_pcp_reset without draining so we loss some zone stat item. Reviewed-by: Wen Congyang <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Cc: Kamezawa Hiroyuki <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09cma: count free CMA pagesBartlomiej Zolnierkiewicz1-0/+8
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this counter always available (not only when CONFIG_CMA=y). [[email protected]: use conventional migratetype naming] Signed-off-by: Bartlomiej Zolnierkiewicz <[email protected]> Signed-off-by: Kyungmin Park <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: remove redundant initializationMinchan Kim1-5/+0
pg_data_t is zeroed before reaching free_area_init_core(), so remove the now unnecessary initializations. Signed-off-by: Minchan Kim <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Ralf Baechle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-07-26atomic: use <linux/atomic.h>Arun Sharma1-1/+1
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: David Miller <[email protected]> Cc: Eric Dumazet <[email protected]> Acked-by: Mike Frysinger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-26mm: move enum vm_event_item into a standalone header fileAndrew Morton1-61/+1
enums are problematic because they cannot be forward-declared: akpm2:/home/akpm> cat t.c enum foo; static inline void bar(enum foo f) { } akpm2:/home/akpm> gcc -c t.c t.c:4: error: parameter 1 ('f') has incomplete type So move the enum's definition into a standalone header file which can be used wherever its definition is needed. Cc: Ying Han <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occurKOSAKI Motohiro1-0/+3
Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug doesn't. There is no reason for this. [[email protected]: fix CONFIG_SMP=n build] Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: per-node vmstat: show proper vmstatsKOSAKI Motohiro1-1/+3
commit 2ac390370a ("writeback: add /sys/devices/system/node/<node>/vmstat") added vmstat entry. But strangely it only show nr_written and nr_dirtied. # cat /sys/devices/system/node/node20/vmstat nr_written 0 nr_dirtied 0 Of course, It's not adequate. With this patch, the vmstat show all vm stastics as /proc/vmstat. # cat /sys/devices/system/node/node0/vmstat nr_free_pages 899224 nr_inactive_anon 201 nr_active_anon 17380 nr_inactive_file 31572 nr_active_file 28277 nr_unevictable 0 nr_mlock 0 nr_anon_pages 17321 nr_mapped 8640 nr_file_pages 60107 nr_dirty 33 nr_writeback 0 nr_slab_reclaimable 6850 nr_slab_unreclaimable 7604 nr_page_table_pages 3105 nr_kernel_stack 175 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 260 nr_dirtied 1050 nr_written 938 numa_hit 962872 numa_miss 0 numa_foreign 0 numa_interleave 8617 numa_local 962872 numa_other 0 nr_anon_transparent_hugepages 0 [[email protected]: no externs in .c files] Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Michael Rubin <[email protected]> Cc: Wu Fengguang <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-04-14mm: add VM counters for transparent hugepagesAndi Kleen1-0/+7
I found it difficult to make sense of transparent huge pages without having any counters for its actions. Add some counters to vmstat for allocation of transparent hugepages and fallback to smaller pages. Optional patch, but useful for development and understanding the system. Contains improvements from Andrea Arcangeli and Johannes Weiner [[email protected]: coding-style fixes] [[email protected]: fix vmstat_text[] entries] Signed-off-by: Andi Kleen <[email protected]> Acked-by: Andrea Arcangeli <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22mm: add __GFP_OTHER_NODE flagAndi Kleen1-2/+2
Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in zone_statistics() that an allocation is on behalf of another thread. This way the local and remote counters can be still correct, even when background daemons like khugepaged are changing memory mappings. This only affects the accounting, but I think it's worth doing that right to avoid confusing users. I first tried to just pass down the right node, but this required a lot of changes to pass down this parameter and at least one addition of a 10th argument to a 9 argument function. Using the flag is a lot less intrusive. Open: should be also used for migration? [[email protected]: coding-style fixes] Signed-off-by: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: vmstat: use a single setter function and callback for adjusting percpu ↵Mel Gorman1-4/+6
thresholds reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid errors due to counter drift. The functions duplicate some code so this patch replaces them with a single set_pgdat_percpu_threshold() that takes a callback function to calculate the desired threshold as a parameter. [[email protected]: readability tweak] [[email protected]: set_pgdat_percpu_threshold(): don't use for_each_online_cpu] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: page allocator: adjust the per-cpu counter threshold when memory is lowMel Gorman1-0/+5
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To avoid synchronization overhead, these counters are maintained on a per-cpu basis and drained both periodically and when a threshold is above a threshold. On large CPU systems, the difference between the estimate and real value of NR_FREE_PAGES can be very high. The system can get into a case where pages are allocated far below the min watermark potentially causing livelock issues. The commit solved the problem by taking a better reading of NR_FREE_PAGES when memory was low. Unfortately, as reported by Shaohua Li this accurate reading can consume a large amount of CPU time on systems with many sockets due to cache line bouncing. This patch takes a different approach. For large machines where counter drift might be unsafe and while kswapd is awake, the per-cpu thresholds for the target pgdat are reduced to limit the level of drift to what should be a safe level. This incurs a performance penalty in heavy memory pressure by a factor that depends on the workload and the machine but the machine should function correctly without accidentally exhausting all memory on a node. There is an additional cost when kswapd wakes and sleeps but the event is not expected to be frequent - in Shaohua's test case, there was one recorded sleep and wake event at least. To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is introduced that takes a more accurate reading of NR_FREE_PAGES when called from wakeup_kswapd, when deciding whether it is really safe to go back to sleep in sleeping_prematurely() and when deciding if a zone is really balanced or not in balance_pgdat(). We are still using an expensive function but limiting how often it is called. When the test case is reproduced, the time spent in the watermark functions is reduced. The following report is on the percentage of time spent cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), zone_page_state(). vanilla 11.6615% disable-threshold 0.2584% David said: : We had to pull aa454840 "mm: page allocator: calculate a better estimate : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 : internally because tests showed that it would cause the machine to stall : as the result of heavy kswapd activity. I merged it back with this fix as : it is pending in the -mm tree and it solves the issue we were seeing, so I : definitely think this should be pushed to -stable (and I would seriously : consider it for 2.6.37 inclusion even at this late date). Signed-off-by: Mel Gorman <[email protected]> Reported-by: Shaohua Li <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Tested-by: Nicolas Bareil <[email protected]> Cc: David Rientjes <[email protected]> Cc: Kyle McMartin <[email protected]> Cc: <[email protected]> [2.6.37.1, 2.6.36.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-09-09mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory ↵Christoph Lameter1-0/+22
is low and kswapd is awake Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces a zone_page_state_snapshot() function (courtesy of Christoph) that takes a slightly more accurate view of an arbitrary vmstat counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark being accidentally broken. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-25mm: compaction: direct compact when a high-order allocation failsMel Gorman1-0/+1
Ordinarily when a high-order allocation fails, direct reclaim is entered to free pages to satisfy the allocation. With this patch, it is determined if an allocation failed due to external fragmentation instead of low memory and if so, the calling process will compact until a suitable page is freed. Compaction by moving pages in memory is considerably cheaper than paging out to disk and works where there are locked pages or no swap. If compaction fails to free a page of a suitable size, then reclaim will still occur. Direct compaction returns as soon as possible. As each block is compacted, it is checked if a suitable page has been freed and if so, it returns. [[email protected]: Fix build errors] [[email protected]: fix count_vm_event preempt in memory compaction direct reclaim] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-25mm: compaction: memory compaction coreMel Gorman1-0/+3
This patch is the core of a mechanism which compacts memory in a zone by relocating movable pages towards the end of the zone. A single compaction run involves a migration scanner and a free scanner. Both scanners operate on pageblock-sized areas in the zone. The migration scanner starts at the bottom of the zone and searches for all movable pages within each area, isolating them onto a private list called migratelist. The free scanner starts at the top of the zone and searches for suitable areas and consumes the free pages within making them available for the migration scanner. The pages isolated for migration are then migrated to the newly isolated free pages. [[email protected]: Fix unsafe optimisation] [[email protected]: do not schedule work on other CPUs for compaction] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-01-05Merge branch 'master' into percpuTejun Heo1-0/+2
Conflicts: arch/powerpc/platforms/pseries/hvCall.S include/linux/percpu.h
2009-12-15vmscan: stop kswapd waiting on congestion when the min watermark is not ↵KOSAKI Motohiro1-1/+2
being met If reclaim fails to make sufficient progress, the priority is raised. Once the priority is higher, kswapd starts waiting on congestion. However, if the zone is below the min watermark then kswapd needs to continue working without delay as there is a danger of an increased rate of GFP_ATOMIC allocation failure. This patch changes the conditions under which kswapd waits on congestion by only going to sleep if the min watermarks are being met. [[email protected]: add stats to track how relevant the logic is] [[email protected]: make kswapd only check its own zones and rename the relevant counters] Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-12-15vmscan: have kswapd sleep for a short interval and double check it should be ↵Mel Gorman1-0/+1
asleep After kswapd balances all zones in a pgdat, it goes to sleep. In the event of no IO congestion, kswapd can go to sleep very shortly after the high watermark was reached. If there are a constant stream of allocations from parallel processes, it can mean that kswapd went to sleep too quickly and the high watermark is not being maintained for sufficient length time. This patch makes kswapd go to sleep as a two-stage process. It first tries to sleep for HZ/10. If it is woken up by another process or the high watermark is no longer met, it's considered a premature sleep and kswapd continues work. Otherwise it goes fully to sleep. This adds more counters to distinguish between fast and slow breaches of watermarks. A "fast" premature sleep is one where the low watermark was hit in a very short time after kswapd going to sleep. A "slow" premature sleep indicates that the high watermark was breached after a very short interval. Signed-off-by: Mel Gorman <[email protected]> Cc: Frans Pop <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-10-29percpu: remove per_cpu__ prefix.Rusty Russell1-4/+4
Now that the return from alloc_percpu is compatible with the address of per-cpu vars, it makes sense to hand around the address of per-cpu variables. To make this sane, we remove the per_cpu__ prefix we used created to stop people accidentally using these vars directly. Now we have sparse, we can use that (next patch). tj: * Updated to convert stuff which were missed by or added after the original patch. * Kill per_cpu_var() macro. Signed-off-by: Rusty Russell <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Christoph Lameter <[email protected]>
2009-10-03this_cpu: Use this_cpu ops for VM statisticsChristoph Lameter1-6/+4
Using per cpu atomics for the vm statistics reduces their overhead. And in the case of x86 we are guaranteed that they will never race even in the lax form used for vm statistics. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2009-09-22mm: count only reclaimable lru pagesWu Fengguang1-9/+2
global_lru_pages() / zone_lru_pages() can be used in two ways: - to estimate max reclaimable pages in determine_dirtyable_memory() - to calculate the slab scan ratio When swap is full or not present, the anon lru lists are not reclaimable and also won't be scanned. So the anon pages shall not be counted in both usage scenarios. Also rename to _reclaimable_pages: now they are counting the possibly reclaimable lru pages. It can greatly (and correctly) increase the slab scan rate under high memory pressure (when most file pages have been reclaimed and swap is full/absent), thus reduce false OOM kills. Acked-by: Peter Zijlstra <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: Jesse Barnes <[email protected]> Cc: David Howells <[email protected]> Cc: "Li, Ming Chun" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-09-22mm: remove __{add,sub}_zone_page_state()KOSAKI Motohiro1-5/+0
__add_zone_page_state() and __sub_zone_page_state() are unused. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-06-16vmscan: count the number of times zone_reclaim() scans and failsMel Gorman1-0/+3
On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but it is possible that the heuristic will fail and the CPU gets tied up scanning uselessly. Detecting the situation requires some guesswork and experimentation so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during high CPU utilisation this counter is increasing rapidly, then the resolution to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0. [[email protected]: name things consistently] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-06-16mm: remove CONFIG_UNEVICTABLE_LRU config optionKOSAKI Motohiro1-2/+0
Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Andi Kleen <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Matt Mackall <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-10-23proc: move /proc/zoneinfo boilerplate to mm/vmstat.cAlexey Dobriyan1-1/+0
Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Christoph Lameter <[email protected]>
2008-10-23proc: move /proc/vmstat boilerplate to mm/vmstat.cAlexey Dobriyan1-1/+0
Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Christoph Lameter <[email protected]>
2008-10-23proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.cAlexey Dobriyan1-1/+0
Signed-off-by: Alexey Dobriyan <[email protected]>
2008-10-23proc: move /proc/buddyinfo boilerplate to mm/vmstat.cAlexey Dobriyan1-1/+0
Signed-off-by: Alexey Dobriyan <[email protected]>
2008-10-20mlock: count attempts to free mlocked pageLee Schermerhorn1-0/+1
Allow free of mlock()ed pages. This shouldn't happen, but during developement, it occasionally did. This patch allows us to survive that condition, while keeping the statistics and events correct for debug. Signed-off-by: Lee Schermerhorn <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-10-20vmstat: mlocked pages statisticsNick Piggin1-0/+4
Add NR_MLOCK zone page state, which provides a (conservative) count of mlocked pages (actually, the number of mlocked pages moved off the LRU). Reworked by lts to fit in with the modified mlock page support in the Reclaim Scalability series. [[email protected]: fix incorrect Mlocked field of /proc/meminfo] [[email protected]: mlocked-pages: add event counting with statistics] Signed-off-by: Nick Piggin <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-10-20unevictable lru: add event counting with statisticsLee Schermerhorn1-0/+5
Fix to unevictable-lru-page-statistics.patch Add unevictable lru infrastructure vm events to the statistics patch. Rename the "NORECL_" and "noreclaim_" symbols and text strings to "UNEVICTABLE_" and "unevictable_", respectively. Currently, both the infrastructure and the mlocked pages event are added by a single patch later in the series. This makes it difficult to add or rework the incremental patches. The events actually "belong" with the stats, so pull them up to here. Also, restore the event counting to putback_lru_page(). This was removed from previous patch in series where it was "misplaced". The actual events weren't defined that early. Signed-off-by: Lee Schermerhorn <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-10-20vmscan: split LRU lists into anon & file setsRik van Riel1-0/+10
Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [[email protected]: collect lru meminfo statistics from correct offset] [[email protected]: prevent incorrect oom under split_lru] [[email protected]: fix pagevec_move_tail() doesn't treat unevictable page] [[email protected]: memcg swapbacked pages active] [[email protected]: splitlru: BDI_CAP_SWAP_BACKED] [[email protected]: fix /proc/vmstat units] [[email protected]: memcg: fix handling of shmem migration] [[email protected]: adjust Quicklists field of /proc/meminfo] [[email protected]: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-07-24mm/vmstat.c: proper externsAdrian Bunk1-0/+6
This patch adds proper extern declarations for five variables in include/linux/vmstat.h Signed-off-by: Adrian Bunk <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-04-28Subject: [PATCH] hugetlb: vmstat events for huge page allocationsAdam Litke1-0/+4
Allocating huge pages directly from the buddy allocator is not guaranteed to succeed. Success depends on several factors (such as the amount of physical memory available and the level of fragmentation). With the addition of dynamic hugetlb pool resizing, allocations can occur much more frequently. For these reasons it is desirable to keep track of huge page allocation successes and failures. Add two new vmstat entries to track huge page allocations that succeed and fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE being enabled. [[email protected]: reduced ifdeffery] Signed-off-by: Adam Litke <[email protected]> Signed-off-by: Eric Munson <[email protected]> Tested-by: Mel Gorman <[email protected]> Reviewed-by: Andy Whitcroft <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-04-28mm: remember what the preferred zone is for zone_statisticsMel Gorman1-1/+1
On NUMA, zone_statistics() is used to record events like numa hit, miss and foreign. It assumes that the first zone in a zonelist is the preferred zone. When multiple zonelists are replaced by one that is filtered, this is no longer the case. This patch records what the preferred zone is rather than assuming the first zone in the zonelist is it. This simplifies the reading of later patches in this set. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-02-29let __dec_zone_page_state use __dec_zone_stateUwe Kleine-König1-2/+1
This removes code duplication and makes __dec_zone_page_state look like __inc_zone_page_state. Signed-off-by: Uwe Kleine-König <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-07-17Create the ZONE_MOVABLE zoneMel Gorman1-2/+3
The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM and __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a single memory partition while allowing movable allocations to be satisfied from either partition. The patches may be applied with the list-based anti-fragmentation patches that groups pages together based on mobility. The size of the zone is determined by a kernelcore= parameter specified at boot-time. This specifies how much memory is usable by non-movable allocations and the remainder is used for ZONE_MOVABLE. Any range of pages within ZONE_MOVABLE can be released by migrating the pages or by reclaiming. When selecting a zone to take pages from for ZONE_MOVABLE, there are two things to consider. First, only memory from the highest populated zone is used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second, the amount of memory usable by the kernel will be spread evenly throughout NUMA nodes where possible. If the nodes are not of equal size, the amount of memory usable by the kernel on some nodes may be greater than others. By default, the zone is not as useful for hugetlb allocations because they are pinned and non-migratable (currently at least). A sysctl is provided that allows huge pages to be allocated from that zone. This means that the huge page pool can be resized to the size of ZONE_MOVABLE during the lifetime of the system assuming that pages are not mlocked. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. Credit goes to Andy Whitcroft for catching a large variety of problems during review of the patches. This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added memory continues to be placed in their existing destination as there is no mechanism to redirect them to a specific zone. [[email protected]: Fix section mismatch of memory hotplug related code] [[email protected]: various fixes] Signed-off-by: Mel Gorman <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Yasunori Goto <[email protected]> Cc: William Lee Irwin III <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-05-09vmstat: use our own timer eventsChristoph Lameter1-3/+0
vmstat is currently using the cache reaper to periodically bring the statistics up to date. The cache reaper does only exists in SLUB as a way to provide compatibility with SLAB. This patch removes the vmstat calls from the slab allocators and provides its own handling. The advantage is also that we can use a different frequency for the updates. Refreshing vm stats is a pretty fast job so we can run this every second and stagger this by only one tick. This will lead to some overlap in large systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm updates occurring at once. However, the vm stats update only accesses per node information. It is only necessary to stagger the vm statistics updates per processor in each node. Vm counter updates occurring on distant nodes will not cause cacheline contention. We could implement an alternate approach that runs the first processor on each node at the second and then each of the other processor on a node on a subsequent tick. That may be useful to keep a large amount of the second free of timer activity. Maybe the timer folks will have some feedback on this one? [[email protected]: add missing break] Cc: Arjan van de Ven <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Jiri Slaby <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-02-11[PATCH] count_vm_events-warning-fixAndrew Morton1-18/+29
- Prevent things like this: block/ll_rw_blk.c: In function 'submit_bio': block/ll_rw_blk.c:3222: warning: unused variable 'count' inlines are very, very preferable to macros. - remove unused get_cpu_vm_events() macro Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-02-11[PATCH] optional ZONE_DMA: optional ZONE_DMA in the VMChristoph Lameter1-4/+13
Make ZONE_DMA optional in core code. - ifdef all code for ZONE_DMA and related definitions following the example for ZONE_DMA32 and ZONE_HIGHMEM. - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of 0. - Modify the VM statistics to work correctly without a DMA zone. - Modify slab to not create DMA slabs if there is no ZONE_DMA. [[email protected]: cleanup] [[email protected]: build fix] [[email protected]: Simplify calculation of the number of bits we need for ZONES_SHIFT] Signed-off-by: Christoph Lameter <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Kyle McMartin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: James Bottomley <[email protected]> Cc: Paul Mundt <[email protected]> Signed-off-by: Andy Whitcroft <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-02-11[PATCH] Drop free_pages()Christoph Lameter1-0/+1
nr_free_pages is now a simple access to a global variable. Make it a macro instead of a function. The nr_free_pages now requires vmstat.h to be included. There is one occurrence in power management where we need to add the include. Directly refrer to global_page_state() there to clarify why the #include was added. [[email protected]: arm build fix] [[email protected]: sparc64 build fix] Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-02-11[PATCH] Use ZVC for inactive and active countsChristoph Lameter1-0/+9
The determination of the dirty ratio to determine writeback behavior is currently based on the number of total pages on the system. However, not all pages in the system may be dirtied. Thus the ratio is always too low and can never reach 100%. The ratio may be particularly skewed if large hugepage allocations, slab allocations or device driver buffers make large sections of memory not available anymore. In that case we may get into a situation in which f.e. the background writeback ratio of 40% cannot be reached anymore which leads to undesired writeback behavior. This patchset fixes that issue by determining the ratio based on the actual pages that may potentially be dirty. These are the pages on the active and the inactive list plus free pages. The problem with those counts has so far been that it is expensive to calculate these because counts from multiple nodes and multiple zones will have to be summed up. This patchset makes these counters ZVC counters. This means that a current sum per zone, per node and for the whole system is always available via global variables and not expensive anymore to calculate. The patchset results in some other good side effects: - Removal of the various functions that sum up free, active and inactive page counts - Cleanup of the functions that display information via the proc filesystem. This patch: The use of a ZVC for nr_inactive and nr_active allows a simplification of some counter operations. More ZVC functionality is used for sums etc in the following patches. [[email protected]: UP build fix] Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-12-22[PATCH] fix vm_events_fold_cpu() build breakageMagnus Damm1-0/+6
fix vm_events_fold_cpu() build breakage 2.6.20-rc1 does not build properly if CONFIG_VM_EVENT_COUNTERS is set and CONFIG_HOTPLUG is unset: CC init/version.o LD init/built-in.o LD .tmp_vmlinux1 mm/built-in.o: In function `page_alloc_cpu_notify': page_alloc.c:(.text+0x56eb): undefined reference to `vm_events_fold_cpu' make: *** [.tmp_vmlinux1] Error 1 [[email protected]: cleanup] Signed-off-by: Magnus Damm <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-12-22[PATCH] CONFIG_VM_EVENT_COUNTER comment decrustifyPaul Jackson1-2/+3
The VM event counters, enabled by CONFIG_VM_EVENT_COUNTERS, which provides VM event counters in /proc/vmstat, has become more essential to non-EMBEDDED kernel configurations than they were in the past. Comments in the code and the Kconfig configuration explanation were stale, downplaying their role excessively. Refresh those comments to correctly reflect the current role of VM event counters. Signed-off-by: Paul Jackson <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-09-29[PATCH] Remove another config.hDave Jones1-1/+0
After the asm/ uses of #include <linux/config.h> this one is the next biggest source of noise. Signed-off-by: Dave Jones <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-09-26[PATCH] reduce MAX_NR_ZONES: remove display of counters for unconfigured zonesChristoph Lameter1-1/+13
eventcounters: Do not display counters for zones that are not available on an arch Do not define or display counters for the DMA32 and the HIGHMEM zone if such zones were not configured. [[email protected]: s390 fix] [[email protected]: s390 fix] Signed-off-by: Christoph Lameter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-09-26[PATCH] reduce MAX_NR_ZONES: make ZONE_DMA32 optionalChristoph Lameter1-3/+1
Make ZONE_DMA32 optional - Add #ifdefs around ZONE_DMA32 specific code and definitions. - Add CONFIG_ZONE_DMA32 config option and use that for x86_64 that alone needs this zone. - Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL for ia64 and fix up the way per node ZVCs are calculated. - Fall back to prior GFP_ZONEMASK of 0x03 if there is no DMA32 zone. Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-08-06[PATCH] fix vmstat per cpu usageJan Blunck1-4/+4
The per cpu variables are used incorrectly in vmstat.h. Signed-off-by: Jan Blunck <[email protected]> Cc: Christoph Lameter <[email protected]> Acked-by: Steve Fox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-07-10[PATCH] ZVC: add __inc_zone_state for !SMP configurationChristoph Lameter1-2/+7
It turns out that there is a way to build a kernel with NUMA and no SMP. In that case we are missing one definition __inc_zone_state. Provide that missing __inc_zone_state. (akpm: NUMA && !SMP sounds odd, but I am told "But there is the concept of cpuless nodes. A NUMA system without SMP has a single processor but multiple memory nodes. This used to work before on IA64 (wasn't aware of it, never seen anyone with this kind of thing).") Acked-by: Tony Luck <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-07-10[PATCH] count_vm_events() fixAndrew Morton1-1/+1
Dopey bug. Causes hopelessly-wrong numbers from vmstat(8) and several other counters. Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>