mm: hugetlb: clear target sub-page last when clearing huge page - blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

diff options

author	Huang Ying <[email protected]>	2017-09-06 16:25:04 -0700
committer	Linus Torvalds <[email protected]>	2017-09-06 17:27:30 -0700
commit	c79b57e462b5d2f47afa5f175cf1828f16e18612 (patch)
tree	2f4b14b2341cd6d88b11b51e77866da252b40203 /drivers/fpga/fpga-bridge.c
parent	212925802454672e6cd2949a727f5e2c1377bf06 (diff)

mm: hugetlb: clear target sub-page last when clearing huge page

Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when clearing huge page on x86_64 platform, the cache footprint is 2M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache pressure is heavy when clearing the huge page, and we clear the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing clearing the end of the huge page. And it is possible for the application to access the begin of the huge page after clearing the huge page. To help the above situation, in this patch, when we clear a huge page, the order to clear sub-pages is changed. In quite some situation, we can get the address that the application will access after we clear the huge page, for example, in a page fault handler. Instead of clearing the huge page from begin to end, we will clear the sub-pages farthest from the the sub-page to access firstly, and clear the sub-page to access last. This will make the sub-page to access most cache-hot and sub-pages around it more cache-hot too. If we cannot know the address the application will access, the begin of the huge page is assumed to be the the address the application will access. With this patch, the throughput increases ~28.3% in vm-scalability anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case creates 72 processes, each process mmap a big anonymous memory area and writes to it from the begin to the end. For each process, other processes could be seen as other workload which generates heavy cache pressure. At the same time, the cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per cycle) increased from 0.56 to 0.74, and the time spent in user space is reduced ~7.9% Christopher Lameter suggests to clear bytes inside a sub-page from end to begin too. But tests show no visible performance difference in the tests. May because the size of page is small compared with the cache size. Thanks Andi Kleen to propose to use address to access to determine the order of sub-pages to clear. The hugetlbfs access address could be improved, will do that in another patch. [[email protected]: improve readability of clear_huge_page()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Suggested-by: Andi Kleen <[email protected]> Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Jan Kara <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Nadia Yvette Chambers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Diffstat (limited to 'drivers/fpga/fpga-bridge.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: