| Age | Commit message (Collapse) | Author | Files | Lines |
|
Due to a merge conflict, the sched_relax_domain_level control file was marked
as being handled by cpuset_read/write_u64, but the code to handle it was
actually in cpuset_common_file_read/write.
Since the value being written/read is in fact a signed integer, it should be
treated as such; this patch adds cpuset_read/write_s64 functions, and uses
them to handle the sched_relax_domain_level file.
With this patch, the sched_relax_domain_level can be read and written, and the
correct contents seen/updated.
Signed-off-by: Paul Menage <[email protected]>
Cc: Hidetoshi Seto <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: Ingo Molnar <[email protected]>
Reviewed-by: Li Zefan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This flag provides the hardwalling properties of mem_exclusive, without
enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
nodes.
Signed-off-by: Paul Menage <[email protected]>
Acked-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently the cpusets mem_exclusive flag is overloaded to mean both
"no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".
These patches add a new mem_hardwall flag with just the allocation restriction
part of the mem_exclusive semantics, without breaking backwards-compatibility
for those who continue to use just mem_exclusive. Additionally, the cgroup
control file registration for cpusets is cleaned up to reduce boilerplate.
This patch:
This change tidies up the cpusets control file definitions, and reduces the
amount of boilerplate required to add/change control files in the future.
Signed-off-by: Paul Menage <[email protected]>
Reviewed-by: Li Zefan <[email protected]>
Acked-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make the following needlessly global functions static:
- cpuset_test_cpumask()
- cpuset_change_cpumask()
- cpuset_do_move_task()
Signed-off-by: Adrian Bunk <[email protected]>
Acked-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Many of the cpusets control files are simple integer values, which don't
require the overhead of memory allocations for reads and writes.
Move the handlers for these control files into cpuset_read_u64() and
cpuset_write_u64().
[[email protected]: ad dmissing `break']
Signed-off-by: Paul Menage <[email protected]>
Cc: "Li Zefan" <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: "YAMAMOTO Takashi" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kernel/cpuset.c:1268:52: warning: Using plain integer as NULL pointer
kernel/pid_namespace.c:95:24: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <[email protected]>
Reviewed-by: Paul Jackson <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch renames mpol_copy() to mpol_dup() because, well, that's what it
does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
existing mempolicy, allocates a new one and copies the contents.
In a later patch, I want to use the name mpol_copy() to copy the contents from
one mempolicy to another like, e.g., strcpy() does for strings.
Signed-off-by: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy. As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering. This eliminates the need for
MPOL_BIND to create a custom zonelist.
A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist. I.e., pages will be allocated from the closest allowed node with
available memory.
[[email protected]: Mempolicy: update stale documentation and comments]
[[email protected]: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[[email protected]: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Lee Schermerhorn <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible. The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.
This patch introduces a struct zoneref to store a zone pointer and a zone
index. The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index as
well as the node index.
[[email protected]: Suggested struct zoneref instead of embedding information in pointers]
[[email protected]: mm-have-zonelist: fix memcg ooms]
[[email protected]: just return do_try_to_free_pages]
[[email protected]: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Lee Schermerhorn <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
[rebased for sched-devel/latest]
- Add a new cpuset file, having levels:
sched_relax_domain_level
- Modify partition_sched_domains() and build_sched_domains()
to take attributes parameter passed from cpuset.
- Fill newidle_idx for node domains which currently unused but
might be required if sched_relax_domain_level become higher.
- We can change the default level by boot option 'relax_domain_level='.
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
* Cleaned up references to cpumask_scnprintf() and added new
cpulist_scnprintf() interfaces where appropriate.
* Fix some small bugs (or code efficiency improvments) for various uses
of cpumask_scnprintf.
* Clean up some checkpatch errors.
Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
* Modify cpuset_cpus_allowed to return the currently allowed cpuset
via a pointer argument instead of as the function return value.
* Use new set_cpus_allowed_ptr function.
* Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
mm migration is no longer done in cpuset_update_task_memory_state() so it
can no longer take current->mm->mmap_sem, so fix the obsolete comment.
[ This changed in commit 04c19fa6f16047abff2288ddbc1f0798ede5a849
("cpuset: migrate all tasks in cpuset at once") when the mm migration
was moved from cpuset_update_task_memory_state() to update_nodemask() ]
Signed-off-by: David Rientjes <[email protected]>
Cc: Paul Jackson <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently we possibly lookup the pid in the wrong pid namespace. So
seq_file convert proc_pid_status which ensures the proper pid namespaces is
passed in.
[[email protected]: coding-style fixes]
[[email protected]: build fix]
[[email protected]: another build fix]
[[email protected]: s390 build fix]
[[email protected]: fix task_name() output]
[[email protected]: fix nommu build]
Signed-off-by: Eric W. Biederman <[email protected]>
Cc: Andrew Morgan <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Cedric Le Goater <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- Narrow the scope of callback_mutex in scan_for_empty_cpusets().
- Avoid rewriting the cpus, mems of cpusets except when it is likely that
we'll be changing them.
- Have remove_tasks_in_empty_cpuset() also check for empty mems.
Signed-off-by: Paul Jackson <[email protected]>
Acked-by: Cliff Wickman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Paul Menage <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch
I had had "iff", meaning "if and only if" in a comment. However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
changed it to "if", presumably figuring that the "iff" was a typo. However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.
The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry. The opposite is true.
Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().
A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.
Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.
Removed extra parentheses and unnecessary return statement.
Renamed attach_task() to cpuset_attach() in various comments.
Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().
Signed-off-by: Paul Jackson <[email protected]>
Acked-by: Cliff Wickman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Paul Menage <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some of the comments in kernel/cpuset.c were stale following the
transition to control groups; this patch updates them to more closely
match reality.
Signed-off-by: Paul Menage <[email protected]>
Acked-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the new function cgroup_scan_tasks() to step through all tasks in a
cpuset.
[[email protected]: coding-style fixes]
Signed-off-by: Cliff Wickman <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch corrects a situation that occurs when one disables all the cpus in
a cpuset.
Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
which is incorrect because it may then overlap its cpu-exclusive sibling.
Tasks of an empty cpuset should be moved to the cpuset which is the parent of
their current cpuset. Or if the parent cpuset has no cpus, to its parent,
etc.
And the empty cpuset should be released (if it is flagged notify_on_release).
Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
iterate through all tasks in the cpu-less cpuset. We are deliberately
avoiding a walk of the tasklist.
[[email protected]: coding-style fixes]
Signed-off-by: Cliff Wickman <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights the
refcount semantics in these operations.
The new API guarantees protection against the cpu-hotplug operation, but
it doesn't guarantee serialized access to any of the local data
structures. Hence the changes needs to be reviewed.
In case of pseries_add_processor/pseries_remove_processor, use
cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
cpu_present_map there.
Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
been running on that cpu.
Currently, such a task is migrated:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any cpu which is both online and among that task's cpus_allowed
It is typical of a multithreaded application running on a large NUMA system to
have its tasks confined to a cpuset so as to cluster them near the memory that
they share. Furthermore, it is typical to explicitly place such a task on a
specific cpu in that cpuset. And in that case the task's cpus_allowed
includes only a single cpu.
This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).
With this patch, migrate the task to:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any online cpu within the task's cpuset
3) to any cpu which is both online and among that task's cpus_allowed
In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
not block. (name change - per Oleg's suggestion)
Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.
[[email protected]: build fix]
[[email protected]: Fix indentation and spacing]
Signed-off-by: Cliff Wickman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:
- collect batches of tasks under tasklist_lock and then call
set_cpus_allowed() on them outside the lock (since this can sleep).
- add a simple generic priority heap type to allow efficient collection
of batches of tasks to be processed without duplicating or missing any
tasks in subsequent batches.
- make "cpus" file update a no-op if the mask hasn't changed
- fix race between update_cpumask() and sched_setaffinity() by making
sched_setaffinity() post-check that it's not running on any cpus outside
cpuset_cpus_allowed().
[[email protected]: coding-style fixes]
Signed-off-by: Paul Menage <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Cedric Le Goater <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Decrustify the kernel/cpuset.c 'cpus' and 'mems' updating code.
Other than subtle improvements in the consistency of identifying
white space at the beginning and end of passed in masks, this
doesn't make any visible difference in behaviour. But it's
one or two hundred kernel text bytes smaller, and easier to
understand.
[[email protected]: coding-style fix]
Signed-off-by: Paul Jackson <[email protected]>
Reviewed-by: Paul Menage <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new per-cpuset flag called 'sched_load_balance'.
When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.
When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.
Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.
If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those
CPUs.
Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.
This serves two purposes:
1) It provides a mechanism for real time isolation of some CPUs, and
2) it can be used to improve performance on systems with many CPUs
by supporting configurations in which load balancing is not done
across all CPUs at once, but rather only done in several smaller
disjoint sets of CPUs.
This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets
See further the Documentation and comments in the code itself.
[[email protected]: don't be weird]
Signed-off-by: Paul Jackson <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem
The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.
Signed-off-by: Paul Menage <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: Kirill Korotaev <[email protected]>
Cc: Herbert Poetzl <[email protected]>
Cc: Srivatsa Vaddagiri <[email protected]>
Cc: Cedric Le Goater <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.
The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.
This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.
Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.
With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.
The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.
Cc: "Eric W. Biederman" <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Herbert Poetzl <[email protected]>
Cc: Kirill Korotaev <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Srivatsa Vaddagiri <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Daniel Walker <[email protected]>
Cc: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors. This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.
Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.
Cc: Andrea Arcangeli <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.
The cpu_exclusive flag can only be set on a child if it is set on the
parent.
This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.
It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.
Furthermore, there was no way on production systems to query the
result.
Using the cpu_exclusive flag for this was simply wrong from the get go.
Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this. One real time group did use it to
affectively isolate CPUs from any load balancing efforts. They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system. They can do without this
present cpu_exclusive based mechanism while we develop an alternative.
There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.
Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.
Initial discussion on lkml of this patch has generated much comment. My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked. There is no concensus on the
replacement. But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.
Signed-off-by: Paul Jackson <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.
This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.
Signed-off-by: Mel Gorman <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
cpusets try to ensure that any node added to a cpuset's mems_allowed is
on-line and contains memory. The assumption was that online nodes contained
memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
tasks to this cpuset. This results in continuous series of oom-kill and
apparent system hang.
Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
place of node_online_map when vetting memories. Return error if admin
attempts to write a non-empty mems_allowed node mask containing only
memoryless-nodes.
Signed-off-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Bob Picco <[email protected]>
Signed-off-by: Nishanth Aravamudan <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rather than using a tri-state integer for the wait flag in
call_usermodehelper_exec, define a proper enum, and use that. I've
preserved the integer values so that any callers I've missed should
still work OK.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Srivatsa Vaddagiri <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (80 commits)
KVM: Use CPU_DYING for disabling virtualization
KVM: Tune hotplug/suspend IPIs
KVM: Keep track of which cpus have virtualization enabled
SMP: Allow smp_call_function_single() to current cpu
i386: Allow smp_call_function_single() to current cpu
x86_64: Allow smp_call_function_single() to current cpu
HOTPLUG: Adapt thermal throttle to CPU_DYING
HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING
HOTPLUG: Add CPU_DYING notifier
KVM: Clean up #includes
KVM: Remove kvmfs in favor of the anonymous inodes source
KVM: SVM: Reliably detect if SVM was disabled by BIOS
KVM: VMX: Remove unnecessary code in vmx_tlb_flush()
KVM: MMU: Fix Wrong tlb flush order
KVM: VMX: Reinitialize the real-mode tss when entering real mode
KVM: Avoid useless memory write when possible
KVM: Fix x86 emulator writeback
KVM: Add support for in-kernel pio handlers
KVM: VMX: Fix interrupt checking on lightweight exit
KVM: Adds support for in-kernel mmio handlers
...
|
|
cpuset.c:update_nodemask() uses a write_lock_irq() on tasklist_lock to
block concurrent forks; a read_lock() suffices and is less intrusive.
Signed-off-by: Paul Menage<[email protected]>
Acked-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
CPU_DYING is called in atomic context, so don't try to take any locks.
Signed-off-by: Avi Kivity <[email protected]>
|
|
The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.
The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.
This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.
Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.
With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.
The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.
Cc: "Eric W. Biederman" <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Herbert Poetzl <[email protected]>
Cc: Kirill Korotaev <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Srivatsa Vaddagiri <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Cleanup using simple_read_from_buffer() for /dev/cpuset/tasks and
/proc/config.gz.
Cc: Paul Jackson <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
You currently cannot remove all cpus or mems from cpus_allowed or
mems_allowed of a cpuset. We now allow both if there are no attached
tasks.
Acked-by: Paul Jackson <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Paul Menage <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently cpuset_exit() changes the exiting task's ->cpuset pointer w/o
taking task_lock(). This can lead to ugly races between attach_task and
cpuset_exit. Details of the races are described at
http://lkml.org/lkml/2007/3/24/132.
Patch below closes those races.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Paul Menage <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
OOM killed tasks have access to memory reserves as specified by the
TIF_MEMDIE flag in the hopes that it will quickly exit. If such a task has
memory allocations constrained by cpusets, we may encounter a deadlock if a
blocking task cannot exit because it cannot allocate the necessary memory.
We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
including outside its cpuset restriction, so that it can quickly die
regardless of whether it is __GFP_HARDWALL.
Cc: Andi Kleen <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
fs/proc/base.c:1869: warning: initialization discards qualifiers from pointer target type
fs/proc/base.c:2150: warning: initialization discards qualifiers from pointer target type
Cc: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Josef Sipek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When using fake NUMA setup, the number of memory nodes can greatly exceed
the number of CPUs. So the current limit in cpuset_common_file_write() is
insufficient.
Signed-off-by: Paul Menage <[email protected]>
Acked-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- move some file_operations structs into the .rodata section
- move static strings from policy_types[] array into the .rodata section
- fix generic seq_operations usages, so that those structs may be defined
as "const" as well
[[email protected]: couple of fixes]
Signed-off-by: Helge Deller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.
the compiler can skip truly unused functions just fine:
text data bss dec hex filename
1624412 728710 3674856 6027978 5bfaca vmlinux.before
1624412 728710 3674856 6027978 5bfaca vmlinux.after
[[email protected]: topology.c fix]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A couple of minor code simplifications to the kernel/cpuset.c code. No
functional change. Just a little less code and a little more readable.
Signed-off-by: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|