Age | Commit message (Collapse) | Author | Files | Lines |
|
Before this patch, when function try_rgrp_unlink queued a glock for
delete_work to reclaim the space, it used the inode glock to do so.
That's different from the iopen callback which uses the iopen glock
for the same purpose. We should be consistent and always use the
iopen glock. This may also save us reference counting problems with
the inode glock, since clear_glock does an extra glock_put() for the
inode glock.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Some error cases in gfs2_create_inode were not unlocking the iopen
glock, getting the reference count off. This adds the proper unlock.
The error logic in function gfs2_create_inode was also convoluted,
so this patch simplifies it. It also takes care of a bug in
which gfs2_qa_delete() was not called in an error case.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
In function gfs2_delete_inode() we write and flush the mapping for
a glock, among other things. We truncate the mapping for the inode,
but we never truncate the mapping for the glock. This patch makes it
also truncate the metamapping. This avoid cases where the glock is
reused by another process who is trying to recreate an inode in its
place using the same block.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch changes every glock_dq for iopen glocks into a dq_wait.
This makes sure that iopen glocks do not outlive the inode itself.
In turn, that ensures that anyone trying to unlink the glock will
be able to find the inode when it receives a remote iopen callback.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
When gfs2 was unmounting filesystems or changing them to read-only it
was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This
caused a race. If an inode glock got demoted in the gap between
clearing the bit and the shutdown flush, it would be unable to reserve
log space to clear out the active items list in inode_go_sync, causing an
error in inode_go_inval because the glock was still dirty.
To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the
shutdown log flush. This means that, because of the locking on the log
blocks, either inode_go_sync will be able to reserve space to clean the
glock before the shutdown flush, or the shutdown flush will clean the
glock itself, before inode_go_sync fails to reserve the space. Either
way, the glock will be clean before inode_go_inval.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
gfs2 currently returns 31 bits of filename hash as a cookie that readdir
uses for an offset into the directory. When there are a large number of
directory entries, the likelihood of a collision goes up way too
quickly. GFS2 will now return cookies that are guaranteed unique for a
while, and then fail back to using 30 bits of filename hash.
Specifically, the directory leaf blocks are divided up into chunks based
on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
cookie is based off the chunk where it starts, in the linked list of
leaf blocks that it hashes to (there are 131072 hash buckets). Directory
entries will have unique names until they take reach chunk 8192.
Assuming the largest filenames possible, and the least efficient spacing
possible, this new method will still be able to return unique names when
the previous method has statistically more than a 99% chance of a
collision. The non-unique names it fails back to are guaranteed to not
collide with the unique names.
unique cookies will be in this format:
- 1 bit "0" to make sure the the returned cookie is positive
- 17 bits for the hash table index
- 1 bit for the mode "0"
- 13 bits for the offset
non-unique cookies will be in this format:
- 1 bit "0" to make sure the the returned cookie is positive
- 17 bits for the hash table index
- 1 bit for the mode "1"
- 13 more bits of the name hash
Another benefit of location based cookies, is that once a directory's
exhash table is fully extended (so that multiple hash table indexs do
not use the same leaf blocks), gfs2 can skip sorting the directory
entries until it reaches the non-unique ones, and then it only needs to
sort these. This provides a significant speed up for directory reads of
very large directories.
The only issue is that for these cookies to continue to point to the
correct entry as files are added and removed from the directory, gfs2
must keep the entries at the same offset in the leaf block when they are
split (see my previous patch). This means that until all the nodes in a
cluster are running with code that will split the directory leaf blocks
this way, none of the nodes can use the new cookie code. To deal with
this, gfs2 now has the mount option loccookie, which, if set, will make
it return these new location based cookies. This option must not be set
until all nodes in the cluster are at least running this version of the
kernel code, and you have guaranteed that there are no outstanding
cookies required by other software, such as NFS.
gfs2 uses some of the extra space at the end of the gfs2_dirent
structure to store the calculated readdir cookies. This keeps us from
needing to allocate a seperate array to hold these values. gfs2
recomputes the cookie stored in de_cookie for every readdir call. The
time it takes to do so is small, and if gfs2 expected this value to be
saved on disk, the new code wouldn't work correctly on filesystems
created with an earlier version of gfs2.
One issue with adding de_cookie to the union in the gfs2_dirent
structure is that it caused the union to align itself to a 4 byte
boundary, instead of its previous 2 byte boundary. This changed the
offset of de_rahead. To solve that, I pulled de_rahead out of the union,
since it does not need to be there.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Currently, when gfs2 splits a directory leaf block, the dirents that
need to be copied to the new leaf block are packed into the start of it.
This is good for space efficiency. However, if gfs2 were to copy those
dirents into the exact same offset in the new leaf block as they had in
the old block, it would be able to generate a readdir cookie based on
the dirent location, that would be guaranteed to be unique up well past
where the current code is statistically almost guaranteed to have
collisions. So, gfs2 now keeps the dirent's offset in the block the
same when it copies it to the new leaf block.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
At some point in the past, we used to have a timeout when GFS2 was
unmounting, trying to clear out its glocks. If the timeout expires,
it would dump the remaining glocks to the kernel messages so that
developers can debug the problem. That timeout was eliminated,
probably by accident. This patch reintroduces it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Before this patch, function update_statfs called gfs2_statfs_change_out
to update the master statfs buffer without the sd_statfs_spin held.
In theory, another process could call gfs2_statfs_sync, which takes
the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's
a theoretical timing window in which one process could write the
master statfs buffer, then another comes along and re-reads it, wiping
out the changes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
This patch makes no functional changes. Its goal is to reduce the
size of the gfs2 inode in memory by rearranging structures and
changing the size of some variables within the structure.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Before this patch, multi-block reservation structures were allocated
from a special slab. This patch folds the structure into the gfs2_inode
structure. The disadvantage is that the gfs2_inode needs more memory,
even when a file is opened read-only. The advantages are: (a) we don't
need the special slab and the extra time it takes to allocate and
deallocate from it. (b) we no longer need to worry that the structure
exists for things like quota management. (c) This also allows us to
remove the calls to get_write_access and put_write_access since we
know the structure will exist.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Remove POSIX_ACL_XATTR_{ACCESS,DEFAULT} and GFS2_POSIX_ACL_{ACCESS,DEFAULT}
and replace them with the definitions in <include/uapi/linux/xattr.h>.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Function gfs2_xattr_acl_chmod is unused since commit e01580bf.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This patch basically reverts the majority of patch 5407e24.
That patch eliminated the gfs2_qadata structure in favor of just
using the reservations structure. The problem with doing that is that
it increases the size of the reservations structure. That is not an
issue until it comes time to fold the reservations structure into the
inode in memory so we know it's always there. By separating out the
quota structure again, we aren't punishing the non-quota users by
making all the inodes bigger, requiring more slab space. This patch
creates a new slab area to allocate the quota stuff so it's managed
a little more sanely.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Instead of submitting a READ_SYNC bio for the inode and a READA bio for
the inode's extended attributes through submit_bh, submit a single READ_SYNC
bio for both through submit_bio when possible. This can be more
efficient on some kinds of block devices.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
When gfs2 allocates an inode and its extended attribute block next to
each other at inode create time, the inode's directory entry indicates
that in de_rahead. In that case, we can readahead the extended
attribute block when we read in the inode.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
This lockdep splat was being triggered on umount:
[55715.973122] ===============================
[55715.980169] [ INFO: suspicious RCU usage. ]
[55715.981021] 4.3.0-11553-g8d3de01-dirty #15 Tainted: G W
[55715.982353] -------------------------------
[55715.983301] fs/gfs2/glock.c:1427 suspicious rcu_dereference_protected() usage!
The code it refers to is the rht_for_each_entry_safe usage in
glock_hash_walk. The condition that triggers the warning is
lockdep_rht_bucket_is_held(tbl, hash) which is checked in the
__rcu_dereference_protected macro.
The rhashtable buckets are not changed in glock_hash_walk so it's safe
to rely on the rcu protection. Replace the rht_for_each_entry_safe()
usage with rht_for_each_entry_rcu(), which doesn't care whether the
bucket lock is held if the rcu read lock is held.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr cleanups from Al Viro.
* 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
f2fs: xattr simplifications
squashfs: xattr simplifications
9p: xattr simplifications
xattr handlers: Pass handler to operations instead of flags
jffs2: Add missing capability check for listing trusted xattrs
hfsplus: Remove unused xattr handler list operations
ubifs: Remove unused security xattr handler
vfs: Fix the posix_acl_xattr_list return value
vfs: Check attribute names in posix acl xattr handers
|
|
The xattr_handler operations are currently all passed a file system
specific flags value which the operations can use to disambiguate between
different handlers; some file systems use that to distinguish the xattr
namespace, for example. In some oprations, it would be useful to also have
access to the handler prefix. To allow that, pass a pointer to the handler
to operations instead of the flags value alone.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
When new files and directories are created inside a parent directory
we automatically inherit the GFS2_DIF_SYSTEM flag (if set) and assign
it to the new file/dirs.
All new system files/dirs created in the metafs by, say gfs2_jadd,
will have this flag set because they will have parent directories in
the metafs whose GFS2_DIF_SYSTEM flag has already been set (most likely
by a previous mkfs.gfs2)
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Merge third patch-bomb from Andrew Morton:
"We're pretty much done over here - I'm still waiting for a nouveau
merge so I can cleanly finish up Christoph's dma-mapping rework.
- bunch of small misc stuff
- fold abs64() into abs(), remove abs64()
- new_valid_dev() cleanups
- binfmt_elf_fdpic feature work"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries
fs/stat.c: remove unnecessary new_valid_dev() check
fs/reiserfs/namei.c: remove unnecessary new_valid_dev() check
fs/nilfs2/namei.c: remove unnecessary new_valid_dev() check
fs/ncpfs/dir.c: remove unnecessary new_valid_dev() check
fs/jfs: remove unnecessary new_valid_dev() checks
fs/hpfs/namei.c: remove unnecessary new_valid_dev() check
fs/f2fs/namei.c: remove unnecessary new_valid_dev() check
fs/ext2/namei.c: remove unnecessary new_valid_dev() check
fs/exofs/namei.c: remove unnecessary new_valid_dev() check
fs/btrfs/inode.c: remove unnecessary new_valid_dev() check
fs/9p: remove unnecessary new_valid_dev() checks
include/linux/kdev_t.h: old/new_valid_dev() can return bool
include/linux/kdev_t.h: remove unused huge_valid_dev()
kmap_atomic_to_page() has no users, remove it
drivers/scsi/cxgbi: fix build with EXTRA_CFLAGS
dma: remove external references to dma_supported
Documentation/sysctl/vm.txt: fix misleading code reference of overcommit_memory
remove abs64()
kernel.h: make abs() work with 64-bit types
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
"Here is a list of patches we've accumulated for GFS2 for the current
upstream merge window. There are only six patches this time:
1. A cleanup patch from Andreas to remove the gl_spin #define in favor
of its value for the sake of clarity.
2. A fix from Andy Price to mark the inode dirty during fallocate.
3. A fix from Andy Price to set s_mode on mount failures to prevent a
stack trace.
4 A patch from me to prevent a kernel BUG() in trans_add_meta/trans_add_data
due to uninitialized storage.
5. A patch from me to protecting our freeing of the in-core directory
hash table to prevent double-free.
6. A fix for a page/block rounding problem that resulted in a metadata
coherency problem when the block size != page size"
I've got a lot more patches in various stages of review and testing,
but I'm afraid they'll have to wait until the next merge window. So
next time we're likely to have a lot more"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Fix rgrp end rounding problem for bsize < page size
GFS2: Protect freeing directory hash table with i_lock spin_lock
gfs2: Remove gl_spin define
gfs2: Add missing else in trans_add_meta/data
GFS2: Set s_mode before parsing mount options
GFS2: fallocate: do not rely on file_update_time to mark the inode dirty
|
|
Switch everything to the new and more capable implementation of abs().
Mainly to give the new abs() a bit of a workout.
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch fixes a bug introduced by commit 7005c3e. That patch
tries to map a vm range for resource groups, but the calculation
breaks down when the block size is less than the page size.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
This patch changes function gfs2_dir_hash_inval so it uses the
i_lock spin_lock to protect the in-core hash table, i_hash_cache.
This will prevent double-frees due to a race between gfs2_evict_inode
and inode invalidation.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
gl_lockref). Remove that define to make the references to gl_lockref.lock more
obvious.
Signed-off-by: Andreas Gruenbacher <andreas.gruenbacher@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct
locks API function, use the check within locks_lock_inode_wait(). This
allows for some later cleanup.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
|
|
This patch fixes a timing window that causes a segfault.
The problem is that bd can remain NULL throughout the function
and then reference that NULL pointer if the bh->b_private starts
out NULL, then someone sets it to non-NULL inside the locking.
In that case, bd still needs to be set.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
In the generic mount_bdev() function, deactivate_locked_super() is
called after the fill_super() call fails, at which point s_mode has been
set. kill_block_super() expects this and dumps a warning when
FMODE_EXCL is not set in s_mode.
In gfs2_mount() we call deactivate_locked_super() on failure of
gfs2_mount_args(), at which point s_mode has not yet been set. This
causes kill_block_super() to dump a stack trace when gfs2 fails to mount
with invalid options. Set s_mode earlier in gfs2_mount() to avoid that.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Previously __gfs2_fallocate() relied on file_update_time() marking the
inode dirty, but that's not a safe assumption as that function doesn't
dirty the inode in some cases. Mark the inode dirty explicitly.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"Here is a list of patches we've accumulated for GFS2 for the current
upstream merge window. This time we've only got six patches, many of
which are very small:
- three cleanups from Andreas Gruenbacher, including a nice cleanup
of the sequence file code for the sbstats debugfs file.
- a patch from Ben Hutchings that changes statistics variables from
signed to unsigned.
- two patches from me that increase GFS2's glock scalability by
switching from a conventional hash table to rhashtable"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: A minor "sbstats" cleanup
gfs2: Fix a typo in a comment
gfs2: Make statistics unsigned, suitable for use with do_div()
GFS2: Use resizable hash table for glocks
GFS2: Move glock superblock pointer to field gl_name
gfs2: Simplify the seq file code for "sbstats"
|
|
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.
Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:
$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory
This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.
[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It seems cleaner to avoid the temporary value here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
None of these statistics can meaningfully be negative, and the
numerator for do_div() must have the type u64. The generic
implementation of do_div() used on some 32-bit architectures asserts
that, resulting in a compiler error in gfs2_rgrp_congested().
Fixes: 0166b197c2ed ("GFS2: Average in only non-zero round-trip times ...")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
This patch changes the glock hash table from a normal hash table to
a resizable hash table, which scales better. This also simplifies
a lot of code.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
What uniquely identifies a glock in the glock hash table is not
gl_name, but gl_name and its superblock pointer. This patch makes
the gl_name field correspond to a unique glock identifier. That will
allow us to simplify hashing with a future patch, since the hash
algorithm can then take the gl_name and hash its components in one
operation.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Don't use struct gfs2_glock_iter as the helper data structure for iterating
through "sbstats"; we are not iterating through glocks here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"Here are the patches we've accumulated for GFS2 for the current
upstream merge window. We have a good mixture this time. Here are
some of the features:
- Fix a problem with RO mounts writing to the journal.
- Further improvements to quotas on GFS2.
- Added support for rename2 and RENAME_EXCHANGE on GFS2.
- Increase performance by making glock lru_list less of a bottleneck.
- Increase performance by avoiding unnecessary buffer_head releases.
- Increase performance by using average glock round trip time from all CPUs.
- Fixes for some compiler warnings and minor white space issues.
- Other misc bug fixes"
* tag 'gfs2-merge-window' of git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Don't brelse rgrp buffer_heads every allocation
GFS2: Don't add all glocks to the lru
gfs2: Don't support fallocate on jdata files
gfs2: s64 cast for negative quota value
gfs2: limit quota log messages
gfs2: fix quota updates on block boundaries
gfs2: fix shadow warning in gfs2_rbm_find()
gfs2: kerneldoc warning fixes
gfs2: convert simple_str to kstr
GFS2: make sure S_NOSEC flag isn't overwritten
GFS2: add support for rename2 and RENAME_EXCHANGE
gfs2: handle NULL rgd in set_rgrp_preferences
GFS2: inode.c: indent with TABs, not spaces
GFS2: mark the journal idle to fix ro mounts
GFS2: Average in only non-zero round-trip times for congestion stats
GFS2: Use average srttb value in congestion calculations
|
|
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
|
|
This patch allows the block allocation code to retain the buffers
for the resource groups so they don't need to be re-read from buffer
cache with every request. This is a performance improvement that's
especially noticeable when resource groups are very large. For
example, with 2GB resource groups and 4K blocks, there can be 33
blocks for every resource group. This patch allows those 33 buffers
to be kept around and not read in and thrown away with every
operation. The buffers are released when the resource group is
either synced or invalidated.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
|
|
The glocks used for resource groups often come and go hundreds of
thousands of times per second. Adding them to the lru list just
adds unnecessary contention for the lru_lock spin_lock, especially
considering we're almost certainly going to re-use the glock and
take it back off the lru microseconds later. We never want the
glock shrinker to cull them anyway. This patch adds a new bit in
the glops that determines which glock types get put onto the lru
list and which ones don't.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
We cannot provide an efficient implementation due to the headers
on the data blocks, so there doesn't seem much point in having it.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
One-line fix to cast quota value to s64 before comparison.
By default the quantity is treated as u64.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch makes the quota subsystem only report once that a
particular user/group has exceeded their allotted quota.
Previously, it was possible for a program to continuously try
exceeding quota (despite receiving EDQUOT) and in turn trigger
gfs2 to issue a kernel log message about quota exceed. In theory,
this could get out of hand and flood the log and the filesystem
hosting the log files.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
For smaller block sizes (512B, 1K, 2K), some quotas straddle block
boundaries such that the usage value is on one block and the rest
of the quota is on the previous block. In such cases, the value
does not get updated correctly. This patch fixes that by addressing
the boundary conditions correctly.
This patch also adds a (s64) cast that was missing in a call to
gfs2_quota_change() in inode.c
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|