aboutsummaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)AuthorFilesLines
2012-10-08pnfsblock: cleanup nfs4_blkdev_getPeng Tao2-21/+5
It is not needed at all and it is messing with return values... Reported-by: Wei Yongjun <[email protected]> Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-08NFS41: send real read size in layoutgetPeng Tao1-1/+9
For buffer read, use offst-to-isize. For direct read, use dreq->bytes_left. Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-08NFS41: send real write size in layoutgetPeng Tao6-7/+57
For buffer write, block layout client scan inode mapping to find next hole and use offset-to-hole as layoutget length. Object layout client uses offset-to-isize as layoutget length. For direct write, both block layout and object layout use dreq->bytes_left. Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-08NFS: track direct IO left bytesPeng Tao1-0/+5
Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-05NFSv4.1: Cleanup ugliness in pnfs_layoutgets_blocked()Trond Myklebust1-11/+14
Split it into two functions, one which checks if layoutgets are blocked, and one which checks if the layout stateid has expired. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-04NFSv4.1: Ensure that the layout sequence id stays 'close' to the currentTrond Myklebust1-13/+8
Clamp the layout barrier sequence id to the current sequence id minus the maximum number of outstanding layoutget requests. Also ensure that we correctly initialise lo->plh_barrier if there are no layout segments associated to this layout header. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-04NFSv4.1: Deal with seqid wraparound in the pNFS return-on-close codeTrond Myklebust1-1/+1
Signed-off-by: Trond Myklebust <[email protected]>
2012-10-03NFSv4 set open access operation call flag in nfs4_init_opendata_resAndy Adamson1-1/+1
nfs4_open_recover_helper zeros the nfs4_opendata result structures, removing the result access_request information which leads to an XDR decode error. Move the setting of the result access_request field to nfs4_init_opendata_res which sets all the other required nfs4_opendata result fields and is shared between the open and recover open paths. Signed-off-by: Andy Adamson <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-03NFSv4.1: Remove the dependency on CONFIG_EXPERIMENTALTrond Myklebust1-2/+2
CONFIG_EXPERIMENTAL is deprecated and, regardless of that, this code is being enabled in most newer distributions. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02Merge branch 'for-linus' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
2012-10-02fs: push rcu_barrier() from deactivate_locked_super() to filesystemsKirill A. Shutemov1-0/+5
There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2012-10-02NFSv4 reduce attribute requests for open reclaimAndy Adamson2-27/+89
We currently make no distinction in attribute requests between normal OPENs and OPEN with CLAIM_PREVIOUS. This offers more possibility of failures in the GETATTR response which foils OPEN reclaim attempts. Reduce the requested attributes to the bare minimum needed to update the reclaim open stateid and split nfs4_opendata_to_nfs4_state processing accordingly. Signed-off-by: Andy Adamson <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4: nfs4_open_done first must check that GETATTR decoded a file typeTrond Myklebust1-1/+3
...before it can check the validity of that file type. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4.1: Deal with wraparound when updating the layout "barrier" seqidTrond Myklebust1-4/+7
...and fix a bug in pnfs_set_layout_stateid. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4.1: Deal with wraparound issues when updating the layout stateidTrond Myklebust1-1/+10
...and add a helper function. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4.1: Always set the layout stateid if this is the first layoutgetTrond Myklebust1-3/+5
If the list of layout segments is empty, we must unconditionally set the layout stateid. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4.1: Fix another refcount issue in pnfs_find_alloc_layoutTrond Myklebust1-7/+8
Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4: don't put ACCESS in OPEN compound if O_EXCLWeston Andros Adamson2-7/+17
Don't put an ACCESS op in OPEN compound if O_EXCL, because ACCESS will return permission denied for all bits until close. Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to OPEN compound) Signed-off-by: Weston Andros Adamson <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4: don't check MAY_WRITE access bit in OPENWeston Andros Adamson1-3/+3
Don't check MAY_WRITE as a newly created file may not have write mode bits, but POSIX allows the creating process to write regardless. This is ok because NFSv4 OPEN ops handle write permissions correctly - the ACCESS in the OPEN compound is to differentiate READ v EXEC permissions. Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to OPEN compound) Signed-off-by: Weston Andros Adamson <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFS: Set key construction data for the legacy upcallBryan Schumaker1-0/+1
This prevents a null pointer dereference when nfs_idmap_complete_pipe_upcall_locked() calls complete_request_key(). Fixes a regression caused by commit 0cac12023 (NFSv4: Ensure that idmap_pipe_downcall sanity-checks the downcall data). Signed-off-by: Bryan Schumaker <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4.1: don't do two EXCHANGE_IDs on mountWeston Andros Adamson1-0/+1
Since the addition of NFSv4 server trunking detection the mount context calls nfs4_proc_exchange_id then schedules the state manager, which also calls nfs4_proc_exchange_id. Setting the NFS4CLNT_LEASE_CONFIRM bit makes the state manager skip the unneeded EXCHANGE_ID and continue on with session creation. Reported-by: Jorge Mora <[email protected]> Signed-off-by: Weston Andros Adamson <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02KEYS: Use keyring_alloc() to create special keyringsDavid Howells1-8/+4
Use keyring_alloc() to create special keyrings now that it has a permissions parameter rather than using key_alloc() + key_instantiate_and_link(). Also document and export keyring_alloc() so that modules can use it too. Signed-off-by: David Howells <[email protected]>
2012-10-02Merge branch 'for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "This is a mostly modest set of changes to enable basic user namespace support. This allows the code to code to compile with user namespaces enabled and removes the assumption there is only the initial user namespace. Everything is converted except for the most complex of the filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs, nfs, ocfs2 and xfs as those patches need a bit more review. The strategy is to push kuid_t and kgid_t values are far down into subsystems and filesystems as reasonable. Leaving the make_kuid and from_kuid operations to happen at the edge of userspace, as the values come off the disk, and as the values come in from the network. Letting compile type incompatible compile errors (present when user namespaces are enabled) guide me to find the issues. The most tricky areas have been the places where we had an implicit union of uid and gid values and were storing them in an unsigned int. Those places were converted into explicit unions. I made certain to handle those places with simple trivial patches. Out of that work I discovered we have generic interfaces for storing quota by projid. I had never heard of the project identifiers before. Adding full user namespace support for project identifiers accounts for most of the code size growth in my git tree. Ultimately there will be work to relax privlige checks from "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing root in a user names to do those things that today we only forbid to non-root users because it will confuse suid root applications. While I was pushing kuid_t and kgid_t changes deep into the audit code I made a few other cleanups. I capitalized on the fact we process netlink messages in the context of the message sender. I removed usage of NETLINK_CRED, and started directly using current->tty. Some of these patches have also made it into maintainer trees, with no problems from identical code from different trees showing up in linux-next. After reading through all of this code I feel like I might be able to win a game of kernel trivial pursuit." Fix up some fairly trivial conflicts in netfilter uid/git logging code. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits) userns: Convert the ufs filesystem to use kuid/kgid where appropriate userns: Convert the udf filesystem to use kuid/kgid where appropriate userns: Convert ubifs to use kuid/kgid userns: Convert squashfs to use kuid/kgid where appropriate userns: Convert reiserfs to use kuid and kgid where appropriate userns: Convert jfs to use kuid/kgid where appropriate userns: Convert jffs2 to use kuid and kgid where appropriate userns: Convert hpfs to use kuid and kgid where appropriate userns: Convert btrfs to use kuid/kgid where appropriate userns: Convert bfs to use kuid/kgid where appropriate userns: Convert affs to use kuid/kgid wherwe appropriate userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids userns: On ia64 deal with current_uid and current_gid being kuid and kgid userns: On ppc convert current_uid from a kuid before printing. userns: Convert s390 getting uid and gid system calls to use kuid and kgid userns: Convert s390 hypfs to use kuid and kgid where appropriate userns: Convert binder ipc to use kuids userns: Teach security_path_chown to take kuids and kgids userns: Add user namespace support to IMA userns: Convert EVM to deal with kuids and kgids in it's hmac computation ...
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-2/+1
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-10-02NFS: nfs41_walk_client_list(): re-lock before iteratingChuck Lever1-0/+1
Sparse identified an execution path in nfs41_walk_client_list() where the nfs_client_lock is not re-acquired before taking the next loop iteration. fs/nfs/nfs4client.c:437:9: sparse: context imbalance in 'nfs41_walk_client_list' - different lock contexts for basic block Signed-off-by: Chuck Lever <[email protected]> Cc: Fengguang Wu <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4.1: Handle BAD_STATEID and EXPIRED errors in layoutgetTrond Myklebust1-8/+26
If the layoutget call returns a stateid error, we want to invalidate the layout stateid, and/or recover the open stateid. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02NFSv4.1: bl_pg_init_write should be staticTrond Myklebust1-1/+1
Reported-by: Fengguang Wu <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02nfs: include internal.h in getroot.hStanislav Kinsbursky1-0/+2
Sparse warning: fs/nfs/nfs4getroot.c:11:5: warning: symbol 'nfs4_get_rootfh' was not declared. Should it be static? Signed-off-by: Stanislav Kinsbursky <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02nfs: include nfs4_fh.h in nfs4sysctl.cStanislav Kinsbursky1-0/+1
Sparse warnings: fs/nfs/nfs4sysctl.c:56:5: warning: symbol 'nfs4_register_sysctl' was not declared. Should it be static? fs/nfs/nfs4sysctl.c:64:6: warning: symbol 'nfs4_unregister_sysctl' was not declared. Should it be static? Signed-off-by: Stanislav Kinsbursky <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02nfs: declare nfs_xdev_mount as staticStanislav Kinsbursky1-1/+1
Sparse warning: fs/nfs/super.c:2517:15: warning: symbol 'nfs_xdev_mount' was not declared. Should it be static? Signed-off-by: Stanislav Kinsbursky <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02nfs: declare nfs_callback_tcp_port in headerStanislav Kinsbursky1-0/+1
Sparse warning: fs/nfs/super.c:2638:16: sparse: symbol 'nfs_callback_tcpport' was not declared. Should it be static? Reported-by: Fengguang Wu <[email protected]> Signed-off-by: Stanislav Kinsbursky <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-02nfs: include NFSv4 header in netns.hStanislav Kinsbursky1-0/+1
Build error: fs/nfs/netns.h:27:15: error: 'NFS4_MAX_MINOR_VERSION' undeclared here (not in a function) Reported-by: Fengguang Wu <[email protected]> Signed-off-by: Stanislav Kinsbursky <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFSv4.0 reclaim reboot state when re-establishing clientidAndy Adamson1-2/+6
We should reclaim reboot state when the clientid is stale. Signed-off-by: Andy Adamson <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFSv4: nfs4_match_clientids is only used by NFSv4.1Trond Myklebust1-15/+15
Fix another compiler warning. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFSv4: Fix the minor version callback channel startupTrond Myklebust1-14/+13
The current spaghetti code confuses some versions of gcc (and just looks ugly as hell)! Clean up... Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFSv4: Fix up a merge conflict between migration and container changesTrond Myklebust1-2/+3
nfs_callback_tcpport is now per-net_namespace. Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01Merge branch 'bugfixes' into nfs-for-nextTrond Myklebust1-34/+70
2012-10-01nfs: replace strict_strto* with kstrto*Daniel Walter2-4/+4
[nfs] replace strict_str* with kstr* variants * replace string conversions with newer kstr* functions Signed-off-by: Daniel Walter <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFS: Remove unnecessary semicolons (fs/nfs/client.c)Yanchuan Nian1-2/+2
There are some unnecessary semicolons in function find_nfs_version. Just remove them. Signed-off-by: Yanchuan Nian <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01pnfsblock: use list_move_tail instead of list_del/list_add_tailWei Yongjun1-2/+1
Using list_move_tail() instead of list_del() + list_add_tail(). spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01pnfsblock: fix non-aligned DIO writePeng Tao1-3/+31
For DIO writes, if it is not blocksize aligned, we need to do internal serialization. It may slow down writers anyway. So we just bail them out and resend to MDS. Cc: stable <[email protected]> [since v3.4] Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01pnfsblock: fix non-aligned DIO readPeng Tao1-8/+56
For DIO read, if it is not sector aligned, we should reject it and resend via MDS. Otherwise there might be data corruption. Also teach bl_read_pagelist to handle partial page reads for DIO. Cc: stable <[email protected]> [since v3.4] Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01pnfsblock: fix partial page buffer wirtePeng Tao2-12/+166
If applications use flock to protect its write range, generic NFS will not do read-modify-write cycle at page cache level. Therefore LD should know how to handle non-sector aligned writes. Otherwise there will be data corruption. Cc: stable <[email protected]> Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01Revert "pnfsblock: bail out partial page IO"Peng Tao1-36/+3
This reverts commit 159e0561e322dd8008fff59e36efff8d2bdd0b0e, in favor of a more complete fix to the alignment issue. Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFS41: fix error of setting blocklayoutdriverPeng Tao2-2/+4
After commit e38eb650 (NFS: set_pnfs_layoutdriver() from nfs4_proc_fsinfo()), set_pnfs_layoutdriver() is called inside nfs4_proc_fsinfo(), but pnfs_blksize is not set. It causes setting blocklayoutdriver failure and pnfsblock mount failure. Cc: stable <[email protected]> [since v3.5] Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFSv41: fix DIO write_io calculationPeng Tao1-2/+2
pnfs_within_mdsthreshold() is called inside pg_init. We need to set read_io/write_io before that. Otherwise we fail pnfs_within_mdsthreshold() and IO goes to MDS. A simple test case: dd if=foo of=/mnt/pnfs/bar bs=10M count=1 oflag=direct Signed-off-by: Peng Tao <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFS: Add nfs4_unique_id boot parameterChuck Lever3-1/+13
An optional boot parameter is introduced to allow client administrators to specify a string that the Linux NFS client can insert into its nfs_client_id4 id string, to make it both more globally unique, and to ensure that it doesn't change even if the client's nodename changes. If this boot parameter is not specified, the client's nodename is used, as before. Client installation procedures can create a unique string (typically, a UUID) which remains unchanged during the lifetime of that client instance. This works just like creating a UUID for the label of the system's root and boot volumes. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFS: Discover NFSv4 server trunking when mountingChuck Lever6-2/+454
"Server trunking" is a fancy named for a multi-homed NFS server. Trunking might occur if a client sends NFS requests for a single workload to multiple network interfaces on the same server. There are some implications for NFSv4 state management that make it useful for a client to know if a single NFSv4 server instance is multi-homed. (Note this is only a consideration for NFSv4, not for legacy versions of NFS, which are stateless). If a client cares about server trunking, no NFSv4 operations can proceed until that client determines who it is talking to. Thus server IP trunking discovery must be done when the client first encounters an unfamiliar server IP address. The nfs_get_client() function walks the nfs_client_list and matches on server IP address. The outcome of that walk tells us immediately if we have an unfamiliar server IP address. It invokes nfs_init_client() in this case. Thus, nfs4_init_client() is a good spot to perform trunking discovery. Discovery requires a client to establish a fresh client ID, so our client will now send SETCLIENTID or EXCHANGE_ID as the first NFS operation after a successful ping, rather than waiting for an application to perform an operation that requires NFSv4 state. The exact process for detecting trunking is different for NFSv4.0 and NFSv4.1, so a minorversion-specific init_client callout method is introduced. CLID_INUSE recovery is important for the trunking discovery process. CLID_INUSE is a sign the server recognizes the client's nfs_client_id4 id string, but the client is using the wrong principal this time for the SETCLIENTID operation. The SETCLIENTID must be retried with a series of different principals until one works, and then the rest of trunking discovery can proceed. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFS: Use the same nfs_client_id4 for every serverChuck Lever1-12/+39
Currently, when identifying itself to NFS servers, the Linux NFS client uses a unique nfs_client_id4.id string for each server IP address it talks with. For example, when client A talks to server X, the client identifies itself using a string like "AX". The requirements for these strings are specified in detail by RFC 3530 (and bis). This form of client identification presents a problem for Transparent State Migration. When client A's state on server X is migrated to server Y, it continues to be associated with string "AX." But, according to the rules of client string construction above, client A will present string "AY" when communicating with server Y. Server Y thus has no way to know that client A should be associated with the state migrated from server X. "AX" is all but abandoned, interfering with establishing fresh state for client A on server Y. To support transparent state migration, then, NFSv4.0 clients must instead use the same nfs_client_id4.id string to identify themselves to every NFS server; something like "A". Now a client identifies itself as "A" to server X. When a file system on server X transitions to server Y, and client A identifies itself as "A" to server Y, Y will know immediately that the state associated with "A," whether it is native or migrated, is owned by the client, and can merge both into a single lease. As a pre-requisite to adding support for NFSv4 migration to the Linux NFS client, this patch changes the way Linux identifies itself to NFS servers via the SETCLIENTID (NFSv4 minor version 0) and EXCHANGE_ID (NFSv4 minor version 1) operations. In addition to removing the server's IP address from nfs_client_id4, the Linux NFS client will also no longer use its own source IP address as part of the nfs_client_id4 string. On multi-homed clients, the value of this address depends on the address family and network routing used to contact the server, thus it can be different for each server. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-10-01NFS: Introduce "migration" mount optionChuck Lever2-0/+22
Currently, the Linux client uses a unique nfs_client_id4.id string when identifying itself to distinct NFS servers. To support transparent state migration, the Linux client will have to use the same nfs_client_id4 string for all servers it communicates with (also known as the "uniform client string" approach). Otherwise NFS servers can not recognize that open and lock state need to be merged after a file system transition. Unfortunately, there are some NFSv4.0 servers currently in the field that do not tolerate the uniform client string approach. Thus, by default, our NFSv4.0 mounts will continue to use the current approach, and we introduce a mount option that switches them to use the uniform model. Client administrators must identify which servers can be mounted with this option. Eventually most NFSv4.0 servers will be able to handle the uniform approach, and we can change the default. The first mount of a server controls the behavior for all subsequent mounts for the lifetime of that set of mounts of that server. After the last mount of that server is gone, the client erases the data structure that tracks the lease. A subsequent lease may then honor a different "migration" setting. This patch adds only the infrastructure for parsing the new mount option. Support for uniform client strings is added in a subsequent patch. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>