summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-04-04libceph: dump pg_temp mappings to debugfsIlya Dryomov1-0/+11
Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g: pg_temp 2.6 [2,3,4] Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04libceph: do not prefix osd lines with \t in debugfs outputIlya Dryomov1-1/+1
To save screen space in anticipation of more fields (e.g. primary affinity). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04libceph: refer to osdmap directly in osdmap_show()Ilya Dryomov1-12/+14
To make it more readable and save screen space. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04crush: support chooseleaf_vary_r tunable (tunables3) by defaultIlya Dryomov1-1/+9
Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features supported by default. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04crush: add SET_CHOOSELEAF_VARY_R stepIlya Dryomov2-0/+6
This lets you adjust the vary_r tunable on a per-rule basis. Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04crush: add chooseleaf_vary_r tunableIlya Dryomov2-6/+30
The current crush_choose_firstn code will re-use the same 'r' value for the recursive call. That means that if we are hitting a collision or rejection for some reason (say, an OSD that is marked out) and need to retry, we will keep making the same (bad) choice in that recursive selection. Introduce a tunable that fixes that behavior by incorporating the parent 'r' value into the recursive starting point, so that a different path will be taken in subsequent placement attempts. Note that this was done from the get-go for the new crush_choose_indep algorithm. This was exposed by a user who was seeing PGs stuck in active+remapped after reweight-by-utilization because the up set mapped to a single OSD. Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04crush: allow crush rules to set (re)tries counts to 0Ilya Dryomov1-2/+2
These two fields are misnomers; they are *retry* counts. Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04crush: fix off-by-one errors in total_tries refactorIlya Dryomov1-19/+27
Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH code to allow adjustment of the retry counts on a per-pool basis. That commit had an off-by-one bug: the previous "tries" counter was a *retry* count, not a *try* count, but the new code was passing in 1 meaning there should be no retries. Fix the ftotal vs tries comparison to use < instead of <= to fix the problem. Note that the original code used <= here, which means the global "choose_total_tries" tunable is actually counting retries. Compensate for that by adding 1 in crush_do_rule when we pull the tunable into the local variable. This was noticed looking at output from a user provided osdmap. Unfortunately the map doesn't illustrate the change in mapping behavior and I haven't managed to construct one yet that does. Inspection of the crush debug output now aligns with prior versions, though. Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04ceph: don't include ceph.{file,dir}.layout vxattr in listxattr()Yan, Zheng1-2/+2
This avoids 'cp -a' modifying layout of new files/directories. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04ceph: check buffer size in ceph_vxattrcb_layout()Yan, Zheng1-9/+25
If buffer size is zero, return the size of layout vxattr. If buffer size is not zero, check if it is large enough for layout vxattr. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04ceph: fix null pointer dereference in discard_cap_releases()Yan, Zheng1-9/+12
send_mds_reconnect() may call discard_cap_releases() after all release messages have been dropped by cleanup_cap_releases() Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()Yan, Zheng1-0/+6
When there is no more data, ceph_msg_data_{pages,pagelist}_advance() should not move on to the next page. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04ceph: Remove get/set acl on symlinksFabian Frederick1-2/+0
Remove unsupported symlink operations. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-04-04ceph: set mds_wanted when MDS reply changes a cap to auth capYan, Zheng1-1/+3
When adjusting caps client wants, MDS does not record caps that are not allowed. For non-auth MDS, it does not record WR caps. So when a MDS reply changes a non-auth cap to auth cap, client needs to set cap's mds_wanted according to the reply. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04ceph: use fl->fl_file as owner identifier of flock and posix lockYan, Zheng4-22/+45
flock and posix lock should use fl->fl_file instead of process ID as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner is usually equal to fl->fl_file, but it also can be a customized value). The process ID of who holds the lock is just for F_GETLK fcntl(2). The fix is rename the 'pid' fields of struct ceph_mds_request_args and struct ceph_filelock to 'owner', rename 'pid_namespace' fields to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages. We also set the most significant bit of the 'owner' field. MDS can use that bit to distinguish between old and new clients. The MDS counterpart of this patch modifies the flock code to not take the 'pid_namespace' into consideration when checking conflict locks. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04ceph: forbid mandatory file lockYan, Zheng1-0/+12
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04ceph: use fl->fl_type to decide flock operationYan, Zheng1-12/+9
VFS does not directly pass flock's operation code to filesystem's flock callback. It translates the operation code to the form how posix lock's parameters are presented. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04ceph: update i_max_size even if inode version does not changeYan, Zheng1-8/+8
handle following sequence of events: - client releases a inode with i_max_size > 0. The release message is queued. (is not sent to the auth MDS) - a 'lookup' request reply from non-auth MDS returns the same inode. - client opens the inode in write mode. The version of inode trace in 'open' request reply is equal to the cached inode's version. - client requests new max size. The MDS ignores the request because it does not affect client's write range Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04ceph: make sure write caps are registered with auth MDSYan, Zheng1-1/+4
Only auth MDS can issue write caps to clients, so don't consider write caps registered with non-auth MDS as valid. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03ceph: print inode number for LOOKUPINO requestYan, Zheng1-0/+2
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03ceph: add get_name() NFS export callbackYan, Zheng4-1/+92
Use the newly introduced LOOKUPNAME MDS request to connect child inode to its parent directory. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03ceph: fix ceph_fh_to_parent()Yan, Zheng1-33/+9
ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field of struct ceph_nfs_confh. This is wrong, it should return dentry that corresponds to the 'parent_ino' field. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03ceph: add get_parent() NFS export callbackYan, Zheng1-0/+60
The callback uses LOOKUPPARENT MDS request to find parent. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03ceph: simplify ceph_fh_to_dentry()Yan, Zheng1-135/+32
MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way. So __cfh_to_dentry() is redundant. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03ceph: fscache: Wait for completion of object initializationYunchuan Wen1-0/+1
The object store limit needs to be updated after writing, and this can be done provided the corresponding object has already been initialized. Current object initialization is done asynchrously, which introduce a race if a file is opened, then immediately followed by a writing, the initialization may have not completed, the code will reach the ASSERT in fscache_submit_exclusive_op() to cause kernel bug. Tested-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Min Chen <minchen@ubuntukylin.com> Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03ceph: fscache: Update object store limit after file writingYunchuan Wen1-0/+3
Synchronize object->store_limit[_l] with new inode->i_size after file writing. Tested-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Min Chen <minchen@ubuntukylin.com> Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03ceph: fscache: add an interface to synchronize object store limitYunchuan Wen1-0/+10
Add an interface to explicitly synchronize object->store_limit[_l] with inode->i_size Tested-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Min Chen <minchen@ubuntukylin.com> Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03ceph: do not set r_old_dentry_dir on link()Sage Weil1-2/+1
This is racy--we do not know whather d_parent has changed out from underneath us because i_mutex is not held on the source inode's directory. Also, taking this reference is useless. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03ceph: do not assume r_old_dentry[_dir] always set togetherSage Weil2-4/+6
Do not assume that r_old_dentry implies that r_old_dentry_dir is also true. Separate out the ref cleanup and make the debugs dump behave when it is NULL. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03ceph: do not chain inode updates to parent fsyncSage Weil4-17/+5
The fsync(dirfd) only covers namespace operations, not inode updates. We do not need to cover setattr variants or O_TRUNC. Reported-by: Al Viro <viro@xeniv.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()Sage Weil1-1/+2
This is just old_dir; no reason to abuse the dcache pointers. Reported-by: Al Viro <viro.zeniv.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03ceph: let MDS adjust readdir 'frag'Yan, Zheng1-3/+0
If readdir 'frag' is adjusted, readdir 'offset' should be reset. Otherwise some dentries may be lost when readdir and fragmenting directory happen at the some. Another way to fix this issue is let MDS adjust readdir 'frag'. The code that handles MDS reply reset the readdir 'offset' if the readdir reply is different than the requested one. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03ceph: fix reset_readdir()Yan, Zheng1-3/+6
When changing readdir postion, fi->next_offset should be set to 0 if the new postion is not in the first dirfrag. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03ceph: fix ceph_dir_llseek()Yan, Zheng2-7/+7
Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for directory. For a fragmented directory, offset (frag_t, off) can be larger than inode->i_sb->s_maxbytes. At the very beginning of ceph_dir_llseek(), local variable old_offset is initialized to parameter offset. This doesn't make sense neither. Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd opIlya Dryomov1-15/+39
In an effort to reduce fragmentation, prefix every rbd write with a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set to the object size (1 << order). Backwards compatibility is taken care of on the libceph/osd side. "The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to do it once. The reason every rbd write is prefixed is that rbd doesn't explicitly create objects and relies on writes creating them implicitly, so there is no place to stick a single hint op into. To get around that we decided to prefix every rbd write with a hint (just like write and setattr ops, hint op will create an object implicitly if it doesn't exist)." Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03rbd: num_ops parameter for rbd_osd_req_create()Ilya Dryomov1-10/+18
In preparation for prefixing rbd writes with an allocation hint introduce a num_ops parameter for rbd_osd_req_create(). The rationale is that not every write request is a write op that needs to be prefixed (e.g. watch op), so the num_ops logic needs to be in the callers. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03libceph: bump CEPH_OSD_MAX_OP to 3Ilya Dryomov2-2/+2
Our longest osd request now contains 3 ops: copyup+hint+write. Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was hard-coded to 2. Fix it, and switch to rbd_assert while at it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd opIlya Dryomov3-0/+42
This is primarily for rbd's benefit and is supposed to combat fragmentation: "... knowing that rbd images have a 4m size, librbd can pass a hint that will let the osd do the xfs allocation size ioctl on new files so that they are allocated in 1m or 4m chunks. We've seen cases where users with rbd workloads have very high levels of fragmentation in xfs and this would mitigate that and probably have a pretty nice performance benefit." SETALLOCHINT is considered advisory, so our backwards compatibility mechanism here is to set FAILOK flag for all SETALLOCHINT ops. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03libceph: encode CEPH_OSD_OP_FLAG_* op flagsIlya Dryomov3-1/+4
Encode ceph_osd_op::flags field so that it gets sent over the wire. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03rbd: fix error paths in rbd_img_request_fill()Ilya Dryomov1-1/+1
Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is not only insufficient, but also triggers an rbd_assert() in rbd_obj_request_destroy(): Assertion failure in rbd_obj_request_destroy() at line 1867: rbd_assert(obj_request->img_request == NULL); rbd_img_obj_request_add() adds obj_requests to the img_request, the opposite is rbd_img_obj_request_del(). Use it. Fixes: http://tracker.ceph.com/issues/7327 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03rbd: remove out_partial label in rbd_img_request_fill()Ilya Dryomov1-4/+3
Commit 03507db631c94 ("rbd: fix buffer size for writes to images with snapshots") moved the call to rbd_img_obj_request_add() up, making the out_partial label bogus. Remove it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03libceph: a per-osdc crush scratch bufferIlya Dryomov2-9/+19
With the addition of erasure coding support in the future, scratch variable-length array in crush_do_rule_ary() is going to grow to at least 200 bytes on average, on top of another 128 bytes consumed by rawosd/osd arrays in the call chain. Replace it with a buffer inside struct osdmap and a mutex. This shouldn't result in any contention, because all osd requests were already serialized by request_mutex at that point; the only unlocked caller was ceph_ioctl_get_dataloc(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-03-30Linux 3.14v3.14Linus Torvalds1-1/+1
2014-03-30Merge branch 'for-linus-2' of ↵Linus Torvalds4-77/+134
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Switch mnt_hash to hlist, turning the races between __lookup_mnt() and hash modifications into false negatives from __lookup_mnt() (instead of hangs)" On the false negatives from __lookup_mnt(): "The *only* thing we care about is not getting stuck in __lookup_mnt(). If it misses an entry because something in front of it just got moved around, etc, we are fine. We'll notice that mount_lock mismatch and that'll be it" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch mnt_hash to hlist don't bother with propagate_mnt() unless the target is shared keep shadowed vfsmounts together resizable namespace.c hashes
2014-03-30MAINTAINERS: resume as Documentation maintainerRandy Dunlap1-2/+2
I am the new kernel tree Documentation maintainer (except for parts that are handled by other people, of course). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Rob Landley <rob@landley.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30Merge branch 'for-linus' of ↵Linus Torvalds2-32/+45
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: "Some more updates for the input subsystem. You will get a fix for race in mousedev that has been causing quite a few oopses lately and a small fixup for force feedback support in evdev" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: mousedev - fix race when creating mixed device Input: don't modify the id of ioctl-provided ff effect on upload failure
2014-03-30AUDIT: Allow login in non-init namespacesEric Paris1-1/+11
It its possible to configure your PAM stack to refuse login if audit messages (about the login) were unable to be sent. This is common in many distros and thus normal configuration of many containers. The PAM modules determine if audit is enabled/disabled in the kernel based on the return value from sending an audit message on the netlink socket. If userspace gets back ECONNREFUSED it believes audit is disabled in the kernel. If it gets any other error else it refuses to let the login proceed. Just about ever since the introduction of namespaces the kernel audit subsystem has returned EPERM if the task sending a message was not in the init user or pid namespace. So many forms of containers have never worked if audit was enabled in the kernel. BUT if the container was not in net_init then the kernel network code would send ECONNREFUSED (instead of the audit code sending EPERM). Thus by pure accident/dumb luck/bug if an admin configured the PAM stack to reject all logins that didn't talk to audit, but then ran the login untility in the non-init_net namespace, it would work!! Clearly this was a bug, but it is a bug some people expected. With the introduction of network namespace support in 3.14-rc1 the two bugs stopped cancelling each other out. Now, containers in the non-init_net namespace refused to let users log in (just like PAM was configfured!) Obviously some people were not happy that what used to let users log in, now didn't! This fix is kinda hacky. We return ECONNREFUSED for all non-init relevant namespaces. That means that not only will the old broken non-init_net setups continue to work, now the broken non-init_pid or non-init_user setups will 'work'. They don't really work, since audit isn't logging things. But it's what most users want. In 3.15 we should have patches to support not only the non-init_net (3.14) namespace but also the non-init_pid and non-init_user namespace. So all will be right in the world. This just opens the doors wide open on 3.14 and hopefully makes users happy, if not the audit system... Reported-by: Andre Tomt <andre@tomt.net> Reported-by: Adam Richter <adam_richter2004@yahoo.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30ext4: atomically set inode->i_flags in ext4_set_inode_flags()Theodore Ts'o2-6/+24
Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30switch mnt_hash to hlistAl Viro4-50/+61
fixes RCU bug - walking through hlist is safe in face of element moves, since it's self-terminating. Cyclic lists are not - if we end up jumping to another hash chain, we'll loop infinitely without ever hitting the original list head. [fix for dumb braino folded] Spotted by: Max Kellermann <mk@cm4all.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30don't bother with propagate_mnt() unless the target is sharedAl Viro1-10/+7
If the dest_mnt is not shared, propagate_mnt() does nothing - there's no mounts to propagate to and thus no copies to create. Might as well don't bother calling it in that case. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>