Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull ceph updates from Ilya Dryomov:
"Notable items here are
- a series to take advantage of David Howells' netfs helper library
from Jeff
- three new filesystem client metrics from Xiubo
- ceph.dir.rsnaps vxattr from Yanhu
- two auth-related fixes from myself, marked for stable.
Interspersed is a smattering of assorted fixes and cleanups across the
filesystem"
* tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client: (24 commits)
libceph: allow addrvecs with a single NONE/blank address
libceph: don't set global_id until we get an auth ticket
libceph: bump CephXAuthenticate encoding version
ceph: don't allow access to MDS-private inodes
ceph: fix up some bare fetches of i_size
ceph: convert some PAGE_SIZE invocations to thp_size()
ceph: support getting ceph.dir.rsnaps vxattr
ceph: drop pinned_page parameter from ceph_get_caps
ceph: fix inode leak on getattr error in __fh_to_dentry
ceph: only check pool permissions for regular files
ceph: send opened files/pinned caps/opened inodes metrics to MDS daemon
ceph: avoid counting the same request twice or more
ceph: rename the metric helpers
ceph: fix kerneldoc copypasta over ceph_start_io_direct
ceph: use attach/detach_page_private for tracking snap context
ceph: don't use d_add in ceph_handle_snapdir
ceph: don't clobber i_snap_caps on non-I_NEW inode
ceph: fix fall-through warnings for Clang
ceph: convert ceph_readpages to ceph_readahead
ceph: convert ceph_write_begin to netfs_write_begin
...
|
|
Normally, an unused OSD id/slot is represented by an empty addrvec.
However, it also appears to be possible to generate an osdmap where
an unused OSD id/slot has an addrvec with a single blank address of
type NONE. Allow such addrvecs and make the end result be exactly
the same as for the empty addrvec case -- leave addr intact.
Cc: stable@vger.kernel.org # 5.11+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
With the introduction of enforcing mode, setting global_id as soon
as we get it in the first MAuth reply will result in EACCES if the
connection is reset before we get the second MAuth reply containing
an auth ticket -- because on retry we would attempt to reclaim that
global_id with no auth ticket at hand.
Neither ceph_auth_client nor ceph_mon_client depend on global_id
being set ealy, so just delay the setting until we get and process
the second MAuth reply. While at it, complain if the monitor sends
a zero global_id or changes our global_id as the session is likely
to fail after that.
Cc: stable@vger.kernel.org # needs backporting for < 5.11
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
|
|
A dummy v3 encoding (exactly the same as v2) was introduced so that
the monitors can distinguish broken clients that may not include their
auth ticket in CEPHX_GET_AUTH_SESSION_KEY request on reconnects, thus
failing to prove previous possession of their global_id (one part of
CVE-2021-20288).
The kernel client has always included its auth ticket, so it is
compatible with enforcing mode as is. However we want to bump the
encoding version to avoid having to authenticate twice on the initial
connect -- all legacy (CephXAuthenticate < v3) are now forced do so in
order to expose insecure global_id reclaim.
Marking for stable since at least for 5.11 and 5.12 it is trivial
(v2 -> v3).
Cc: stable@vger.kernel.org # 5.11+
URL: https://tracker.ceph.com/issues/50452
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
|
|
Modify "inital" to "initial" in net/ceph/osdmap.c.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 83aff95eb9d6 ("libceph: remove 'osdtimeout' option") deprecated
osdtimeout over 8 years ago, but it is still recognized. Let's remove
it entirely.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
These options were introduced in 3.19 with support for message signing
and are rather useless, as explained in commit a51983e4dd2d ("libceph:
add nocephx_sign_messages option"). Deprecate them.
In case there is someone out there with a cluster that lacks support
for MSG_AUTH feature (very unlikely but has to be considered since we
haven't formally raised the bar from argonaut to bobtail yet), make
nocephx_sign_messages also waive MSG_AUTH requirement. This is probably
how it should have been done in the first place -- if we aren't going
to sign, requiring the signing feature makes no sense.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
This line dates back to 2013, but cppcheck complained because commit
2f713615ddd9 ("libceph: move msgr1 protocol implementation to its own
file") moved it. Add parenthesis to silence the warning.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Since a few years, kernel addresses are no longer included in oops
dumps, at least on x86. All we get is a symbol name with offset and
size.
This is a problem for ceph_connection_operations handlers, especially
con->ops->dispatch(). All three handlers have the same name and there
is little context to disambiguate between e.g. monitor and OSD clients
because almost everything is inlined. gdb sneakily stops at the first
matching symbol, so one has to resort to nm and addr2line.
Some of these are already prefixed with mon_, osd_ or mds_. Let's do
the same for all others.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
|
|
Try and avoid leaving bits and pieces of session key and connection
secret (gets split into GCM key and a pair of GCM IVs) around.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
crypto_shash_setkey() and crypto_aead_setkey() will do a (small)
GFP_ATOMIC allocation to align the key if it isn't suitably aligned.
It's not a big deal, but at the same time easy to avoid.
The actual alignment requirement is dynamic, queryable with
crypto_shash_alignmask() and crypto_aead_alignmask(), but shouldn't
be stricter than 16 bytes for our algorithms.
Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
auth_signature frame is 68 bytes in plain mode and 96 bytes in
secure mode but we are requesting 68 bytes in both modes. By luck,
this doesn't actually result in any invalid memory accesses because
the allocation is satisfied out of kmalloc-96 slab and so exactly
96 bytes are allocated, but KASAN rightfully complains.
Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Reported-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Pull ceph updates from Ilya Dryomov:
"The big ticket item here is support for msgr2 on-wire protocol, which
adds the option of full in-transit encryption using AES-GCM algorithm
(myself).
On top of that we have a series to avoid intermittent errors during
recovery with recover_session=clean and some MDS request encoding work
from Jeff, a cap handling fix and assorted observability improvements
from Luis and Xiubo and a good number of cleanups.
Luis also ran into a corner case with quotas which sadly means that we
are back to denying cross-quota-realm renames"
* tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
libceph: drop ceph_auth_{create,update}_authorizer()
libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
libceph: introduce connection modes and ms_mode option
libceph, rbd: ignore addr->type while comparing in some cases
libceph, ceph: get and handle cluster maps with addrvecs
libceph: factor out finish_auth()
libceph: drop ac->ops->name field
libceph: amend cephx init_protocol() and build_request()
libceph, ceph: incorporate nautilus cephx changes
libceph: safer en/decoding of cephx requests and replies
libceph: more insight into ticket expiry and invalidation
libceph: move msgr1 protocol specific fields to its own struct
libceph: move msgr1 protocol implementation to its own file
libceph: separate msgr1 protocol implementation
libceph: export remaining protocol independent infrastructure
libceph: export zero_page
libceph: rename and export con->flags bits
libceph: rename and export con->state states
libceph: make con->state an int
...
|
|
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
This shouldn't cause any functional changes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5. msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
msgr2 supports two connection modes: crc (plain) and secure (on-wire
encryption). Connection mode is picked by server based on input from
client.
Introduce ms_mode option:
ms_mode=legacy - msgr1 (default)
ms_mode=crc - crc mode, if denied fail
ms_mode=secure - secure mode, if denied fail
ms_mode=prefer-crc - crc mode, if denied agree to secure mode
ms_mode=prefer-secure - secure mode, if denied agree to crc mode
ms_mode affects all connections, we don't separate connections to mons
like it's done in userspace with ms_client_mode vs ms_mon_client_mode.
For now the default is legacy, to be flipped to prefer-crc after some
time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
For libceph, this ensures that libceph instance sharing (share option)
continues to work. For rbd, this avoids blocklisting alive lock owners
(locker addr is always LEGACY, while watcher addr is ANY in nautilus).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In preparation for msgr2, make the cluster send us maps with addrvecs
including both LEGACY and MSGR2 addrs instead of a single LEGACY addr.
This means advertising support for SERVER_NAUTILUS and also some older
features: SERVER_MIMIC, MONENC and MONNAMES.
MONNAMES and MONENC are actually pre-argonaut, we just never updated
ceph_monmap_decode() for them. Decoding is unconditional, see commit
23c625ce3065 ("libceph: assume argonaut on the server side").
SERVER_MIMIC doesn't bear any meaning for the kernel client.
Since ceph_decode_entity_addrvec() is guarded by encoding version
checks (and in msgr2 case it is guarded implicitly by the fact that
server is speaking msgr2), we assume MSG_ADDR2 for it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In preparation for msgr2, factor out finish_auth() so it is suitable
for both existing MAuth message based authentication and upcoming msgr2
authentication exchange.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In msgr2, initial authentication happens with an exchange of msgr2
control frames -- MAuth message and struct ceph_mon_request_header
aren't used. Make that optional.
Stop reporting cephx protocol as "x". Use "cephx" instead.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
- request service tickets together with auth ticket. Currently we get
auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service
tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message.
Since nautilus, desired service tickets are shared togther with auth
ticket in CEPHX_GET_AUTH_SESSION_KEY reply.
- propagate session key and connection secret, if any. In preparation
for msgr2, update handle_reply() and verify_authorizer_reply() auth
ops to propagate session key and connection secret. Since nautilus,
if secure mode is negotiated, connection secret is shared either in
CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer
reply (for osds and mdses).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Make it clear that "need" is a union of "missing" and "have, but up
for renewal" and dout when the ticket goes missing due to expiry or
invalidation by client.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
A couple whitespace fixups, no functional changes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
A pure move, no other changes.
Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}()
helpers are also moved. msgr2 will bring its own, more efficient,
variants based on iov_iter. Switching msgr1 to them was considered
but decided against to avoid subtle regressions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In preparation for msgr2, define internal messenger <-> protocol
interface (as opposed to external messenger <-> client interface, which
is struct ceph_connection_operations) consisting of try_read(),
try_write(), revoke(), revoke_incoming(), opened(), reset_session() and
reset_protocol() ops. The semantics are exactly the same as they are
now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In preparation for msgr2, make all protocol independent functions
in messenger.c global.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In preparation for msgr2, make zero_page global.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In preparation for msgr2, move the defines to the header file.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In preparation for msgr2, rename msgr1 specific states and move the
defines to the header file.
Also drop state transition comments. They don't cover all possible
transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more
harm than good.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
unsigned long is a leftover from when con->state used to be a set of
bits managed with set_bit(), clear_bit(), etc. Save a bit of memory.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Our messenger instance addr->port is normally zero -- anything else is
nonsensical because as a client we connect to multiple servers and don't
listen on any port. However, a user can supply an arbitrary addr:port
via ip option and the port is currently preserved. Zero it.
Conversely, make sure our addr->nonce is non-zero. A zero nonce is
special: in combination with a zero port, it is used to blocklist the
entire ip.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Move the logic of grabbing the next message from the queue into its own
function. Like ceph_con_in_msg_alloc(), this is protocol independent.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and
struct ceph_msg_header in general) is msgr1 specific. While the struct
is deeply ingrained inside and outside the messenger, con->in_hdr field
can be separated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Make it possible to have local cursors and embed them outside struct
ceph_msg.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Make it easier to follow and remove dependency on msgr1 specific
CEPH_MSGR_TAG_SEQ.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
It is set in process_ack() but never used.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Stick with pr_info message because session reset isn't an error most of
the time. When it is (i.e. if the server denies the reconnect attempt),
we get a bunch of other pr_err messages.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
con->peer_global_seq is part of session state. Clear it when
the server tells us to reset, not just in ceph_con_close().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
With just session reset bits left, rename appropriately.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Move protocol reset bits into ceph_con_reset_protocol(), leaving
just session reset bits.
Note that con->out_skip is now reset on faults. This fixes a crash
in the case of a stateful session getting a fault while in the middle
of revoking a message.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
A fault due to a version mismatch or a feature set mismatch used to be
treated differently from other faults: the connection would get closed
without trying to reconnect and there was a ->bad_proto() connection op
for notifying about that.
This changed a long time ago, see commits 6384bb8b8e88 ("libceph: kill
bad_proto ceph connection op") and 0fa6ebc600bc ("libceph: fix protocol
feature mismatch failure path"). Nowadays these aren't any different
from other faults (i.e. we try to reconnect even though the mismatch
won't resolve until the server is replaced). reset_connection() calls
there are rather confusing because reset_connection() resets a session
together an individual instance of the protocol. This is cleaned up
in the next patch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The current setting allows the backoff to climb up to 5 minutes. This
is too high -- it becomes hard to tell whether the client is stuck on
something or just in backoff.
In userspace, ms_max_backoff is defaulted to 15 seconds. Let's do the
same.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
con->out_msg must be cleared on Policy::stateful_server
(!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the
reconnection attempt, because after writing the banner the
messenger moves on to writing the data section of that message
(either from where it got interrupted by the connection reset or
from the beginning) instead of writing struct ceph_msg_connect.
This results in a bizarre error message because the server
sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
ceph_msg_connect:
libceph: mds0 (1)172.21.15.45:6828 socket error on write
ceph: mds0 reconnect start
libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch
AFAICT this bug goes back to the dawn of the kernel client.
The reason it survived for so long is that only MDS sessions
are stateful and only two MDS messages have a data section:
CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
The connection has to get reset precisely when such message
is being sent -- in this case it was the former.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/47723
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
Match the server side logs.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The queued con->work can start executing (and therefore logging)
before we get to this "con->work has been queued" message, making
the logs confusing. Move it up, with the meaning of "con->work
is about to be queued".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|