Age | Commit message (Collapse) | Author | Files | Lines |
|
In the process of disentagling/libraryizing bset.c from the rest of the
bcache code.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
It was a single element mempool before, it's slightly cleaner to just use a real
mempool.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Used this fixed code to find and fix the bug fixed by
a4d885097b0ac0cd1337f171f2d4b83e946094d4.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
That was a terrible name for a macro, add some better helpers to replace it.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Now that we've got code for raid5/6 stripe awareness, bcache just needs
to know about the stripes and when writing partial stripes is expensive
- we probably don't want to enable this optimization for raid1 or 10,
even though they have stripes. So add a flag to queue_limits.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
This error path shouldn't have been hit in practice.. and we've got reworked
reserve code coming soon so that it shouldn't _ever_ be bit... but if we've got
code for this error path it should be correct.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
We need a reserve for allocating buckets for new btree nodes - and now that
we've got multiple btrees, it really needs to be per btree.
This reworks the reserves so we've got separate freelists for each reserve
instead of watermarks, which seems to make things a bit cleaner, and it adds
some code so that btree_split() can make sure the reserve is available before it
starts.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Also flesh out the documentation a bit
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Another minor performance optimization
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Unnecessary since a bucket that has dirty pointers pointing to it can
never be invalidated - and skipping it is a measurable performance
boost, since the bucket gen will usually be a cache miss.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
We were unnecessarily waiting on a journal write to complete when we just needed
to start a journal write and start setting up the next one.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
The real fix is where we check the bytes we need against how much is
remaining - we also need to check for a journal entry bigger than our
buffer, we'll never write those and it would be bad if we tried to read
one.
Also improve the diagnostic messages.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.
The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.
I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
|
|
When extending a metadata space map we should do the first commit whilst
still in bootstrap mode -- a mode where all blocks get allocated in the
new area.
That way the commit overhead is allocated from the newly added space.
Otherwise we risk running out of space.
With this fix, and the previous commit "dm space map common: make sure
new space is used during extend", the following device mapper testsuite
test passes:
dmtest run --suite thin-provisioning -n /resize_metadata_no_io/
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.
Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
-> sm_ll_extend
-> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
=> returns -ENOSPC because smm->begin == smm->ll.nr_blocks
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
There may be other parts of the kernel holding a reference on the dm
kobject. We must wait until all references are dropped before
deallocating the mapped_device structure.
The dm_kobject_release method signals that all references are dropped
via completion. But dm_kobject_release doesn't free the kobject (which
is embedded in the mapped_device structure).
This is the sequence of operations:
* when destroying a DM device, call kobject_put from dm_sysfs_exit
* wait until all users stop using the kobject, when it happens the
release method is called
* the release method signals the completion and should return without
delay
* the dm device removal code that waits on the completion continues
* the dm device removal code drops the dm_mod reference the device had
* the dm device removal code frees the mapped_device structure that
contains the kobject
Using kobject this way should avoid the module unload race that was
mentioned at the beginning of this thread:
https://lkml.org/lkml/2014/1/4/83
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
The comparison is always true and the compiler optimizes it out anyway.
Milan offered additional context relative to the original commit
784aae735d ("dm: add name and uuid to sysfs") which introduced the code:
"I think it is just relict of some experiments before I committed this
simple embedded sysfs kobj handling".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache.
This patch introduces three new tunables that allow you to tweak the
promotion threshold by adding a small value. These adjustments depend
on the io type:
read_promote_adjustment: READ io, default 4
write_promote_adjustment: WRITE io, default 8
discard_promote_adjustment: READ/WRITE io to a discarded block, default 1
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The pool mode must not be switched until after the corresponding pool
process_* methods have been established. Otherwise, because
set_pool_mode() isn't interlocked with the IO path for performance
reasons, the IO path can end up executing process_* operations that
don't match the mode. This patch eliminates problems like the following
(as seen on really fast PCIe SSD storage when transitioning the pool's
mode from PM_READ_ONLY to PM_WRITE):
kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event.
kernel: device-mapper: thin: 253:2: no free data space available.
kernel: device-mapper: thin: 253:2: switching pool to read-only mode
kernel: device-mapper: thin: 253:2: switching pool to write mode
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]()
...
kernel: Workqueue: dm-thin do_worker [dm_thin_pool]
kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3
kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800
kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3
kernel: Call Trace:
kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e
kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0
kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20
kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]
kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool]
kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool]
kernel: [<ffffffff81067823>] process_one_work+0x183/0x490
kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0
kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160
kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: ---[ end trace 3f00528e08ffa55c ]---
kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!?
dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY);
at the top of handle_unserviceable_bio(). And as the additional
debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like
expected, it was already PM_WRITE, yet pool->process_bio was still set
to process_bio_read_only().
Also, while fixing this up, reduce logging of redundant pool mode
transitions by checking new_mode is different from old_mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
The pool's error_if_no_space flag can easily serve the same purpose that
no_free_space did, namely: control whether handle_unserviceable_bio()
will error a bio or requeue it.
This is cleaner since error_if_no_space is established when the pool's
features are processed during table load. So it avoids managing the
no_free_space flag by taking the pool's spinlock.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
If the pool runs out of data or metadata space, the pool can either
queue or error the IO destined to the data device. The default is to
queue the IO until more space is added.
An admin may now configure the pool to error IO when no space is
available by setting the 'error_if_no_space' feature when loading the
thin-pool table.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Now that we switch the pool to read-only mode when the data device runs
out of space it causes active writers to get IO errors once we resume
after resizing the data device.
If no_free_space is set, save bios to the 'retry_on_resume_list' and
requeue them on resume (once the data or metadata device may have been
resized).
With this patch the resize_io test passes again (on slower storage):
dmtest run --suite thin-provisioning -n /resize_io/
Later patches fix some subtle races associated with the pool mode
transitions done as part of the pool's -ENOSPC handling. These races
are exposed on fast storage (e.g. PCIe SSD).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Factor out_of_data_space() out of alloc_data_block(). Eliminate the use
of 'no_free_space' as a latch in alloc_data_block() -- this is no longer
needed now that we switch to read-only mode when we run out of data or
metadata space. In a later patch, the 'no_free_space' flag will be
eliminated entirely (in favor of checking metadata rather than relying
on a transient flag).
Move no metdata space handling into metdata_operation_failed(). Set
no_free_space when metadata space is exhausted too. This is useful,
because it offers consistency, for the following patch that will requeue
data IOs if no_free_space.
Also, rename no_space() to retry_bios_on_resume().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Introduce metadata_operation_failed() wrappers, around set_pool_mode(),
to assist with improving the consistency of how metadata failures are
handled. Logging is improved and metadata operation failures trigger
read-only mode immediately.
Also, eliminate redundant set_pool_mode() calls in the two
alloc_data_block() caller's error paths.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Factor check_low_water_mark() out of alloc_data_block().
Change a couple unsigned flags in the pool structure to bool.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Mappings could be processed in descending logical block order,
particularly if buffered IO is used. This could adversely affect the
latency of IO processing. Fix this by adding mappings to the end of the
'prepared_mappings' and 'prepared_discards' lists.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Also, move 'err' member in dm_thin_new_mapping structure to eliminate 4
byte hole (reduces size from 88 bytes to 80).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
DM's persistent-data library is now used my multiple targets so
exclusive references to "pool" or "thin provisioning" need to be
cleaned up. Adjust Kconfig's DM_DEBUG_BLOCK_STACK_TRACING text
and remove "pool" from a block manager error message.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
The "unable to allocate new metadata block" error can be a particularly
verbose error if there is a systemic issue with the metadata device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Starting with commit c0820cf5ad095 ("dm: introduce per_bio_data"),
device mapper has the capability to pre-allocate a target-specific
structure with the bio.
This patch changes dm-delay to use this facility instead of a slab cache
and mempool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
A device mapper table is allocated in the following way:
* The function dm_table_create is called, it gets the number of targets
as an argument -- it allocates a targets array accordingly.
* For each target, we call dm_table_add_target.
If we add more targets than were specified in dm_table_create, the
function dm_table_add_target reallocates the targets array. However,
this reallocation code is wrong - it moves the targets array to a new
location, while some target constructors hold pointers to the array in
the old location.
The following DM target drivers save the pointer to the target
structure, so they corrupt memory if the target array is moved:
multipath, raid, mirror, snapshot, stripe, switch, thin, verity.
Under normal circumstances, the reallocation function is not called
(because dm_table_create is called with the correct number of targets),
so the buggy reallocation code is not used.
Prior to the fix "dm table: fail dm_table_create on dm_round_up
overflow", the reallocation code could only be used in case the user
specifies too large a value in param->target_count, such as 0xffffffff.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time. The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity. In this case, the shared flag would be true.
But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists). This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.
To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used(). If the
reference count for a block is now zero the discard is allowed to be
passed down.
Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
|
|
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
|
|
Pull block fixes from Jens Axboe:
- fix for a memory leak on certain unplug events
- a collection of bcache fixes from Kent and Nicolas
- a few null_blk fixes and updates form Matias
- a marking of static of functions in the stec pci-e driver
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: support submit_queues on use_per_node_hctx
null_blk: set use_per_node_hctx param to false
null_blk: corrections to documentation
null_blk: warning on ignored submit_queues param
null_blk: refactor init and init errors code paths
null_blk: documentation
null_blk: mem garbage on NUMA systems during init
drivers: block: Mark the functions as static in skd_main.c
bcache: New writeback PD controller
bcache: bugfix for race between moving_gc and bucket_invalidate
bcache: fix for gc and writeback race
bcache: bugfix - moving_gc now moves only correct buckets
bcache: fix for gc crashing when no sectors are used
bcache: Fix heap_peek() macro
bcache: Fix for can_attach_cache()
bcache: Fix dirty_data accounting
bcache: Use uninterruptible sleep in writeback
bcache: kthread don't set writeback task to INTERUPTIBLE
block: fix memory leaks on unplugging block device
bcache: fix sparse non static symbol warning
|
|
We want these fixes here to handle some merge issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
into for-linus
Kent writes:
Jens - small pile of bcache fixes. I've been slacking on the writeback
fixes but those definitely need to get into 3.13.
|
|
The old writeback PD controller could get into states where it had throttled all
the way down and take way too long to recover - it was too complicated to really
understand what it was doing.
This rewrites a good chunk of it to hopefully be simpler and make more sense,
and it also pays more attention to units which should make the behaviour a bit
easier to understand.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
There is a possibility for a bucket to be invalidated by the allocator
while moving_gc was copying it's contents to another bucket, if the
bucket only held cached data. To prevent this moving checks for
a stale ptr (to an invalidated bucket), before and after reads.
It it finds one, it simply ignores moving that data. This only
affects bcache if the moving_gc was turned on, note that it's
off by default.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Garbage collector needs to check keys in the writeback keybuf to
make sure it's not invalidating buckets to which the writeback
keys point to.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Removed gc_move_threshold because picking buckets only by
threshold could lead moving extra buckets (ei. if there are
buckets at the threshold that aren't supposed to be moved
do to space considerations).
This is replaced by a GC_MOVE bit in the gc_mark bitmask.
Now only marked buckets get moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|