summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2016-08-01Btrfs: send, add missing error check for calls to path_loop()Filipe Manana1-0/+2
The function path_loop() can return a negative integer, signaling an error, 0 if there's no path loop and 1 if there's a path loop. We were treating any non zero values as meaning that a path loop exists. Fix this by explicitly checking for errors and gracefully return them to user space. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01Btrfs: send, fix failure to move directories with the same name aroundRobbie Ko1-5/+95
When doing an incremental send we can end up not moving directories that have the same name. This happens when the same parent directory has different child directories with the same name in the parent and send snapshots. For example, consider the following scenario: Parent snapshot: . (ino 256) |---- d/ (ino 257) | |--- p1/ (ino 258) | |---- p1/ (ino 259) Send snapshot: . (ino 256) |--- d/ (ino 257) |--- p1/ (ino 259) |--- p1/ (ino 258) The directory named "d" (inode 257) has in both snapshots an entry with the name "p1" but it refers to different inodes in both snapshots (inode 258 in the parent snapshot and inode 259 in the send snapshot). When attempting to move inode 258, the operation is delayed because its new parent, inode 259, was not yet moved/renamed (as the stream is currently processing inode 258). Then when processing inode 259, we also end up delaying its move/rename operation so that it happens after inode 258 is moved/renamed. This decision to delay the move/rename rename operation of inode 259 is due to the fact that the new parent inode (257) still has inode 258 as its child, which has the same name has inode 259. So we end up with inode 258 move/rename operation waiting for inode's 259 move/rename operation, which in turn it waiting for inode's 258 move/rename. This results in ending the send stream without issuing move/rename operations for inodes 258 and 259 and generating the following warnings in syslog/dmesg: [148402.979747] ------------[ cut here ]------------ [148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs] [148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1 [148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000 [148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8 [148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000 [148402.988136] Call Trace: [148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd [148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f [148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs] [148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs] [148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57 [148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34 [148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be [148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77 [148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71 [148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79 [148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa [148403.011373] ---[ end trace a4539270c8056f8b ]--- [148403.012296] ------------[ cut here ]------------ [148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs] [148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1 [148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000 [148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8 [148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000 [148403.020104] Call Trace: [148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd [148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f [148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs] [148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs] [148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57 [148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34 [148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be [148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77 [148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71 [148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79 [148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa [148403.038981] ---[ end trace a4539270c8056f8c ]--- There's another issue caused by similar (but more complex) changes in the directory hierarchy that makes move/rename operations fail, described with the following example: Parent snapshot: . |---- a/ (ino 262) | |---- c/ (ino 268) | |---- d/ (ino 263) |---- ance/ (ino 267) |---- e/ (ino 264) |---- f/ (ino 265) |---- ance/ (ino 266) Send snapshot: . |---- a/ (ino 262) |---- c/ (ino 268) | |---- ance/ (ino 267) | |---- d/ (ino 263) | |---- ance/ (ino 266) | |---- f/ (ino 265) |---- e/ (ino 264) When the inode 265 is processed, the path for inode 267 is computed, which at that time corresponds to "d/ance", and it's stored in the names cache. Later on when processing inode 266, we end up orphanizing (renaming to a name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has the same name as inode 266 and it's currently a child of the new parent directory (inode 263) for inode 266. After the orphanization and while we are still processing inode 266, a rename operation for inode 266 is generated. However the source path for that rename operation is incorrect because it ends up using the old, pre-orphanization, name of inode 267. The no longer valid name for inode 267 was previously cached when processing inode 265 and it remains usable and considered valid until the inode currently being processed has a number greater than 267. This resulted in the receiving side failing with the following error: ERROR: rename d/ance/ance -> d/ance failed: No such file or directory So fix these issues by detecting such circular dependencies for rename operations and by clearing the cached name of an inode once the inode is orphanized. A test case for fstests will follow soon. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> [Rewrote change log to be more detailed and organized, and improved comments] Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01Btrfs: add missing check for writeback errors on fsyncFilipe Manana1-0/+8
When we start an fsync we start ordered extents for all delalloc ranges. However before attempting to log the inode, we only wait for those ordered extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's flags). This means that if an ordered extent completes with an IO error before we check if we can skip logging the inode, we will not catch and report the IO error to user space. This is because on an IO error, when the ordered extent completes we do not update the inode, so if the inode was not previously updated by the current transaction we end up not logging it through calls to fsync and therefore not check its mapping flags for the presence of IO errors. Fix this by checking for errors in the flags of the inode's mapping when we notice we can skip logging the inode. This caused sporadic failures in the test generic/331 (which explicitly tests for IO errors during an fsync call). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-07-31Merge branch 'for-linus-4.8' of ↵Linus Torvalds6-338/+544
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is dedicated to Josef's enospc rework, which we've been testing for a few releases now. It fixes some early enospc problems and is dramatically faster. This also includes an updated fix for the delalloc accounting that happens after a fault in copy_from_user. My patch in v4.7 was almost but not quite enough" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix delalloc accounting after copy_from_user faults Btrfs: avoid deadlocks during reservations in btrfs_truncate_block Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes Btrfs: fill relocation block rsv after allocation Btrfs: always use trans->block_rsv for orphans Btrfs: change how we calculate the global block rsv Btrfs: use root when checking need_async_flush Btrfs: don't bother kicking async if there's nothing to reclaim Btrfs: fix release reserved extents trace points Btrfs: add fsid to some tracepoints Btrfs: add tracepoints for flush events Btrfs: fix delalloc reservation amount tracepoint Btrfs: trace pinned extents Btrfs: introduce ticketed enospc infrastructure Btrfs: add tracepoint for adding block groups Btrfs: warn_on for unaccounted spaces Btrfs: change delayed reservation fallback behavior Btrfs: always reserve metadata for delalloc extents Btrfs: fix callers of btrfs_block_rsv_migrate Btrfs: add bytes_readonly to the spaceinfo at once
2016-07-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+2
Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...
2016-07-26mm, memcg: use consistent gfp flags during readaheadMichal Hocko1-1/+2
Vladimir has noticed that we might declare memcg oom even during readahead because read_pages only uses GFP_KERNEL (with mapping_gfp restriction) while __do_page_cache_readahead uses page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from OOMs. This gfp mask discrepancy is really unfortunate and easily fixable. Drop page_cache_alloc_readahead() which only has one user and outsource the gfp_mask logic into readahead_gfp_mask and propagate this mask from __do_page_cache_readahead down to read_pages. This alone would have only very limited impact as most filesystems are implementing ->readpages and the common implementation mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to use readahead_gfp_mask instead as this function is called only during readahead as well. The same applies to read_cache_pages. ext4 has its own ext4_mpage_readpages but the path which has pages != NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to mpage_readpages so the same can be applied to them as well. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.com: restrict gfp mask in mpage_alloc] Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Chris Mason <clm@fb.com> Cc: Steve French <sfrench@samba.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Changman Lee <cm224.lee@samsung.com> Cc: Chao Yu <yuchao0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-blockLinus Torvalds14-209/+222
Pull core block updates from Jens Axboe: - the big change is the cleanup from Mike Christie, cleaning up our uses of command types and modified flags. This is what will throw some merge conflicts - regression fix for the above for btrfs, from Vincent - following up to the above, better packing of struct request from Christoph - a 2038 fix for blktrace from Arnd - a few trivial/spelling fixes from Bart Van Assche - a front merge check fix from Damien, which could cause issues on SMR drives - Atari partition fix from Gabriel - convert cfq to highres timers, since jiffies isn't granular enough for some devices these days. From Jan and Jeff - CFQ priority boost fix idle classes, from me - cleanup series from Ming, improving our bio/bvec iteration - a direct issue fix for blk-mq from Omar - fix for plug merging not involving the IO scheduler, like we do for other types of merges. From Tahsin - expose DAX type internally and through sysfs. From Toshi and Yigal * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits) block: Fix front merge check block: do not merge requests without consulting with io scheduler block: Fix spelling in a source code comment block: expose QUEUE_FLAG_DAX in sysfs block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Btrfs: fix comparison in __btrfs_map_block() block: atari: Return early for unsupported sector size Doc: block: Fix a typo in queue-sysfs.txt cfq-iosched: Charge at least 1 jiffie instead of 1 ns cfq-iosched: Fix regression in bonnie++ rewrite performance cfq-iosched: Convert slice_resid from u64 to s64 block: Convert fifo_time from ulong to u64 blktrace: avoid using timespec block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable block: bio: kill BIO_MAX_SIZE cfq-iosched: temporarily boost queue priority for idle classes block: drbd: avoid to use BIO_MAX_SIZE block: bio: remove BIO_MAX_SECTORS ...
2016-07-26btrfs: btrfs_abort_transaction, drop root parameterJeff Mahoney17-152/+147
__btrfs_abort_transaction doesn't use its root parameter except to obtain an fs_info pointer. We can obtain that from trans->root->fs_info for now and from trans->fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: add btrfs_trans_handle->fs_info pointerJeff Mahoney4-4/+6
btrfs_trans_handle->root is documented as for use for confirming that the root passed in to start the transaction is the same as the one ending it. It's used in several places when an fs_info pointer is needed, so let's just add an fs_info pointer directly. Eventually, the root pointer can be removed. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transactionJeff Mahoney1-1/+1
In btrfs_relocate_chunk, we get a transaction handle via btrfs_start_trans_remove_block_group, which starts the transaction using the extent root. When we call btrfs_end_transaction, we're calling it using the chunk root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: convert nodesize macros to static inlinesJeff Mahoney1-15/+33
This patch converts the macros used to calculate various node size limits to static inlines. That way we get type checking for free. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: introduce BTRFS_MAX_ITEM_SIZEJeff Mahoney4-10/+8
We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in several places. This introduces a BTRFS_MAX_ITEM_SIZE macro to do the same. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: cleanup, remove prototype for btrfs_find_root_refJeff Mahoney1-3/+0
The function isn't implemented anywhere. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: copy_to_sk drop unused root parameterJeff Mahoney1-3/+2
The root parameter for copy_to_sk is not used at all. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: simpilify btrfs_subvol_inherit_propsJeff Mahoney1-3/+3
We just need a superblock, but we look it up using two different roots depending on the call site. Let's just use a superblock pointer initialized at the outset. This is mostly for Coccinelle not to choke on my root push up set. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy rootJeff Mahoney7-21/+19
Now that we have a dummy fs_info associated with each test that uses a root, we don't need the DUMMY_ROOT bit anymore. This lets us make choices without needing an actual root like in e.g. btrfs_find_create_tree_block. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: tests, require fs_info for rootJeff Mahoney10-61/+103
This allows the upcoming patchset to push nodesize and sectorsize into fs_info. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: tests, move initialization into tests/Jeff Mahoney3-77/+48
We have all these stubs that only exist because they're called from btrfs_run_sanity_tests, which is a static inside super.c. Let's just move it all into tests/btrfs-tests.c and only have one stub. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: btrfs_test_opt and friends should take a btrfs_fs_infoJeff Mahoney13-130/+135
btrfs_test_opt and friends only use the root pointer to access the fs_info. Let's pass the fs_info directly in preparation to eliminate similar patterns all over btrfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: prefix fsid to all trace eventsJeff Mahoney5-22/+27
When using trace events to debug a problem, it's impossible to determine which file system generated a particular event. This patch adds a macro to prefix standard information to the head of a trace event. The extent_state alloc/free events are all that's left without an fs_info available. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: plumb fs_info into btrfs_workJeff Mahoney4-31/+63
In order to provide an fsid for trace events, we'll need a btrfs_fs_info pointer. The most lightweight way to do that for btrfs_work structures is to associate it with the __btrfs_workqueue structure. Each queued btrfs_work structure has a workqueue associated with it, so that's a natural fit. It's a privately defined structures, so we add accessors to retrieve the fs_info pointer. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: remove obsolete part of comment in statfsDavid Sterba1-3/+0
The mixed blockgroup reporting has been fixed by commit ae02d1bd070767e109f4a6f1bb1f466e9698a355 "btrfs: fix mixed block count of available space" Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: hide test-only member under ifdefDavid Sterba2-0/+4
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: Ratelimit "no csum found" info messageNikolay Borisov1-1/+1
Recently during a crash it became apparent that this particular message can be printed so many times that it causes the softlockup detector to trigger. Fix it by ratelimiting it. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: Add ratelimit to btrfs printingNikolay Borisov1-2/+24
This patch adds ratelimiting to all messages which are not using the _rl version of the various printing APIs in btrfs. This is designed to be used as a safety net, since a flood messages might cause the softlockup detector to trigger. To reduce interference between different classes of messages use a separate ratelimit state for every class of message. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: fix unexpected balance crash due to BUG_ONLiu Bo1-4/+24
Mounting a btrfs can resume previous balance operations asynchronously. An user got a crash when one drive has some corrupt sectors. Since balance can cancel itself in case of any error, we can gracefully return errors to upper layers and let balance do the cancel job. Reported-by: sash <master.b.at.raven@chefmail.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: fix panic in balance due to EIOLiu Bo1-0/+4
During build_backref_tree(), if we fail to read a btree node, we can eventually run into BUG_ON(cache->nr_nodes) that we put in backref_cache_cleanup(), meaning we have at least one memory leak. This frees the backref_node that we's allocated at the very beginning of build_backref_tree(). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: fix eb memory leak due to readpage failureLiu Bo1-3/+22
eb->io_pages is set in read_extent_buffer_pages(). In case of readpage failure, for pages that have been added to bio, it calls bio_endio and later readpage_io_failed_hook() does the work. When this eb's page (couldn't be the 1st page) fails to add itself to bio due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio, and ends up with a memory leak eventually. This lets __do_readpage propagate errors to callers and adds the 'atomic_dec(&eb->io_pages)'. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: change BUG_ON()'s to ASSERT()'s in backref_cache_cleanup()Liu Bo1-6/+6
Since it is just an in-memory building of the backrefs of several btree blocks, nothing is fatal other than memory leaks, so this changes BUG_ON()'s to ASSERT()'s. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: fix free space calculation in dump_space_info()Wang Xiaoguang1-2/+2
In btrfs, btrfs_space_info's bytes_may_use is treated as fs used space, as what we do in reserve_metadata_bytes() or btrfs_alloc_data_chunk_ondemand(), so in dump_space_info(), when calculating free space, we should also subtract btrfs_space_info's bytes_may_use. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: subpage-blocksize: Rate limit scrub error messageChandan Rajendra1-1/+1
btrfs/073 invokes scrub ioctl in a tight loop. In subpage-blocksize scenario this results in a lot of "scrub: size assumption sectorsize != PAGE_SIZE " messages being printed on the console. To reduce the number of such messages this commit uses btrfs_err_rl() instead of btrfs_err(). Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: expand cow_file_range() to support in-band dedup and subpage-blocksizeWang Xiaoguang2-11/+41
Extract cow_file_range() new parameters for both in-band dedupe and subpage sector size patchset. This should make conflict of both patchset to minimal, and reduce the effort needed to rebase them. Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: fix BUG_ON in btrfs_submit_compressed_writeLiu Bo1-2/+8
This is similar to btrfs_submit_compressed_read(), if we fail after bio is allocated, then we can use bio_endio() and errors are saved in bio->bi_error. But please note that we don't return errors to its caller because the caller assumes it won't call endio to cleanup on error. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: make sure device is synced before returnAnand Jain1-0/+5
An inconsistent behavior due to stale reads from the disk was reported mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html This patch will make sure devices are synced before return in the unmount thread. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: reorg btrfs_close_one_device()Anand Jain1-36/+35
Moves closer to the caller and removes declaration Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: Cleanup compress_file_range()Ashish Samant1-37/+35
Remove unnecessary checks in compress_file_range(). Signed-off-by: Ashish Samant <ashish.samant@oracle.com> [ minor coding style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: cleanup BUG_ON in merge_bioLiu Bo2-3/+6
One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(), thus this aims to stop anyone to panic the whole system by using their btrfs. Since the error in merge_bio can only come from __btrfs_map_block() when chunk tree mapping has something insane and __btrfs_map_block() has already had printed the reason, we can just return errors in merge_bio. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: Fix slab accounting flagsNikolay Borisov9-16/+16
BTRFS is using a variety of slab caches to satisfy internal needs. Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT, meaning allocations from the caches are going to be accounted as SReclaimable. At the same time btrfs is not registering any shrinkers whatsoever, thus preventing memory from the slabs to be shrunk. This means those caches are not in fact reclaimable. To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the inode cache, since this one is being freed by the generic VFS super_block shrinker. Also set the transaction related caches as SLAB_TEMPORARY, to better document the lifetime of the objects (it just translates to SLAB_RECLAIM_ACCOUNT). Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: Replace -ENOENT by -ERANGE in btrfs_get_acl()Salah Triki1-2/+1
size contains the value returned by posix_acl_from_xattr(), which returns -ERANGE, -ENODATA, zero, or an integer greater than zero. So replace -ENOENT by -ERANGE. Signed-off-by: Salah Triki <salah.triki@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: Handle uninitialised inode evictionNikolay Borisov1-1/+8
The code flow in btrfs_new_inode allows for btrfs_evict_inode to be called with not fully initialised inode (e.g. ->root member not being set). This can happen when btrfs_set_inode_index in btrfs_new_inode fails, which in turn would call iput for the newly allocated inode. This in turn leads to vfs calling into btrfs_evict_inode. This leads to null pointer dereference. To handle this situation check whether the passed inode has root set and just free it in case it doesn't. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: fix read_node_slot to return errorsLiu Bo2-21/+52
We use read_node_slot() to read btree node and it has two cases, a) slot is out of range, which means 'no such entry' b) we fail to read the block, due to checksum fails or corrupted content or not with uptodate flag. But we're returning NULL in both cases, this makes it return -ENOENT in case a) and return -EIO in case b), and this fixes its callers as well as btrfs_search_forward() 's caller to catch the new errors. The problem is reported by Peter Becker, and I can manage to hit the same BUG_ON by mounting my fuzz image. Reported-by: Peter Becker <floyd.net@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: fix double free of fs rootLiu Bo1-9/+3
I got this warning while mounting a btrfs image, [ 3020.509606] ------------[ cut here ]------------ [ 3020.510107] WARNING: CPU: 3 PID: 5581 at lib/idr.c:1051 ida_remove+0xca/0x190 [ 3020.510853] ida_remove called for id=42 which is not allocated. [ 3020.511466] Modules linked in: [ 3020.511802] CPU: 3 PID: 5581 Comm: mount Not tainted 4.7.0-rc5+ #274 [ 3020.512438] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 3020.513385] 0000000000000286 0000000021295d86 ffff88006c66b8f0 ffffffff8182ba5a [ 3020.514153] 0000000000000000 0000000000000009 ffff88006c66b930 ffffffff810e0ed7 [ 3020.514928] 0000041b00000000 ffffffff8289a8c0 ffff88007f437880 0000000000000000 [ 3020.515717] Call Trace: [ 3020.515965] [<ffffffff8182ba5a>] dump_stack+0xc9/0x13f [ 3020.516487] [<ffffffff810e0ed7>] __warn+0x147/0x160 [ 3020.517005] [<ffffffff810e0f4f>] warn_slowpath_fmt+0x5f/0x80 [ 3020.517572] [<ffffffff8182e6ca>] ida_remove+0xca/0x190 [ 3020.518075] [<ffffffff813a2bcc>] free_anon_bdev+0x2c/0x60 [ 3020.518609] [<ffffffff81657a9f>] free_fs_root+0x13f/0x160 [ 3020.519138] [<ffffffff8165c679>] btrfs_get_fs_root+0x379/0x3d0 [ 3020.519710] [<ffffffff81e6e975>] ? __mutex_unlock_slowpath+0x155/0x2c0 [ 3020.520366] [<ffffffff816615b1>] open_ctree+0x2e91/0x3200 [ 3020.520965] [<ffffffff8161ede2>] btrfs_mount+0x1322/0x15b0 [ 3020.521536] [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170 [ 3020.522167] [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210 [ 3020.522780] [<ffffffff813a4f59>] mount_fs+0x49/0x2c0 [ 3020.523305] [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0 [ 3020.523872] [<ffffffff8161dee1>] btrfs_mount+0x421/0x15b0 [ 3020.524402] [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170 [ 3020.525045] [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210 [ 3020.525657] [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210 [ 3020.526289] [<ffffffff813a4f59>] mount_fs+0x49/0x2c0 [ 3020.526803] [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0 [ 3020.527365] [<ffffffff813dc27a>] do_mount+0x41a/0x1770 [ 3020.527899] [<ffffffff812e800d>] ? strndup_user+0x6d/0xc0 [ 3020.528447] [<ffffffff812e7f68>] ? memdup_user+0x78/0xb0 [ 3020.528987] [<ffffffff813ddad0>] SyS_mount+0x150/0x160 [ 3020.529493] [<ffffffff81e72b7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd It turns out that we free fs root twice, btrfs_init_fs_root() calls free_anon_bdev(root->anon_dev) and later then btrfs_get_fs_root() cals free_fs_root which does another free_anon_bdev() and it ends up with the above warning. Instead of reset root->anon_dev to 0 after free_anon_bdev(), we can let btrfs_init_fs_root() return directly since its callers have already done the free job by calling free_fs_root(). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: error out if generic_bin_search get invalid argumentsLiu Bo1-0/+8
With btrfs-corrupt-block, one can set btree node/leaf's field, if we assign a negative value to node/leaf, we can get various hangs, eg. if extent_root's nritems is -2ULL, then we get stuck in btrfs_read_block_groups() because it has a while loop and btrfs_search_slot() on extent_root will always return the first child. This lets us know what's happening and returns a EINVAL to callers instead of returning the first item. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: check inconsistence between chunk and block groupLiu Bo1-1/+16
With btrfs-corrupt-block, one can drop one chunk item and mounting will end up with a panic in btrfs_full_stripe_len(). This doesn't not remove the BUG_ON, but instead checks it a bit earlier when we find the block group item. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: add missing bytes_readonly attribute file in sysfsWang Xiaoguang1-0/+2
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-21Btrfs: fix delalloc accounting after copy_from_user faultsChris Mason1-7/+5
Commit 56244ef151c3cd11 was almost but not quite enough to fix the reservation math after btrfs_copy_from_user returned partial copies. Some users are still seeing warnings in btrfs_destroy_inode, and with a long enough test run I'm able to trigger them as well. This patch fixes the accounting math again, bringing it much closer to the way it was before the sectorsize conversion Chandan did. The problem is accounting for the offset into the page/sector when we do a partial copy. This one just uses the dirty_sectors variable which should already be updated properly. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v4.6+
2016-07-20Btrfs: avoid deadlocks during reservations in btrfs_truncate_blockJosef Bacik1-0/+5
The new enospc code makes it possible to deadlock if we don't use FLUSH_LIMIT during reservations inside a transaction. This enforces the correct flush type to avoid both deadlocks and assertions Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2016-07-18Btrfs: fix comparison in __btrfs_map_block()Vincent Stehlé1-1/+1
Add missing comparison to op in expression, which was forgotten when doing the REQ_OP transition. Fixes: b3d3fa519905 ("btrfs: update __btrfs_map_block for REQ_OP transition") Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytesJosef Bacik2-17/+22
We used to allow you to set FLUSH_ALL and then just wouldn't do things like commit transactions or wait on ordered extents if we noticed you were in a transaction. However now that all the flushing for FLUSH_ALL is asynchronous we've lost the ability to tell, and we could end up deadlocking. So instead use FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if we error out to preserve the previous behavior. I've also added an ASSERT() to catch anybody else who tries to do this. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: fill relocation block rsv after allocationJosef Bacik1-0/+6
Since we set the reloc control before we've reserved our space for relocation we could race with a root being dirtied and not actually have space to do our init reloc root. So once we've allocated it and set it up go ahead and make our reservation before setting the relocate control, that way anybody who tries to do the reloc root init has space to use. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>