summaryrefslogtreecommitdiff
path: root/fs/ocfs2/ocfs2.h
AgeCommit message (Collapse)AuthorFilesLines
2017-02-22ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lockEric Ren1-0/+1
We are in the situation that we have to avoid recursive cluster locking, but there is no way to check if a cluster lock has been taken by a precess already. Mostly, we can avoid recursive locking by writing code carefully. However, we found that it's very hard to handle the routines that are invoked directly by vfs code. For instance: const struct inode_operations ocfs2_file_iops = { .permission = ocfs2_permission, .get_acl = ocfs2_iop_get_acl, .set_acl = ocfs2_iop_set_acl, }; Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR): do_sys_open may_open inode_permission ocfs2_permission ocfs2_inode_lock() <=== first time generic_permission get_acl ocfs2_iop_get_acl ocfs2_inode_lock() <=== recursive one A deadlock will occur if a remote EX request comes in between two of ocfs2_inode_lock(). Briefly describe how the deadlock is formed: On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of the remote EX lock request. Another hand, the recursive cluster lock (the second one) will be blocked in in __ocfs2_cluster_lock() because of OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why? because there is no chance for the first cluster lock on this node to be unlocked - we block ourselves in the code path. The idea to fix this issue is mostly taken from gfs2 code. 1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track of the processes' pid who has taken the cluster lock of this lock resource; 2. introduce a new flag for ocfs2_inode_lock_full: OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for us if we've got cluster lock. 3. export a helper: ocfs2_is_locked_by_me() is used to check if we have got the cluster lock in the upper code path. The tracking logic should be used by some of the ocfs2 vfs's callbacks, to solve the recursive locking issue cuased by the fact that vfs routines can call into each other. The performance penalty of processing the holder list should only be seen at a few cases where the tracking logic is used, such as get/set acl. You may ask what if the first time we got a PR lock, and the second time we want a EX lock? fortunately, this case never happens in the real world, as far as I can see, including permission check, (get|set)_(acl|attr), and the gfs2 code also do so. [sfr@canb.auug.org.au remove some inlines] Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12ocfs2: use time64_t to represent orphan scan timesDeepa Dinamani1-1/+1
struct timespec is not y2038 safe. Use time64_t which is y2038 safe to represent orphan scan times. time64_t is sufficient here as only the seconds delta times are relevant. Also use appropriate time functions that return time in time64_t format. Time functions now return monotonic time instead of real time as only delta scan times are relevant and these values are not persistent across reboots. The format string for the debug print is still using long as this is only the time elapsed since the last scan and long is sufficient to represent this value. Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-10/+10
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix occurring deadlock by changing ocfs2_wq from global to localjiangyiwen1-0/+8
This patch fixes a deadlock, as follows: Node 1 Node 2 Node 3 1)volume a and b are only mount vol a only mount vol b mounted 2) start to mount b start to mount a 3) check hb of Node 3 check hb of Node 2 in vol a, qs_holds++ in vol b, qs_holds++ 4) -------------------- all nodes' network down -------------------- 5) progress of mount b the same situation as failed, and then call Node 2 ocfs2_dismount_volume. but the process is hung, since there is a work in ocfs2_wq cannot beo completed. This work is about vol a, because ocfs2_wq is global wq. BTW, this work which is scheduled in ocfs2_wq is ocfs2_orphan_scan_work, and the context in this work needs to take inode lock of orphan_dir, because lockres owner are Node 1 and all nodes' nework has been down at the same time, so it can't get the inode lock. 6) Why can't this node be fenced when network disconnected? Because the process of mount is hung what caused qs_holds is not equal 0. Because all works in the ocfs2_wq are relative to the super block. The solution is to change the ocfs2_wq from global to local. In other words, move it into struct ocfs2_super. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: add errors=continueGoldwyn Rodrigues1-0/+2
OCFS2 is often used in high-availaibility systems. However, ocfs2 converts the filesystem to read-only at the drop of the hat. This may not be necessary, since turning the filesystem read-only would affect other running processes as well, decreasing availability. This attempt is to add errors=continue, which would return the EIO to the calling process and terminate furhter processing so that the filesystem is not corrupted further. However, the filesystem is not converted to read-only. As a future plan, I intend to create a small utility or extend fsck.ocfs2 to fix small errors such as in the inode. The input to the utility such as the inode can come from the kernel logs so we don't have to schedule a downtime for fixing small-enough errors. The patch changes the ocfs2_error to return an error. The error returned depends on the mount option set. If none is set, the default is to turn the filesystem read-only. Perhaps errors=continue is not the best option name. Historically it is used for making an attempt to progress in the current process itself. Should we call it errors=eio? or errors=killproc? Suggestions/Comments welcome. Sources are available at: https://github.com/goldwynr/linux/tree/error-cont Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24ocfs2: fix wrong check in ocfs2_direct_IO_get_blocksJoseph Qi1-0/+10
contig_blocks gotten from ocfs2_extent_map_get_blocks cannot be compared with clusters_to_alloc. So convert it to clusters first. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Weiwei Wang <wangww631@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12ocfs2: make append_dio an incompat featureMark Fasheh1-1/+1
It turns out that making this feature ro_compat isn't quite enough to prevent accidental corruption on mount from older kernels. Ocfs2 (like other file systems) will process orphaned inodes even when the user mounts in 'ro' mode. So for the case of a filesystem not knowing the append_dio feature, mounting the filesystem could result in orphaned-for-dio files being deleted, which we clearly don't want. So instead, turn this into an incompat flag. Btw, this is kind of my fault - initially I asked that we add a flag to cover the feature and even suggested that we use an ro flag. It wasn't until I was looking through our commits for v4.0-rc1 that I realized we actually want this to be incompat. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16ocfs2: set append dio as a ro compat featureJoseph Qi1-0/+8
Intruduce a bit OCFS2_FEATURE_RO_COMPAT_APPEND_DIO and check it in write flow. If the bit is not set, fall back to the old way. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16ocfs2: implement ocfs2_direct_IO_writeJoseph Qi1-0/+10
Implement ocfs2_direct_IO_write. Add the inode to orphan dir first, and then delete it once append O_DIRECT finished. This is to make sure block allocation and inode size are consistent. [akpm@linux-foundation.org: fix it for "block: Add discard flag to blkdev_issue_zeroout() function"] Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16ocfs2: add orphan recovery types in ocfs2_recover_orphansJoseph Qi1-0/+5
Define two orphan recovery types, which indicates if need truncate file or not. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10ocfs2: add a mount option journal_async_commit on ocfs2 filesystemalex chen1-0/+2
Add a mount option to support JBD2 feature: JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. This option will enable 'journal_checksum' internally. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement, the files per second go up from 215.2 to 317.5. test script: fs_mark -d /mnt/ocfs2/ -s 10240 -n 1000 default: FSUse% Count Size Files/sec App Overhead 0 1000 10240 215.2 17878 with journal_async_commit option: FSUse% Count Size Files/sec App Overhead 0 1000 10240 317.5 17881 Signed-off-by: Alex Chen <alex.chen@huawei.com> Signed-off-by: Weiwei Wang <wangww631@huawei.comm> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10ocfs2: do not set OCFS2_LOCK_UPCONVERT_FINISHING if nonblocking lock can not ↵Xue jiufei1-0/+6
be granted at once ocfs2_readpages() use nonblocking flag to avoid page lock inversion. It will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING is not cleared if nonblocking lock cannot be granted at once. The flag would prevent dc thread from downconverting. So other nodes cannot acheive this lockres for ever. So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if nonblocking lock had already returned. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04ocfs2: fix umount hang while shutting down truncate logXue jiufei1-0/+1
Revert commit 75f82eaa502c ("ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously") because it may cause a umount hang while shutting down the truncate log. fix NULL pointer dereference when dismount and ocfs2rec simultaneously The situation is as followes: ocfs2_dismout_volume -> ocfs2_recovery_exit -> free osb->recovery_map -> ocfs2_truncate_shutdown -> lock global bitmap inode -> ocfs2_wait_for_recovery -> check whether osb->recovery_map->rm_used is zero Because osb->recovery_map is already freed, rm_used can be any other values, so it may yield umount hang. To prevent NULL pointer dereference while getting sys_root_inode, we use a osb_tl_disable flag to disable schedule osb_truncate_log_wq after truncate log shutdown. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03ocfs2: avoid system inode ref confusion by adding mutex lockjiangyiwen1-0/+2
The following case may lead to the same system inode ref in confusion. A thread B thread ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode because of *arr == NULL, ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode gets first ref thru _ocfs2_get_system_file_inode, gets second ref thru igrab and set *arr = inode at the moment, B thread also gets two refs, so lead to one more inode ref. So add mutex lock to avoid multi thread set two inode ref once at the same time. Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03ocfs2: revert iput deferring code in ocfs2_drop_dentry_lockGoldwyn Rodrigues1-24/+4
The following patches are reverted in this patch because these patches caused performance regression in the remote unlink() calls. ea455f8ab683 - ocfs2: Push out dropping of dentry lock to ocfs2_wq f7b1aa69be13 - ocfs2: Fix deadlock on umount 5fd131893793 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount Previous patches in this series removed the possible deadlocks from downconvert thread so the above patches shouldn't be needed anymore. The regression is caused because these patches delay the iput() in case of dentry unlocks. This also delays the unlocking of the open lockres. The open lockresource is required to test if the inode can be wiped from disk or not. When the deleting node does not get the open lock, it marks it as orphan (even though it is not in use by another node/process) and causes a journal checkpoint. This delays operations following the inode eviction. This also moves the inode to the orphaned inode which further causes more I/O and a lot of unneccessary orphans. The following script can be used to generate the load causing issues: declare -a create declare -a remove declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384) unique="`mktemp -u XXXXX`" script="/tmp/idontknow-${unique}.sh" cat <<EOF > "${script}" for n in {1..8}; do mkdir -p test/dir\${n} eval touch test/dir\${n}/foo{1.."\$1"} done EOF chmod 700 "${script}" function fcreate () { exec 2>&1 /usr/bin/time --format=%E "${script}" "$1" } function fremove () { exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*" } function fcp () { exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new" } echo ------------------------------------------------- echo "| # files | create #s | copy #s | remove #s |" echo ------------------------------------------------- for ((x=0; x < ${#iterations[*]} ; x++)) do create[$x]="`fcreate ${iterations[$x]}`" copy[$x]="`fcp ${iterations[$x]}`" remove[$x]="`fremove`" printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]} done rm "${script}" echo "------------------------" Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03ocfs2: implement delayed dropping of last dquot referenceJan Kara1-0/+5
We cannot drop last dquot reference from downconvert thread as that creates the following deadlock: NODE 1 NODE2 holds dentry lock for 'foo' holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE dquot_initialize(bar) ocfs2_dquot_acquire() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) ... downconvert thread (triggered from another node or a different process from NODE2) ocfs2_dentry_post_unlock() ... iput(foo) ocfs2_evict_inode(foo) ocfs2_clear_inode(foo) dquot_drop(inode) ... ocfs2_dquot_release() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) - blocks finds we need more space in quota file ... ocfs2_extend_no_holes() ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) - deadlocks waiting for downconvert thread We solve the problem by postponing dropping of the last dquot reference to a workqueue if it happens from the downconvert thread. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: add clustername to cluster connectionGoldwyn Rodrigues1-0/+1
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM handling up to the times with respect to DLM (>=4.0.1) and corosync (2.3.x). AFAIK, cman also is being phased out for a unified corosync cluster stack. fs/dlm performs all the functions with respect to fencing and node management and provides the API's to do so for ocfs2. For all future references, DLM stands for fs/dlm code. The advantages are: + No need to run an additional userspace daemon (ocfs2_controld) + No controld device handling and controld protocol + Shifting responsibilities of node management to DLM layer For backward compatibility, we are keeping the controld handling code. Once enough time has passed we can remove a significant portion of the code. This was tested by using the kernel with changes on older unmodified tools. The kernel used ocfs2_controld as expected, and displayed the appropriate warning message. This feature requires modification in the userspace ocfs2-tools. The changes can be found at: https://github.com/goldwynr/ocfs2-tools branch: nocontrold Currently, not many checks are present in the userspace code, but that would change soon. This patch (of 6): Add clustername to cluster connection. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: remove unecessary variable needs_checkpointGoldwyn Rodrigues1-1/+0
Code cleanup: needs_checkpoint is assigned to but never used. Delete the variable. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Jeff Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-01ocfs2: avoid unaligned access to dqc_bitmapAkinobu Mita1-0/+47
The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned, but not 64-bit aligned. The dqc_bitmap is accessed by ocfs2_set_bit(), ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit(). These are wrapper macros for ext2_*_bit() which need to take an unsigned long aligned address (though some architectures are able to handle unaligned address correctly) So some 64bit architectures may not be able to access the dqc_bitmap correctly. This avoids such unaligned access by using another wrapper functions for ext2_*_bit(). The code is taken from fs/ext4/mballoc.c which also need to handle unaligned bitmap access. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-31ocfs2: use proper little-endian bitopsAkinobu Mita1-2/+2
Using __test_and_{set,clear}_bit_le() with ignoring its return value can be replaced with __{set,clear}_bit_le(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: ocfs2-devel@oss.oracle.com Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-03-28Merge branch 'upstream-linus' of ↵Linus Torvalds1-9/+14
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (39 commits) Treat writes as new when holes span across page boundaries fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS. ocfs2/dlm: Move kmalloc() outside the spinlock ocfs2: Make the left masklogs compat. ocfs2: Remove masklog ML_AIO. ocfs2: Remove masklog ML_UPTODATE. ocfs2: Remove masklog ML_BH_IO. ocfs2: Remove masklog ML_JOURNAL. ocfs2: Remove masklog ML_EXPORT. ocfs2: Remove masklog ML_DCACHE. ocfs2: Remove masklog ML_NAMEI. ocfs2: Remove mlog(0) from fs/ocfs2/dir.c ocfs2: remove NAMEI from symlink.c ocfs2: Remove masklog ML_QUOTA. ocfs2: Remove mlog(0) from quota_local.c. ocfs2: Remove masklog ML_RESERVATIONS. ocfs2: Remove masklog ML_XATTR. ocfs2: Remove masklog ML_SUPER. ocfs2: Remove mlog(0) from fs/ocfs2/heartbeat.c ocfs2: Remove mlog(0) from fs/ocfs2/slot_map.c ... Fix up trivial conflict in fs/ocfs2/super.c
2011-03-23ocfs2: use little-endian bitopsAkinobu Mita1-5/+5
As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Joel Becker <joel.becker@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-20ocfs2: Use hrtimer to track ocfs2 fs lock statsSunil Mushran1-9/+14
Patch makes use of the hrtimer to track times in ocfs2 lock stats. The patch is a bit involved to ensure no additional impact on the memory footprint. The size of ocfs2_inode_cache remains 1280 bytes on 32-bit systems. A related change was to modify the unit of the max wait time from nanosec to microsec allowing us to track max time larger than 4 secs. This change necessitated the bumping of the output version in the debugfs file, locking_state, from 2 to 3. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2010-12-16ocfs2: Try to free truncate log when meeting ENOSPC in write.Tao Ma1-0/+5
Recently, one of our colleagues meet with a problem that if we write/delete a 32mb files repeatly, we will get an ENOSPC in the end. And the corresponding bug is 1288. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288 The real problem is that although we have freed the clusters, they are in truncate log and they will be summed up so that we can free them once in a whole. So this patch just try to resolve it. In case we see -ENOSPC in ocfs2_write_begin_no_lock, we will check whether the truncate log has enough clusters for our need, if yes, we will try to flush the truncate log at that point and try again. This method is inspired by Mark Fasheh <mfasheh@suse.com>. Thanks. Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-18ocfs2: char is not always signedMilton Miller1-3/+3
Commit 1c66b360fe262 (Change some lock status member in ocfs2_lock_res to char.) states that these fields need to be signed due to comparision to -1, but only changed the type from unsigned char to char. However, it is a compiler option if char is a signed or unsigned type. Change these fields to signed char so the code will work with all compilers. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-13ocfs2: Change some lock status member in ocfs2_lock_res to char.Tao Ma1-3/+3
Commit 83fd9c7 changes l_level, l_requested and l_blocking of ocfs2_lock_res from int to unsigned char. But actually it is initially as -1(ocfs2_lock_res_init_common) which correspoding to 255 for unsigned char. So the whole dlm lock mechanism doesn't work now which means a disaster to ocfs2. Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-10-15Merge branch 'globalheartbeat-2' of ↵Joel Becker1-6/+34
git://oss.oracle.com/git/smushran/linux-2.6 into ocfs2-merge-window Conflicts: fs/ocfs2/ocfs2.h
2010-10-11ocfs2: Add a mount option "coherency=*" to handle cluster coherency for ↵Tristan Ye1-0/+3
O_DIRECT writes. Currently, the default behavior of O_DIRECT writes was allowing concurrent writing among nodes to the same file, with no cluster coherency guaranteed (no EX lock held). This can leave stale data in the cache for buffered reads on other nodes. The new mount option introduce a chance to choose two different behaviors for O_DIRECT writes: * coherency=full, as the default value, will disallow concurrent O_DIRECT writes by taking EX locks. * coherency=buffered, allow concurrent O_DIRECT writes without EX lock among nodes, which gains high performance at risk of getting stale data on other nodes. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-10-07ocfs2: Add support for heartbeat=global mount optionSunil Mushran1-1/+3
Adds support for heartbeat=global mount option. It ensures that the heartbeat mode passed matches the one enabled on disk. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-09ocfs2: Add an incompat feature flag OCFS2_FEATURE_INCOMPAT_CLUSTERINFOSunil Mushran1-2/+29
OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for both userspace and o2cb cluster stacks. It also allows us to extend cluster info to include stack flags. This patch also adds stackflags to sb->s_clusterinfo. It also introduces a clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled global heartbeat mode. This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-09-10ocfs2: Cache system inodes of other slots.Tao Ma1-1/+2
Durring orphan scan, if we are slot 0, and we are replaying orphan_dir:0001, the general process is that for every file in this dir: 1. we will iget orphan_dir:0001, since there is no inode for it. we will have to create an inode and read it from the disk. 2. do the normal work, such as delete_inode and remove it from the dir if it is allowed. 3. call iput orphan_dir:0001 when we are done. In this case, since we have no dcache for this inode, i_count will reach 0, and VFS will have to call clear_inode and in ocfs2_clear_inode we will checkpoint the inode which will let ocfs2_cmt and journald begin to work. 4. We loop back to 1 for the next file. So you see, actually for every deleted file, we have to read the orphan dir from the disk and checkpoint the journal. It is very time consuming and cause a lot of journal checkpoint I/O. A better solution is that we can have another reference for these inodes in ocfs2_super. So if there is no other race among nodes(which will let dlmglue to checkpoint the inode), for step 3, clear_inode won't be called and for step 1, we may only need to read the inode for the 1st time. This is a big win for us. So this patch will try to cache system inodes of other slots so that we will have one more reference for these inodes and avoid the extra inode read and journal checkpoint. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10Reorganize data elements to reduce struct sizesGoldwyn Rodrigues1-8/+15
Thanks for the comments. I have incorportated them all. CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled. Statistics now look like - ocfs2_write_ctxt: 2144 - 2136 = 8 ocfs2_inode_info: 1960 - 1848 = 112 ocfs2_journal: 168 - 160 = 8 ocfs2_lock_res: 336 - 304 = 32 ocfs2_refcount_tree: 512 - 472 = 40 Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-04-22ocfs2: Some tiny bug fixes for discontiguous block allocation.Tao Ma1-1/+1
The fixes include: 1. some endian problems. 2. we should use bit/bpc in ocfs2_block_group_grow_discontig to allocate clusters. 3. set num_clusters properly in __ocfs2_claim_clusters. 4. change name from ocfs2_supports_discontig_bh to ocfs2_supports_discontig_bg. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-04-13ocfs2: Allocate discontiguous block groups.Joel Becker1-0/+7
If we cannot get a contiguous region for a block group, allocate a discontiguous one when the filesystem supports it. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-05-05ocfs2: Add dir_resv_level mount optionMark Fasheh1-0/+1
The default behavior for directory reservations stays the same, but we add a mount option so people can tweak the size of directory reservations according to their workloads. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05ocfs2: increase the default size of local alloc windowsMark Fasheh1-0/+3
I have observed that the current size of 8M gives us pretty poor fragmentation on multi-threaded workloads which do lots of writes. Generally, I can increase the size of local alloc windows and observe a marked decrease in fragmentation, even up and beyond window sizes of 512 megabytes. This makes sense for a couple reasons - larger local alloc means more room for reservation windows. On multi-node workloads the larger local alloc helps as well because we don't have to do window slides as often. Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no longer used and the comment above it was out of date. To test fragmentation, I used a workload which launched 4 threads that did 4k writes into a series of about 140 alternating files. With resv_level=2, and a 4k/4k file system I observed the following average fragmentation for various localalloc= parameters: localalloc= avg. fragmentation 8 48 32 16 64 10 120 7 On larger cluster sizes, the difference is more dramatic. The new default size top out at 256M, which we'll only get for cluster sizes of 32K and above. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05ocfs2: clean up localalloc mount option size parsingMark Fasheh1-0/+6
This patch pulls the local alloc sizing code into localalloc.c and provides a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged except that I correctly calculate the maximum local alloc size. The old code in ocfs2_parse_options() calculated the max size as: ocfs2_local_alloc_size(sb) * 8 which is correct, in bits. Unfortunately though the option passed in is in megabytes. Ultimately, this bug made no real difference - the shrink code would catch a too-large size and bring it down to something reasonable. Still, it's less than efficient as-is. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05ocfs2: allocation reservationsMark Fasheh1-0/+5
This patch improves Ocfs2 allocation policy by allowing an inode to reserve a portion of the local alloc bitmap for itself. The reserved portion (allocation window) is advisory in that other allocation windows might steal it if the local alloc bitmap becomes full. Otherwise, the reservations are honored and guaranteed to be free. When the local alloc window is moved to a different portion of the bitmap, existing reservations are discarded. Reservation windows are represented internally by a red-black tree. Within that tree, each node represents the reservation window of one inode. An LRU of active reservations is also maintained. When new data is written, we allocate it from the inodes window. When all bits in a window are exhausted, we allocate a new one as close to the previous one as possible. Should we not find free space, an existing reservation is pulled off the LRU and cannibalized. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2010-03-23ocfs2: Clear undo bits when local alloc is freedMark Fasheh1-2/+12
When the local alloc file changes windows, unused bits are freed back to the global bitmap. By defnition, those bits can not be in use by any file. Also, the local alloc will never have been able to allocate those bits if they were part of a previous truncate. Therefore it makes sense that we should clear unused local alloc bits in the undo buffer so that they can be used immediatly. [ Modified to call it ocfs2_release_clusters() -- Joel ] Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-02Ocfs2: Move ocfs2 ioctl definitions from ocfs2_fs.h to newly added ocfs2_ioctl.hTristan Ye1-0/+1
Currently we were adding ioctl cmds/structures for ocfs2 into ocfs2_fs.h which was used for define ocfs2 on-disk layout. That sounds a little bit confusing, and it may be quickly polluted espcially when growing the ocfs2_info_request ioctls afterwards(it will grow i bet). As a result, such OCFS2 IOCs do need to be placed somewhere other than ocfs2_fs.h, a separated ocfs2_ioctl.h will be added to store such ioctl structures and definitions which could also be used from userspace to invoke ioctls call. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Attach the connection to the lksbJoel Becker1-1/+1
We're going to want it in the ast functions, so we convert union ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: add extent block stealing for ocfs2 v5Tiger Yang1-27/+2
This patch add extent block (metadata) stealing mechanism for extent allocation. This mechanism is same as the inode stealing. if no room in slot specific extent_alloc, we will try to allocate extent block from the next slot. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Acked-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02ocfs2: Prevent a livelock in dlmglueSunil Mushran1-0/+4
There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were to get an ast for an upconvert request, followed immediately by a bast, there is a small window where the fs may downconvert the lock before the process requesting the upconvert is able to take the lock. This patch adds a new flag to indicate that the upconvert is still in progress and that the dc thread should not downconvert it right now. Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker <joel.becker@oracle.com> contributed heavily to this patch. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-12-24Merge branch 'upstream-linus' of ↵Linus Torvalds1-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2/trivial: Use le16_to_cpu for a disk value in xattr.c ocfs2/trivial: Use proper mask for 2 places in hearbeat.c Ocfs2: Let ocfs2 support fiemap for symlink and fast symlink. Ocfs2: Should ocfs2 support fiemap for S_IFDIR inode? ocfs2: Use FIEMAP_EXTENT_SHARED fiemap: Add new extent flag FIEMAP_EXTENT_SHARED ocfs2: replace u8 by __u8 in ocfs2_fs.h ocfs2: explicit declare uninitialized var in user_cluster_connect() ocfs2-devel: remove redundant OCFS2_MOUNT_POSIX_ACL check in ocfs2_get_acl_nolock() ocfs2: return -EAGAIN instead of EAGAIN in dlm ocfs2/cluster: Make fence method configurable - v2 ocfs2: Set MS_POSIXACL on remount ocfs2: Make acl use the default ocfs2: Always include ACL support
2009-11-13ocfs2: Trivial cleanup of jbd compatibility layer removalSunil Mushran1-6/+1
Mainline commit 53ef99cad9878f02f27bb30bc304fc42af8bdd6e removed the JBD compatibility layer from OCFS2. This patch removes the last remaining remnants of that. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-10-28ocfs2: Make acl use the defaultJan Kara1-3/+5
Change acl mount options handling to match the one of XFS and BTRFS and hopefully it is also easier to use now. When admin does not specify any acl mount option, acls are enabled if and only if the filesystem has xattr feature enabled. If admin specifies 'acl' mount option, we fail the mount if the filesystem does not have xattr feature and thus acls cannot be enabled. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-22ocfs2: Add functions for extents refcounted.Tao Ma1-0/+7
Add function ocfs2_mark_extent_refcounted which can mark an extent refcounted. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Add refcount tree lock mechanism.Tao Ma1-0/+4
Implement locking around struct ocfs2_refcount_tree. This protects all read/write operations on refcount trees. ocfs2_refcount_tree has its own lock and its own caching_info, protecting buffers among multiple nodes. User must call ocfs2_lock_refcount_tree before his operation on the tree and unlock it after that. ocfs2_refcount_trees are referenced by the block number of the refcount tree root block, So we create an rb-tree on the ocfs2_super to look them up. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Add ocfs2_read_refcount_block.Tao Ma1-0/+3
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-04ocfs2: Pass struct ocfs2_caching_info to the journal functions.Joel Becker1-1/+7
The next step in divorcing metadata I/O management from struct inode is to pass struct ocfs2_caching_info to the journal functions. Thus the journal locks a metadata cache with the cache io_lock function. It also can compare ci_last_trans and ci_created_trans directly. This is a large patch because of all the places we change ocfs2_journal_access..(handle, inode, ...) to ocfs2_journal_access..(handle, INODE_CACHE(inode), ...). Signed-off-by: Joel Becker <joel.becker@oracle.com>