Commit graph

1067 commits

Author SHA1 Message Date
Christoph Hellwig
59d697b702 btrfs: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Chris Mason
b263c2c8bf Btrfs: fix extent_buffer leak during tree log replay
During tree log replay, we read in the tree log roots,
process them and then free them.  A recent change
takes an extra reference on the root node of the tree
when the root is read in, and stores that reference
in root->commit_root.

This reference was not being freed, leaving us with
one buffer pinned in ram for each subvol with
a tree log root after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:24:47 -04:00
Chris Mason
0b4dcea579 Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
This happens during subvol creation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:13:35 -04:00
Chris Mason
067c28adc5 Btrfs: fix -o nodatasum printk spelling
It was printing nodatacsum, which was not the correct option name.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 09:30:13 -04:00
Yan Zheng
85d4198e40 Btrfs: check duplicate backrefs for both data and metadata
lookup_inline_extent_backref only checks for duplicate backref for data
extents. It assumes backrefs for tree block never conflict.

This patch makes lookup_inline_extent_backref check for duplicate backrefs
for both data and tree block, so that we can detect potential bug earlier.
This is a safety check, strictly speaking it is not required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 08:51:34 -04:00
Shin Hong
fd0fb038d5 Btrfs: init worker struct fields before kthread-run
This patch fixes a bug which may result race condition
between btrfs_start_workers() and worker_loop().

btrfs_start_workers() executed in a parent thread writes
on workers->worker and worker_loop() in a child thread
reads workers->worker. However, there is no synchronization
enforcing the order of two operations.

This patch makes btrfs_start_workers() fill workers->worker
before it starts a child thread with worker_loop()

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 20:11:29 -04:00
Hisashi Hifumi
4eedeb75e7 Btrfs: pin buffers during write_dev_supers
write_dev_supers is called in sequence.  First is it called with wait == 0,
which starts IO on all of the super blocks for a given device.  Then it is
called with wait == 1 to make sure they all reach the disk.

It doesn't currently pin the buffers between the two calls, and it also
assumes the buffers won't go away between the two calls, leading to
an oops if the VM manages to free the buffers in the middle of the sync.

This fixes that assumption and updates the code to return an error if things
are not up to date when the wait == 1 run is done.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 16:49:25 -04:00
Chris Mason
e5e9a5206a Btrfs: avoid races between super writeout and device list updates
On multi-device filesystems, btrfs writes supers to all of the devices
before considering a sync complete.  There wasn't any additional
locking between super writeout and the device list management code
because device management was done inside a transaction and
super writeout only happened  with no transation writers running.

With the btrfs fsync log and other async transaction updates, this
has been racey for some time.  This adds a mutex to protect
the device list.  The existing volume mutex could not be reused due to
transaction lock ordering requirements.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 15:17:02 -04:00
Al Viro
7df336ec12 Fix btrfs when ACLs are configured out
... otherwise generic_permission() will allow *anything* for all
files you don't own and that have some group permissions.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:36:43 -04:00
Hisashi Hifumi
524724ed1f Btrfs: fdatasync should skip metadata writeout
In btrfs, fdatasync and fsync are identical, but
fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.

--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run

Results:
-2.6.30-rc8
Test execution summary:
    total time:                          1980.6540s
    total number of events:              10001
    total time taken by event execution: 1192.9804
    per-request statistics:
         min:                            0.0000s
         avg:                            0.1193s
         max:                            15.3720s
         approx.  95 percentile:         0.7257s

Threads fairness:
    events (avg/stddev):           625.0625/151.32
    execution time (avg/stddev):   74.5613/9.46

-2.6.30-rc8-patched
Test execution summary:
    total time:                          1695.9118s
    total number of events:              10000
    total time taken by event execution: 871.3214
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0871s
         max:                            10.4644s
         approx.  95 percentile:         0.4787s

Threads fairness:
    events (avg/stddev):           625.0000/131.86
    execution time (avg/stddev):   54.4576/8.98

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
David Woodhouse
163e783e6a Btrfs: remove crc32c.h and use libcrc32c directly.
There's no need to preserve this abstraction; it used to let us use
hardware crc32c support directly, but libcrc32c is already doing that for us
through the crypto API -- so we're already using the Intel crc32c
acceleration where appropriate.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
Christoph Hellwig
6cbff00f46 Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Add support for the standard attributes set via chattr and read via
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
c289811cc0 Btrfs: autodetect SSD devices
During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS.  If they are all
non-rotating, SSD mode is enabled by default.

If the FS was mounted with -o nossd, the non-rotating
flag is ignored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
451d7585a8 Btrfs: add mount -o ssd_spread to spread allocations out
Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.

The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.

mount -o ssd_spread will make sure there are no allocated blocks
mixed in.  It should perform better on lower end SSDs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
c604480171 Btrfs: avoid allocation clusters that are too spread out
In SSD mode for data, and all the time for metadata the allocator
will try to find a cluster of nearby blocks for allocations.  This
commit adds extra checks to make sure that each free block in the
cluster is close to the last one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:51 -04:00
Chris Mason
3b30c22f64 Btrfs: Add mount -o nossd
This allows you to turn off the ssd mode via remount.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:50 -04:00
Chris Mason
d644d8a1e3 Btrfs: avoid IO stalls behind congested devices in a multi-device FS
The btrfs IO submission threads try to service a bunch of devices with a small
number of threads.  They do a congestion check to try and avoid waiting
on requests for a busy device.

The checks make sure we've sent a few requests down to a given device just so
that we aren't bouncing between busy devices without actually sending down
any IO.  The counter used to decide if we can switch to the next device
is somewhat overloaded.  It is also being used to decide if we've done
a good batch of requests between the WRITE_SYNC or regular priority lists.
It may get reset to zero often, leaving us hammering on a busy device
instead of moving on to another disk.

This commit adds a new counter for the number of bios sent while
servicing a device.  It doesn't get reset or fiddled with.  On
multi-device filesystems, this fixes IO stalls in streaming
write workloads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason
d84275c938 Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
Btrfs uses dedicated threads to submit bios when checksumming is on,
which allows us to make sure the threads dedicated to checksumming don't get
stuck waiting for requests.  For each btrfs device, there are
two lists of bios.  One list is for WRITE_SYNC bios and the other
is for regular priority bios.

The IO submission threads used to process all of the WRITE_SYNC bios first and
then switch to the regular bios.  This commit makes sure we don't completely
starve the regular bios by rotating between the two lists.

WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
to run in batches to avoid seeking.  Benchmarking shows this eliminates
stalls during streaming buffered writes on both multi-device and
single device filesystems.

If the regular bios starve, the system can end up with a large amount of ram
pinned down in writeback pages.  If we are a little more fair between the two
classes, we're able to keep throughput up and make progress on the bulk of
our dirty ram.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason
585ad2c379 Btrfs: fix metadata dirty throttling limits
Once a metadata block has been written, it must be recowed, so the
btrfs dirty balancing call has a check to make sure a fair amount of metadata
was actually dirty before it started writing it back to disk.

A previous commit had changed the dirty tracking for metadata without
updating the btrfs dirty balancing checks.  This commit switches it
to use the correct counter.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason
2c943de6ad Btrfs: reduce mount -o ssd CPU usage
The block allocator in SSD mode will try to find groups of free blocks
that are close together.  This commit makes it loop less on a given
group size before bumping it.

The end result is that we are less likely to fill small holes in the
available free space, but we don't waste as much CPU building the
large cluster used by ssd mode.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason
cfbb930846 Btrfs: balance btree more often
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done.  This commit
makes us more aggressively balance leaves and nodes as they become
less full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Chris Mason
b361242102 Btrfs: stop avoiding balancing at the end of the transaction.
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.

With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore.  So, the balance avoidance is no longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Yan Zheng
5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Yan Zheng
5c939df56c btrfs: Fix set/clear_extent_bit for 'end == (u64)-1'
There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Linus Torvalds
064e38aade Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Fix oops and use after free during space balancing
  Btrfs: set device->total_disk_bytes when adding new device
2009-06-05 11:54:28 -07:00
Chris Mason
44fb551163 Btrfs: Fix oops and use after free during space balancing
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks.  It starts off with a hint
to help find the best block group for a given allocation.

The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing.  If it is being
freed, the list pointers in it are bogus and can't be trusted.  But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.

The fix used here is to check to make sure the block group we find really
is on the list before we use it.  list_del_init is used when removing
it from the list, so we can do a proper check.

The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster.  If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.

The fix used here is to check the current cluster against the
current allocation flags.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 15:41:27 -04:00
Yan Zheng
2cc3c559fb Btrfs: set device->total_disk_bytes when adding new device
It was not being properly initialized, and so the size saved to
disk was not correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 09:23:57 -04:00
Linus Torvalds
5732c46849 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
  Btrfs: make show_options result match actual option names
  Btrfs: remove outdated comment in btrfs_ioctl_resize()
  Btrfs: remove some WARN_ONs in the IO failure path
  Btrfs: Don't loop forever on metadata IO failures
  Btrfs: init inode ordered_data_close flag properly
2009-05-14 19:18:44 -07:00
Sankar P
9f55684c2d Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Signed-off-by: Sankar P <sankar.curiosity@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Sage Weil
6b65c5c61b Btrfs: make show_options result match actual option names
The notreelog and flushoncommit mount options were being printed slightly
differently.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Li Hong
5d847a8ed9 Btrfs: remove outdated comment in btrfs_ioctl_resize()
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
cc7b0c9b70 Btrfs: remove some WARN_ONs in the IO failure path
These debugging WARN_ONs make too much console noise during regular
IO failures.  An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
76a05b35a3 Btrfs: Don't loop forever on metadata IO failures
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block.  If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.

The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads.  The end result was looping forever on buffers that were never
going to become up to date.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:32 -04:00
Chris Mason
2757495c90 Btrfs: init inode ordered_data_close flag properly
This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits.  It was not being properly set to zero when the inode was
being setup.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:31 -04:00
Al Viro
6f5bbff9a1 Convert obvious places to deactivate_locked_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Linus Torvalds
4ebf662337 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: look for acls during btrfs_read_locked_inode
  Btrfs: fix acl caching
  Btrfs: Fix a bunch of printk() warnings.
  Btrfs: Fix a trivial warning using max() of u64 vs ULL.
  Btrfs: remove unused btrfs_bit_radix slab
  Btrfs: ratelimit IO error printks
  Btrfs: remove #if 0 code
  Btrfs: When shrinking, only update disk size on success
  Btrfs: fix deadlocks and stalls on dead root removal
  Btrfs: fix fallocate deadlock on inode extent lock
  Btrfs: kill btrfs_cache_create
  Btrfs: don't export symbols
  Btrfs: simplify makefile
  Btrfs: try to keep a healthy ratio of metadata vs data block groups
2009-04-27 11:16:33 -07:00
Chris Mason
46a53cca82 Btrfs: look for acls during btrfs_read_locked_inode
This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
If it is certain a given inode has no acls, it will set the in memory acl
fields to null to avoid acl lookups completely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 13:18:35 -04:00
Chris Mason
7b1a14bbb0 Btrfs: fix acl caching
Linus noticed the btrfs code to cache acls wasn't properly caching
a NULL acl when the inode didn't have any acls.  This meant the common
case of no acls resulted in expensive btree searches every time the
kernel checked permissions (which is quite often).

This is a modified version of Linus' original patch:

Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
This forces an acl lookup when permission checks are done.

Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
are set to null.

Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
when deciding to cache a NULL acl.  It was storing a NULL acl when
__btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
-ENODATA for this case.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 13:18:26 -04:00
Joel Becker
21380931eb Btrfs: Fix a bunch of printk() warnings.
Just happened to notice a bunch of %llu vs u64 warnings.  Here's a patch
to cast them all.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Joel Becker
e63b6a6c0f Btrfs: Fix a trivial warning using max() of u64 vs ULL.
A small warning popped up on ia64 because inode-map.c was comparing a
u64 object id with the ULL FIRST_FREE_OBJECTID.  My first thought was
that all the OBJECTID constants should contain the u64 cast because
btrfs code deals entirely in u64s.  But then I saw how large that was,
and figured I'd just fix the max() call.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Chris Mason
45c06543af Btrfs: remove unused btrfs_bit_radix slab
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:48 -04:00
Chris Mason
193f284d49 Btrfs: ratelimit IO error printks
Btrfs has printks for various IO errors, including bad checksums and
mismatches between what we expect the block headers to contain and what
we actually find on the disk.

Longer term we need a real reporting mechanism for this, but for now
printk is going to have to do.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 07:41:47 -04:00
Chris Mason
b7967db75a Btrfs: remove #if 0 code
Btrfs had some old code sitting around under #if 0, this drops it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 07:40:52 -04:00
Chris Ball
d6397baee4 Btrfs: When shrinking, only update disk size on success
Previously, we updated a device's size prior to attempting a shrink
operation.  This patch moves the device resizing logic to only happen if
the shrink completes successfully.  In the process, it introduces a new
field to btrfs_device -- disk_total_bytes -- to track the on-disk size.

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 07:40:51 -04:00
Chris Mason
59bc5c758e Btrfs: fix deadlocks and stalls on dead root removal
After a transaction commit, the old root of the subvol btrees are sent through
snapshot removal.  This is what actually frees up any blocks replaced by
COW, and anything the old blocks pointed to.

Snapshot deletion will pause when a transaction commit has started, which
helps to avoid a huge amount of delayed reference count updates piling up
as the transaction is trying to close.

But, this pause happens after the snapshot deletion process has asked other
procs on the system to throttle back a bit so that it can make progress.

We don't want to throttle everyone while we're waiting for the transaction
commit, it leads to deadlocks in the user transaction ioctls used by Ceph
and makes things slower in general.

This patch changes things to avoid the throttling while we sleep.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:05 -04:00
Chris Mason
e980b50cda Btrfs: fix fallocate deadlock on inode extent lock
The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.

The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding.  The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.

It turns out that at least one other caller had the same bug.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:05 -04:00
Christoph Hellwig
9601e3f633 Btrfs: kill btrfs_cache_create
Just use kmem_cache_create directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:04 -04:00
Christoph Hellwig
0d4bf11e53 Btrfs: don't export symbols
Currently the extent_map code is only for btrfs so don't export it's
symbols.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:04 -04:00
Christoph Hellwig
2ea2544ef5 Btrfs: simplify makefile
Get rid of the hacks for building out of tree, and always use += for
assigning to the object lists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:03 -04:00
Josef Bacik
97e728d435 Btrfs: try to keep a healthy ratio of metadata vs data block groups
This patch makes the chunk allocator keep a good ratio of metadata vs data
block groups.  By default for every 8 data block groups, we'll allocate 1
metadata chunk, or about 12% of the disk will be allocated for metadata.  This
can be changed by specifying the metadata_ratio mount option.

This is simply the number of data block groups that have to be allocated to
force a metadata chunk allocation.  By making sure we allocate metadata chunks
more often, we are less likely to get into situations where the whole disk
has been allocated as data block groups.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:02 -04:00
Linus Torvalds
ccc5ff94c6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix btrfs fallocate oops and deadlock
  Btrfs: use the right node in reada_for_balance
  Btrfs: fix oops on page->mapping->host during writepage
  Btrfs: add a priority queue to the async thread helpers
  Btrfs: use WRITE_SYNC for synchronous writes
2009-04-21 14:12:58 -07:00
Chris Mason
546888da82 Btrfs: fix btrfs fallocate oops and deadlock
Btrfs fallocate was incorrectly starting a transaction with a lock held
on the extent_io tree for the file, which could deadlock.  Strictly
speaking it was using join_transaction which would be safe, but it is better
to move the transaction outside of the lock.

When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
being called on an unlocked buffer.  This was triggering an assertion and
oops because the lock is supposed to be held.

The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
been run.  btrfs_del_item takes care of dirtying things, so the solution is a
to skip the btrfs_mark_buffer_dirty call in this case.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-21 12:45:12 -04:00
Li Zefan
dae7b665cf btrfs: use memdup_user()
Remove open-coded memdup_user().

Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may
cause pagefault, it's pointless to pass GFP_NOFS to kmalloc().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20 23:02:50 -04:00
Chris Mason
8c594ea81d Btrfs: use the right node in reada_for_balance
reada_for_balance was using the wrong index into the path node array,
so it wasn't reading the right blocks.  We never directly used the
results of the read done by this function because the btree search is
started over at the end.

This fixes reada_for_balance to reada in the correct node and to
avoid searching past the last slot in the node.  It also makes sure to
hold the parent lock while we are finding the nodes to read.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20 15:53:09 -04:00
Chris Mason
11c8349b4e Btrfs: fix oops on page->mapping->host during writepage
The extent_io writepage call updates the writepage index in the inode
as it makes progress.  But, it was doing the update after unlocking the page,
which isn't legal because page->mapping can't be trusted once the page
is unlocked.

This lead to an oops, especially common with compression turned on.  The
fix here is to update the writeback index before unlocking the page.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20 15:53:09 -04:00
Chris Mason
d313d7a31a Btrfs: add a priority queue to the async thread helpers
Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
higher priority.  But, the checksumming helper threads prevent it
from being fully effective.

There are two problems.  First, a big queue of pending checksumming
will delay the synchronous IO behind other lower priority writes.  Second,
the checksumming uses an ordered async work queue.  The ordering makes sure
that IOs are sent to the block layer in the same order they are sent
to the checksumming threads.  Usually this gives us less seeky IO.

But, when we start mixing IO priorities, the lower priority IO can delay
the higher priority IO.

This patch solves both problems by adding a high priority list to the async
helper threads, and a new btrfs_set_work_high_prio(), which is used
to make put a new async work item onto the higher priority list.

The ordering is still done on high priority IO, but all of the high
priority bios are ordered separately from the low priority bios.  This
ordering is purely an IO optimization, it is not involved in data
or metadata integrity.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20 15:53:08 -04:00
Chris Mason
ffbd517d5a Btrfs: use WRITE_SYNC for synchronous writes
Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future.  This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.

Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios.  The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.

This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20 15:53:08 -04:00
Linus Torvalds
b983471794 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: BUG to BUG_ON changes
  Btrfs: remove dead code
  Btrfs: remove dead code
  Btrfs: fix typos in comments
  Btrfs: remove unused ftrace include
  Btrfs: fix __ucmpdi2 compile bug on 32 bit builds
  Btrfs: free inode struct when btrfs_new_inode fails
  Btrfs: fix race in worker_loop
  Btrfs: add flushoncommit mount option
  Btrfs: notreelog mount option
  Btrfs: introduce btrfs_show_options
  Btrfs: rework allocation clustering
  Btrfs: Optimize locking in btrfs_next_leaf()
  Btrfs: break up btrfs_search_slot into smaller pieces
  Btrfs: kill the pinned_mutex
  Btrfs: kill the block group alloc mutex
  Btrfs: clean up find_free_extent
  Btrfs: free space cache cleanups
  Btrfs: unplug in the async bio submission threads
  Btrfs: keep processing bios for a given bdev if our proc is batching
2009-04-03 15:14:44 -07:00
Linus Torvalds
8fe74cf053 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  Remove two unneeded exports and make two symbols static in fs/mpage.c
  Cleanup after commit 585d3bc06f
  Trim includes of fdtable.h
  Don't crap into descriptor table in binfmt_som
  Trim includes in binfmt_elf
  Don't mess with descriptor table in load_elf_binary()
  Get rid of indirect include of fs_struct.h
  New helper - current_umask()
  check_unsafe_exec() doesn't care about signal handlers sharing
  New locking/refcounting for fs_struct
  Take fs_struct handling to new file (fs/fs_struct.c)
  Get rid of bumping fs_struct refcount in pivot_root(2)
  Kill unsharing fs_struct in __set_personality()
2009-04-02 21:09:10 -07:00
Stoyan Gaydarov
c293498be6 Btrfs: BUG to BUG_ON changes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 17:05:11 -04:00
Dan Carpenter
3e7ad38d20 Btrfs: remove dead code
Remove an unneeded return statement and conditional

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:46:06 -04:00
Dan Carpenter
ff0a5836ac Btrfs: remove dead code
merge is always NULL at this point.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:46:06 -04:00
Wu Fengguang
d4a789474a Btrfs: fix typos in comments
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:46:06 -04:00
Jim Owens
2e966ed22c Btrfs: remove unused ftrace include
Signed-off-by: jim owens <jowens@hp.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 17:02:55 -04:00
Heiko Carstens
93dbfad7ac Btrfs: fix __ucmpdi2 compile bug on 32 bit builds
We get this on 32 builds:

fs/built-in.o: In function `extent_fiemap':
(.text+0x1019f2): undefined reference to `__ucmpdi2'

Happens because of a switch statement with a 64 bit argument.
Convert this to an if statement to fix this.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 10:33:45 -04:00
Shen Feng
09771430f3 Btrfs: free inode struct when btrfs_new_inode fails
btrfs_new_inode doesn't call iput to free the inode
when it fails.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:46:06 -04:00
Amit Gud
b5555f7711 Btrfs: fix race in worker_loop
Need to check kthread_should_stop after schedule_timeout() before calling
schedule(). This causes threads to sleep with potentially no one to wake them
up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop.

Signed-off-by: Amit Gud <gud@ksu.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 17:01:27 -04:00
Sage Weil
dccae99995 Btrfs: add flushoncommit mount option
The 'flushoncommit' mount option forces any data dirtied by a write in a
prior transaction to commit as part of the current commit.  This makes
the committed state a fully consistent view of the file system from the
application's perspective (i.e., it includes all completed file system
operations).  This was previously the behavior only when a snapshot is
created.

This is used by Ceph to ensure that completed writes make it to the
platter along with the metadata operations they are bound to (by
BTRFS_IOC_TRANS_{START,END}).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:59:01 -04:00
Sage Weil
3a5e14048a Btrfs: notreelog mount option
Add a 'notreelog' mount option to disable the tree log (used by fsync,
O_SYNC writes).  This is much slower, but the tree logging produces
inconsistent views into the FS for ceph.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:49:40 -04:00
Eric Paris
a9572a15a8 Btrfs: introduce btrfs_show_options
btrfs options can change at times other than mount, yet /proc/mounts shows the
options string used when the fs was mounted (an example would be when btrfs
determines that barriers aren't useful and turns them off.)  This patch
instead outputs the actual options in use by btrfs.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:46:06 -04:00
Chris Mason
fa9c0d795f Btrfs: rework allocation clustering
Because btrfs is copy-on-write, we end up picking new locations for
blocks very often.  This makes it fairly difficult to maintain perfect
read patterns over time, but we can at least do some optimizations
for writes.

This is done today by remembering the last place we allocated and
trying to find a free space hole big enough to hold more than just one
allocation.  The end result is that we tend to write sequentially to
the drive.

This happens all the time for metadata and it happens for data
when mounted -o ssd.  But, the way we record it is fairly racey
and it tends to fragment the free space over time because we are trying
to allocate fairly large areas at once.

This commit gets rid of the races by adding a free space cluster object
with dedicated locking to make sure that only one process at a time
is out replacing the cluster.

The free space fragmentation is somewhat solved by allowing a cluster
to be comprised of smaller free space extents.  This part definitely
adds some CPU time to the cluster allocations, but it allows the allocator
to consume the small holes left behind by cow.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 09:47:43 -04:00
Chris Mason
8e73f27501 Btrfs: Optimize locking in btrfs_next_leaf()
btrfs_next_leaf was using blocking locks when it could have been using
faster spinning ones instead.  This adds a few extra checks around
the pieces that block and switches over to spinning locks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 10:14:18 -04:00
Chris Mason
c8c42864f6 Btrfs: break up btrfs_search_slot into smaller pieces
btrfs_search_slot was doing too many things at once.  This breaks
it up into more reasonable units.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 10:14:18 -04:00
Josef Bacik
04018de5d4 Btrfs: kill the pinned_mutex
This patch removes the pinned_mutex.  The extent io map has an internal tree
lock that protects the tree itself, and since we only copy the extent io map
when we are committing the transaction we don't need it there.  We also don't
need it when caching the block group since searching through the tree is also
protected by the internal map spin lock.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:18 -04:00
Josef Bacik
6226cb0a5e Btrfs: kill the block group alloc mutex
This patch removes the block group alloc mutex used to protect the free space
tree for allocations and replaces it with a spin lock which is used only to
protect the free space rb tree.  This means we only take the lock when we are
directly manipulating the tree, which makes us a touch faster with
multi-threaded workloads.

This patch also gets rid of btrfs_find_free_space and replaces it with
btrfs_find_space_for_alloc, which takes the number of bytes you want to
allocate, and empty_size, which is used to indicate how much free space should
be at the end of the allocation.

It will return an offset for the allocator to use.  If we don't end up using it
we _must_ call btrfs_add_free_space to put it back.  This is the tradeoff to
kill the alloc_mutex, since we need to make sure nobody else comes along and
takes our space.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:18 -04:00
Josef Bacik
2552d17e32 Btrfs: clean up find_free_extent
I've replaced the strange looping constructs with a list_for_each_entry on
space_info->block_groups.  If we have a hint we just jump into the loop with
the block group and start looking for space.  If we don't find anything we
start at the beginning and start looking.  We never come out of the loop with a
ref on the block_group _unless_ we found space to use, then we drop it after we
set the trans block_group.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:19 -04:00
Josef Bacik
70cb074345 Btrfs: free space cache cleanups
This patch cleans up the free space cache code a bit.  It better documents the
idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
I took out the info allocation at the start of __btrfs_add_free_space and put it
where it makes more sense.  This was left over cruft from when alloc_mutex
existed.  Also all of the re-searches we do to make sure we inserted properly.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:19 -04:00
Chris Mason
bedf762ba3 Btrfs: unplug in the async bio submission threads
Btrfs pages being written get set to writeback, and then may go through
a number of steps before they hit the block layer.  This includes compression,
checksumming and async bio submission.

The end result is that someone who writes a page and then does
wait_on_page_writeback is likely to unplug the queue before the bio they
cared about got there.

We could fix this by marking bios sync, or by doing more frequent unplugs,
but this commit just changes the async bio submission code to unplug
after it has processed all the bios for a device.  The async bio submission
does a fair job of collection bios, so this shouldn't be a huge problem
for reducing merging at the elevator.

For streaming O_DIRECT writes on a 5 drive array, it boosts performance
from 386MB/s to 460MB/s.

Thanks to Hisashi Hifumi for helping with this work.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 10:32:58 -04:00
Chris Mason
b765ead57d Btrfs: keep processing bios for a given bdev if our proc is batching
Btrfs uses async helper threads to submit write bios so the checksumming
helper threads don't block on the disk.

The submit bio threads may process bios for more than one block device,
so when they find one device congested they try to move on to other
devices instead of blocking in get_request_wait for one device.

This does a pretty good job of keeping multiple devices busy, but the
congested flag has a number of problems.  A congested device may still
give you a request, and other procs that aren't backing off the congested
device may starve you out.

This commit uses the io_context stored in current to decide if our process
has been made a batching process by the block layer.  If so, it keeps
sending IO down for at least one batch.  This helps make sure we do
a good amount of work each time we visit a bdev, and avoids large IO
stalls in multi-device workloads.

It's also very ugly.  A better solution is in the works with Jens Axboe.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 10:27:10 -04:00
Linus Torvalds
c226fd659f Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: try to free metadata pages when we free btree blocks
  Btrfs: add extra flushing for renames and truncates
  Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod
  Btrfs: optimize fsyncs on old files
  Btrfs: tree logging unlink/rename fixes
  Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay
  Btrfs: limit balancing work while flushing delayed refs
  Btrfs: readahead checksums during btrfs_finish_ordered_io
  Btrfs: leave btree locks spinning more often
  Btrfs: Only let very young transactions grow during commit
  Btrfs: Check for a blocking lock before taking the spin
  Btrfs: reduce stack in cow_file_range
  Btrfs: reduce stalls during transaction commit
  Btrfs: process the delayed reference queue in clusters
  Btrfs: try to cleanup delayed refs while freeing extents
  Btrfs: reduce stack usage in some crucial tree balancing functions
  Btrfs: do extent allocation and reference count updates in the background
  Btrfs: don't preallocate metadata blocks during btrfs_search_slot
2009-04-01 10:20:44 -07:00
Nick Piggin
56a76f8275 fs: fix page_mkwrite error cases in core code and btrfs
page_mkwrite is called with neither the page lock nor the ptl held.  This
means a page can be concurrently truncated or invalidated out from
underneath it.  Callers are supposed to prevent truncate races themselves,
however previously the only thing they can do in case they hit one is to
raise a SIGBUS.  A sigbus is wrong for the case that the page has been
invalidated or truncated within i_size (eg.  hole punched).  Callers may
also have to perform memory allocations in this path, where again, SIGBUS
would be wrong.

The previous patch ("mm: page_mkwrite change prototype to match fault")
made it possible to properly specify errors.  Convert the generic buffer.c
code and btrfs to return sane error values (in the case of page removed
from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
without doing anything, and the fault will be retried properly).

This fixes core code, and converts btrfs as a template/example.  All other
filesystems defining their own page_mkwrite should be fixed in a similar
manner.

Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:14 -07:00
Nick Piggin
c2ec175c39 mm: page_mkwrite change prototype to match fault
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags.  There should be no functional change.

This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg.  virtual_address to the
driver, which might be important in some special cases).

This is required for a subsequent fix.  And will also make it easier to
merge page_mkwrite() with fault() in future.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:14 -07:00
Al Viro
ce3b0f8d5c New helper - current_umask()
current->fs->umask is what most of fs_struct users are doing.
Put that into a helper function.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:26 -04:00
Chris Mason
d57e62b897 Btrfs: try to free metadata pages when we free btree blocks
COW means we cycle though blocks fairly quickly, and once we
free an extent on disk, it doesn't make much sense to keep the pages around.

This commit tries to immediately free the page when we free the extent,
which lowers our memory footprint significantly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-31 14:27:58 -04:00
Chris Mason
5a3f23d515 Btrfs: add extra flushing for renames and truncates
Renames and truncates are both common ways to replace old data with new
data.  The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.

This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved.  The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.

If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file.  The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done.  This is similar to the
ext3 style data=ordering, except it is only done on selected files.

Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).

For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file.  Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).

For truncates, we order if the file goes from non-zero size down to
zero size.  This is a little different, because at the time of the
truncate the file has no dirty bytes to order.  But, we flag the inode
so that it is added to the ordered list on close (via release method).  We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-31 14:27:58 -04:00
Jens Axboe
6933c02e9c btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty
Chris says it's safe to kill.

Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-03-26 11:01:35 +01:00
Chris Mason
1a81af4d1d Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod
btrfs_update_delayed_ref is optimized to add and remove different
references in one pass through the delayed ref tree.  It is a zero
sum on the total number of refs on a given extent.

But, the code was recording an extra ref in the head node.  This
never made it down to the disk but was used when deciding if it was
safe to free the extent while dropping snapshots.

The fix used here is to make sure the ref_mod count is unchanged
on the head ref when btrfs_update_delayed_ref is called.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-25 09:55:11 -04:00
Chris Mason
af4176b49c Btrfs: optimize fsyncs on old files
The fsync log has code to make sure all of the parents of a file are in the
log along with the file.  It uses a minimal log of the parent directory
inodes, just enough to get the parent directory on disk.

If the transaction that originally created a file is fully on disk,
and the file hasn't been renamed or linked into other directories, we
can safely skip the parent directory walk.  We know the file is on disk
somewhere and we can go ahead and just log that single file.

This is more important now because unrelated unlinks in the parent directory
might make us force a commit if we try to log the parent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:52 -04:00
Chris Mason
12fcfd22fe Btrfs: tree logging unlink/rename fixes
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.

The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log.  This patch adds a few new rules
to the tree logging.

1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done.  The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.

Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged.  For renames this is only done when
renaming to a different directory.

mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file

The fsync above will unlink the original some_dir without recording
it in its new location (foo2).  After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit

2) we must log any new names for any file or dir that is in the fsync
log.  This way we make sure not to lose files that are unlinked during
the same transaction.

2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.

2a is actually the more important variant.  Without the extra logging
a crash might unlink the old name without recreating the new one

3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf

mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)

The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir.  After a crash the rm -rf must
be replayed.  This must be able to recurse down the entire
directory tree.  The inode link count fixup code takes care of the
ugly details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:52 -04:00
Chris Mason
a74ac32207 Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay
During log replay, inodes are copied from the log to the main filesystem
btrees.  Sometimes they have a zero link count in the log but they actually
gain links during the replay or have some in the main btree.

This patch updates the link count to be at least one after copying the
inode out of the log.  This makes sure the inode is deleted during an
iput while the rest of the replay code is still working on it.

The log replay has fixup code to make sure that link counts are correct
at the end of the replay, so we could use any non-zero number here and
it would work fine.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:51 -04:00
Chris Mason
a4b6e07d1a Btrfs: limit balancing work while flushing delayed refs
The delayed reference mechanism is responsible for all updates to the
extent allocation trees, including those updates created while processing
the delayed references.

This commit tries to limit the amount of work that gets created during
the final run of delayed refs before a commit.  It avoids cowing new blocks
unless it is required to finish the commit, and so it avoids new allocations
that were not really required.

The goal is to avoid infinite loops where we are always making more work
on the final run of delayed refs.  Over the long term we'll make a
special log for the last delayed ref updates as well.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:51 -04:00
Chris Mason
5d13a98f3b Btrfs: readahead checksums during btrfs_finish_ordered_io
This reads in blocks in the checksum btree before starting the
transaction in btrfs_finish_ordered_io.  It makes it much more likely
we'll be able to do operations inside the transaction without
needing any btree reads, which limits transaction latencies overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:51 -04:00
Chris Mason
b9473439d3 Btrfs: leave btree locks spinning more often
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying.  This may require a kmalloc and it
was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer.  Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:28 -04:00
Chris Mason
89573b9c51 Btrfs: Only let very young transactions grow during commit
Commits are fairly expensive, and so btrfs has code to sit around for a while
during the commit and let new writers come in.

But, while we're sitting there, new delayed refs might be added, and those
can be expensive to process as well.  Unless the transaction is very very
young, it makes sense to go ahead and let the commit finish without hanging
around.

The commit grow loop isn't as important as it used to be, the fsync logging
code handles most performance critical syncs now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:28 -04:00
Chris Mason
66d7e85ea7 Btrfs: Check for a blocking lock before taking the spin
This reduces contention on the extent buffer spin locks by testing for a
blocking lock before trying to take the spinlock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:27 -04:00
Chris Mason
7f366cfecf Btrfs: reduce stack in cow_file_range
The fs/btrfs/inode.c code to run delayed allocation during writout
needed some stack usage optimization.  This is the first pass, it does
the check for compression earlier on, which allows us to do the common
(no compression) case higher up in the call chain.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:27 -04:00
Chris Mason
b7ec40d784 Btrfs: reduce stalls during transaction commit
To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.

This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait.  This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.

It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish.  This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason
c3e69d58e8 Btrfs: process the delayed reference queue in clusters
The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree.  These are processed by
finding records in the tree that are not currently being processed one at
a time.

This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.

This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason
1887be66dc Btrfs: try to cleanup delayed refs while freeing extents
When extents are freed, it is likely that we've removed the last
delayed reference update for the extent.  This checks the delayed
ref tree when things are freed, and if no ref updates area left it
immediately processes the delayed ref.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason
44871b1b24 Btrfs: reduce stack usage in some crucial tree balancing functions
Many of the tree balancing functions follow the same pattern.

1) cow a block
2) do something to the result

This commit breaks them up into two functions so the variables and
code required for part two don't suck down stack during part one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00