Since tree log blocks get freed every transaction, they never really
need to be written to disk. This skips the step where we update
metadata to record they were allocated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Pin down data blocks to prevent them from being reallocated like so:
trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file
Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.
With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent. If we crash,
we're left with two different files using the extent.
* Don't wait in start_transaction if log replay is going on. This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.
* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory
* Hold the buffer lock while we mark a buffer as written. This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.
After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Far from the perfect fix, but these structs are small. TODO for the
next release. The block group cache structs are referenced in many
different places, and it isn't safe to just free them while resizing.
A real fix will be a larger change to the allocator so that it doesn't
have to carry about the block group cache structs to find good places
to search for free blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Make walk_down_tree wake up throttled tasks more often
* Make walk_down_tree call cond_resched during long loops
* As the size of the ref cache grows, wait longer in throttle
* Get rid of the reada code in walk_down_tree, the leaves don't get
read anymore, thanks to the ref cache.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.
The first part needs locks in the extent allocation tree, and may need to
do IO. This changeset splits that into a separate function that can be
called without any tree locks held.
btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block. This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.
It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.
The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time. Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.
It also lowers the limits for where we start throttling.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.
The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.
We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A large reference cache is directly related to a lot of work pending
for the cleaner thread. This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.
Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.
This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.
Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.
This creates a cache so that IO isn't required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.
Also, the relocation code needs to wait for ordered IO before scanning
the block group again. This is because the extents are not removed
until the IO for the new extents is finished
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This releases the alloc_mutex in a few places that hold it for over long
operations. btrfs_lookup_block_group is changed so that it doesn't need
the mutex at all.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.
The mutexes alone don't fix either problem, but they are the first step.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* In btrfs_delete_inode, wait for ordered extents after calling
truncate_inode_pages. This is much faster, and more correct
* Properly clear our the PageChecked bit everywhere we redirty the page.
* Change the writepage fixup handler to lock the page range and check to
see if an ordered extent had been inserted since the improperly dirtied
page was discovered
* Wait for ordered extents outside the transaction. This isn't required
for locking rules but does improve transaction latencies
* Reduce contention on the alloc_mutex by dropping it while incrementing
refs on a node/leaf and while dropping refs on a leaf.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.
This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree. It
prevents races from two procs working against adjacent ranges in the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk. This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.
The new code changes the way data allocations and extents work:
* When delayed allocation is filled, data extents are reserved, and
the extent bit EXTENT_ORDERED is set on the entire range of the extent.
A struct btrfs_ordered_extent is allocated an inserted into a per-inode
rbtree to track the pending extents.
* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
to that page.
* When all of the bytes corresponding to a single struct btrfs_ordered_extent
are written, The previously reserved extent is inserted into the FS
btree and into the extent allocation trees. The checksums for the file
data are also updated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.
This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This makes it possible for callers to check for extent_buffers in cache
without deadlocking against any btree locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This creates one kthread for commits and one kthread for
deleting old snapshots. All the work queues are removed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree. But, those locks might already be held in other places, leading
to deadlocks.
Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups. A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree. This drops
locks held by a path when doing reads during a tree search.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The allocation trees and the chunk trees are serialized via their own
dedicated mutexes. This means allocation location is still not very
fine grained.
The main FS btree is protected by locks on each block in the btree. Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.
The end result of a search is now a path where only the lowest level
is locked. Releasing or freeing the path drops any locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
try to allocate from devices that don't exist
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When duplicate copies exist, writes are allowed to fail to one of those
copies. This changeset includes a few changes that allow the FS to
continue even when some IOs fail.
It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Once part of a delalloc request fails the cow checks, just cow the
entire range
It is possible for the back references to all be from the same root,
but still have snapshots against an extent. The checks are now more strict,
forcing cow any time there are multiple refs against the data extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, nodatacow only checked to make sure multiple roots didn't have
references on a single extent. This check makes sure that multiple
inodes don't have references.
nodatacow needed an extra check to see if the block group was currently
readonly. This way cows forced by the chunk relocation code are honored.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This required a few structural changes to the code that manages bdev pointers:
The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev. This allows us to avoid swapping the super block bdev pointer
around at run time.
The code to read in the super block no longer goes through the extent
buffer interface. Things got ugly keeping the mapping constant.
Signed-off-by: Chris Mason <chris.mason@oracle.com>