Commit graph

245 commits

Author SHA1 Message Date
Tao Ma
d3ad8434aa jbd2: use WRITE_SYNC in journal checkpoint
In journal checkpoint, we write the buffer and wait for its finish.
But in cfq, the async queue has a very low priority, and in our test,
if there are too many sync queues and every queue is filled up with
requests, the write request will be delayed for quite a long time and
all the tasks which are waiting for journal space will end with errors like:

INFO: task attr_set:3816 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
attr_set      D ffff880028393480     0  3816      1 0x00000000
 ffff8802073fbae8 0000000000000086 ffff8802140847c8 ffff8800283934e8
 ffff8802073fb9d8 ffffffff8103e456 ffff8802140847b8 ffff8801ed728080
 ffff8801db4bc080 ffff8801ed728450 ffff880028393480 0000000000000002
Call Trace:
 [<ffffffff8103e456>] ? __dequeue_entity+0x33/0x38
 [<ffffffff8103caad>] ? need_resched+0x23/0x2d
 [<ffffffff814006a6>] ? thread_return+0xa2/0xbc
 [<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
 [<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
 [<ffffffff81400d31>] __mutex_lock_common+0x14e/0x1a9
 [<ffffffffa021dbfb>] ? brelse+0x13/0x15 [ext4]
 [<ffffffff81400ddb>] __mutex_lock_slowpath+0x19/0x1b
 [<ffffffff81400b2d>] mutex_lock+0x1b/0x32
 [<ffffffffa01f927b>] __jbd2_journal_insert_checkpoint+0xe3/0x20c [jbd2]
 [<ffffffffa01f547b>] start_this_handle+0x438/0x527 [jbd2]
 [<ffffffff8106f491>] ? autoremove_wake_function+0x0/0x3e
 [<ffffffffa01f560b>] jbd2_journal_start+0xa1/0xcc [jbd2]
 [<ffffffffa02353be>] ext4_journal_start_sb+0x57/0x81 [ext4]
 [<ffffffffa024a314>] ext4_xattr_set+0x6c/0xe3 [ext4]
 [<ffffffffa024aaff>] ext4_xattr_user_set+0x42/0x4b [ext4]
 [<ffffffff81145adb>] generic_setxattr+0x6b/0x76
 [<ffffffff81146ac0>] __vfs_setxattr_noperm+0x47/0xc0
 [<ffffffff81146bb8>] vfs_setxattr+0x7f/0x9a
 [<ffffffff81146c88>] setxattr+0xb5/0xe8
 [<ffffffff81137467>] ? do_filp_open+0x571/0xa6e
 [<ffffffff81146d26>] sys_fsetxattr+0x6b/0x91
 [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b

So this patch tries to use WRITE_SYNC in __flush_batch so that the request will
be moved into sync queue and handled by cfq timely. We also use the new plug,
sot that all the WRITE_SYNC requests can be given as a whole when we unplug it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Reported-by: Robin Dong <sanbai@taobao.com>
2011-06-27 12:36:29 -04:00
Jan Kara
de1b794130 jbd2: Fix oops in jbd2_journal_remove_journal_head()
jbd2_journal_remove_journal_head() can oops when trying to access
journal_head returned by bh2jh(). This is caused for example by the
following race:

	TASK1					TASK2
  jbd2_journal_commit_transaction()
    ...
    processing t_forget list
      __jbd2_journal_refile_buffer(jh);
      if (!jh->b_transaction) {
        jbd_unlock_bh_state(bh);
					jbd2_journal_try_to_free_buffers()
					  jbd2_journal_grab_journal_head(bh)
					  jbd_lock_bh_state(bh)
					  __journal_try_to_free_buffer()
					  jbd2_journal_put_journal_head(jh)
        jbd2_journal_remove_journal_head(bh);

jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and
buffer is not part of any transaction and thus frees journal_head
before TASK1 gets to doing so. Note that even buffer_head can be
released by try_to_free_buffers() after
jbd2_journal_put_journal_head() which adds even larger opportunity for
oops (but I didn't see this happen in reality).

Fix the problem by making transactions hold their own journal_head
reference (in b_jcount). That way we don't have to remove journal_head
explicitely via jbd2_journal_remove_journal_head() and instead just
remove journal_head when b_jcount drops to zero. The result of this is
that [__]jbd2_journal_refile_buffer(),
[__]jbd2_journal_unfile_buffer(), and
__jdb2_journal_remove_checkpoint() can free journal_head which needs
modification of a few callers. Also we have to be careful because once
journal_head is removed, buffer_head might be freed as well. So we
have to get our own buffer_head reference where it matters.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-13 15:38:22 -04:00
Tao Ma
1fb74cda1b jbd2: Remove obsolete parameters in the comments for some jbd2 functions
credits isn't a parameter for jbd2_journal_get_write_access and
jbd2_journal_get_undo_access. So remove the corresponding comments.

Acked-by: Jan Kara <jack@suse.cz>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-12 22:44:10 -04:00
Linus Torvalds
35806b4f7c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
  jbd2: Add MAINTAINERS entry
  jbd2: fix a potential leak of a journal_head on an error path
  ext4: teach ext4_ext_split to calculate extents efficiently
  ext4: Convert ext4 to new truncate calling convention
  ext4: do not normalize block requests from fallocate()
  ext4: enable "punch hole" functionality
  ext4: add "punch hole" flag to ext4_map_blocks()
  ext4: punch out extents
  ext4: add new function ext4_block_zero_page_range()
  ext4: add flag to ext4_has_free_blocks
  ext4: reserve inodes and feature code for 'quota' feature
  ext4: add support for multiple mount protection
  ext4: ensure f_bfree returned by ext4_statfs() is non-negative
  ext4: protect bb_first_free in ext4_trim_all_free() with group lock
  ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
  jbd2: Fix comment to match the code in jbd2__journal_start()
  ext4: fix waiting and sending of a barrier in ext4_sync_file()
  jbd2: Add function jbd2_trans_will_send_data_barrier()
  jbd2: fix sending of data flush on journal commit
  ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
  ...
2011-05-26 09:53:20 -07:00
Ding Dinghua
3991b4008c jbd2: fix a potential leak of a journal_head on an error path
drop jh->b_jcount in error path

Signed-off-by: Ding Dinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 17:43:48 -04:00
Eryu Guan
c867516de5 jbd2: Fix comment to match the code in jbd2__journal_start()
jbd2__journal_start() returns an ERR_PTR() value rather than NULL on
failure.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 17:09:58 -04:00
Jan Kara
bbd2be3691 jbd2: Add function jbd2_trans_will_send_data_barrier()
Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device.  The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:59:18 -04:00
Jan Kara
81be12c817 jbd2: fix sending of data flush on journal commit
In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit.  In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).

Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:52:40 -04:00
Tao Ma
28e35e42fb jbd2: Fix the wrong calculation of t_max_wait in update_t_max_wait
t_max_wait is added in commit 8e85fb3f to indicate how long we
were waiting for new transaction to start. In commit 6d0bf005,
it is moved to another function named update_t_max_wait to
avoid a build warning. But the wrong thing is that the original
'ts' is initialized in the start of function start_this_handle
and we can calculate t_max_wait in the right way. while with
this change, ts is initialized within the function and t_max_wait
can never be calculated right.

This patch moves the initialization of ts to the original beginning
of start_this_handle and pass it to function update_t_max_wait so
that it can be calculated right and the build warning is avoided also.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-22 21:45:26 -04:00
Tao Ma
9199e66528 jbd/jbd2: remove obsolete summarise_journal_usage.
summarise_journal_usage seems to be obsolete for a long time,
so remove it.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-17 13:47:42 +02:00
Theodore Ts'o
1be2add685 jbd2: only print the debugging information for tid wraparound once
If we somehow wrap, we don't want to keep printing the warning message
over and over again.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-08 19:37:54 -04:00
Jan Kara
229309caeb jbd2: Fix forever sleeping process in do_get_write_access()
In do_get_write_access() we wait on BH_Unshadow bit for buffer to get
from shadow state. The waking code in journal_commit_transaction() has
a bug because it does not issue a memory barrier after the buffer is
moved from the shadow state and before wake_up_bit() is called. Thus a
waitqueue check can happen before the buffer is actually moved from
the shadow state and waiting process may never be woken. Fix the
problem by issuing proper barrier.

Reported-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-08 19:09:53 -04:00
Theodore Ts'o
deeeaf13b2 jbd2: fix fsync() tid wraparound bug
If an application program does not make any changes to the indirect
blocks or extent tree, i_datasync_tid will not get updated.  If there
are enough commits (i.e., 2**31) such that tid_geq()'s calculations
wrap, and there isn't a currently active transaction at the time of
the fdatasync() call, this can end up triggering a BUG_ON in
fs/jbd2/commit.c:

	J_ASSERT(journal->j_running_transaction != NULL);

It's pretty rare that this can happen, since it requires the use of
fdatasync() plus *very* frequent and excessive use of fsync().  But
with the right workload, it can.

We fix this by replacing the use of tid_geq() with an equality test,
since there's only one valid transaction id that we is valid for us to
wait until it is commited: namely, the currently running transaction
(if it exists).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-01 18:16:26 -04:00
Linus Torvalds
a97b52022a Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix data corruption regression by reverting commit 6de9843dab
  ext4: Allow indirect-block file to grow the file size to max file size
  ext4: allow an active handle to be started when freezing
  ext4: sync the directory inode in ext4_sync_parent()
  ext4: init timer earlier to avoid a kernel panic in __save_error_info
  jbd2: fix potential memory leak on transaction commit
  ext4: fix a double free in ext4_register_li_request
  ext4: fix credits computing for indirect mapped files
  ext4: remove unnecessary [cm]time update of quota file
  jbd2: move bdget out of critical section
2011-04-11 15:45:47 -07:00
Zhang Huan
6cba611e60 jbd2: fix potential memory leak on transaction commit
There is potential memory leak of journal head in function
jbd2_journal_commit_transaction. The problem is that JBD2 will not
reclaim the journal head of commit record if error occurs or journal
is abotred.

I use the following script to reproduce this issue, on a RHEL6
system. I found it very easy to reproduce with async commit enabled.

mount /dev/sdb /mnt -o journal_checksum,journal_async_commit
touch /mnt/xxx
echo offline > /sys/block/sdb/device/state
sync
umount /mnt
rmmod ext4
rmmod jbd2

Removal of the jbd2 module will make slab complaining that
"cache `jbd2_journal_head': can't free all objects".

Signed-off-by: Zhang Huan <zhhuan@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-05 19:16:20 -04:00
Zhu Yanhai
50f689af01 jbd2: move bdget out of critical section
bdget() should not be called when we hold spinlocks since
it might sleep.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-04 12:58:12 -04:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds
6c51038900 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
  Documentation/iostats.txt: bit-size reference etc.
  cfq-iosched: removing unnecessary think time checking
  cfq-iosched: Don't clear queue stats when preempt.
  blk-throttle: Reset group slice when limits are changed
  blk-cgroup: Only give unaccounted_time under debug
  cfq-iosched: Don't set active queue in preempt
  block: fix non-atomic access to genhd inflight structures
  block: attempt to merge with existing requests on plug flush
  block: NULL dereference on error path in __blkdev_get()
  cfq-iosched: Don't update group weights when on service tree
  fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
  block: Require subsystems to explicitly allocate bio_set integrity mempool
  jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  fs: make fsync_buffers_list() plug
  mm: make generic_writepages() use plugging
  blk-cgroup: Add unaccounted time to timeslice_used.
  block: fixup plugging stubs for !CONFIG_BLOCK
  block: remove obsolete comments for blkdev_issue_zeroout.
  blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
  ...

Fix up conflicts in fs/{aio.c,super.c}
2011-03-24 10:16:26 -07:00
Jens Axboe
82f04ab47e jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
'write_op' was still used, even though it was always WRITE_SYNC now.
Add plugging around the cases where it submits IO, and flush them
before we end up waiting for that IO.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-17 11:01:52 +01:00
Jens Axboe
721a9602e6 block: kill off REQ_UNPLUG
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:27 +01:00
Justin P. Mattock
3c26bdb423 jbd: Remove one to many n's in a word.
The Patch below removes one to many "n's" in a word..

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-02-28 21:55:58 +01:00
Theodore Ts'o
e447183180 jbd2: call __jbd2_log_start_commit with j_state_lock write locked
On an SMP ARM system running ext4, I've received a report that the
first J_ASSERT in jbd2_journal_commit_transaction has been triggering:

	J_ASSERT(journal->j_running_transaction != NULL);

While investigating possible causes for this problem, I noticed that
__jbd2_log_start_commit() is getting called with j_state_lock only
read-locked, in spite of the fact that it's possible for it might
j_commit_request.  Fix this by grabbing the necessary information so
we can test to see if we need to start a new transaction before
dropping the read lock, and then calling jbd2_log_start_commit() which
will grab the write lock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:18:24 -05:00
Linus Torvalds
008d23e485 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  Documentation/trace/events.txt: Remove obsolete sched_signal_send.
  writeback: fix global_dirty_limits comment runtime -> real-time
  ppc: fix comment typo singal -> signal
  drivers: fix comment typo diable -> disable.
  m68k: fix comment typo diable -> disable.
  wireless: comment typo fix diable -> disable.
  media: comment typo fix diable -> disable.
  remove doc for obsolete dynamic-printk kernel-parameter
  remove extraneous 'is' from Documentation/iostats.txt
  Fix spelling milisec -> ms in snd_ps3 module parameter description
  Fix spelling mistakes in comments
  Revert conflicting V4L changes
  i7core_edac: fix typos in comments
  mm/rmap.c: fix comment
  sound, ca0106: Fix assignment to 'channel'.
  hrtimer: fix a typo in comment
  init/Kconfig: fix typo
  anon_inodes: fix wrong function name in comment
  fix comment typos concerning "consistent"
  poll: fix a typo in comment
  ...

Fix up trivial conflicts in:
 - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
 - fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13 10:05:56 -08:00
Theodore Ts'o
8aefcd557d ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing.  This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:29:43 -05:00
Jiri Kosina
4b7bd36470 Merge branch 'master' into for-next
Conflicts:
	MAINTAINERS
	arch/arm/mach-omap2/pm24xx.c
	drivers/scsi/bfa/bfa_fcpim.c

Needed to update to apply fixes for which the old branch was too
outdated.
2010-12-22 18:57:02 +01:00
Theodore Ts'o
b7271b0a39 jbd2: simplify return path of journal_init_common
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:39:38 -05:00
Theodore Ts'o
9a4f6271b6 jbd2: move debug message into debug #ifdef
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:36:33 -05:00
Theodore Ts'o
ae00b267f3 jbd2: remove unnecessary goto statement
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:34:20 -05:00
Theodore Ts'o
a1dd533184 jbd2: use offset_in_page() instead of manual calculation
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:13:40 -05:00
Theodore Ts'o
cfef2c6a55 jbd2: Fix a debug message in do_get_write_access()
'buffer_head' should be 'journal_head'

This is a port of a patch which Namhyung Kim <namhyung@gmail.com> made
to fs/jbd to jbd2.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:07:34 -05:00
Theodore Ts'o
670be5a78a jbd2: Use pr_notice_ratelimited() in journal_alloc_journal_head()
We had an open-coded version of printk_ratelimited(); use the provided
abstraction to make the code cleaner and easier to understand.

Based on a similar patch for fs/jbd from Namhyung Kim <namhyung@gmail.com>

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-17 10:44:16 -05:00
Uwe Kleine-König
a34f0b3139 fix comment typos concerning "consistent"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-12-10 16:04:28 +01:00
yangsheng
0587aa3d11 jbd2: fix /proc/fs/jbd2/<dev> when using an external journal
In jbd2_journal_init_dev(), we need make sure the journal structure is
fully initialzied before calling jbd2_stats_proc_init().

Reviewed-by: Andreas Dilger <andreas.dilger@oracle.com>
Signed-off-by: yangsheng <sheng.yang@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:26 -05:00
Thomas Gleixner
51dfacdef3 jbd2: Convert jbd2_slab_create_sem to mutex
jbd2_slab_create_sem is used as a mutex, so make it one.

[ akpm muttered: We may as well make it local to
jbd2_journal_create_slab() also. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <alpine.LFD.2.00.1010162231480.2496@localhost6.localdomain6>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-10-30 12:12:50 +02:00
Linus Torvalds
81280572ca Merge branch 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
  ext4,jbd2: convert tracepoints to use major/minor numbers
  ext4: optimize orphan_list handling for ext4_setattr
  ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
  ext4: fix compile error in ext4_fallocate()
  ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
  ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
  ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
  ext4: rename {ext,idx}_pblock and inline small extent functions
  ext4: make various ext4 functions be static
  ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
  ext4: fix kernel oops if the journal superblock has a non-zero j_errno
  ext4: update writeback_index based on last page scanned
  ext4: implement writeback livelock avoidance using page tagging
  ext4: tidy up a void argument in inode.c
  ext4: add batched_discard into ext4 feature list
  ext4: Add batched discard support for ext4
  fs: Add FITRIM ioctl
  ext4: Use return value from sb_issue_discard()
  ext4: Check return value of sb_getblk() and friends
  ext4: use bio layer instead of buffer layer in mpage_da_submit_io
  ...
2010-10-27 21:54:31 -07:00
Theodore Ts'o
a107e5a3a4 Merge branch 'next' into upstream-merge
Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h
2010-10-27 23:44:47 -04:00
Linus Torvalds
7d2f280e75 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits)
  quota: Fix possible oops in __dquot_initialize()
  ext3: Update kernel-doc comments
  jbd/2: fixed typos
  ext2: fixed typo.
  ext3: Fix debug messages in ext3_group_extend()
  jbd: Convert atomic_inc() to get_bh()
  ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate()
  jbd: Fix debug message in do_get_write_access()
  jbd: Check return value of __getblk()
  ext3: Use DIV_ROUND_UP() on group desc block counting
  ext3: Return proper error code on ext3_fill_super()
  ext3: Remove unnecessary casts on bh->b_data
  ext3: Cleanup ext3_setup_super()
  quota: Fix issuing of warnings from dquot_transfer
  quota: fix dquot_disable vs dquot_transfer race v2
  jbd: Convert bitops to buffer fns
  ext3/jbd: Avoid WARN() messages when failing to write the superblock
  jbd: Use offset_in_page() instead of manual calculation
  jbd: Remove unnecessary goto statement
  jbd: Use printk_ratelimited() in journal_alloc_journal_head()
  ...
2010-10-27 20:13:18 -07:00
Theodore Ts'o
5c2178e785 jbd2: Add sanity check for attempts to start handle during umount
An attempt to modify the file system during the call to
jbd2_destroy_journal() can lead to a system lockup.  So add some
checking to make it much more obvious when this happens to and to
determine where the offending code is located.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Brian King
39e3ac2599 jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode
This fixes a hang seen in jbd2_journal_release_jbd_inode
on a lot of Power 6 systems running with ext4. When we get
in the hung state, all I/O to the disk in question gets blocked
where we stay indefinitely. Looking at the task list, I can see
we are stuck in jbd2_journal_release_jbd_inode waiting on a
wake up. I added some debug code to detect this scenario and
dump additional data if we were stuck in jbd2_journal_release_jbd_inode
for longer than 30 minutes. When it hit, I was able to see that
i_flags was 0, suggesting we missed the wake up.

This patch changes i_flags to be an unsigned long, uses bit operators
to access it, and adds barriers around the accesses. Prior to applying
this patch, we were regularly hitting this hang on numerous systems
in our test environment. After applying the patch, the hangs no longer
occur.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:25:12 -04:00
Andrea Gelmini
bcf3d0bcff jbd/2: fixed typos
"wakup"

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:05 +02:00
Linus Torvalds
a2887097f2 Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
  xen-blkfront: disable barrier/flush write support
  Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
  block: remove BLKDEV_IFL_WAIT
  aic7xxx_old: removed unused 'req' variable
  block: remove the BH_Eopnotsupp flag
  block: remove the BLKDEV_IFL_BARRIER flag
  block: remove the WRITE_BARRIER flag
  swap: do not send discards as barriers
  fat: do not send discards as barriers
  ext4: do not send discards as barriers
  jbd2: replace barriers with explicit flush / FUA usage
  jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
  jbd: replace barriers with explicit flush / FUA usage
  nilfs2: replace barriers with explicit flush / FUA usage
  reiserfs: replace barriers with explicit flush / FUA usage
  gfs2: replace barriers with explicit flush / FUA usage
  btrfs: replace barriers with explicit flush / FUA usage
  xfs: replace barriers with explicit flush / FUA usage
  block: pass gfp_mask and flags to sb_issue_discard
  dm: convey that all flushes are processed as empty
  ...
2010-10-22 17:07:18 -07:00
Linus Torvalds
e9dd2b6837 Merge branch 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
  cfq-iosched: Fix a gcc 4.5 warning and put some comments
  block: Turn bvec_k{un,}map_irq() into static inline functions
  block: fix accounting bug on cross partition merges
  block: Make the integrity mapped property a bio flag
  block: Fix double free in blk_integrity_unregister
  block: Ensure physical block size is unsigned int
  blkio-throttle: Fix possible multiplication overflow in iops calculations
  blkio-throttle: limit max iops value to UINT_MAX
  blkio-throttle: There is no need to convert jiffies to milli seconds
  blkio-throttle: Fix link failure failure on i386
  blkio: Recalculate the throttled bio dispatch time upon throttle limit change
  blkio: Add root group to td->tg_list
  blkio: deletion of a cgroup was causes oops
  blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
  block: set the bounce_pfn to the actual DMA limit rather than to max memory
  block: revert bad fix for memory hotplug causing bounces
  Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
  block: set the bounce_pfn to the actual DMA limit rather than to max memory
  block: Prevent hang_check firing during long I/O
  cfq: improve fsync performance for small files
  ...

Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
2010-10-22 17:00:32 -07:00
Corrado Zoccolo
749ef9f842 cfq: improve fsync performance for small files
Fsync performance for small files achieved by cfq on high-end disks is
lower than what deadline can achieve, due to idling introduced between
the sync write happening in process context and the journal commit.

Moreover, when competing with a sequential reader, a process writing
small files and fsync-ing them is starved.

This patch fixes the two problems by:
- marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE
  flag set,
- force all queues that have REQ_NOIDLE requests to be put in the noidle
  tree.

Having the queue associated to the fsync-ing process and the one associated
 to journal commits in the noidle tree allows:
- switching between them without idling,
- fairness vs. competing idling queues, since they will be serviced only
  after the noidle tree expires its slice.

Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-20 15:24:50 +02:00
Christoph Hellwig
dd3932eddf block: remove BLKDEV_IFL_WAIT
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller.  To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16 20:52:58 +02:00
Patrick J. LoPresti
1113e1b504 JBD2: Allow feature checks before journal recovery
Before we start accessing a huge (> 16 TiB) OCFS2 volume, we need to
confirm that its journal supports 64-bit offsets.  In particular, we
need to check the journal's feature bits before recovering the journal.

This is not possible with JBD2 at present, because the journal
superblock (where the feature bits reside) is not loaded from disk until
the journal is recovered.

This patch loads the journal superblock in
jbd2_journal_check_used_features() if it has not already been loaded,
allowing us to check the feature bits before journal recovery.

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 08:41:54 -07:00
Christoph Hellwig
9c35575bbe jbd2: replace barriers with explicit flush / FUA usage
Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the
EOPNOTSUPP detection for barriers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:39 +02:00
Jan Kara
f73bee4985 jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
Currently JBD2 relies blkdev_issue_flush() draining the queue when ASYNC_COMMIT
feature is set. This property is going away so make JBD2 wait for buffers it
needs on its own before submitting the cache flush.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:39 +02:00
Christoph Hellwig
9cb569d601 remove SWRITE* I/O types
These flags aren't real I/O types, but tell ll_rw_block to always
lock the buffer instead of giving up on a failed trylock.

Instead add a new write_dirty_buffer helper that implements this semantic
and use it from the existing SWRITE* callers.  Note that the ll_rw_block
code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
this patch fixes.

In the ufs code clean up the helper that used to call ll_rw_block
to mirror sync_dirty_buffer, which is the function it implements for
compound buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 01:09:01 -04:00
Christoph Hellwig
87e99511ea kill BH_Ordered flag
Instead of abusing a buffer_head flag just add a variant of
sync_dirty_buffer which allows passing the exact type of write
flag required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 01:09:00 -04:00
Theodore Ts'o
6d0bf00512 ext4: clean up compiler warning in start_this_handle()
Fix the compiler warning:

  fs/jbd2/transaction.c: In function ‘start_this_handle’:
  fs/jbd2/transaction.c:98: warning: unused variable ‘ts’

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-09 17:28:38 -04:00
Linus Torvalds
09dc942c2a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: Adding error check after calling ext4_mb_regular_allocator()
  ext4: Fix dirtying of journalled buffers in data=journal mode
  ext4: re-inline ext4_rec_len_(to|from)_disk functions
  jbd2: Remove t_handle_lock from start_this_handle()
  jbd2: Change j_state_lock to be a rwlock_t
  jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
  ext4: Add mount options in superblock
  ext4: force block allocation on quota_off
  ext4: fix freeze deadlock under IO
  ext4: drop inode from orphan list if ext4_delete_inode() fails
  ext4: check to make make sure bd_dev is set before dereferencing it
  jbd2: Make barrier messages less scary
  ext4: don't print scary messages for allocation failures post-abort
  ext4: fix EFBIG edge case when writing to large non-extent file
  ext4: fix ext4_get_blocks references
  ext4: Always journal quota file modifications
  ext4: Fix potential memory leak in ext4_fill_super
  ext4: Don't error out the fs if the user tries to make a file too big
  ext4: allocate stripe-multiple IOs on stripe boundaries
  ext4: move aio completion after unwritten extent conversion
  ...

Fix up conflicts in fs/ext4/inode.c as per Ted.

Fix up xfs conflicts as per earlier xfs merge.
2010-08-07 13:03:53 -07:00
Theodore Ts'o
8dd420466c jbd2: Remove t_handle_lock from start_this_handle()
This should remove the last exclusive lock from start_this_handle(),
so that we should now be able to start multiple transactions at the
same time on large SMP systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-03 21:38:29 -04:00
Theodore Ts'o
a931da6ac9 jbd2: Change j_state_lock to be a rwlock_t
Lockstat reports have shown that j_state_lock is a major source of
lock contention, especially on systems with more than 4 CPU cores.  So
change it to be a read/write spinlock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-03 21:35:12 -04:00
Theodore Ts'o
a51dca9cd3 jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
By using an atomic_t for t_updates and t_outstanding credits, this
should allow us to not need to take transaction t_handle_lock in
jbd2_journal_stop().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-02 08:43:25 -04:00
Eric Sandeen
cc937db74b jbd2: Make barrier messages less scary
Saying things like "sync failed" when a device does
not support barriers makes users slightly more worried than
they need to be; rather than talking about sync failures,
let's just state the barrier-based facts.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Theodore Ts'o
47def82672 jbd2: Remove __GFP_NOFAIL from jbd2 layer
__GFP_NOFAIL is going away, so add our own retry loop.  Also add
jbd2__journal_start() and jbd2__journal_restart() which take a gfp
mask, so that file systems can optionally (re)start transaction
handles using GFP_KERNEL.  If they do this, then they need to be
prepared to handle receiving an PTR_ERR(-ENOMEM) error, and be ready
to reflect that error up to userspace.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:05 -04:00
Jan Kara
13ceef099e jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
OCFS2 uses t_commit trigger to compute and store checksum of the just
committed blocks. When a buffer has b_frozen_data, checksum is computed
for it instead of b_data but this can result in an old checksum being
written to the filesystem in the following scenario:

1) transaction1 is opened
2) handle1 is opened
3) journal_access(handle1, bh)
    - This sets jh->b_transaction to transaction1
4) modify(bh)
5) journal_dirty(handle1, bh)
6) handle1 is closed
7) start committing transaction1, opening transaction2
8) handle2 is opened
9) journal_access(handle2, bh)
    - This copies off b_frozen_data to make it safe for transaction1 to commit.
      jh->b_next_transaction is set to transaction2.
10) jbd2_journal_write_metadata() checksums b_frozen_data
11) the journal correctly writes b_frozen_data to the disk journal
12) handle2 is closed
    - There was no dirty call for the bh on handle2, so it is never queued for
      any more journal operation
13) Checkpointing finally happens, and it just spools the bh via normal buffer
writeback.  This will write b_data, which was never triggered on and thus
contains a wrong (old) checksum.

This patch fixes the problem by calling the trigger at the moment data is
frozen for journal commit - i.e., either when b_frozen_data is created by
do_get_write_access or just before we write a buffer to the log if
b_frozen_data does not exist. We also rename the trigger to t_frozen as
that better describes when it is called.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15 15:17:47 -07:00
Andi Kleen
5a0790c2c4 ext4: remove initialized but not read variables
No real bugs found, just removed some dead code.

Found by gcc 4.6's new warnings.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 13:28:03 -04:00
Linus Torvalds
e4ce30f377 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: Make fsync sync new parent directories in no-journal mode
  ext4: Drop whitespace at end of lines
  ext4: Fix compat EXT4_IOC_ADD_GROUP
  ext4: Conditionally define compat ioctl numbers
  tracing: Convert more ext4 events to DEFINE_EVENT
  ext4: Add new tracepoints to track mballoc's buddy bitmap loads
  ext4: Add a missing trace hook
  ext4: restart ext4_ext_remove_space() after transaction restart
  ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted
  ext4: Avoid crashing on NULL ptr dereference on a filesystem error
  ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
  ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()
  ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks()
  ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks()
  ext4: Use our own write_cache_pages()
  ext4: Show journal_checksum option
  ext4: Fix for ext4_mb_collect_stats()
  ext4: check for a good block group before loading buddy pages
  ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate
  ext4: Remove extraneous newlines in ext4_msg() calls
  ...

Fixed up trivial conflict in fs/ext4/fsync.c
2010-05-27 10:26:37 -07:00
Jens Axboe
ee9a3607fb Merge branch 'master' into for-2.6.35
Conflicts:
	fs/ext3/fsync.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21 21:27:26 +02:00
Theodore Ts'o
c35a56a090 jbd2: Improve scalability by not taking j_state_lock in jbd2_journal_stop()
One of the most contended locks in the jbd2 layer is j_state_lock when
running dbench.  This is especially true if using the real-time kernel
with its "sleeping spinlocks" patch that replaces spinlocks with
priority inheriting mutexes --- but it also shows up on large SMP
benchmarks.

Thanks to John Stultz for pointing this out.

Reviewed by Mingming Cao and Jan Kara.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 05:00:00 -04:00
Bill Pemberton
8ac97b74bc jbd2: use NULL instead of 0 when pointer is needed
Fixes sparse warning:

fs/jbd2/journal.c:1892:9: warning: Using plain integer as NULL pointer

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-05-11 10:04:36 +02:00
Dmitry Monakhov
fbd9b09a17 blkdev: generalize flags for blkdev_issue_fn functions
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-28 19:47:36 +02:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
dingdinghua
23e2af3518 jbd2: clean up an assertion in jbd2_journal_commit_transaction()
commit_transaction has the same value as journal->j_running_transaction,
so we can simplify the assert statement.

Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 12:11:20 -05:00
dingdinghua
ba869023ea jbd2: delay discarding buffers in journal_unmap_buffer
Delay discarding buffers in journal_unmap_buffer until
we know that "add to orphan" operation has definitely been
committed, otherwise the log space of committing transation
may be freed and reused before truncate get committed, updates
may get lost if crash happens.

Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 16:35:42 -05:00
Theodore Ts'o
d2eecb0393 ext4: Use slab allocator for sub-page sized allocations
Now that the SLUB seems to be fixed so that it respects the requested
alignment, use kmem_cache_alloc() to allocator if the block size of
the buffer heads to be allocated is less than the page size.
Previously, we were using 16k page on a Power system for each buffer,
even when the file system was using 1k or 4k block size.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-07 10:36:20 -05:00
Theodore Ts'o
71f2be213a ext4: Add new tracepoint for jbd2_cleanup_journal_tail
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 07:45:44 -05:00
Andrew Morton
3ebfdf885a jbd2: don't use __GFP_NOFAIL in journal_init_common()
It triggers the warning in get_page_from_freelist(), and it isn't
appropriate to use __GFP_NOFAIL here anyway.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14843

Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 08:05:15 -05:00
Theodore Ts'o
cc3e1bea5d ext4, jbd2: Add barriers for file systems with exernal journals
This is a bit complicated because we are trying to optimize when we
send barriers to the fs data disk.  We could just throw in an extra
barrier to the data disk whenever we send a barrier to the journal
disk, but that's not always strictly necessary.

We only need to send a barrier during a commit when there are data
blocks which are must be written out due to an inode written in
ordered mode, or if fsync() depends on the commit to force data blocks
to disk.  Finally, before we drop transactions from the beginning of
the journal during a checkpoint operation, we need to guarantee that
any blocks that were flushed out to the data disk are firmly on the
rust platter before we drop the transaction from the journal.

Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 06:52:08 -05:00
Yin Kangkai
765f836190 jbd: jbd-debug and jbd2-debug should be writable
jbd-debug and jbd2-debug is currently read-only (S_IRUGO), which is not
correct. Make it writable so that we can start debuging.

Signed-off-by: Yin Kangkai <kangkai.yin@intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23 13:44:13 +01:00
Linus Torvalds
b6e3224fb2 Revert "task_struct: make journal_info conditional"
This reverts commit e4c570c4cb, as
requested by Alexey:

 "I think I gave a good enough arguments to not merge it.
  To iterate:
   * patch makes impossible to start using ext3 on EXT3_FS=n kernels
     without reboot.
   * this is done only for one pointer on task_struct"

  None of config options which define task_struct are tristate directly
  or effectively."

Requested-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-17 13:23:24 -08:00
Hiroshi Shimamoto
e4c570c4cb task_struct: make journal_info conditional
journal_info in task_struct is used in journaling file system only.  So
introduce CONFIG_FS_JOURNAL_INFO and make it conditional.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:27 -08:00
Linus Torvalds
3126c136bc Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
  ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
  ext3: Fix data / filesystem corruption when write fails to copy data
  ext4: Support for 64-bit quota format
  ext3: Support for vfsv1 quota format
  quota: Implement quota format with 64-bit space and inode limits
  quota: Move definition of QFMT_OCFS2 to linux/quota.h
  ext2: fix comment in ext2_find_entry about return values
  ext3: Unify log messages in ext3
  ext2: clear uptodate flag on super block I/O error
  ext2: Unify log messages in ext2
  ext3: make "norecovery" an alias for "noload"
  ext3: Don't update the superblock in ext3_statfs()
  ext3: journal all modifications in ext3_xattr_set_handle
  ext2: Explicitly assign values to on-disk enum of filetypes
  quota: Fix WARN_ON in lookup_one_len
  const: struct quota_format_ops
  ubifs: remove manual O_SYNC handling
  afs: remove manual O_SYNC handling
  kill wait_on_page_writeback_range
  vfs: Implement proper O_SYNC semantics
  ...
2009-12-11 15:31:13 -08:00
Christoph Hellwig
94004ed726 kill wait_on_page_writeback_range
All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:50 +01:00
Theodore Ts'o
3b799d15f2 jbd2: Export jbd2_log_start_commit to fix ext4 build
This fixes:
    ERROR: "jbd2_log_start_commit" [fs/ext4/ext4.ko] undefined!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 20:42:53 -05:00
Theodore Ts'o
e6ec116b67 jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()
OOM happens.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-01 09:04:42 -05:00
Theodore Ts'o
e6a47428de jbd2: don't wipe the journal on a failed journal checksum
If there is a failed journal checksum, don't reset the journal.  This
allows for userspace programs to decide how to recover from this
situation.  It may be that ignoring the journal checksum failure might
be a better way of recovering the file system.  Once we add per-block
checksums, we can definitely do better.  Until then, a system
administrator can try backing up the file system image (or taking a
snapshot) and and trying to determine experimentally whether ignoring
the checksum failure or aborting the journal replay results in less
data loss.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:31:37 -05:00
Tao Ma
7b02bec07e JBD/JBD2: free j_wbuf if journal init fails.
If journal init fails, we need to free j_wbuf.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-11-11 15:24:14 +01:00
Alexey Dobriyan
828c09509b const: constify remaining file_operations
[akpm@linux-foundation.org: fix KVM]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-01 16:11:11 -07:00
Theodore Ts'o
bf6993276f jbd2: Use tracepoints for history file
The /proc/fs/jbd2/<dev>/history was maintained manually; by using
tracepoints, we can get all of the existing functionality of the /proc
file plus extra capabilities thanks to the ftrace infrastructure.  We
save memory as a bonus.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 00:32:06 -04:00
Theodore Ts'o
90576c0b9a ext4, jbd2: Drop unneeded printks at mount and unmount time
There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted.  Disable them to economize space
in the system logs.  In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 15:51:30 -04:00
James Morris
88e9d34c72 seq_file: constify seq_operations
Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.

This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:29 -07:00
Theodore Ts'o
0e3d2a6313 ext4: Fix async commit mode to be safe by using a barrier
Previously the journal_async_commit mount option was equivalent to
using barrier=0 (and just as unsafe).  This patch fixes it so that we
eliminate the barrier before the commit block (by not using ordered
mode), and explicitly issuing an empty barrier bio after writing the
commit block.  Because of the journal checksum, it is safe to do this;
if the journal blocks are not all written before a power failure, the
checksum in the commit block will prevent the last transaction from
being replayed.

Using the fs_mark benchmark, using journal_async_commit shows a 50%
improvement:

FSUse%        Count         Size    Files/sec     App Overhead
     8         1000        10240         30.5            28242

vs.

FSUse%        Count         Size    Files/sec     App Overhead
     8         1000        10240         45.8            28620


Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-11 09:30:12 -04:00
Jan Kara
9599b0e597 jbd2: Annotate transaction start also for jbd2_journal_restart()
lockdep annotation for a transaction start has been at the end of
jbd2_journal_start(). But a transaction is also started from
jbd2_journal_restart(). Move the lockdep annotation to start_this_handle()
which covers both cases.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 21:23:17 -04:00
Andreas Dilger
b1f485f20e jbd2: round commit timer up to avoid uncommitted transaction
fix jiffie rounding in jbd commit timer setup code.  Rounding down
could cause the timer to be fired before the corresponding transaction
has expired.  That transaction can stay not committed forever if no
new transaction is created or expicit sync/umount happens.

Signed-off-by: Alex Zhuravlev (Tomas) <alex.zhuravlev@sun.com>
Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-10 22:51:53 -04:00
Jan Kara
f6f50e28f0 jbd2: Fail to load a journal if it is too short
Due to on disk corruption, it can happen that journal is too short. Fail
to load it in such case so that we don't oops somewhere later.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-17 10:40:01 -04:00
Jens Axboe
1fe06ad892 writeback: get rid of wbc->for_writepages
It's only set, it's never checked. Kill it.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-16 15:16:18 +02:00
dingdinghua
96577c4382 jbd2: fix race between write_metadata_buffer and get_write_access
The function jbd2_journal_write_metadata_buffer() calls
jbd_unlock_bh_state(bh_in) too early; this could potentially allow
another thread to call get_write_access on the buffer head, modify the
data, and dirty it, and allowing the wrong data to be written into the
journal.  Fortunately, if we lose this race, the only time this will
actually cause filesystem corruption is if there is a system crash or
other unclean shutdown of the system before the next commit can take
place.

Signed-off-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 17:55:35 -04:00
Jan Kara
f91d1d0417 jbd2: Fix a race between checkpointing code and journal_get_write_access()
The following race can happen:

 CPU1                          CPU2
                               checkpointing code checks the buffer, adds
                                 it to an array for writeback
 do_get_write_access()
 ...
 lock_buffer()
 unlock_buffer()
                               flush_batch() submits the buffer for IO
 __jbd2_journal_file_buffer()

So a buffer under writeout is returned from
do_get_write_access(). Since the filesystem code relies on the fact
that journaled buffers cannot be written out, it does not take the
buffer lock and so it can modify buffer while it is under
writeout. That can lead to a filesystem corruption if we crash at the
right moment.

We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty
bit regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.

Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.

Reported-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 16:16:20 -04:00
Theodore Ts'o
b574480507 jbd2: Remove GFP_ATOMIC kmalloc from inside spinlock critical region
Fix jbd2_dev_to_name(), a function used when pretty-printting jbd2 and
ext4 tracepoints.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-20 23:34:44 -04:00
Hisashi Hifumi
536fc240e7 jbd2: clean up jbd2_journal_try_to_free_buffers()
This patch reverts 3f31fddf, which is no longer needed because if a
race between freeing buffer and committing transaction functionality
occurs and dio gets error, currently dio falls back to buffered IO due
to the commit 6ccfa806.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 20:08:51 -04:00
Theodore Ts'o
879c5e6b7c jbd2: convert instrumentation from markers to tracepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 11:47:48 -04:00
Alberto Bertogli
bfcd3555af jbd2: Fix minor typos in comments in fs/jbd2/journal.c
Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2009-06-09 00:06:20 -04:00
Theodore Ts'o
67c457a8c3 jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records
The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-14 07:50:56 -04:00
Jens Axboe
4194b1eaf1 jbd2: use WRITE_SYNC_PLUG instead of WRITE_SYNC
When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:54 -07:00
Jan Kara
86db97c87f jbd2: Update locking coments
Update information about locking in JBD2 revoke code. Inconsistency in
comments found by Lin Tan <tammy000@gmail.com>.

CC: Lin Tan <tammy000@gmail.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-27 17:20:40 -04:00
Theodore Ts'o
7058548cd5 ext4: Use WRITE_SYNC for commits which are caused by fsync()
If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-25 23:35:46 -04:00
Jan Kara
7f5aa21508 jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
If we race with commit code setting i_transaction to NULL, we could
possibly dereference it.  Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have.  So we have to
change the prototype of the function so that filesystem passes us the
journal pointer.  Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.

Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Joel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>
2009-02-10 11:15:34 -05:00
Jan Kara
c88ccea314 jbd2: Fix return value of jbd2_journal_start_commit()
The function jbd2_journal_start_commit() returns 1 if either a
transaction is committing or the function has queued a transaction
commit. But it returns 0 if we raced with somebody queueing the
transaction commit as well. This resulted in ext4_sync_fs() not
functioning correctly (description from Arthur Jones): 

   In the case of a data=ordered umount with pending long symlinks
   which are delayed due to a long list of other I/O on the backing
   block device, this causes the buffer associated with the long
   symlinks to not be moved to the inode dirty list in the second
   phase of fsync_super.  Then, before they can be dirtied again,
   kjournald exits, seeing the UMOUNT flag and the dirty pages are
   never written to the backing block device, causing long symlink
   corruption and exposing new or previously freed block data to
   userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes jbd2_journal_start_commit() to always return 1 when
there's a transaction committing or queued for commit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: Eric Sandeen <sandeen@redhat.com>
CC: linux-ext4@vger.kernel.org
2009-02-10 11:27:46 -05:00