Univention Bugzilla – Attachment 4217 Details for
Bug 26192
ext4-Korruption bei Verwendung von sparse-Files, AIO
Home
|
New
|
Browse
|
Search
|
[?]
|
Reports
|
Requests
|
Help
|
New Account
|
Log In
[x]
|
Forgot Password
Login:
[x]
[patch]
Backport e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d: ext4: serialize unaligned asynchronous DIO
26192_ext4-serialize-unaligned-asynchronous-DIO.patch (text/plain), 8.90 KB, created by
Philipp Hahn
on 2012-02-23 14:29:26 CET
(
hide
)
Description:
Backport e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d: ext4: serialize unaligned asynchronous DIO
Filename:
MIME Type:
Creator:
Philipp Hahn
Created:
2012-02-23 14:29:26 CET
Size:
8.90 KB
patch
obsolete
>commit e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d >Author: Eric Sandeen <sandeen@redhat.com> >Date: Sat Feb 12 08:17:34 2011 -0500 > > ext4: serialize unaligned asynchronous DIO > > ext4 has a data corruption case when doing non-block-aligned > asynchronous direct IO into a sparse file, as demonstrated > by xfstest 240. > > The root cause is that while ext4 preallocates space in the > hole, mappings of that space still look "new" and > dio_zero_block() will zero out the unwritten portions. When > more than one AIO thread is going, they both find this "new" > block and race to zero out their portion; this is uncoordinated > and causes data corruption. > > Dave Chinner fixed this for xfs by simply serializing all > unaligned asynchronous direct IO. I've done the same here. > The difference is that we only wait on conversions, not all IO. > This is a very big hammer, and I'm not very pleased with > stuffing this into ext4_file_write(). But since ext4 is > DIO_LOCKING, we need to serialize it at this high level. > > I tried to move this into ext4_ext_direct_IO, but by then > we have the i_mutex already, and we will wait on the > work queue to do conversions - which must also take the > i_mutex. So that won't work. > > This was originally exposed by qemu-kvm installing to > a raw disk image with a normal sector-63 alignment. I've > tested a backport of this patch with qemu, and it does > avoid the corruption. It is also quite a lot slower > (14 min for package installs, vs. 8 min for well-aligned) > but I'll take slow correctness over fast corruption any day. > > Mingming suggested that we can track outstanding > conversions, and wait on those so that non-sparse > files won't be affected, and I've implemented that here; > unaligned AIO to nonsparse files won't take a perf hit. > > [tytso@mit.edu: Keep the mutex as a hashed array instead > of bloating the ext4 inode] > > [tytso@mit.edu: Fix up namespace issues so that global > variables are protected with an "ext4_" prefix.] > > Signed-off-by: Eric Sandeen <sandeen@redhat.com> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > >diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >index 67c46ed..10edb67 100644 >--- a/fs/ext4/ext4.h >+++ b/fs/ext4/ext4.h >@@ -781,6 +781,8 @@ struct ext4_inode_info { > /* on-disk additional length */ > __u16 i_extra_isize; > >+ atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */ >+ > spinlock_t i_block_reservation_lock; > #ifdef CONFIG_QUOTA > /* quota space reservation, managed internally by quota code */ >@@ -1889,6 +1891,15 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh) > > #define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1) > >+/* For ioend & aio unwritten conversion wait queues */ >+#define EXT4_WQ_HASH_SZ 37 >+#define ext4_ioend_wq(v) (&ext4__ioend_wq[((unsigned long)(v)) %\ >+ EXT4_WQ_HASH_SZ]) >+#define ext4_aio_mutex(v) (&ext4__aio_mutex[((unsigned long)(v)) %\ >+ EXT4_WQ_HASH_SZ]) >+extern wait_queue_head_t ext4__ioend_wq[EXT4_WQ_HASH_SZ]; >+extern struct mutex ext4__aio_mutex[EXT4_WQ_HASH_SZ]; >+ > #endif /* __KERNEL__ */ > > #endif /* _EXT4_H */ >diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c >index ecbe20d..f4830e6 100644 >--- a/fs/ext4/extents.c >+++ b/fs/ext4/extents.c >@@ -3118,9 +3118,10 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, > * that this IO needs to convertion to written when IO is > * completed > */ >- if (io) >+ if (io && !(io->flag & DIO_AIO_UNWRITTEN)) { > io->flag = DIO_AIO_UNWRITTEN; >- else >+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >+ } else > ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN); > goto out; > } >@@ -3404,9 +3405,10 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, > * that we need to perform convertion when IO is done. > */ > if (flags == EXT4_GET_BLOCKS_DIO_CREATE_EXT) { >- if (io) >+ if (io && !(io->flag & DIO_AIO_UNWRITTEN)) { > io->flag = DIO_AIO_UNWRITTEN; >- else >+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >+ } else > ext4_set_inode_state(inode, > EXT4_STATE_DIO_UNWRITTEN); > } >diff --git a/fs/ext4/file.c b/fs/ext4/file.c >index 2a60541..a18b45c 100644 >--- a/fs/ext4/file.c >+++ b/fs/ext4/file.c >@@ -54,11 +54,47 @@ static int ext4_release_file(struct inode *inode, struct file *filp) > return 0; > } > >+static void ext4_aiodio_wait(struct inode *inode) >+{ >+ wait_queue_head_t *wq = ext4_ioend_wq(inode); >+ >+ wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_aiodio_unwritten) == 0)); >+} >+ >+/* >+ * This tests whether the IO in question is block-aligned or not. >+ * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they >+ * are converted to written only after the IO is complete. Until they are >+ * mapped, these blocks appear as holes, so dio_zero_block() will assume that >+ * it needs to zero out portions of the start and/or end block. If 2 AIO >+ * threads are at work on the same unwritten block, they must be synchronized >+ * or one thread will zero the other's data, causing corruption. >+ */ >+static int >+ext4_unaligned_aio(struct inode *inode, const struct iovec *iov, >+ unsigned long nr_segs, loff_t pos) >+{ >+ struct super_block *sb = inode->i_sb; >+ int blockmask = sb->s_blocksize - 1; >+ size_t count = iov_length(iov, nr_segs); >+ loff_t final_size = pos + count; >+ >+ if (pos >= inode->i_size) >+ return 0; >+ >+ if ((pos & blockmask) || (final_size & blockmask)) >+ return 1; >+ >+ return 0; >+} >+ > static ssize_t > ext4_file_write(struct kiocb *iocb, const struct iovec *iov, > unsigned long nr_segs, loff_t pos) > { > struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode; >+ int unaligned_aio = 0; >+ int ret; > > /* > * If we have encountered a bitmap-format file, the size limit >@@ -76,9 +112,31 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, > nr_segs = iov_shorten((struct iovec *)iov, nr_segs, > sbi->s_bitmap_maxbytes - pos); > } >+ } else if (unlikely((iocb->ki_filp->f_flags & O_DIRECT) && >+ !is_sync_kiocb(iocb))) { >+ unaligned_aio = ext4_unaligned_aio(inode, iov, nr_segs, pos); > } > >- return generic_file_aio_write(iocb, iov, nr_segs, pos); >+ /* Unaligned direct AIO must be serialized; see comment above */ >+ if (unaligned_aio) { >+ static unsigned long unaligned_warn_time; >+ >+ /* Warn about this once per day */ >+ if (printk_timed_ratelimit(&unaligned_warn_time, 60*60*24*HZ)) >+ ext4_msg(inode->i_sb, KERN_WARNING, >+ "Unaligned AIO/DIO on inode %ld by %s; " >+ "performance will be poor.", >+ inode->i_ino, current->comm); >+ mutex_lock(ext4_aio_mutex(inode)); >+ ext4_aiodio_wait(inode); >+ } >+ >+ ret = generic_file_aio_write(iocb, iov, nr_segs, pos); >+ >+ if (unaligned_aio) >+ mutex_unlock(ext4_aio_mutex(inode)); >+ >+ return ret; > } > > static const struct vm_operations_struct ext4_file_vm_ops = { >diff --git a/fs/ext4/super.c b/fs/ext4/super.c >index f27e045..2cb8748 100644 >--- a/fs/ext4/super.c >+++ b/fs/ext4/super.c >@@ -713,6 +713,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) > ei->cur_aio_dio = NULL; > ei->i_sync_tid = 0; > ei->i_datasync_tid = 0; >+ atomic_set(&ei->i_aiodio_unwritten, 0); > > return &ei->vfs_inode; > } >@@ -4002,11 +4003,21 @@ static struct file_system_type ext4_fs_type = { > .fs_flags = FS_REQUIRES_DEV, > }; > >+/* Shared across all ext4 file systems */ >+wait_queue_head_t ext4__ioend_wq[EXT4_WQ_HASH_SZ]; >+struct mutex ext4__aio_mutex[EXT4_WQ_HASH_SZ]; >+ > static int __init init_ext4_fs(void) > { >- int err; >+ int i, err; > > ext4_check_flag_values(); >+ >+ for (i = 0; i < EXT4_WQ_HASH_SZ; i++) { >+ mutex_init(&ext4__aio_mutex[i]); >+ init_waitqueue_head(&ext4__ioend_wq[i]); >+ } >+ > err = init_ext4_system_zone(); > if (err) > return err; >diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >--- a/fs/ext4/inode.c 2012-02-22 18:21:15.000000000 +0100 >+++ b/fs/ext4/inode.c 2012-02-23 10:53:48.893489058 +0100 >@@ -3609,6 +3609,7 @@ > struct inode *inode = io->inode; > loff_t offset = io->offset; > ssize_t size = io->size; >+ wait_queue_head_t *wq; > int ret = 0; > > ext4_debug("end_aio_dio_onlock: io 0x%p from inode %lu,list->next 0x%p," >@@ -3619,7 +3620,7 @@ > if (list_empty(&io->list)) > return ret; > >- if (io->flag != DIO_AIO_UNWRITTEN) >+ if (!(io->flag & DIO_AIO_UNWRITTEN)) > return ret; > > if (offset + size <= i_size_read(inode)) >@@ -3633,7 +3634,16 @@ > } > > /* clear the DIO AIO unwritten flag */ >- io->flag = 0; >+ if (io->flag & DIO_AIO_UNWRITTEN) { >+ io->flag &= ~DIO_AIO_UNWRITTEN; >+ /* Wake up anyone waiting on unwritten extent conversion */ >+ wq = ext4_ioend_wq(io->inode); >+ if (atomic_dec_and_test(&EXT4_I(inode)->i_aiodio_unwritten) && >+ waitqueue_active(wq)) { >+ wake_up_all(wq); >+ } >+ } >+ > return ret; > } > /* >@@ -3747,7 +3757,7 @@ > size); > > /* if not aio dio with unwritten extents, just free io and return */ >- if (io_end->flag != DIO_AIO_UNWRITTEN){ >+ if (!(io_end->flag & DIO_AIO_UNWRITTEN)) { > ext4_free_io_end(io_end); > iocb->private = NULL; > return;
You cannot view the attachment while viewing its details because your browser does not support IFRAMEs.
View the attachment on a separate page
.
View Attachment As Diff
View Attachment As Raw
Actions:
View
|
Diff
Attachments on
bug 26192
:
4192
|
4212
| 4217