(This article is for SLE11-SP3, which is based on Linux 3.0 kernel.)
Recently people report that on SLE11-SP3, they observe I/O requests are not merged on device mapper target, and ‘iostat’ displays average request only in 4KB size.
This is not a bug, no negative performance impact. Here I try to explain why this situation is not a bug and how it happens, a few Linux file system and block layer stuffs will be mentioned, but it won’t be complexed to understand.
The story is, from a SLE11-SP3 machine, a LVM volume is created (as linear device mapper target), and ‘dd’ is used to generate sequential WRITE I/Os on to this volume. People tried to use buffered I/O and direct I/O with ‘dd’ command, on raw device mapper target, or on an ext3 file system on top of the device mapper target. So there are 4 conditions,
1) buffered I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M
2) direct I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M oflag=direct
3) buffered I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M
4) direct I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M oflag=direct
For 2) and 4), large request sizes are observed from hundreds to thousands sectors, maximum request size is 2048 sectors (because bs=1M). But for 1) and 3), all the request sizes displayed from ‘iostat’ on device mapper target dm-0 are 8 sectors (4KB).
The question is, sequential write I/Os are supposed to be merged into larger ones, why the request size reported by ‘iostat’ from device mapper target /dev/dm-0 is only 4KB, and not merged into larger request size ?
At first, let me give the simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.
Let me explain the above 2 points of simple answer in details. For direct I/O, the request size in device mapper target is the actual size sent from upper layer, it might be directly from application buffer, or adjusted by file system. so we only look at buffered I/O cases.
a) device mapper target does not merge bios
Device mapper only handles bios. In case of linear device mapper target (a common & simple lvm volume), it only re-maps the original bio from the logical device mapper target to actual underlying storage device, or maybe split the bio (of the device mapper target) into smaller ones if the original bio goes across multiple underlying storage devices. It never combines small bios into larger ones, it just re-maps the bios, and submit them to underlying block layer. Elevator, a.k.a I/O scheduler handles request merging and scheduling, device mapper does not.
b) upper layer code issues only 4KB size bios
For buffered I/O, file system only dirties the pages which contains the data writing to disk, the actual write action is handled by write back and journal code automatically,
– journal: ext3 uses jbd to handle journaling, in data=ordered mode, jbd only handles meta data blocks, and submit the metadata I/Os in buffer head, which means the maximum size is one page (4KB).
– write back: the actual kernel code to submit I/O to disk in write back code path is mm/page-writeback.c:do_writepages(). In SLE11-SP3 it looks like this,
1084 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
1086 int ret;
1088 if (wbc->nr_to_write <= 0)
1089 return 0;
1090 if (mapping->a_ops->writepages)
1091 ret = mapping->a_ops->writepages(mapping, wbc);
1093 ret = generic_writepages(mapping, wbc);
1094 return ret;
Device mapper target is created on devtmpfs, which does not have writepages() method defined. Ext3 does not have writepages() method defined in its a_ops set neither, so both conditions will go into generic_writepages().
Inside generic_writepages(), the I/O code path is: generic_writepages()==>write_cache_pages()==>__writepage()==>mapping->a_ops->writepage(). For different conditions, the implementation of mapping->a_ops->writeback() are different.
b.1) raw device mapper target
In SLE11-SP3, block device mapping->a_ops->writepage() is defined in fs/block_dev.c:blkdev_writepage(), its code path to submit I/O is: blkdev_writepage()==>block_write_full_page()==>block_write_full_page_endio()==>__block_write_full_page(). In __block_write_full_page(), finally a buffer head contains this page is submitted to underlying block layer by submit_bh(). So device mapper layer only receives bio with 4KB size in this case.
b.2) ext3 file system on top of raw device mapper target
In SLE11-SP3, mapping->a_ops->writeback() method in ext3 file system is defined in three ways, corresponding to three different data journal modes. Here I use data=ordered mode as the example. Ext3 uses jbd as its journaling infrastructure, when journal works in data=ordered mode (the default mode in SLE11-SP3), mapping->a_ops->writeback() is defined as fs/ext3/inode.c:ext3_ordered_writepage(). Inside this function, block_write_full_page() is called to write page to block layer, same to the raw device mapper target condition, finally submit_bh() is called to submit bio with one page to device mapper layer. Therefore in this case, device mapper target still only receives bios with 4KB size.
Finally let’s back to my first simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.