An enjoyed kernel apprentice

June 27, 2010

Random I/O — Is raw device always faster than file system ?

Filed under: File System Magic — colyli @ 8:53 am

For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc…) might be faster than on file systems.

Their choice is reasonable,

1, Random I/O on large file cannot get any help from file system page cache.

2, <logical offset, physical offset> mapping introduces more I/O on file systems than on raw disk

3, Managing metadata on other powerful servers avoid the necessary to use file systems for data nodes.

The penalty for the “higher” performance is management cost, storing data on raw device introduces difficulties like,

1, Harder to backup/restore the data.

2, Cannot do more flexible management without special management tools for the raw device.

3, No convenient method to access/management the data on raw device.

The above penalties are hard to be ignored by system administrators. Further more, the store of “higher” performance is not exactly true today,

1, For file systems using block pointers for <logical offset, physical offset> mapping, large file takes too many pointer blocks. For example, on Ext3, with 4KB block, a 2TB file needs around 520K+  pointer blocks. Most of the pointer blocks are cold in random I/O, which results lower random I/O performance number than on raw device.

2, For file systems using extent for <logical offset, physical offset> mapping, the extent blocks number depends on how many fragment a large file has. For example, on Ext4, with max block group size 128MB, a 2TB file has around 16384 fragment. To mapping these 16K fragment, 16K extent records are needed, which can be placed in 50+ extent blocks. It’s very easy to hit a hot extent in memory for random I/O on large file.

3, If the <logical offset, physical offset> mapping can be cached in memory as hot, random I/O performance on file system might not be worse than on raw device.

In order to verify my guess, I did some performance testing.  I share part of the data here.

Processor: AMD opteron 6174 (2.2 GHz) x 2

Memory: DDR3 1333MHz 4GB x 4

Hard disk: 5400RPM SATA 2TB x 3 [2]

File size: (create by dd, almost) 2TB

Random I/O access: 100K times read

IO size: 512 bytes

File systems: Ext3, Ext4 (with and without directio)

test tool: seekrw [3]

* With page cache

- Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r

- Performance result

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 95.88 767.07 0 46024 0
Ext4 sdd 60.72 485.6 0 29136 0

- Wall clock time

Ext3: real time: 34 minutes 23 seconds 557537 usec

Ext4: real time: 24 minutes 44 seconds 10118 usec

* directio (without pagecache)

- Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r -d

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r -d

- Performance result

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 94.93 415.77 0 12473 0
Ext4 sdd 67.9 67.9 0 2037 0
Raw sdf 67.27 538.13 0 16144 0

- Wall clock time

Ext3: real time: 33 minutes 26 seconds 947875 usec

Ext4: real time: 24 minutes 25 seconds 545536 usec

sdf: real time: 24 minutes 38 seconds 523379 usec    (raw device)

From the above performance numbers, Ext4 is 39% faster than Ext3 on random I/O with or without paegcache, this is expected.

The result of random I/O on Ext4 and raw device, is almost same. This is a result also as expected. For file systems mapping <logical offset, physical offset> by extent, it’s quite easy to make most of the mapping records hot in memory. Random I/O on raw device has *NO* obvious performance advance then Ext4.

Dear developers, how about considering extent based file systems now :-)

[1] TFS, TaobaoFS. A distributed file system deployed for . It is developed by core system team of Taobao, will be open source very soon.

[2] The hard disk is connected to RocketRAID 644 card via eSATA connecter into system.

[3] seekrw source code can be download from

1 Comment »

  1. This question indeed been asked million time. For testing result, the bufferI/O(page cache) .vs. Driect I/O, the buffer I/O should show better result than direct I/O on the file system level testing. The ext3 in your testing did show this, but ext4 don’t have this behavior. Why? Also, all of testing is read only operation, the page cache almost no use if you didn’t enable the readahead on the block device layer.

    I may find a choice to do the compare with read/write mix operation to see the result.
    Also, may try to use the iozone to control the I/O block size to see any effect on the result.
    Anyways, this is a good approaching to do no-philosophical problem 

    Comment by jebtang — August 21, 2010 @ 9:17 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress