For some implementations of distributed file systems, like TFS , developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc…) might be faster than on file systems.
Their choice is reasonable,
1, Random I/O on large file cannot get any help from file system page cache.
2, <logical offset, physical offset> mapping introduces more I/O on file systems than on raw disk
3, Managing metadata on other powerful servers avoid the necessary to use file systems for data nodes.
The penalty for the “higher” performance is management cost, storing data on raw device introduces difficulties like,
1, Harder to backup/restore the data.
2, Cannot do more flexible management without special management tools for the raw device.
3, No convenient method to access/management the data on raw device.
The above penalties are hard to be ignored by system administrators. Further more, the store of “higher” performance is not exactly true today,
1, For file systems using block pointers for <logical offset, physical offset> mapping, large file takes too many pointer blocks. For example, on Ext3, with 4KB block, a 2TB file needs around 520K+ pointer blocks. Most of the pointer blocks are cold in random I/O, which results lower random I/O performance number than on raw device.
2, For file systems using extent for <logical offset, physical offset> mapping, the extent blocks number depends on how many fragment a large file has. For example, on Ext4, with max block group size 128MB, a 2TB file has around 16384 fragment. To mapping these 16K fragment, 16K extent records are needed, which can be placed in 50+ extent blocks. It’s very easy to hit a hot extent in memory for random I/O on large file.
3, If the <logical offset, physical offset> mapping can be cached in memory as hot, random I/O performance on file system might not be worse than on raw device.
In order to verify my guess, I did some performance testing. I share part of the data here.
Processor: AMD opteron 6174 (2.2 GHz) x 2
Memory: DDR3 1333MHz 4GB x 4
Hard disk: 5400RPM SATA 2TB x 3 
File size: (create by dd, almost) 2TB
Random I/O access: 100K times read
IO size: 512 bytes
File systems: Ext3, Ext4 (with and without directio)
test tool: seekrw 
* With page cache
seekrw -f /mnt/ext3/img -a 100000 -l 512 -r
seekrw -f /mnt/ext4/img -a 100000 -l 512 -r
– Performance result
Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn Ext3 sdc 95.88 767.07 0 46024 0 Ext4 sdd 60.72 485.6 0 29136 0
– Wall clock time
Ext3: real time: 34 minutes 23 seconds 557537 usec
Ext4: real time: 24 minutes 44 seconds 10118 usec
* directio (without pagecache)
seekrw -f /mnt/ext3/img -a 100000 -l 512 -r -d
seekrw -f /mnt/ext4/img -a 100000 -l 512 -r -d
– Performance result
Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn Ext3 sdc 94.93 415.77 0 12473 0 Ext4 sdd 67.9 67.9 0 2037 0 Raw sdf 67.27 538.13 0 16144 0
– Wall clock time
Ext3: real time: 33 minutes 26 seconds 947875 usec
Ext4: real time: 24 minutes 25 seconds 545536 usec
sdf: real time: 24 minutes 38 seconds 523379 usec (raw device)
From the above performance numbers, Ext4 is 39% faster than Ext3 on random I/O with or without paegcache, this is expected.
The result of random I/O on Ext4 and raw device, is almost same. This is a result also as expected. For file systems mapping <logical offset, physical offset> by extent, it’s quite easy to make most of the mapping records hot in memory. Random I/O on raw device has *NO* obvious performance advance then Ext4.
Dear developers, how about considering extent based file systems now 🙂
 TFS, TaobaoFS. A distributed file system deployed for http://www.taobao.com . It is developed by core system team of Taobao, will be open source very soon.
 The hard disk is connected to RocketRAID 644 card via eSATA connecter into system.
 seekrw source code can be download from http://www.mlxos.org/misc/seekrw.c