An enjoyed kernel apprentice Just another WordPress weblog

June 30, 2010

Taobao joins open source

Filed under: Great Days — colyli @ 10:27 am

Taobao's open source commnity

Taobao's open source community

Today, Taobao announces its open source community  —

This is a historical day, a China local  internet and e-business leading company joins open source world by its practice approved activity.

The first project released on is TAIR. Tair is a distributed, high performance key/value storage system, using in Taobao’s infrastructure for time.  Taobao is on the way to make more internal projects to be open source. Yes, talk is cheap, show the code !

If you are working on large scale website, with more than 10K server nodes, checking projects on may help you to avoid making another wheel. Please, visit, and join the community to contribute. I believe people can improve the community better and better. Currently, most of the expect developers are Chinese spoken, that’s why you can find square characters on the website. I believe more changes will come in future, because the people behind the community like continuously improvement 🙂

Of cause there are some other contributions to open source community from Taobao can not be found on, For example, I believer patches from Taobao will appear in Linux kernel changelog very soon 🙂

June 27, 2010

Random I/O — Is raw device always faster than file system ?

Filed under: File System Magic — colyli @ 8:53 am

For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc…) might be faster than on file systems.

Their choice is reasonable,

1, Random I/O on large file cannot get any help from file system page cache.

2, <logical offset, physical offset> mapping introduces more I/O on file systems than on raw disk

3, Managing metadata on other powerful servers avoid the necessary to use file systems for data nodes.

The penalty for the “higher” performance is management cost, storing data on raw device introduces difficulties like,

1, Harder to backup/restore the data.

2, Cannot do more flexible management without special management tools for the raw device.

3, No convenient method to access/management the data on raw device.

The above penalties are hard to be ignored by system administrators. Further more, the store of “higher” performance is not exactly true today,

1, For file systems using block pointers for <logical offset, physical offset> mapping, large file takes too many pointer blocks. For example, on Ext3, with 4KB block, a 2TB file needs around 520K+  pointer blocks. Most of the pointer blocks are cold in random I/O, which results lower random I/O performance number than on raw device.

2, For file systems using extent for <logical offset, physical offset> mapping, the extent blocks number depends on how many fragment a large file has. For example, on Ext4, with max block group size 128MB, a 2TB file has around 16384 fragment. To mapping these 16K fragment, 16K extent records are needed, which can be placed in 50+ extent blocks. It’s very easy to hit a hot extent in memory for random I/O on large file.

3, If the <logical offset, physical offset> mapping can be cached in memory as hot, random I/O performance on file system might not be worse than on raw device.

In order to verify my guess, I did some performance testing.  I share part of the data here.

Processor: AMD opteron 6174 (2.2 GHz) x 2

Memory: DDR3 1333MHz 4GB x 4

Hard disk: 5400RPM SATA 2TB x 3 [2]

File size: (create by dd, almost) 2TB

Random I/O access: 100K times read

IO size: 512 bytes

File systems: Ext3, Ext4 (with and without directio)

test tool: seekrw [3]

* With page cache

– Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r

– Performance result

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 95.88 767.07 0 46024 0
Ext4 sdd 60.72 485.6 0 29136 0

– Wall clock time

Ext3: real time: 34 minutes 23 seconds 557537 usec

Ext4: real time: 24 minutes 44 seconds 10118 usec

* directio (without pagecache)

– Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r -d

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r -d

– Performance result

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 94.93 415.77 0 12473 0
Ext4 sdd 67.9 67.9 0 2037 0
Raw sdf 67.27 538.13 0 16144 0

– Wall clock time

Ext3: real time: 33 minutes 26 seconds 947875 usec

Ext4: real time: 24 minutes 25 seconds 545536 usec

sdf: real time: 24 minutes 38 seconds 523379 usec    (raw device)

From the above performance numbers, Ext4 is 39% faster than Ext3 on random I/O with or without paegcache, this is expected.

The result of random I/O on Ext4 and raw device, is almost same. This is a result also as expected. For file systems mapping <logical offset, physical offset> by extent, it’s quite easy to make most of the mapping records hot in memory. Random I/O on raw device has *NO* obvious performance advance then Ext4.

Dear developers, how about considering extent based file systems now 🙂

[1] TFS, TaobaoFS. A distributed file system deployed for . It is developed by core system team of Taobao, will be open source very soon.

[2] The hard disk is connected to RocketRAID 644 card via eSATA connecter into system.

[3] seekrw source code can be download from

Powered by WordPress