An enjoyed kernel apprentice Just another WordPress weblog

October 16, 2010

China Linux Storage and File System Workshop 2010

Filed under: File System Magic,Great Days,kernel — colyli @ 12:41 pm

[CLSF 2010, Oct 14~15, Intel Zizhu Campus, Shanghai, China]

Similar to Linux Storage and File System Summit in north America, China Linux Storage and File System Workshop is a chance to make most of active upstream I/O related kernel developers get together and share their ideas and current status.

We (CLSF committee) invited around 26 persons to China LSF 2010, including community developers who contribute to Linux I/O subsystem, and engineers who develop their storage products/solutions based on Linux. In order to reduce travel cost to all attendees, we decided to co-locate China LSF with CLK (China Linux Kernel Developers Conference) in Shanghai.

This year, Intel OTC (Opensource Technology Center) contributed a lot to the conference organization. She kindly provided free and comfortable conference room, donated employees to help the organization and preparation, two intern students acted as volunteers helping on many trivial stuffs.

CLSF2010 is a two days’ conference,  here are some interesting topics (IMHO) which I’d like to share on my blog. I don’t understand very well on every topic, if there is any error/mistake in this text, please let me know. Any errata is welcome 🙂

— Writeback, led by Fengguang Wu

— CFQ, Block IO Controller & Write IO Controller, led by Jianfeng Gui, Fengguang Wu

— Btrfs, led by Coly Li

— SSD & Block Layer, led by Shaohua Li

— VFS Scalability, led by Tao Ma

— Kernel Tracing, led by Zefan Li

— Kernel Testing and Benchmarking, led by Alex Shi

Beside the above topics, we also had ‘From Industry’ sessions, engineers from Baidu, Taobao and EMC shared their experience when building their own storage solutions/products based on Linux.

In this blog, I’d like to share the information I got from CLSF 2010, hope it could be informative 😉

Write back

The first session started from Write back,  which is quite hot recently. Fengguang does quite a few work on it, and kindly volunteer to lead this session.

An idea was brought out to limit the dirty page ratio by per-process. Fengguang made a patch and shared a demo picture with us. When dirty pages exceeds the up-limit specified to a process, kernel will write back the dirty pages of this process smoothly, until the dirty page numbers reduced to a pre-configured rate. This idea is helpful to processes hold a large number of dirty pages.  Some people concerned this patch didn’t help the condition that a lot of processes and each hold a few dirty pages. Fengguang replied for server application, if this condition happened, the design might be buggy.

People also mentioned now the erase block size of SSD increased from KBs to MBs, adopting a bigger page numbers in writing out may help on the whole file system performance. Engineers from Baidu shared their experience,

— Increase the write out size from 4MB to 40MB, they achieved 20% performance improvement.

— Use extent based file system, they got better continuous on-disk layout and less memory consume for metadata.

Fengguang also shared his idea on how to control process to write pages, the original idea was control dirty pages by I/O (calling writeback_inode(dirtied * 3/2)), after several times improvement it became wait_for_writeback(dirteid/throttle_bandwidth). By this means, the I/O bandwidth of dirty pages to a process also got controlled.

During the discussion, Fengguang pointed out the event that a page got dirty was more important than whether a page was dirty. Engineers from Baidu said, in order to avoid a kernel/user space memory copy during file read/write, while using kernel page cache, they used mmap to read/write file pages other than calling read/write syscalls. In this case, a page writable in mmap is initialized as read only firstly, when the writing happened a page fault was triggered, then kernel knew this page got dirty.

It seems many ideas are under working to improve the writeback performance, including active writeback in back group, and some cooperation with underlying block layer. My current focus is not here, anyway I believe people in the room could help a bit out 🙂


Recently, there are many developers in China start to work on btrfs, e.g. Xie Miao, Zefan Li, Shaohua Li, Zheng Yan, … Therefore we specially arranged a two hours session for btrfs. The main purpose of the btrfs session is to share what we are doing on btrfs.

Most of people agreed that btrfs needed a real fsck tool now. Engineers from Fujitsu said they had a plan to invest people on btrfs checking tool development. Miao Xie, Zefan Li, Coly Li and other developers suggested to consider the pain of fsck from beginning,

— memory consuming

Now a 10TB+ storage media is cheap and common, for large file system built on them, doing fsck needs more memory to hold meta data (e.g. bitmap, dir blocks, inode blocks, btree internal blocks …). For online fsck, consuming too many memory in file system checking will have negative performance impact to page cache or other applications. For offline fack, it was not a problem, now online fsck is coming, we have to encounter this open question now 🙂

— fsck speed

A tree structured file system has (much) more meta data than a table structured file system (like Ext2/3/4), which may mean more I/O and more time. For a 10TB+ 80% full file system, how to reduce the file system checking time will be a key issue, especially for online service workload. I proposed an solution, allocating metadata to SSD or other higher seek speed device, then checking on metadata may have no (or a little) seeking time, which results a faster file system checking.

Weeks before, two intern students Kunshan Wang and Shaoyan Wang, they worked with me, wrote a very basic patch set (including kernel and user space code), to allocate metadata from a higher seek time device. This patch set is compiling passed, the students did a quite basic verification on meta data allocation, the patch worked. I don’t review the patch yet, by a quite rough code checking, there is much improvement needed. I post this draft patch set to China LSF mailing list, to call for more comments from CLSF attendees. Hope next month,  I can have time to improve the great job by Kunshan and Shaoyan.

Zefan Li said there was a todo list of btrfs, a long term task was data de-duplication, and a short term task was allocating data from SSD. Herbert Xu pointed out, the underlying storage media impacted file system performance quite a lot, from a benchmark from Ric Wheeler of Redhat, on Fusion IO high end PCI-E SSD, there is almost no performance difference between well known file system like xfs, ext2/3/4 or btrfs.

People also said that these days, the code review or merge of btrfs patches were often delayed, it seemed btrfs maintainer was too busy to handle the community patches. There was reply from the maintainer that the condition will be improved and patches would be handled in time, but there was no obvious improvement so far. I can understand when a person has more emergent task like kernel tree maintenance, he or she does have difficulty to handle non-trivial patches in time if this is not his or her highest priority job. From CLSF, I find more and more Chinese developers start to work on btrfs, I hope they should be patient if their patches don’t get handled in time 🙂

Engineers from Intel OTC mentioned there is no btrfs support from popular boot loader like Grub2. For me, IIRC there is someone working on it, and the patches are almost ready. Shaohua mentioned why not loading the Linux kernel by a linux kernel, like the kboot project does. People pointed out there still should be something to load the first Linux kernel, this was a chicken-and-egg question 🙂 My point was, it should not be very hard to enable the btrfs support in boot loader, a small Google Summer of Code project could make it. I’d like to port and merge the patches (if they are available) to openSUSE since I maintain openSuSE grub2 package.

Shaohua Li shared his experience on btrfs development for Meego project, he did some work on fast boot and read ahead on btrfs. Shaohua said there was some performance advance observed on btrfs, and the better result was achieved by some hacking, like a big read ahead size, a dedicated work queue to handle write request and using a big write back size. Fengguang Wu and Tao Ma pointed out this might be a general hacking, because Ext4 and OCFS2 also did the similar hacking for better performance.

Finally Shaohua Li pointed out there was a huge opportunity to improve the scalability of btrfs, since there still were many global locking, cache missing existing in current code.

SSD & Block Layer

This was a quite interesting session led by Shaohua Li. Shaohua started the session by some observed problems between SSD and block layer,

— Throughput is high, like network

— Disk controller gap, no MSI-x…

— Big locks, queue lock, scsi host lock, …

Shaohua shared some benchmark result showed that for high IOPS the interrupt over loaded on a single CPU,  even on a multi processors system, the interrupts could not be balanced to multi processors, which was a bottleneck to handle interrupts invoked by I/O of SSD.  If a system had 4 SDDs, a processor ran 100% to handle the interrupts and how throughput was around 60%-80%.

A workaround here was polling. Replacing interrupt by blk_iopoll could help the performance number, which could reduce processor overload on interrupts handling. However, Herbert Xu points out the key issue was current hardware didn’t support multi-queue to handle same interrupts. Different interrupts could be balanced to every processor in the system, but unlike network hardware, same interrupt could not be balanced into multi-queue and only be handled by a single processor. A hardware multi-queue support should be the silver bullet.

For SSD like Fusion IO produces, the IOPS could be one million + IOPS on a single SSD device, the parallel load is much more higher than on traditional hard disk. Herbert, Zefan and I agreed that some hidden race defect should be observed very soon.

Right now, block layer is not ready for such high parallel I/O load.  Herbert Xu pointed out that lock contention might be a big issue to solve. The source of the lock contention was cache consistence cost for global resource which protected by locking. Convert the global resource to a per-CPU local data might be a direction to solve the locking contention issue. Since Jens and Nick can touch Fusion IO devices more conveniently, we believe they can work with other developers to help out a lot.

Kernel Tracing

Zefan Li helped to lead an interesting session about kernel tracing. I don’t have any real understanding for any kernel trace infrastructure, for me the only tool is printk(). IMHO printk is the best trace/debug tool for kernel programming. Anyway, debugging is always an attractive topic to curious programmer, and I felt Zefan did his job quite well 🙂

The OCFS2 developer Tao Ma, mentioned OCFS2 currently using a printk wrapper trace code, which was not flexible and quite obsolete, OCFS2 developers were thinking of using a trace infrastructure like ftrace.

Zefan pointed out using ftrace to replace previous printk based trace messages should be careful, there might be ABI (application binary interface) issue for user space tools. Some user space tools work with kernel message (one can check kernel message with kmesg command). An Intel engineer mentioned there was accident recently that a kernel message modification caused the powertop tools didn’t work correctly.

For file system trace, the situation might be easier. Because most of the trace info was used by file system developers or testers, the one adding trace info into file system code might ignore the ABI issue with happy. Anyway, it was just “might”, not “be able to”.

Zefan said there was patch introduced TRACE_EVENT_ABI, if some trace info could form a stable user space ABI they could be announced by TRACE_EVENT_ABI.

This session also discussed how ftrace working. Now I know the trace info stored in a ring buffer. If ftrace is enabled but the ring buffer is not, user is still not able to receive trace info. People also said that a user space trace tool would be necessary.

Someone said perf tool currently getting more and more powerful, it was probably that integrating trace function into perf. Linux kernel only needs one trace tool,  some people in this workshop think it might be perf (for me, I have no point, because I use neither).

Finally Herbert again suggested people to pay attention on scalability issues when adding trace point. Currently the ring buffer was not a per-CPU local area, adding trace point might introduce performance regression for existing optimized code.

From Industry

In last year’s BeijingLSF, we invited two engineers from Lenovo. They shared their experience using Linux as the base system for their storage solution. This session had a quite positive feed back, and all committee member suggested to continue the From Industry sessions again this year.

For ChinaLSF2010, we invited 3 companies to share their ideas with other attendees, engineers from Baidu, Taobao and EMC  led three interesting sessions, people had chance to know which kind of difficulties they encountered, how they solved the problems and what they achieved from their solution or work around. Here I share some interesting points on my blog.

From Taobao

Engineers from Taobao also shared their works based on Linux storage and file systems,  the projects were Tair and TFS.

Tair is a distributed cache system used inside Taobao, TFS is a distributed user space file system to store Taobao goods pictures.  For detail information, please check 🙂

From EMC

Engineers from EMC shared their work on file system recovery, especially file system checking. Tao Ma and I, we also mentioned what we did in fsck.ocfs2 (ocfs2 file system checking tool). The opinion from EMC was, even an online file system checking was possible, the offline fsck was still required. Because an offline file system checking could check and fix a file system from a higher level scope.

Other points were also discussed in previous sessions, including memory occupation, time consuming …

From Baidu

This was the first time I knew people from Baidu, and had chance to knew what they did on Linux kernel. Thanks to Baidu kernel team, we had opportunity to know what they did in the past years.

Guangjun Xie from Baidu started the session by introducing Baidu’s I/O workload, most of the I/O were indexing and distributed computing related, reading performance was more desired then writing performance. In order to reduce memory copying in data reading, they used mmap to read data pages from underlying media to page cache.  Accessing the page via mmap could not use the advantage of Linux kernel page cache replacement algorithm, while Baidu didn’t want to implement a similar page cache within user space. Therefore they used a not-beautiful-but-efficient workaround, they implemented an in-house system call, the system call updated the page (returned by mmap) in kernel’s page LRU. By this means, the data page could be management by kernel’s page cache code. Some people pointed out this was mmap() + read ahead. From Baidu’s benchmark their effort increased 100% searching workload performance on a single node server.

Baidu also tried to use bigger block size of Ext2 file system, to make data block layout more continuous, also from their performance data the bigger block size also resulted a better I/O performance. IMHO, a local mod ocfs2 file system may achieve a similar performance, because the basic block unit of ocfs2 is a cluster, the cluster size could be from 4KB to 1MB.

Baidu also tried to compress/decompress the data when writing/reading from disk, since most of Baidu’s data was text, the compress rate was quite satisfied high. They even used a PCIE compressing card, the performance result was pretty good.

Guangjun also mentioned, when they used SATA disks, some I/O error was silence error, for meta data, this was a fatal error, at least meta data checksum was necessary. For data checksum, they did it in application level.


Now comes to the last part of this blog, let me give my own conclusion to ChinaLSF 2010 🙂

IMHO, the organization and preparation this year is much better than BeijingLSF 2009, people from Intel Shanghai OTC contribute a lot of time and effort before/during/after the workshop, without their effort, we can not have such a successful event. Also a big thank you should give our sponsor EMC China, they not only sponsor conference expense, but also send engineers to share their development experience.

Let’s wait for next year for ChinaLSF 2011 🙂


  1. A small update:
    The statement “…better result was achieved by some hacking, like a big read ahead size…” in the Btrfs segment is a bit out. We tracked this issue here:

    In the mobile device with MMC card MeeGo will use a much smaller readahead window.

    Comment by Zhu Yanhai — October 19, 2010 @ 9:17 pm

  2. Yeah, I should specify the context, for searching engine application, a bigger read ahead window may be helpful. For hand hold devices, such big window size normally does not help, especially when the device I/O is slow.
    Thanks for your comment 🙂

    Comment by colyli — October 20, 2010 @ 1:55 am

  3. Haha, I find YOU in the first pic.
    Although I can’t understand some paragraphs. But the part “From Industry” is interesting to me.;)

    Comment by Lena — October 20, 2010 @ 11:50 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress