An enjoyed kernel apprentice Just another WordPress weblog

February 27, 2017

RAID1: avoid unnecessary spin locks in I/O barrier code

Filed under: Block Layer Magic,kernel — colyli @ 11:56 am

When I run a parallel reading performan testing on a md raid1 device with two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and wait_barrier() code,

|        – 41.41%  fio [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
|             – _raw_spin_lock_irqsave
|                         + 89.92% allow_barrier
|                         + 9.34% __wake_up
|        – 37.30%  fio [kernel.kallsyms]  [k] _raw_spin_lock_irq
|              – _raw_spin_lock_irq
|                         – 100.00% wait_barrier

The reason is, in these I/O barrier related functions,

– raise_barrier()
– lower_barrier()
– wait_barrier()
– allow_barrier()

They always hold conf->resync_lock firstly, even there are only regular reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides comments to improve it. I continue to work on it, and make the patch into current form.

In the new simpler raid1 I/O barrier implementation, there are two wait barrier functions,

  • wait_barrier()

Which calls _wait_barrier(), is used for regular write I/O. If there is resync I/O happening on the same I/O barrier bucket, or the whole array is frozen, task will wait until no barrier on same barrier bucket, or the whold array is unfreezed.

  • wait_read_barrier()

Since regular read I/O won’t interfere with resync I/O (read_balance() will make sure only uptodate data will be read out), it is unnecessary to wait for barrier in regular read I/Os, waiting in only necessary when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->barrier[idx] are very carefully designed in raise_barrier(), lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to avoid unnecessary spin locks in these functions. Once conf->nr_pengding[idx] is increased, a resync I/O with same barrier bucket index has to wait in raise_barrier(). Then in _wait_barrier() if no barrier raised in same barrier bucket index and array is not frozen, the regular I/O doesn’t need to hold conf->resync_lock, it can just increase conf->nr_pending[idx], and return to its caller. wait_read_barrier() is very similar to _wait_barrier(), the only difference is it only waits when array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my testing, a raid1 device built by two NVMe SSD, runs fio with 64KB blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).


Thanks to Shaohua and Neil, very patient to explain memory barrier and atomic operations to me, help me to compose this patch in a correct way. This patch is merged into Linux v4.11 with commit ID 824e47daddbf.

RAID1: a new I/O barrier implementation to remove resync window

Filed under: Block Layer Magic,kernel — colyli @ 11:42 am

‘Commit 79ef3a8aa1cb (“raid1: Rewrite the implementation of iobarrier.”)’ introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don’t need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time.

The idea of sliding resync widow is awesome, but code complexity is a challenge. Sliding resync window requires several variables to work collectively, this is complexed and very hard to make it work correctly. Just grep “Fixes: 79ef3a8aa1” in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regression.

Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,

  • Do not maintain a global unique resync window
  • Use multiple hash buckets to reduce I/O barrier conflicts, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa.
  • I/O barrier can be reduced to an acceptable number if there are enough barrier buckets

Here I explain how the barrier buckets are designed,


The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.

Bio requests won’t go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. For random I/O 64MB is large enough for both read and write requests, for sequential I/O considering underlying block layer may merge them into larger requests, 64MB is still good enough.

Neil Brown also points out that for resync operation, “we want the resync to move from region to region fairly quickly so that the slowness caused by having to synchronize with the resync is averaged out over a fairly small time frame”. For full speed resync, 64MB should take less then 1 second. When resync is competing with other I/O, it could take up a few minutes. Therefore 64MB size is fairly good range for resync.


There are BARRIER_BUCKETS_NR buckets in total, which is defined by,


this patch makes the bellowed members of struct r1conf from integer to array of integers,

– int     nr_pending;
– int     nr_waiting;
– int     nr_queued;
– int     barrier;
+ int   *nr_pending;
+ int   *nr_waiting;
+ int   *nr_queued;
+ int   *barrier;

number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB kernel space page size, (PAGE_SHIFT – 2) indecates there are 1024 I/O barrier buckets, and each array of integers occupies single memory page. 1024 means for a request which is smaller than the I/O barrier unit size has ~0.1% chance to wait for resync to pause, which is quite a small enough fraction. Also requesting single memory page is more friendly to kernel page allocator than larger memory size.

  • I/O barrier bucket is indexed by bio start sector

If multiple I/O requests hit different I/O barrier units, they only need to compete I/O barrier with other I/Os which hit the same I/O barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by sector_to_idx() which is defined in raid1.h as an inline function,

+    static inline int sector_to_idx(sector_t sector)
+   {
+            return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
+                                             BARRIER_BUCKETS_NR_BITS);
+   }

Here sector_nr is the start sector number of a bio.

  • Single bio won’t go across boundary of a I/O barrier unit

If a request goes across boundary of barrier unit, it will be split. A bio may be split in raid1_make_request() or raid1_sync_request(), if sectors returned by align_to_barrier_unit_end() is smaller than original bio size.

Comparing to single sliding resync window,

  • Currently resync I/O grows linearly, therefore regular and resync I/O will conflict within a single barrier units. So the I/O behavior is similar to single sliding resync window.
  • But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of conflict might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket indexs with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice.
  • Therefore we can achieve a good enough low conflict rate with much simpler barrier algorithm and implementation.


Great thanks to Shaohua and Neil, review the code, point out many bugs, and provide very useful suggestion. Finally we make it, this patch is merged in Linux v4.11 with commit ID fd76863e37fe.

October 13, 2016

Why 4KB I/O requests are not merged on DM target

Filed under: File System Magic,kernel — colyli @ 10:00 am

(This article is for SLE11-SP3, which is based on Linux 3.0 kernel.)

Recently people report that on SLE11-SP3, they observe I/O requests are not merged on device mapper target, and ‘iostat’ displays average request only in 4KB size.

This is not a bug, no negative performance impact. Here I try to explain why this situation is not a bug and how it happens, a few Linux file system and block layer stuffs will be mentioned, but it won’t be complexed to understand.

The story is, from a SLE11-SP3 machine, a LVM volume is created (as linear device mapper target), and ‘dd’ is used to generate sequential WRITE I/Os on to this volume. People tried to use buffered I/O and direct I/O with ‘dd’ command, on raw device mapper target, or on an ext3 file system on top of the device mapper target. So there are 4 conditions,

1) buffered I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M
2) direct I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M oflag=direct
3) buffered I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M
4) direct I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M oflag=direct

For 2) and 4), large request sizes are observed from hundreds to thousands sectors, maximum request size is 2048 sectors (because bs=1M). But for 1) and 3), all the request sizes displayed from ‘iostat’ on device mapper target dm-0 are 8 sectors (4KB).

The question is, sequential write I/Os are supposed to be merged into larger ones, why the request size reported by ‘iostat’ from device mapper target /dev/dm-0 is only 4KB, and not merged into larger request size ?

At first, let me give the simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.

Let me explain the above 2 points of simple answer in details. For direct I/O, the request size in device mapper target is the actual size sent from upper layer, it might be directly from application buffer, or adjusted by file system. so we only look at buffered I/O cases.

a) device mapper target does not merge bios
Device mapper only handles bios. In case of linear device mapper target (a common & simple lvm volume), it only re-maps the original bio from the logical device mapper target to actual underlying storage device, or maybe split the bio (of the device mapper target) into smaller ones if the original bio goes across multiple underlying storage devices. It never combines small bios into larger ones, it just re-maps the bios, and submit them to underlying block layer. Elevator, a.k.a I/O scheduler handles request merging and scheduling, device mapper does not.

b) upper layer code issues only 4KB size bios
For buffered I/O, file system only dirties the pages which contains the data writing to disk, the actual write action is handled by write back and journal code automatically,
– journal: ext3 uses jbd to handle journaling, in data=ordered mode, jbd only handles meta data blocks, and submit the metadata I/Os in buffer head, which means the maximum size is one page (4KB).
– write back: the actual kernel code to submit I/O to disk in write back code path is mm/page-writeback.c:do_writepages(). In SLE11-SP3 it looks like this,

1084 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
1085 {
1086         int ret;
1088         if (wbc->nr_to_write <= 0)
1089                 return 0;
1090         if (mapping->a_ops->writepages)
1091                 ret = mapping->a_ops->writepages(mapping, wbc);
1092         else
1093                 ret = generic_writepages(mapping, wbc);
1094         return ret;
1095 }

Device mapper target is created on devtmpfs, which does not have writepages() method defined. Ext3 does not have writepages() method defined in its a_ops set neither, so both conditions will go into generic_writepages().

Inside generic_writepages(), the I/O code path is: generic_writepages()==>write_cache_pages()==>__writepage()==>mapping->a_ops->writepage(). For different conditions, the implementation of mapping->a_ops->writeback() are different.

b.1) raw device mapper target
In SLE11-SP3, block device mapping->a_ops->writepage() is defined in fs/block_dev.c:blkdev_writepage(), its code path to submit I/O is: blkdev_writepage()==>block_write_full_page()==>block_write_full_page_endio()==>__block_write_full_page(). In __block_write_full_page(), finally a buffer head contains this page is submitted to underlying block layer by submit_bh(). So device mapper layer only receives bio with 4KB size in this case.

b.2) ext3 file system on top of raw device mapper target
In SLE11-SP3, mapping->a_ops->writeback() method in ext3 file system is defined in three ways, corresponding to three different data journal modes. Here I use data=ordered mode as the example. Ext3 uses jbd as its journaling infrastructure, when journal works in data=ordered mode (the default mode in SLE11-SP3), mapping->a_ops->writeback() is defined as fs/ext3/inode.c:ext3_ordered_writepage(). Inside this function, block_write_full_page() is called to write page to block layer, same to the raw device mapper target condition, finally submit_bh() is called to submit bio with one page to device mapper layer. Therefore in this case, device mapper target still only receives bios with 4KB size.

Finally let’s back to my first simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.


June 17, 2016

My DCTC2016 talk: Linux MD RAID performance improvement since 3.11 to 4.6

Filed under: Great Days,kernel — colyli @ 3:11 am

This week I was invited by Memblaze to give a talk on Data Center Technology Conference 2016 about Linux MD RAID performance on NVMe SSD. In the past 3 years, Linux community make a lot of effort to improve MD RAID performance on high speed media, especially on RAID456. I happen to maintain block layer for SUSE Linux, back port quite a lot patches back to Linux 3.12.

From this talk, I list a selected recognized effort from Linux kernel community on MD RAID5 performance improvement, and how much performance data is increased by each patch (set), it looks quite impressive. Many people contribute their talent on this job, I am glad to say “Thank you all ” !


A slide in Mandarin of this talk can be found here, currently I don’t have time to translate it in English, maybe several months later …

April 26, 2016

libelf-devel is required when building kernel module

Filed under: Basic Knowledge,kernel — colyli @ 10:19 pm

For most documents about kernel module building just need a Makefile like this,

HELLO = helloworld

obj-m += $(HELLO).o

$(HELLO)-objs := hello.o world.o

KERNEL_SOURCE := /lib/modules/`uname -r`/build/


        $(MAKE) -C $(KERNEL_SOURCE) M=`pwd` modules


        $(MAKE) -C $(KERNEL_SOURCE) M=`pwd` clean

        $(RM) Module.markers modules.order

Then type “make” will make everything set. (Of cause source of current kernel is ready at /lib/modules/`uname -r`/build).

But yesterday when I tried to build a kernel module for Linux-4.6-rc5 (openSUSE vanilla) kernel, I observed an error never saw before.

helloworld> make

make -C /lib/modules/`uname -r`/build/ M=`pwd` modules

make[1]: Entering directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

make[2]: *** No rule to make target ‘/home/colyli/source/tmp/helloworld/hello.o’, needed by ‘/home/colyli/source/tmp/helloworld/helloworld.o’.  Stop.

Makefile:1428: recipe for target ‘_module_/home/colyli/source/tmp/helloworld’ failed

make[1]: *** [_module_/home/colyli/source/tmp/helloworld] Error 2

make[1]: Leaving directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

Makefile:10: recipe for target ‘default’ failed

make: *** [default] Error 2

It seems nothing missed, but the error message was there.  Today a friend (Chang Liu from Memblaze) tells me maybe I should check the output of “make modules_prepare” in the kernel source directory, here is my result,

linux-4.6-rc5-vanilla> make modules_prepare

Makefile:1016: “Cannot use CONFIG_STACK_VALIDATION, please install libelf-dev or elfutils-libelf-devel”

This is an informative clue, so I install libelf-dev package and re-run “make modules_prepare”,

linux-4.6-rc5-vanilla> make modules_prepare

  CHK     include/config/kernel.release

  CHK     include/generated/uapi/linux/version.h

  CHK     include/generated/utsrelease.h

  CHK     include/generated/bounds.h

  CHK     include/generated/timeconst.h

  CHK     include/generated/asm-offsets.h

  CALL    scripts/

  DESCEND  objtool

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/fixdep.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/fixdep-in.o

  LINK     /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/fixdep

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/exec-cmd.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/help.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/pager.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/parse-options.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/run-command.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/sigchain.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/subcmd-config.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/libsubcmd-in.o

  AR       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/libsubcmd.a

  GEN      /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/arch/x86/insn/inat-tables.c

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/arch/x86/decode.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/arch/x86/objtool-in.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/builtin-check.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/elf.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/special.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/objtool.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/libstring.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/objtool-in.o

  LINK     /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/objtool

No complain anymore, then I back to kernel module source directory, run “make” again,

helloworld> make

make -C /lib/modules/`uname -r`/build/ M=`pwd` modules

make[1]: Entering directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

  CC [M]  /home/colyli/source/tmp/helloworld/hello.o

  CC [M]  /home/colyli/source/tmp/helloworld/world.o

  LD [M]  /home/colyli/source/tmp/helloworld/helloworld.o

  Building modules, stage 2.

  MODPOST 1 modules

  CC      /home/colyli/source/tmp/helloworld/helloworld.mod.o

  LD [M]  /home/colyli/source/tmp/helloworld/helloworld.ko

make[1]: Leaving directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

Everything is cool, the kernel module is built. So installing libelf-devel package solve the problem.

But it is still not clear to me, why missing libelf-devel may cause kernel module building failed. If you know the reason, please give me a hint. Thanks in advance.

March 13, 2014

A six lectures OS kernel development course

Filed under: Basic Knowledge,kernel — colyli @ 12:17 pm

Since year 2010 after I joined Taobao (a subsidiary of Alibaba Group), I help my employer to build a Linux kernel team, to maintain in-house Linux kernel and optimize system performance continuously. The team grew from 1 person to 10 persons in the next 2 years, we made some successful stories by internal projects, while having 200+ patches merged into upstream Linux kernel.

In these 2 years, I found most of programmers had just a little concept on how to write code to cooperate with Linux kernel perfectly. And I found I was not the only person had similar conclusion. A colleague of mine, Zhitong Wang, a system software engineer from Ali Cloud (another subsidiary company of Alibaba Group), asked me whether I had interest to design and promote a course on OS kernel development, to help other junior developers to write better code on Linux servers. We had more then 100K real hardware servers online, if we could help other developers to improve 1% performance in their code, no doubt it would be extremely cool.

Very soon, we agreed on the outline of this course. This was a six lectures course, each one taking 120 ~ 150 minutes,


  • First class: Loading Kernel

This class introduced how a runnable OS kernel was loaded by boot loader and how the first instruction of the kernel was executed.

  • Second class: Protected Mode Programming

This class introduced very basic concept on x86 protect mode programming, which was fundamental to rested four classes.

  • Third class: System Call

This class explained how to design and implement system call interface, how priority transfer was happened.

  • Forth class: Process scheduling

We expected people was able to understand how a simplest scheduler was working and how context switch was made.

  • Fifth class: Physical Memory Management

In this class people could have a basic idea that how memory size was detected, how memory was managed before buddy system initialized, how buddy and slab system working.

  • Sixth class: Virtual Memory Management

Finally there were enough back ground knowledge to introduce how memory map, virtual memory area, page fault was designed and implemented, there was also a few slide pages introduces TLB and huge pages.


In next 6 months, Zhitong and I finished first version of  all slides. When Alibaba training department knew we were preparing an OS kernel development training, they helped us to arrange time slots both in Beijing and Hangzhou (Alibaba Group office location). We did the first wave training in 4 months, around 30 persons attended each class. We received a lot of positive feed back beyond our expectation. Many colleagues told me they were too busy to attend all these six classes, and required us to arrange this course again.

This was great encouragement to us. We knew the training material could be better, we yet had better method to make audience understand kernel development more. By this motivation, with many helpful suggestions from Zhitong, I spent half year to re-write all slide pages for all six classes, to make the materials to be more logical, consistent and scrutable.

Thanks to my employer, I may prepare course material in working hours, and accomplish the second wave training earlier. In last two classes, the teaching room was full, even some people had to stand for hours. Again, many colleagues complained they were too busy to miss some of the classes, and asked me to arrange another wave sometime in future.

This is not an easy task, I gave 6 classes both in Beijing and Hangzhou, including Q&A it was more than 30 hours. But I decide to arrange another wave of the course again, maybe start in Oct 2014, to show my honor to all people who helped and encouraged me 🙂

Here you may find all slide files for these six classes, they are written in simplified Chinese.
[There are more than enough document in English, but in Chinese the more the better ]

* Class 1: osdev1-loading_kernel
* Class 2: osdev2-protected_mode_programming
* Class 3: osdev3-system_call
* Class 4: osdev4-process_scheduling
* Class 5: osdev5-physical_memory_management
* Class 6: osdev6-virtual_memory_management

November 22, 2010

Three Practical System Workloads of Taobao

Filed under: kernel — colyli @ 9:44 pm

Days ago, I gave a talk on an academic seminar at ACT of Beihang University ( In my talk, I introduced three typical system workloads we (a group of system software developers inside Taobao) observed from the most heavily used/deployed product lines. The introduction was quite brief, no detail touched here. we don’t mind to share what we did imperfectly, and we would like to open mind to cooperate with open source community and industries to improve 🙂

If you find there is anything unclear or misleading, please let me know. Communication makes things better most of time 🙂

[The slide file can be found here]

October 16, 2010

China Linux Storage and File System Workshop 2010

Filed under: File System Magic,Great Days,kernel — colyli @ 12:41 pm

[CLSF 2010, Oct 14~15, Intel Zizhu Campus, Shanghai, China]

Similar to Linux Storage and File System Summit in north America, China Linux Storage and File System Workshop is a chance to make most of active upstream I/O related kernel developers get together and share their ideas and current status.

We (CLSF committee) invited around 26 persons to China LSF 2010, including community developers who contribute to Linux I/O subsystem, and engineers who develop their storage products/solutions based on Linux. In order to reduce travel cost to all attendees, we decided to co-locate China LSF with CLK (China Linux Kernel Developers Conference) in Shanghai.

This year, Intel OTC (Opensource Technology Center) contributed a lot to the conference organization. She kindly provided free and comfortable conference room, donated employees to help the organization and preparation, two intern students acted as volunteers helping on many trivial stuffs.

CLSF2010 is a two days’ conference,  here are some interesting topics (IMHO) which I’d like to share on my blog. I don’t understand very well on every topic, if there is any error/mistake in this text, please let me know. Any errata is welcome 🙂

— Writeback, led by Fengguang Wu

— CFQ, Block IO Controller & Write IO Controller, led by Jianfeng Gui, Fengguang Wu

— Btrfs, led by Coly Li

— SSD & Block Layer, led by Shaohua Li

— VFS Scalability, led by Tao Ma

— Kernel Tracing, led by Zefan Li

— Kernel Testing and Benchmarking, led by Alex Shi

Beside the above topics, we also had ‘From Industry’ sessions, engineers from Baidu, Taobao and EMC shared their experience when building their own storage solutions/products based on Linux.

In this blog, I’d like to share the information I got from CLSF 2010, hope it could be informative 😉

Write back

The first session started from Write back,  which is quite hot recently. Fengguang does quite a few work on it, and kindly volunteer to lead this session.

An idea was brought out to limit the dirty page ratio by per-process. Fengguang made a patch and shared a demo picture with us. When dirty pages exceeds the up-limit specified to a process, kernel will write back the dirty pages of this process smoothly, until the dirty page numbers reduced to a pre-configured rate. This idea is helpful to processes hold a large number of dirty pages.  Some people concerned this patch didn’t help the condition that a lot of processes and each hold a few dirty pages. Fengguang replied for server application, if this condition happened, the design might be buggy.

People also mentioned now the erase block size of SSD increased from KBs to MBs, adopting a bigger page numbers in writing out may help on the whole file system performance. Engineers from Baidu shared their experience,

— Increase the write out size from 4MB to 40MB, they achieved 20% performance improvement.

— Use extent based file system, they got better continuous on-disk layout and less memory consume for metadata.

Fengguang also shared his idea on how to control process to write pages, the original idea was control dirty pages by I/O (calling writeback_inode(dirtied * 3/2)), after several times improvement it became wait_for_writeback(dirteid/throttle_bandwidth). By this means, the I/O bandwidth of dirty pages to a process also got controlled.

During the discussion, Fengguang pointed out the event that a page got dirty was more important than whether a page was dirty. Engineers from Baidu said, in order to avoid a kernel/user space memory copy during file read/write, while using kernel page cache, they used mmap to read/write file pages other than calling read/write syscalls. In this case, a page writable in mmap is initialized as read only firstly, when the writing happened a page fault was triggered, then kernel knew this page got dirty.

It seems many ideas are under working to improve the writeback performance, including active writeback in back group, and some cooperation with underlying block layer. My current focus is not here, anyway I believe people in the room could help a bit out 🙂


Recently, there are many developers in China start to work on btrfs, e.g. Xie Miao, Zefan Li, Shaohua Li, Zheng Yan, … Therefore we specially arranged a two hours session for btrfs. The main purpose of the btrfs session is to share what we are doing on btrfs.

Most of people agreed that btrfs needed a real fsck tool now. Engineers from Fujitsu said they had a plan to invest people on btrfs checking tool development. Miao Xie, Zefan Li, Coly Li and other developers suggested to consider the pain of fsck from beginning,

— memory consuming

Now a 10TB+ storage media is cheap and common, for large file system built on them, doing fsck needs more memory to hold meta data (e.g. bitmap, dir blocks, inode blocks, btree internal blocks …). For online fsck, consuming too many memory in file system checking will have negative performance impact to page cache or other applications. For offline fack, it was not a problem, now online fsck is coming, we have to encounter this open question now 🙂

— fsck speed

A tree structured file system has (much) more meta data than a table structured file system (like Ext2/3/4), which may mean more I/O and more time. For a 10TB+ 80% full file system, how to reduce the file system checking time will be a key issue, especially for online service workload. I proposed an solution, allocating metadata to SSD or other higher seek speed device, then checking on metadata may have no (or a little) seeking time, which results a faster file system checking.

Weeks before, two intern students Kunshan Wang and Shaoyan Wang, they worked with me, wrote a very basic patch set (including kernel and user space code), to allocate metadata from a higher seek time device. This patch set is compiling passed, the students did a quite basic verification on meta data allocation, the patch worked. I don’t review the patch yet, by a quite rough code checking, there is much improvement needed. I post this draft patch set to China LSF mailing list, to call for more comments from CLSF attendees. Hope next month,  I can have time to improve the great job by Kunshan and Shaoyan.

Zefan Li said there was a todo list of btrfs, a long term task was data de-duplication, and a short term task was allocating data from SSD. Herbert Xu pointed out, the underlying storage media impacted file system performance quite a lot, from a benchmark from Ric Wheeler of Redhat, on Fusion IO high end PCI-E SSD, there is almost no performance difference between well known file system like xfs, ext2/3/4 or btrfs.

People also said that these days, the code review or merge of btrfs patches were often delayed, it seemed btrfs maintainer was too busy to handle the community patches. There was reply from the maintainer that the condition will be improved and patches would be handled in time, but there was no obvious improvement so far. I can understand when a person has more emergent task like kernel tree maintenance, he or she does have difficulty to handle non-trivial patches in time if this is not his or her highest priority job. From CLSF, I find more and more Chinese developers start to work on btrfs, I hope they should be patient if their patches don’t get handled in time 🙂

Engineers from Intel OTC mentioned there is no btrfs support from popular boot loader like Grub2. For me, IIRC there is someone working on it, and the patches are almost ready. Shaohua mentioned why not loading the Linux kernel by a linux kernel, like the kboot project does. People pointed out there still should be something to load the first Linux kernel, this was a chicken-and-egg question 🙂 My point was, it should not be very hard to enable the btrfs support in boot loader, a small Google Summer of Code project could make it. I’d like to port and merge the patches (if they are available) to openSUSE since I maintain openSuSE grub2 package.

Shaohua Li shared his experience on btrfs development for Meego project, he did some work on fast boot and read ahead on btrfs. Shaohua said there was some performance advance observed on btrfs, and the better result was achieved by some hacking, like a big read ahead size, a dedicated work queue to handle write request and using a big write back size. Fengguang Wu and Tao Ma pointed out this might be a general hacking, because Ext4 and OCFS2 also did the similar hacking for better performance.

Finally Shaohua Li pointed out there was a huge opportunity to improve the scalability of btrfs, since there still were many global locking, cache missing existing in current code.

SSD & Block Layer

This was a quite interesting session led by Shaohua Li. Shaohua started the session by some observed problems between SSD and block layer,

— Throughput is high, like network

— Disk controller gap, no MSI-x…

— Big locks, queue lock, scsi host lock, …

Shaohua shared some benchmark result showed that for high IOPS the interrupt over loaded on a single CPU,  even on a multi processors system, the interrupts could not be balanced to multi processors, which was a bottleneck to handle interrupts invoked by I/O of SSD.  If a system had 4 SDDs, a processor ran 100% to handle the interrupts and how throughput was around 60%-80%.

A workaround here was polling. Replacing interrupt by blk_iopoll could help the performance number, which could reduce processor overload on interrupts handling. However, Herbert Xu points out the key issue was current hardware didn’t support multi-queue to handle same interrupts. Different interrupts could be balanced to every processor in the system, but unlike network hardware, same interrupt could not be balanced into multi-queue and only be handled by a single processor. A hardware multi-queue support should be the silver bullet.

For SSD like Fusion IO produces, the IOPS could be one million + IOPS on a single SSD device, the parallel load is much more higher than on traditional hard disk. Herbert, Zefan and I agreed that some hidden race defect should be observed very soon.

Right now, block layer is not ready for such high parallel I/O load.  Herbert Xu pointed out that lock contention might be a big issue to solve. The source of the lock contention was cache consistence cost for global resource which protected by locking. Convert the global resource to a per-CPU local data might be a direction to solve the locking contention issue. Since Jens and Nick can touch Fusion IO devices more conveniently, we believe they can work with other developers to help out a lot.

Kernel Tracing

Zefan Li helped to lead an interesting session about kernel tracing. I don’t have any real understanding for any kernel trace infrastructure, for me the only tool is printk(). IMHO printk is the best trace/debug tool for kernel programming. Anyway, debugging is always an attractive topic to curious programmer, and I felt Zefan did his job quite well 🙂

The OCFS2 developer Tao Ma, mentioned OCFS2 currently using a printk wrapper trace code, which was not flexible and quite obsolete, OCFS2 developers were thinking of using a trace infrastructure like ftrace.

Zefan pointed out using ftrace to replace previous printk based trace messages should be careful, there might be ABI (application binary interface) issue for user space tools. Some user space tools work with kernel message (one can check kernel message with kmesg command). An Intel engineer mentioned there was accident recently that a kernel message modification caused the powertop tools didn’t work correctly.

For file system trace, the situation might be easier. Because most of the trace info was used by file system developers or testers, the one adding trace info into file system code might ignore the ABI issue with happy. Anyway, it was just “might”, not “be able to”.

Zefan said there was patch introduced TRACE_EVENT_ABI, if some trace info could form a stable user space ABI they could be announced by TRACE_EVENT_ABI.

This session also discussed how ftrace working. Now I know the trace info stored in a ring buffer. If ftrace is enabled but the ring buffer is not, user is still not able to receive trace info. People also said that a user space trace tool would be necessary.

Someone said perf tool currently getting more and more powerful, it was probably that integrating trace function into perf. Linux kernel only needs one trace tool,  some people in this workshop think it might be perf (for me, I have no point, because I use neither).

Finally Herbert again suggested people to pay attention on scalability issues when adding trace point. Currently the ring buffer was not a per-CPU local area, adding trace point might introduce performance regression for existing optimized code.

From Industry

In last year’s BeijingLSF, we invited two engineers from Lenovo. They shared their experience using Linux as the base system for their storage solution. This session had a quite positive feed back, and all committee member suggested to continue the From Industry sessions again this year.

For ChinaLSF2010, we invited 3 companies to share their ideas with other attendees, engineers from Baidu, Taobao and EMC  led three interesting sessions, people had chance to know which kind of difficulties they encountered, how they solved the problems and what they achieved from their solution or work around. Here I share some interesting points on my blog.

From Taobao

Engineers from Taobao also shared their works based on Linux storage and file systems,  the projects were Tair and TFS.

Tair is a distributed cache system used inside Taobao, TFS is a distributed user space file system to store Taobao goods pictures.  For detail information, please check 🙂

From EMC

Engineers from EMC shared their work on file system recovery, especially file system checking. Tao Ma and I, we also mentioned what we did in fsck.ocfs2 (ocfs2 file system checking tool). The opinion from EMC was, even an online file system checking was possible, the offline fsck was still required. Because an offline file system checking could check and fix a file system from a higher level scope.

Other points were also discussed in previous sessions, including memory occupation, time consuming …

From Baidu

This was the first time I knew people from Baidu, and had chance to knew what they did on Linux kernel. Thanks to Baidu kernel team, we had opportunity to know what they did in the past years.

Guangjun Xie from Baidu started the session by introducing Baidu’s I/O workload, most of the I/O were indexing and distributed computing related, reading performance was more desired then writing performance. In order to reduce memory copying in data reading, they used mmap to read data pages from underlying media to page cache.  Accessing the page via mmap could not use the advantage of Linux kernel page cache replacement algorithm, while Baidu didn’t want to implement a similar page cache within user space. Therefore they used a not-beautiful-but-efficient workaround, they implemented an in-house system call, the system call updated the page (returned by mmap) in kernel’s page LRU. By this means, the data page could be management by kernel’s page cache code. Some people pointed out this was mmap() + read ahead. From Baidu’s benchmark their effort increased 100% searching workload performance on a single node server.

Baidu also tried to use bigger block size of Ext2 file system, to make data block layout more continuous, also from their performance data the bigger block size also resulted a better I/O performance. IMHO, a local mod ocfs2 file system may achieve a similar performance, because the basic block unit of ocfs2 is a cluster, the cluster size could be from 4KB to 1MB.

Baidu also tried to compress/decompress the data when writing/reading from disk, since most of Baidu’s data was text, the compress rate was quite satisfied high. They even used a PCIE compressing card, the performance result was pretty good.

Guangjun also mentioned, when they used SATA disks, some I/O error was silence error, for meta data, this was a fatal error, at least meta data checksum was necessary. For data checksum, they did it in application level.


Now comes to the last part of this blog, let me give my own conclusion to ChinaLSF 2010 🙂

IMHO, the organization and preparation this year is much better than BeijingLSF 2009, people from Intel Shanghai OTC contribute a lot of time and effort before/during/after the workshop, without their effort, we can not have such a successful event. Also a big thank you should give our sponsor EMC China, they not only sponsor conference expense, but also send engineers to share their development experience.

Let’s wait for next year for ChinaLSF 2011 🙂

October 27, 2009

BeijingLSF 2009

Filed under: File System Magic,Great Days,kernel — colyli @ 12:31 pm

In the passed several months, I was helping to organize BeijingLSF(Beijing Linux Storage and File System Workshop) with other kernel developers in China. This great event happened on Oct 24, here is my report.

Since early this year, budget control happens in almost all companies/organizations, many local kernel developers in China could not attend LSF in United States (it’s quite interesting that most of kernel developers in China are storage related, mm, cgroup, io controller, fs, device mapper …). In such condition, we found there were enough people inside China to sit together for discussion on storage and file system related topics. In 2009 April, a proposal was posted on for BeijingLSF. Many people provided positive feed back, then we started to organize this event.

A 7 persons’ committee was set up firstly, people were from Novell, Intel, Oracle, Redhat, Freescale. The committee made a 20 persons’ invitation list. The website was built on and all invitees registered. Fortunately we got sponsorship from Novell China for soft drink and snack.

There were 6 sessions for BeijingLSF 2009. There was no talk, people just sat together to discuss on specific topics.

The first session was distributed locker manager. I led the session, the discussion included,

– introduced back ground of dlm, and current issues from fs/dlm (most of the issues were from my closed-or-opened and open-to-community BNCs).

– Oracle ocfs2 developers explained why ocfs2 using 64bytes lock value block.

– Jiaju Zhang (Novell) explained his patches for dlm performance improvement.

– Tao Ma (Oracle), Jiaju Zhang (Novell), Xinwei Hu (Novell) and I, discussed how dlm working with ocfs2’s user mode cluster stack.

The second session was clustering file system, led by Tao Ma (Oracle). Tao suggested people to introduce each other before the session. During the introduction, discussion happened when people introduced their current projects. When the discussion finished, 40 minutes passed. The workshop had no introduction time planed, therefore most time of this session was used for people to know each other. IMHO it was worth. This was the first time almost all storage and file system developers in China sat together, people came to know the faces behind email addresses.

The last session in the morning was shared storage and snapshot, led by Xinwei Hu (Novell). Xinwei introduced how logical volume management working in clustering environment, then discussion happened on,

– Considering snapshot start to happen on file system level, snapshot by device mapper might be less and less important in future.

– Is it possible to support snapshot by lvm in clustering environment and is it worth ? There was no conclusion from the discussion, and I’d like to hear from device mapper developers 🙂

After the lunch, the first session in the afternoon was VFS readahead and writeback. The session was led by Fengguang Wu (Intel), a 6 pages slide made people discuss for 90 minutes. Wu spent 20 minutes to introduce his patch, then people talked about,

– Why anonymous pages and file pages should be treated differently.

– In order to improvement writeback performance, MM should be able to suspend (maybe there is some better term) a process who making too many dirty pages.

– On some kind of SSD, linear read/write was slower than discrete read/write. If the storage media was SSD, there might be little difference for the writeback policy.

Second session in the afternoon was I/O controller and I/O bandwidth. The session was led by Jiangfeng Gui (Fujitsu). This talk was quite new to most of the attendees. Jianfeng explained the conception of io controller very well, at least I understood it was a software conception, not a haredware 🙂 io controller was an interesting idea, but most of concern in the workshop was focused on its complexity.

The last session of the workshop was interaction with industry. We invited an engineer from Lenovo – Yilei Lu, who was working on a internet storage solution based on Linux operating system. Yilei introduced how they used Linux as base system in their storage solution, what problems or difficulties they encountered. Many people provided suggestions to the questions, and most of the developers were very happen to hear feed back from the development users.

After all the six sessions, there was light talks. Almost all attendees said this workshop was the first effort to make upstream active developers to sit together in China. Some people showed their willing to sponsor BeijingLSF next year (if there is), some people said they could help to organize similar events in their cities. IMHO, BeijingLSF is a great and successful event. The most important thing is even not discussion, this is the *first* time for *ALL* (yes ALL) most active storage related developers within China to see each other, and have chance to talk face to face. Unlike LSF in United States, BeijingLSF has little effect to Linux storage and file system development, anyway it’s a great effort to make discussion and face-to-face communication happen.

Novell acts a very important role and contributes quite a lot to BeijingLSF. I was able to use the ITO (what a great idea!) to help organize BeijingLSF, and Novell China sponsored soft drink and snack to make all attendees more comfortable while talking whole day.

Finally please permit me to thank all attendees, they are,

Bo Yang, Coly Li, Fengguang Wu, Herbert Xu, Jeff He, Jeff Liu, Jiaju Zhang, Jianfeng Gui, Michael Fu, Tao Ma, Tiger Yang, Xiao Luo, Xinwei Hu, Xu Wang, Yang Li, Yawei Niu, Yilei Lu, Yu Zhong, Zefan Li, Zheng Yan.

Your coming and participating make BeijingLSF being a great and successful event.

Beijing Linux Storage and File System Workshop 2009

Beijing Linux Storage and File System Workshop 2009

[If you are interested on how the attendees look alike, please check]

June 29, 2009

when failed to apply openSuSE patches

Filed under: kernel — Tags: — colyli @ 2:02 am

openSuSE kernel is an upstream kernel tree with a set of patches. Last week, the second time I encountered the patch applying failure. From the .rej file, I could not see any conflict. This problem blocked me for the whole weekend, until my colleague Jeff Mohoeny and Jiri Slaby showed me the magic.

Jiri told me there were hardlinks from the git repository to the patched directory. If some files from the git repository got modified, the one in patched directory also got modified. Unless the modification got patched or edited by quilt edit, it should be problematic. I removed all the decompressed upstream kernel tree and repatched everything from the tar.bz2 upstream code file, everything backed to work.

Jeff kindly replied me a hint to solve this issue when using vim (cause it will try to edit in place), which is adding something into .vimrc:

set backup

if  version > 700

set    backupcopy=auto,breakhardlink


set    backupcopy=no


Jeff also told me the “if” part was because the SLES9 vim does not support the breakhardlink option, but it’ll still do the right thing.

By this experience, I realized it’s time to setup a blog for myself. Here it is 🙂

Powered by WordPress